Page History

Panel
Description of MapReduce concept:

...

http://java.dzone.com/articles/how-hadoop-mapreduce-works

...

Idea

...

of

...

mapper

...

&

...

reducer
Image Added

Data flow for a single MapReduce job
Image Added

Hadoop - my first MapReduce (M/R)

...

code

...

install

...

VMware

...

on

...

the

...

local

...

machine

...

import

...

Hadoop

...

server

...

from

...

http://www.cloudera.com/hadoop-training-virtual-machine

...

fire

...

VM

...

up

...

To

...

use

...

Streamers

...

add

...

system

...

variable:

...

export

...

SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar

...

Now

...

you

...

can

...

writhe

...

M/R

...

in

...

any

...

language

...

upload

...

Shakespeare

...

text

...

to

...

HDFS

...

(Hadoop

...

Distributed

...

File

...

System)

...

cd ~/git/data

...

(is

...

distributed

...

w/

...

Cloudera

...

VM)

...

tar

...

vzxf

...

shakespeare.tar.gz

...

check

...

nothing

...

is

...

in

...

HDFS

...

hadoop

...

fs

...

-ls

...

/user/training

...

add text file to HDFS hadoop fs -put input /user/training/inputShak

...

and

...

check

...

again

...

hadoop

...

fs

...

-ls

...

/user/training

...

Execute

...

M/R

...

job

...

under

...

hadoop

...

using

...

'cat'

...

&

...

'wc'

...

Code Block

...


hadoop jar $SJAR \ -mapper cat \ -reducer wc \ -input inputShak \ -output outputShak

...

1. inspect output in
  hadoop fs -cat outputShak/p

...

1. *

...

1. 175376 948516 5398106
Task: write own Map & Reduce counting frequency of words:

mapper : read text data from stdin
write "<key> <value>" to stdout (<key>=word,

...

<value>=1)

...

example:

...

$

...

echo

...

"foo

...

foo

...

quux

...

labs

...

foo

...

bar

...

quux"

...

|

...

./mapper.py

...

1. Python

...

1. Mapper

...

1. Code Block

...

1. title

...

1. mapp1.py

...

1. borderStyle

...

1. solid

...


#!/usr/bin/env python
# my 1st mapper: writes <word> 1
import sys
data = sys.stdin.readlines()
for ln in data:
    L=ln.split()
    for key in L:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
        if len(key)>1:
            print key,1

...

reducer : read a stream of "<word> 1" from stdin
write "<word> <count>" to stdout

Code Block

title	redu1.py
borderStyle	solid


#!/usr/bin/env python
# my 1st reducer: reads:  <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
import sys
data = sys.stdin.readlines()
myKey=""
myVal=0
for ln in data:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
    #print ln,
    L=ln.split()
    #print L
    nw=len(L)/2
    for i in range(nw):
        #print i
        key=L[0+2*i]
        val=int(L[2*i+1])
        #print key,val,nw
        if myKey==key:
            myVal=myVal+val
        else:
            if len(myKey)>0:
                print myKey, myVal,
            myKey=key
            myVal=val

if len(myKey)>0:
    print myKey, myVal,

...

2. Execute:

...

1. Code Block

...


hadoop jar $SJAR \ -mapper $(pwd)/mapp1.py \ -reducer $(pwd)/redu1.py \ -input inputShak \ -output outputShak3

...

Cloudera - my 1st EC2 cluster deployed

follow instruction http://archive.cloudera.com/docs/ec2.html

...

1. Item

...

1. 2.1

...

1. .

...

1. I

...

1. uploaded

...

1. 3

...

1. tar

...

1. files:

...

1. 'client

...

1. script',

...

1. 'boto',

...

1. and

...

1. 'simplejson'.

...

1. - Un-tarred

...

1. - all

...

1. - 3:

...

1. - tar

...

1. - vzxf

...

1. - ....

...

1. - execute

...

1. - twice

...

1. - ,

...

1. - in

...

1. - 'boto'

...

1. - and

...

1. - 'simplejson'

...

1. - directories

...

1. - Code Block

...

1. - sudo python setup.py install

...

1. - move hadoop-ec2

...

1. - to

...

1. - permanent

...

1. - place

...

1. - &

...

1. - add

...

1. - to

...

1. - path

...

1. - (for

...

1. - easier

...

1. - use)

...

1. - Code Block

...


tar vzxf cloudera-for-hadoop-on-ec2-py-0.3.0-beta..tar.gz
 sudo mv cloudera-for-hadoop-on-ec2-py-0.3.0-beta /opt/
 export HADOOP_EC2_HOME=/opt/cloudera-for-hadoop-on-ec2-py-0.3.0-beta
 export PATH=$PATH:$HADOOP_EC2_HOME

...

1. - exported by hand environment variables
    AWS_ACCESS_KEY_ID

...

1. - -

...

1. - Your

...

1. - AWS

...

1. - Access

...

1. - Key

...

1. - ID

...

1. - AWS_SECRET_ACCESS_KEY

...

1. - -

...

1. - Your

...

1. - AWS

...

1. - Secret

...

1. - Access

...

1. - Key

...

1. - create

...

1. - a

...

1. - directory

...

1. - called

...

1. - ~/.hadoop-ec2

...

1. - w/

...

1. - file

...

1. - ec2-clusters.cfg

...

1. - with

...

1. - content:

...

1. - Code Block

...

[my-hadoop-cluster]
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
availability_zone=us-east-1a
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no

...

1. - filer a cluster of 1 server+2

...

1. - nodes

...

1. - hadoop-ec2

...

1. - launch-cluster

...

1. - my-hadoop-cluster

...

1. - 2

...

1. - define

...

1. - bunch

...

1. - of

...

1. - variables in /usr/src/hadoop/contrib/ec2/bin/hadoop-ec2-env.sh

...

1+2

...

machines

...

are

...

up

...

and

...

running:

{

Code Block

}Waiting for master to start (Reservation:r-d0d424b8)
...............................................
master	i-1cbe3974	ami-6159bf08	ec2-67-202-49-216.compute-1.amazonaws.com	ip-10-244-182-63.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:31:17.000Z	us-east-1c
Waiting for slaves to start
.............................................
slave	i-f2be399a	ami-6159bf08	ec2-67-202-14-97.compute-1.amazonaws.com	ip-10-244-182-127.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
slave	i-f4be399c	ami-6159bf08	ec2-75-101-201-30.compute-1.amazonaws.com	ip-10-242-19-81.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
Waiting for jobtracker to start

Waiting for 2 tasktrackers to start
.................................................1.............2.
Browse the cluster at http://ec2-67-202-49-216.compute-1.amazonaws.com/
{code}

To

...

login to cluster execute:
hadoop-ec2

...

login

...

my-hadoop-cluster

...

To

...

terminate

...

cluster

...

execute:

...

hadoop-ec2

...

terminate-cluster

...

my-hadoop-cluster

...

and

...

say

...

'yes'

...

!

...

!

...

To

...

Deleting

...

the

...

EC2

...

security

...

groups

...

hadoop-ec2

...

delete-cluster

...

my-hadoop-cluster

...

To

...

actually

...

brows

...

them

...

I

...

need

...

to:

...

execute

...

./hadoop-ec2

...

proxy

...

my-hadoop-cluster

...

Resulting

...

with:

...

export

...

HADOOP_EC2_PROXY_PID=20873;

...

echo

...

Proxy

...

pid

...

20873;

...

and

...

add

...

proxy

...

to

...

the

...

fireFox

...

-

...

so

...

far

...

it

...

did

...

not

...

worked.

...

Functional packages I made

Very simple word count example which works TAR BALL
Compute PageRank TAR BALL for medium-set wiki-pages from the Harvard class by Hanspeter Pfister, code still have problems in deploying it. It crashes on reduce if full 11M lines is processed on 10-node EC2 cluster
here is tar-ball of the finale version of my code
- to upload it more often I used command
  scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh"

...

- .

...

- ./mySetup.sh

...

- -f

...

- l

...

- -v11

...

- -D

...

- it

...

- contains

...

- Code Block

...

 training  495 2009-11-11 20:25 abcd-pages
 training  290 2009-11-11 20:25 cleanup.py
 training 1374 2009-11-14 19:47 mappPR.py
 training 2302 2009-11-14 18:30 pageRankCommon.py
 training 2648 2009-11-14 18:31 pageRankCommon.pyc
 training 1034 2009-11-14 19:25 reduPR.py
 training 7251 2009-11-14 11:33 runPageRank.sh
 training 1806 2009-11-14 18:34 wiki2mappPR.py

...

- to upload data set to hadoop HDFS by hand I did
  hadoop fs -put wL-pages-iter0

...

- wL-pages-iter0

...

- to

...

- execute

...

- full

...

- map/reduce

...

- job

...

- w/

...

- 3

...

- iterations:

...

- cleaniup

...

- all,

...

- write

...

- raw

...

- file,

...

- use

...

- 4map+2red,

...

- init

...

- Map,

...

- 3

...

- x

...

- M/R,

...

- final

...

- sort

...

- /runPageRank.sh

...

- -X

...

- -w

...

- -D

...

- 4.2

...

- -m

...

- -I

...

- 0.3

...

- -i

...

- -f

...

More advance instruction about Running Hadoop on Amazon EC2

http://wiki.apache.org/hadoop/AmazonEC2

...

Child pages

Versions Compared

Old Version 27

New Version 28

Key

Hadoop - my first MapReduce (M/R)

code

Task: write own Map & Reduce counting frequency of words:

Cloudera - my 1st EC2 cluster deployed

Functional packages I made

More advance instruction about Running Hadoop on Amazon EC2