Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0
Wiki Markup
{panel}
Description of MapReduce concept:
{panel}
[
Panel

Description of MapReduce concept:

http://java.dzone.com/articles/how-hadoop-mapreduce-works

...

]

Idea of mapper &

...

Data flow for a single MapReduce job
Image Removed

...

 reducer
!P1.png!

Data flow for a single MapReduce job
!P4.png!


h5. Hadoop - my first MapReduce (M/R) code

...



# install VMware on the local machine

...


# import Hadoop server from [http://www.cloudera.com/hadoop-training-virtual-machine

...

]
# fire VM up

...


# To use Streamers add system variable:

...


*export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar

...

*
Now you can writhe M/R in any language

...


# upload Shakespeare text to HDFS (Hadoop Distributed File System)

...


\\
*cd \~/git/data* (is distributed w/ Cloudera VM)

...


*tar vzxf shakespeare.tar.gz

...

*
_check nothing is in HDFS{_}hadoop *fs \-ls /user/training

...

*
# _add  text file to HDFS _  *hadoop fs \-put input /user/training/inputShak

...

*
_and check again_ *hadoop fs \-ls /user/training

...

*
# Execute M/R job under hadoop using 'cat' & 'wc'

...


{code

...

}
hadoop jar $SJAR \
-mapper cat \
-reducer wc \
-input inputShak \
-output outputShak

...


{code}
## inspect output in
*hadoop fs \-cat outputShak/p\**

...

Task: write own Map & Reduce counting frequency of words:

...


175376    948516   5398106
# h6. Task: write own Map & Reduce counting frequency of words:

*mapper* : read text data from stdin
write "<key> <value>" to stdout (<key>=word, <value>=1)

...


example:

...


$ echo "foo foo quux labs foo bar quux" \| ./mapper.py

...


## Python Mapper

...


{code

...

:title

...

=mapp1.py

...

|borderStyle

...

=solid

...

}
#!/usr/bin/env python
# my 1st mapper: writes <word> 1
import sys
data = sys.stdin.readlines()
for ln in data:
    L=ln.split()
    for key in L:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
        if len(key)>1:
            print key,1

...

titleredu1.py
borderStylesolid
{code}
*reducer* : read a stream of "<word> 1" from stdin
write "<word> <count>" to stdout
{code:title=redu1.py|borderStyle=solid}
#!/usr/bin/env python
# my 1st reducer: reads:  <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
import sys
data = sys.stdin.readlines()
myKey=""
myVal=0
for ln in data:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
    #print ln,
    L=ln.split()
    #print L
    nw=len(L)/2
    for i in range(nw):
        #print i
        key=L[0+2*i]
        val=int(L[2*i+1])
        #print key,val,nw
        if myKey==key:
            myVal=myVal+val
        else:
            if len(myKey)>0:
                print myKey, myVal,
            myKey=key
            myVal=val

if len(myKey)>0:
    print myKey, myVal,

...

{code}
## Execute:

...


{code

...

}
hadoop jar $SJAR \
-mapper $(pwd)/mapp1.py \
-reducer $(pwd)/redu1.py \
-input inputShak \
-output outputShak3

...

Cloudera - my 1st EC2 cluster deployed

...


{code}

h5. Cloudera - my 1st EC2 cluster deployed

# follow instruction [http://archive.cloudera.com/docs/ec2.html

...

]
## Item 2.1 . I uploaded 3 tar files: 'client script', 'boto', and 'simplejson'.

...


##* Un-tarred all 3: tar vzxf ....

...


##* execute twice , in 'boto' and 'simplejson' directories

...


{code

...

}sudo python setup.py install {code}
##* 

...

move hadoop-ec2 to permanent place & add to path (for easier use)

...


{code

...

}
tar vzxf cloudera-for-hadoop-on-ec2-py-0.3.0-beta..tar.gz
 sudo mv cloudera-for-hadoop-on-ec2-py-0.3.0-beta /opt/
 export HADOOP_EC2_HOME=/opt/cloudera-for-hadoop-on-ec2-py-0.3.0-beta
 export PATH=$PATH:$HADOOP_EC2_HOME

...

{code}
##* exported by hand environment variables
AWS_ACCESS_KEY_ID - Your AWS Access Key ID

...


AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key

...


##* create a directory called *\~/.hadoop-ec2* w/ file *ec2-clusters.cfg* with content:

...


{code

...

}[my-hadoop-cluster]
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
availability_zone=us-east-1a
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no

...

{code}
##* filer a cluster of 1 server+2 nodes

...


*hadoop-ec2 launch-cluster my-hadoop-cluster 2

...

*
##* define bunch of

...

 variables in&nbsp; */usr/src/hadoop/contrib/ec2/bin/hadoop-ec2-env.sh

...

*

1+2 machines are up and running:

...


{code
}Waiting for master to start (Reservation:r-d0d424b8)
...............................................
master	i-1cbe3974	ami-6159bf08	ec2-67-202-49-216.compute-1.amazonaws.com	ip-10-244-182-63.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:31:17.000Z	us-east-1c
Waiting for slaves to start
.............................................
slave	i-f2be399a	ami-6159bf08	ec2-67-202-14-97.compute-1.amazonaws.com	ip-10-244-182-127.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
slave	i-f4be399c	ami-6159bf08	ec2-75-101-201-30.compute-1.amazonaws.com	ip-10-242-19-81.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
Waiting for jobtracker to start

Waiting for 2 tasktrackers to start
.................................................1.............2.
Browse the cluster at http://ec2-67-202-49-216.compute-1.amazonaws.com/
{code}
To

...

 login to&nbsp; cluster execute:
*hadoop-ec2 login my-hadoop-cluster

...

*


To terminate cluster execute:

...


*hadoop-ec2 terminate-cluster my-hadoop-cluster

...

*
and say 'yes' \!\!

...



To Deleting the EC2 security groups

...


*hadoop-ec2 delete-cluster my-hadoop-cluster

...

*
To actually brows them I need to:

...


# execute ./hadoop-ec2 proxy  my-hadoop-cluster

...


Resulting with:

...


export HADOOP_EC2_PROXY_PID=20873;

...


echo Proxy pid 20873;

...


# and add proxy to the fireFox - so far it did not worked.


h4.

...

 Functional packages I made

...



# Very simple word count example which works [TAR

...

 BALL|^hadoop-simple.tar.gz]
# Compute PageRank&nbsp; [TAR BALL|^wikiMapReducePageRankJan12.tar.gz] for medium-set wiki-pages from the Harvard class by *Hanspeter Pfister*, code still have problems in deploying it. It crashes on reduce if full 11M lines is processed on 10-node EC2 cluster

...


# here is tar-ball of the finale version of my code

...


#* to upload it more often I used command

...


*scp  balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .

...

*
*./mySetup.sh \-f l \-v11 \-D

...

*
it contains

...

 {code

...

} training  495 2009-11-11 20:25 abcd-pages
 training  290 2009-11-11 20:25 cleanup.py
 training 1374 2009-11-14 19:47 mappPR.py
 training 2302 2009-11-14 18:30 pageRankCommon.py
 training 2648 2009-11-14 18:31 pageRankCommon.pyc
 training 1034 2009-11-14 19:25 reduPR.py
 training 7251 2009-11-14 11:33 runPageRank.sh
 training 1806 2009-11-14 18:34 wiki2mappPR.py

...


{code}
#* to upload data set to hadoop HDFS by hand I did
*hadoop fs \-put  wL-pages-iter0  wL-pages-iter0

...

*
#* to execute full map/reduce job w/ 3 iterations:

...


cleaniup all, write raw file, use 4map+2red, init Map, 3 x M/R, final sort

...


*/runPageRank.sh \-X \-w \-D 4.2 \-m \-I 0.3 \-i \-f

...

More advance instruction about Running Hadoop on Amazon EC2

*


h4. More advance instruction about *Running Hadoop on Amazon EC2*

[http://wiki.apache.org/hadoop/AmazonEC2]