Wiki Markup |
---|
{panel}
Description of MapReduce concept:
{panel}
[ |
Panel |
---|
Description of MapReduce concept: |
http://java.dzone.com/articles/how-hadoop-mapreduce-works |
...
] Idea of mapper & |
...
Data flow for a single MapReduce job
...
reducer !P1.png! Data flow for a single MapReduce job !P4.png! h5. Hadoop - my first MapReduce (M/R) code |
...
# install VMware on the local machine |
...
# import Hadoop server from [http://www.cloudera.com/hadoop-training-virtual-machine |
...
] # fire VM up |
...
# To use Streamers add system variable: |
...
*export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar |
...
* Now you can writhe M/R in any language |
...
# upload Shakespeare text to HDFS (Hadoop Distributed File System) |
...
\\ *cd \~/git/data* (is distributed w/ Cloudera VM) |
...
*tar vzxf shakespeare.tar.gz |
...
* _check nothing is in HDFS{_}hadoop *fs \-ls /user/training |
...
* # _add text file to HDFS _ *hadoop fs \-put input /user/training/inputShak |
...
* _and check again_ *hadoop fs \-ls /user/training |
...
* # Execute M/R job under hadoop using 'cat' & 'wc' |
...
{code |
...
} hadoop jar $SJAR \ -mapper cat \ -reducer wc \ -input inputShak \ -output outputShak |
...
{code} ## inspect output in *hadoop fs \-cat outputShak/p\** |
...
Task: write own Map & Reduce counting frequency of words:
...
175376 948516 5398106 # h6. Task: write own Map & Reduce counting frequency of words: *mapper* : read text data from stdin write "<key> <value>" to stdout (<key>=word, <value>=1) |
...
example: |
...
$ echo "foo foo quux labs foo bar quux" \| ./mapper.py |
...
## Python Mapper |
...
{code |
...
:title |
...
=mapp1.py |
...
|borderStyle |
...
=solid |
...
} #!/usr/bin/env python # my 1st mapper: writes <word> 1 import sys data = sys.stdin.readlines() for ln in data: L=ln.split() for key in L:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" . if len(key)>1: print key,1 |
...
title | redu1.py |
---|---|
borderStyle | solid |
{code} *reducer* : read a stream of "<word> 1" from stdin write "<word> <count>" to stdout {code:title=redu1.py|borderStyle=solid} #!/usr/bin/env python # my 1st reducer: reads: <word> <vlue., sums key values from the same consecutive key, writes <word> <sum> import sys data = sys.stdin.readlines() myKey="" myVal=0 for ln in data:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" . #print ln, L=ln.split() #print L nw=len(L)/2 for i in range(nw): #print i key=L[0+2*i] val=int(L[2*i+1]) #print key,val,nw if myKey==key: myVal=myVal+val else: if len(myKey)>0: print myKey, myVal, myKey=key myVal=val if len(myKey)>0: print myKey, myVal, |
...
{code} ## Execute: |
...
{code |
...
} hadoop jar $SJAR \ -mapper $(pwd)/mapp1.py \ -reducer $(pwd)/redu1.py \ -input inputShak \ -output outputShak3 |
...
Cloudera - my 1st EC2 cluster deployed
...
{code} h5. Cloudera - my 1st EC2 cluster deployed # follow instruction [http://archive.cloudera.com/docs/ec2.html |
...
] ## Item 2.1 . I uploaded 3 tar files: 'client script', 'boto', and 'simplejson'. |
...
##* Un-tarred all 3: tar vzxf .... |
...
##* execute twice , in 'boto' and 'simplejson' directories |
...
{code |
...
}sudo python setup.py install {code} ##* |
...
move hadoop-ec2 to permanent place & add to path (for easier use) |
...
{code |
...
} tar vzxf cloudera-for-hadoop-on-ec2-py-0.3.0-beta..tar.gz sudo mv cloudera-for-hadoop-on-ec2-py-0.3.0-beta /opt/ export HADOOP_EC2_HOME=/opt/cloudera-for-hadoop-on-ec2-py-0.3.0-beta export PATH=$PATH:$HADOOP_EC2_HOME |
...
{code} ##* exported by hand environment variables AWS_ACCESS_KEY_ID - Your AWS Access Key ID |
...
AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key |
...
##* create a directory called *\~/.hadoop-ec2* w/ file *ec2-clusters.cfg* with content: |
...
{code |
...
}[my-hadoop-cluster] ami=ami-6159bf08 instance_type=m1.small key_name=janAmazonKey2 availability_zone=us-east-1a private_key=/home/training/.ec2/id_rsa-janAmazonKey2 ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no |
...
{code} ##* filer a cluster of 1 server+2 nodes |
...
*hadoop-ec2 launch-cluster my-hadoop-cluster 2 |
...
* ##* define bunch of |
...
variables in */usr/src/hadoop/contrib/ec2/bin/hadoop-ec2-env.sh |
...
* 1+2 machines are up and running: |
...
{code |
}Waiting for master to start (Reservation:r-d0d424b8) ............................................... master i-1cbe3974 ami-6159bf08 ec2-67-202-49-216.compute-1.amazonaws.com ip-10-244-182-63.ec2.internal running janAmazonKey c1.medium 2009-10-26T02:31:17.000Z us-east-1c Waiting for slaves to start ............................................. slave i-f2be399a ami-6159bf08 ec2-67-202-14-97.compute-1.amazonaws.com ip-10-244-182-127.ec2.internal running janAmazonKey c1.medium 2009-10-26T02:32:19.000Z us-east-1c slave i-f4be399c ami-6159bf08 ec2-75-101-201-30.compute-1.amazonaws.com ip-10-242-19-81.ec2.internal running janAmazonKey c1.medium 2009-10-26T02:32:19.000Z us-east-1c Waiting for jobtracker to start Waiting for 2 tasktrackers to start .................................................1.............2. Browse the cluster at http://ec2-67-202-49-216.compute-1.amazonaws.com/ {code} To |
...
login to cluster execute: *hadoop-ec2 login my-hadoop-cluster |
...
* To terminate cluster execute: |
...
*hadoop-ec2 terminate-cluster my-hadoop-cluster |
...
* and say 'yes' \!\! |
...
To Deleting the EC2 security groups |
...
*hadoop-ec2 delete-cluster my-hadoop-cluster |
...
* To actually brows them I need to: |
...
# execute ./hadoop-ec2 proxy my-hadoop-cluster |
...
Resulting with: |
...
export HADOOP_EC2_PROXY_PID=20873; |
...
echo Proxy pid 20873; |
...
# and add proxy to the fireFox - so far it did not worked. h4. |
...
Functional packages I made |
...
# Very simple word count example which works [TAR |
...
BALL|^hadoop-simple.tar.gz] # Compute PageRank [TAR BALL|^wikiMapReducePageRankJan12.tar.gz] for medium-set wiki-pages from the Harvard class by *Hanspeter Pfister*, code still have problems in deploying it. It crashes on reduce if full 11M lines is processed on 10-node EC2 cluster |
...
# here is tar-ball of the finale version of my code |
...
#* to upload it more often I used command |
...
*scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" . |
...
* *./mySetup.sh \-f l \-v11 \-D |
...
* it contains |
...
{code |
...
} training 495 2009-11-11 20:25 abcd-pages training 290 2009-11-11 20:25 cleanup.py training 1374 2009-11-14 19:47 mappPR.py training 2302 2009-11-14 18:30 pageRankCommon.py training 2648 2009-11-14 18:31 pageRankCommon.pyc training 1034 2009-11-14 19:25 reduPR.py training 7251 2009-11-14 11:33 runPageRank.sh training 1806 2009-11-14 18:34 wiki2mappPR.py |
...
{code} #* to upload data set to hadoop HDFS by hand I did *hadoop fs \-put wL-pages-iter0 wL-pages-iter0 |
...
* #* to execute full map/reduce job w/ 3 iterations: |
...
cleaniup all, write raw file, use 4map+2red, init Map, 3 x M/R, final sort |
...
*/runPageRank.sh \-X \-w \-D 4.2 \-m \-I 0.3 \-i \-f |
...
More advance instruction about Running Hadoop on Amazon EC2
* h4. More advance instruction about *Running Hadoop on Amazon EC2* [http://wiki.apache.org/hadoop/AmazonEC2] |