...
- install VMware on the local machine
- import Hadoop server from http://www.cloudera.com/hadoop-training-virtual-machine
- fire VM up
- To use Streamers add system variable:
export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar
Now you can writhe M?R in any language - upload Shakespeare text to HDFS (Hadoop Distributed File System)
cd ~/git/data
tar vzxf shakespeare.tar.gz
check nothig is in HDFS
hadoop fs -ls /user/training
add unpacked text gile to HDFS and check again
hadoop fs -put input /user/training/inputShak - (source) (target)
hadoop fs -ls /user/training - Execute M/R job using 'cat' & 'wc'
hadoop jar $SJAR \
-mapper cat \
-reducer wc \
-input inputShak \
-output outputShak- inspect output in
hadoop fs -cat outputShak/p*
175376 948516 5398106
- inspect output in
- Task: write own Map & Reduce counting frequency of words:
mapper : read text data from stdin
write "<key> <value>" to stdout (<key>=word, <value>=1)
example:
$ echo "foo foo quux labs foo bar quux" | ./mapper.py- Python Mapper
reducer :Code Block title mapp1.py borderStyle solid #!/usr/bin/env python # my 1st mapper: writes <word> 1 import sys data = sys.stdin.readlines() for ln in data: L=ln.split() for key in L: if len(key)>1: print key,1
read a stream of "<word> 1" from stdin
write "<word> <count>" to stdoutCode Block title redu1.py borderStyle solid #!/usr/bin/env python # my 1st reducer: reads: <word> <vlue., sums key values from the same consecutive key, writes <word> <sum> import sys data = sys.stdin.readlines() myKey="" myVal=0 for ln in data: #print ln, L=ln.split() #print L nw=len(L)/2 for i in range(nw): #print i key=L[0+2*i] val=int(L[2*i+1]) #print key,val,nw if myKey==key: myVal=myVal+val else: if len(myKey)>0: print myKey, myVal, myKey=key myVal=val if len(myKey)>0: print myKey, myVal,
- Python Mapper