You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 4
Next »
Hadoop - my first MapReduce (M/R) code
- install VMware on the local machine
- import Hadoop server from http://www.cloudera.com/hadoop-training-virtual-machine

- fire VM up
- To use Streamers add system variable:
export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar
Now you can writhe M?R in any language
- upload Shakespeare text to HDFS (Hadoop Distributed File System)
cd ~/git/data
tar vzxf shakespeare.tar.gz
check nothig is in HDFS
hadoop fs -ls /user/training
add unpacked text gile to HDFS and check again
hadoop fs -put input /user/training/inputShak
- (source) (target)
hadoop fs -ls /user/training
- Execute M/R job using 'cat' & 'wc'
hadoop jar $SJAR \
-mapper cat \
-reducer wc \
-input inputShak \
-output outputShak
- inspect output in
hadoop fs -cat outputShak/p*
175376 948516 5398106
- Task: write own Map & Reduce counting frequency of words:
mapper : read text data from stdin
write "<key> <value>" to stdout (<key>=word, <value>=1)
example:
$ echo "foo foo quux labs foo bar quux" | ./mapper.py
- Python Mapper
#!/usr/bin/env python
# my 1st mapper: writes <word> 1
import sys
data = sys.stdin.readlines()
for ln in data:
L=ln.split()
for key in L:
if len(key)>1:
print key,1
reducer : read a stream of "<word> 1" from stdin
write "<word> <count>" to stdout
#!/usr/bin/env python
# my 1st reducer: reads: <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
import sys
data = sys.stdin.readlines()
myKey=""
myVal=0
for ln in data:
#print ln,
L=ln.split()
#print L
nw=len(L)/2
for i in range(nw):
#print i
key=L[0+2*i]
val=int(L[2*i+1])
#print key,val,nw
if myKey==key:
myVal=myVal+val
else:
if len(myKey)>0:
print myKey, myVal,
myKey=key
myVal=val
if len(myKey)>0:
print myKey, myVal,
- Execute:
hadoop jar $SJAR \
-mapper $(pwd)/map1.py \
-reducer $(pwd)/redu1.py \
-input inputShak \
-output outputShak3