This example has been copied from here.


Instruction how to fire simple hadoop map/reduce using pythin user mapp & reduc to accomplish  word counting
1) -
2) Upload tar-ball and unpack mapper & reducer, place both in this directory, make them executable.
  chmod +x reducer.py
  chmod +x mapper.py
3) unpack  big text file, upload to HDFS,  e.g.
  unzip 4300.zip
  hadoop dfs -copyFromLocal 4300.txt gutenberg
  hadoop dfs -ls
4) find location of java streamer, make a shortcut
 export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.18.3-14.cloudera.CH0_3-streaming.jar
 
5) fire haddop w/ 2x 'file'
 hadoop jar $SJAR -file $(pwd)/mapper.py -mapper $(pwd)/mapper.py -file $(pwd)/reducer.py -reducer  $(pwd)/reducer.py -input gutenberg -output out2
If it does not work, skip  'file's and try again
6) get results back:
 hadoop fs -ls
 hadoop fs -get out1/part-00000 outX0
7) to controll # of mappers=3 & reducers=5 add:
 hadoop jar $SJAR -file $(pwd)/mapper.py -mapper $(pwd)/mapper.py -file $(pwd)/reducer.py -reducer  $(pwd)/reducer.py  -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=5 -input gutenberg -output out4

  • No labels