simple hadoop example tested EC2 1+3 Cloudera cluster

This example has been copied from here.

Instruction how to fire simple hadoop map/reduce using pythin user mapp & reduc to accomplish word counting
1) -
2) Upload tar-ball and unpack mapper & reducer, place both in this directory, make them executable.
chmod +x reducer.py
chmod +x mapper.py
3) unpack big text file, upload to HDFS, e.g.
unzip 4300.zip
hadoop dfs -copyFromLocal 4300.txt gutenberg
hadoop dfs -ls
4) find location of java streamer, make a shortcut
export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.18.3-14.cloudera.CH0_3-streaming.jar

5) fire haddop w/ 2x 'file'
hadoop jar $SJAR -file $(pwd)/mapper.py -mapper $(pwd)/mapper.py -file $(pwd)/reducer.py -reducer $(pwd)/reducer.py -input gutenberg -output out2
If it does not work, skip 'file's and try again
6) get results back:
hadoop fs -ls
hadoop fs -get out1/part-00000 outX0
7) to controll # of mappers=3 & reducers=5 add:
hadoop jar $SJAR -file $(pwd)/mapper.py -mapper $(pwd)/mapper.py -file $(pwd)/reducer.py -reducer $(pwd)/reducer.py -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=5 -input gutenberg -output out4

Child pages

simple hadoop example tested EC2 1+3 Cloudera cluster