Page History

Hadoop - my first MapReduce (M/R) code

install VMware on the local machine
import Hadoop server from http://www.cloudera.com/hadoop-training-virtual-machine
fire VM up
To use Streamers add system variable:
export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar
Now you can writhe M?R in any language
upload Shakespeare text to HDFS (Hadoop Distributed File System)
cd ~/git/data
tar vzxf shakespeare.tar.gz
check nothig is in HDFShadoop fs -ls /user/training
add unpacked text gile to HDFS and check againhadoop fs -put input /user/training/inputShak
(source) (target)
hadoop fs -ls /user/training
Execute M/R job using 'cat' & 'wc'
Code Block
hadoop jar $SJAR \ -mapper cat \ -reducer wc \ -input inputShak \ -output outputShak
1. inspect output in
  hadoop fs -cat outputShak/p*
  175376 948516 5398106

Task: write own Map & Reduce counting frequency of words:
mapper : read text data from stdin
write "<key> <value>" to stdout (<key>=word, <value>=1)
example:
$ echo "foo foo quux labs foo bar quux" | ./mapper.py

Python Mapper

Code Block

title	mapp1.py
borderStyle	solid

#!/usr/bin/env python
# my 1st mapper: writes <word> 1
import sys
data = sys.stdin.readlines()
for ln in data:
    L=ln.split()
    for key in L:
        if len(key)>1:
            print key,1

reducer : read a stream of "<word> 1" from stdin
write "<word> <count>" to stdout

Code Block

title	redu1.py
borderStyle	solid

#!/usr/bin/env python
# my 1st reducer: reads:  <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
import sys
data = sys.stdin.readlines()
myKey=""
myVal=0
for ln in data:
    #print ln,
    L=ln.split()
    #print L
    nw=len(L)/2
    for i in range(nw):
        #print i
        key=L[0+2*i]
        val=int(L[2*i+1])
        #print key,val,nw
        if myKey==key:
            myVal=myVal+val
        else:
            if len(myKey)>0:
                print myKey, myVal,
            myKey=key
            myVal=val

if len(myKey)>0:
    print myKey, myVal,

Execute:

Code Block
hadoop jar $SJAR \ -mapper $(pwd)/mapp1.py \ -reducer $(pwd)/redu1.py \ -input inputShak \ -output outputShak3

Cloudera - my 1st EC2 cluster deployed

follow instruction http://archive.cloudera.com/docs/ec2.html

Item 2.1 . I uploaded 3 tar files: 'client script', 'boto', and 'simplejson'.

Un-tarred all 3: tar vzxf ....
execute twice , in 'boto' and 'simplejson' directories
Code Block
sudo python setup.py install

move hadoop-ec2 to permanent place & add to path (for easier use)

Code Block

tar vzxf cloudera-for-hadoop-on-ec2-py-0.3.0-beta..tar.gz
 sudo mv cloudera-for-hadoop-on-ec2-py-0.3.0-beta /opt/
 export HADOOP_EC2_HOME=/opt/cloudera-for-hadoop-on-ec2-py-0.3.0-beta
 export PATH=$PATH:$HADOOP_EC2_HOME

exported by hand environment variables
AWS_ACCESS_KEY_ID - Your AWS Access Key ID
AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key

create a directory called ~/.hadoop-ec2 w/ file ec2-clusters.cfg with content:

Code Block

[my-hadoop-cluster]
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
availability_zone=us-east-1c
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no

filer a cluster of 1 server+2 nodes
hadoop-ec2 launch-cluster my-hadoop-cluster 2
define bunch of variables in /usr/src/hadoop/contrib/ec2/bin/hadoop-ec2-env.sh

...

execute ./hadoop-ec2 proxy my-hadoop-cluster
Resulting with:
export HADOOP_EC2_PROXY_PID=20873;
echo Proxy pid 20873;
and add proxy to the fireFox - so far it did not worked.

More advance instruction about Running Hadoop on Amazon EC2

http://wiki.apache.org/hadoop/AmazonEC2

Child pages

Versions Compared

Old Version 16

New Version 17

Key

Hadoop - my first MapReduce (M/R) code

Cloudera - my 1st EC2 cluster deployed

More advance instruction about Running Hadoop on Amazon EC2