You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 26 Next »

Description of MapReduce concept:

http://java.dzone.com/articles/how-hadoop-mapreduce-works

Idea of mapper & reducer

Data flow for a single MapReduce job

Hadoop - my first MapReduce (M/R) code
  1. install VMware on the local machine
  2. import Hadoop server from http://www.cloudera.com/hadoop-training-virtual-machine
  3. fire VM up
  4. To use Streamers add system variable:
    export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar
    Now you can writhe M/R in any language
  5. upload Shakespeare text to HDFS (Hadoop Distributed File System)
    cd ~/git/data (is distributed w/ Cloudera VM)
    tar vzxf shakespeare.tar.gz
    check nothing is in HDFShadoop fs -ls /user/training
  6. add  text file to HDFS   hadoop fs -put input /user/training/inputShak
    and check again hadoop fs -ls /user/training
  7. Execute M/R job under hadoop using 'cat' & 'wc'
    hadoop jar $SJAR \
    -mapper cat \
    -reducer wc \
    -input inputShak \
    -output outputShak
    
    1. inspect output in
      hadoop fs -cat outputShak/p*
      175376   948516   5398106
  8. Task: write own Map & Reduce counting frequency of words:

mapper : read text data from stdin
write "<key> <value>" to stdout (<key>=word, <value>=1)
example:
$ echo "foo foo quux labs foo bar quux" | ./mapper.py

    1. Python Mapper
      mapp1.py
      #!/usr/bin/env python
      # my 1st mapper: writes <word> 1
      import sys
      data = sys.stdin.readlines()
      for ln in data:
          L=ln.split()
          for key in L:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
              if len(key)>1:
                  print key,1
      
      
      reducer : read a stream of "<word> 1" from stdin
      write "<word> <count>" to stdout
      redu1.py
      #!/usr/bin/env python
      # my 1st reducer: reads:  <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
      import sys
      data = sys.stdin.readlines()
      myKey=""
      myVal=0
      for ln in data:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
          #print ln,
          L=ln.split()
          #print L
          nw=len(L)/2
          for i in range(nw):
              #print i
              key=L[0+2*i]
              val=int(L[2*i+1])
              #print key,val,nw
              if myKey==key:
                  myVal=myVal+val
              else:
                  if len(myKey)>0:
                      print myKey, myVal,
                  myKey=key
                  myVal=val
      
      if len(myKey)>0:
          print myKey, myVal,
      
    2. Execute:
      hadoop jar $SJAR \
      -mapper $(pwd)/mapp1.py \
      -reducer $(pwd)/redu1.py \
      -input inputShak \
      -output outputShak3
      
Cloudera - my 1st EC2 cluster deployed
  1. follow instruction http://archive.cloudera.com/docs/ec2.html
    1. Item 2.1 . I uploaded 3 tar files: 'client script', 'boto', and 'simplejson'.
      • Un-tarred all 3: tar vzxf ....
      • execute twice , in 'boto' and 'simplejson' directories
        sudo python setup.py install 
      • move hadoop-ec2 to permanent place & add to path (for easier use)
        tar vzxf cloudera-for-hadoop-on-ec2-py-0.3.0-beta..tar.gz
         sudo mv cloudera-for-hadoop-on-ec2-py-0.3.0-beta /opt/
         export HADOOP_EC2_HOME=/opt/cloudera-for-hadoop-on-ec2-py-0.3.0-beta
         export PATH=$PATH:$HADOOP_EC2_HOME
        
      • exported by hand environment variables
        AWS_ACCESS_KEY_ID - Your AWS Access Key ID
        AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key
      • create a directory called ~/.hadoop-ec2 w/ file ec2-clusters.cfg with content:
        [my-hadoop-cluster]
        ami=ami-6159bf08
        instance_type=m1.small
        key_name=janAmazonKey2
        availability_zone=us-east-1a
        private_key=/home/training/.ec2/id_rsa-janAmazonKey2
        ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
        
      • filer a cluster of 1 server+2 nodes
        hadoop-ec2 launch-cluster my-hadoop-cluster 2
      • define bunch of variables in  /usr/src/hadoop/contrib/ec2/bin/hadoop-ec2-env.sh

1+2 machines are up and running:

Waiting for master to start (Reservation:r-d0d424b8)
...............................................
master	i-1cbe3974	ami-6159bf08	ec2-67-202-49-216.compute-1.amazonaws.com	ip-10-244-182-63.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:31:17.000Z	us-east-1c
Waiting for slaves to start
.............................................
slave	i-f2be399a	ami-6159bf08	ec2-67-202-14-97.compute-1.amazonaws.com	ip-10-244-182-127.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
slave	i-f4be399c	ami-6159bf08	ec2-75-101-201-30.compute-1.amazonaws.com	ip-10-242-19-81.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
Waiting for jobtracker to start

Waiting for 2 tasktrackers to start
.................................................1.............2.
Browse the cluster at http://ec2-67-202-49-216.compute-1.amazonaws.com/

To login to  cluster execute:
hadoop-ec2 login my-hadoop-cluster

To terminate cluster execute:
hadoop-ec2 terminate-cluster my-hadoop-cluster
and say 'yes' !!

To Deleting the EC2 security groups
hadoop-ec2 delete-cluster my-hadoop-cluster
To actually brows them I need to:

  1. execute ./hadoop-ec2 proxy my-hadoop-cluster
    Resulting with:
    export HADOOP_EC2_PROXY_PID=20873;
    echo Proxy pid 20873;
  2. and add proxy to the fireFox - so far it did not worked.

Functional packages I made

  1. Very simple word count example which works TAR BALL
  2. Compute PageRank  TAR BALL for medium-set wiki-pages from the Harvard class by Hanspeter Pfister, code still have problems in deploying it. It crashes on reduce if full 11M lines is processed on 10-node EC2 cluster
  3. here is tar-ball of the finale version of my code
    • to upload it more often I used command
      scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
      ./mySetup.sh -f l -v11 -D
      it contains
       training  495 2009-11-11 20:25 abcd-pages
       training  290 2009-11-11 20:25 cleanup.py
       training 1374 2009-11-14 19:47 mappPR.py
       training 2302 2009-11-14 18:30 pageRankCommon.py
       training 2648 2009-11-14 18:31 pageRankCommon.pyc
       training 1034 2009-11-14 19:25 reduPR.py
       training 7251 2009-11-14 11:33 runPageRank.sh
       training 1806 2009-11-14 18:34 wiki2mappPR.py
      
    • to upload data set to hadoop HDFS by hand I did
      hadoop fs -put wL-pages-iter0 wL-pages-iter0
    • to execute full map/reduce job w/ 3 iterations:
      cleaniup all, write raw file, use 4map+2red, init Map, 3 x M/R, final sort
      /runPageRank.sh -X -w -D 4.2 -m -I 0.3 -i -f

More advance instruction about Running Hadoop on Amazon EC2

http://wiki.apache.org/hadoop/AmazonEC2

  • No labels