You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Hadoop - my first MapReduce (M/R) code
  1. install VMware on the local machine
  2. import Hadoop server from http://www.cloudera.com/hadoop-training-virtual-machine
  3. fire VM up
  4. To use Streamers add system variable:
    export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar
    Now you can writhe M?R in any language
  5. upload Shakespeare text to HDFS (Hadoop Distributed File System)
    cd ~/git/data
    tar vzxf shakespeare.tar.gz
    check nothig is in HDFS
    hadoop fs -ls /user/training
    add unpacked text gile to HDFS and check again
    hadoop fs -put input /user/training/inputShak
  6. (source) (target)
    hadoop fs -ls /user/training
  7. Execute M/R job using 'cat' & 'wc'
    hadoop jar $SJAR \
    -mapper cat \
    -reducer wc \
    -input inputShak \
    -output outputShak
    1. inspect output in
      hadoop fs -cat outputShak/p*
      175376 948516 5398106
  8. Task: write own Map & Reduce counting frequency of words:
    mapper : read text data from stdin
    write "<key> <value>" to stdout (<key>=word, <value>=1)
    example:
    $ echo "foo foo quux labs foo bar quux" | ./mapper.py
    1. Python Mapper
      mapp1.py
      #!/usr/bin/env python
      # my 1st mapper: writes <word> 1 
      import sys
      data = sys.stdin.readlines()
      for ln in data:
          L=ln.split()
          for key in L:
              if len(key)>1:
                  print key,1
      
      
      reducer :
      read a stream of "<word> 1" from stdin
      write "<word> <count>" to stdout
      redu1.py
      #!/usr/bin/env python
      # my 1st reducer: reads:  <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
      import sys
      data = sys.stdin.readlines()
      myKey=""
      myVal=0
      for ln in data:
          #print ln,
          L=ln.split()
          #print L
          nw=len(L)/2
          for i in range(nw):
              #print i
              key=L[0+2*i]
              val=int(L[2*i+1])
              #print key,val,nw
              if myKey==key:
                  myVal=myVal+val
              else:
                  if len(myKey)>0:
                      print myKey, myVal,
                  myKey=key
                  myVal=val
                      
      if len(myKey)>0:
          print myKey, myVal,
        
      
  • No labels