Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

    1. Python Mapper
      Code Block
      titlemapp1.py
      borderStylesolid
      #!/usr/bin/env python
      # my 1st mapper: writes <word> 1
      import sys
      data = sys.stdin.readlines()
      for ln in data:
          L=ln.split()
          for key in L:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
              if len(key)>1:
                  print key,1
      
      
      reducer : read a stream of "<word> 1" from stdin
      write "<word> <count>" to stdout
      Code Block
      titleredu1.py
      borderStylesolid
      #!/usr/bin/env python
      # my 1st reducer: reads:  <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
      import sys
      data = sys.stdin.readlines()
      myKey=""
      myVal=0
      for ln in data:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
          #print ln,
          L=ln.split()
          #print L
          nw=len(L)/2
          for i in range(nw):
              #print i
              key=L[0+2*i]
              val=int(L[2*i+1])
              #print key,val,nw
              if myKey==key:
                  myVal=myVal+val
              else:
                  if len(myKey)>0:
                      print myKey, myVal,
                  myKey=key
                  myVal=val
      
      if len(myKey)>0:
          print myKey, myVal,
      
    2. Execute:
      Code Block
      hadoop jar $SJAR \
      -mapper $(pwd)/mapp1.py \
      -reducer $(pwd)/redu1.py \
      -input inputShak \
      -output outputShak3
      

...

  1. compute PageRank for medium-set wiki-pages from the Harvard class by Hanspeter Pfister, code still have problems in deploying on EC2 cluster
  2. here is tar-ball of the finale version of my code
    • to upload it more often I used command
      *scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" . *
      ./mySetup.sh -f l -v11 -D
      it contains
      Code Block
       training  495 2009-11-11 20:25 abcd-pages
       training  290 2009-11-11 20:25 cleanup.py
       training 1374 2009-11-14 19:47 mappPR.py
       training 2302 2009-11-14 18:30 pageRankCommon.py
       training 2648 2009-11-14 18:31 pageRankCommon.pyc
       training 1034 2009-11-14 19:25 reduPR.py
       training 7251 2009-11-14 11:33 runPageRank.sh
       training 1806 2009-11-14 18:34 wiki2mappPR.py
      
    • to upload data set to hadoop HDFS by hand I did
      hadoop fs -put wL-pages-iter0 wL-pages-iter0
    • to execute full map/reduce job w/ 3 iterations:
      cleaniup all, write raw file, use 4map+2red, init Map, 3 x M/R, final sort
      /runPageRank.sh -X -w -D 4.2 -m -I 0.3 -i -f

...