Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Panel

Description of MapReduce concept:

...

http://java.dzone.com/articles/how-hadoop-mapreduce-works

...

Idea

...

of

...

mapper

...

&

...

reducer
Image Added

Data flow for a single MapReduce job
Image Added

Hadoop - my first MapReduce (M/R)

...

code

...

  1. install

...

  1. VMware

...

  1. on

...

  1. the

...

  1. local

...

  1. machine

...

  1. import

...

  1. Hadoop

...

  1. server

...

  1. from

...

  1. http://www.cloudera.com/hadoop-training-virtual-machine

...

  1. fire

...

  1. VM

...

  1. up

...

  1. To

...

  1. use

...

  1. Streamers

...

  1. add

...

  1. system

...

  1. variable:

...


  1. export

...

  1. SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar

...


  1. Now

...

  1. you

...

  1. can

...

  1. writhe

...

  1. M/R

...

  1. in

...

  1. any

...

  1. language

...

  1. upload

...

  1. Shakespeare

...

  1. text

...

  1. to

...

  1. HDFS

...

  1. (Hadoop

...

  1. Distributed

...

  1. File

...

  1. System)

...


  1. cd ~/git/data

...

  1. (is

...

  1. distributed

...

  1. w/

...

  1. Cloudera

...

  1. VM)

...


  1. tar

...

  1. vzxf

...

  1. shakespeare.tar.gz

...


  1. check

...

  1. nothing

...

  1. is

...

  1. in

...

  1. HDFS

...

  1. hadoop

...

  1. fs

...

  1. -ls

...

  1. /user/training

...

  1. add  text file to HDFS   hadoop fs -put input /user/training/inputShak

...


  1. and

...

  1. check

...

  1. again

...

  1. hadoop

...

  1. fs

...

  1. -ls

...

  1. /user/training

...

  1. Execute

...

  1. M/R

...

  1. job

...

  1. under

...

  1. hadoop

...

  1. using

...

  1. 'cat'

...

  1. &

...

  1. 'wc'

...

  1. Code Block

...

  1. 
    hadoop jar $SJAR \
    -mapper cat \
    -reducer wc \
    -input inputShak \
    -output outputShak
    

...

    1. inspect output in
      hadoop fs -cat outputShak/p

...

    1. *

...


    1. 175376   948516   5398106
  1. Task: write own Map & Reduce counting frequency of words:

mapper : read text data from stdin
write "<key> <value>" to stdout (<key>=word,

...

<value>=1)

...


example:

...


$

...

echo

...

"foo

...

foo

...

quux

...

labs

...

foo

...

bar

...

quux"

...

|

...

./mapper.py

...

    1. Python

...

    1. Mapper

...

    1. Code Block

...

    1. title

...

    1. mapp1.py

...

    1. borderStyle

...

    1. solid

...

    1. 
      #!/usr/bin/env python
      # my 1st mapper: writes <word> 1
      import sys
      data = sys.stdin.readlines()
      for ln in data:
          L=ln.split()
          for key in L:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
              if len(key)>1:
                  print key,1
      
      

...

    1. reducer : read a stream of "<word> 1" from stdin
      write "<word> <count>" to stdout
      Code Block
      titleredu1.py
      borderStylesolid
      
      #!/usr/bin/env python
      # my 1st reducer: reads:  <word> <vlue., sums key values from the same consecutive key, writes <word> <sum>
      import sys
      data = sys.stdin.readlines()
      myKey=""
      myVal=0
      for ln in data:scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh" .
          #print ln,
          L=ln.split()
          #print L
          nw=len(L)/2
          for i in range(nw):
              #print i
              key=L[0+2*i]
              val=int(L[2*i+1])
              #print key,val,nw
              if myKey==key:
                  myVal=myVal+val
              else:
                  if len(myKey)>0:
                      print myKey, myVal,
                  myKey=key
                  myVal=val
      
      if len(myKey)>0:
          print myKey, myVal,
      

...

    1. Execute:

...

    1. Code Block

...

    1. 
      hadoop jar $SJAR \
      -mapper $(pwd)/mapp1.py \
      -reducer $(pwd)/redu1.py \
      -input inputShak \
      -output outputShak3
      

...

Cloudera - my 1st EC2 cluster deployed
  1. follow instruction http://archive.cloudera.com/docs/ec2.html

...

    1. Item

...

    1. 2.1

...

    1. .

...

    1. I

...

    1. uploaded

...

    1. 3

...

    1. tar

...

    1. files:

...

    1. 'client

...

    1. script',

...

    1. 'boto',

...

    1. and

...

    1. 'simplejson'.

...

      • Un-tarred

...

      • all

...

      • 3:

...

      • tar

...

      • vzxf

...

      • ....

...

      • execute

...

      • twice

...

      • ,

...

      • in

...

      • 'boto'

...

      • and

...

      • 'simplejson'

...

      • directories

...

      • Code Block

...

      • sudo python setup.py install 

...

      • move hadoop-ec2

...

      • to

...

      • permanent

...

      • place

...

      • &

...

      • add

...

      • to

...

      • path

...

      • (for

...

      • easier

...

      • use)

...

      • Code Block

...

      • 
        tar vzxf cloudera-for-hadoop-on-ec2-py-0.3.0-beta..tar.gz
         sudo mv cloudera-for-hadoop-on-ec2-py-0.3.0-beta /opt/
         export HADOOP_EC2_HOME=/opt/cloudera-for-hadoop-on-ec2-py-0.3.0-beta
         export PATH=$PATH:$HADOOP_EC2_HOME
        

...

      • exported by hand environment variables
        AWS_ACCESS_KEY_ID

...

      • -

...

      • Your

...

      • AWS

...

      • Access

...

      • Key

...

      • ID

...


      • AWS_SECRET_ACCESS_KEY

...

      • -

...

      • Your

...

      • AWS

...

      • Secret

...

      • Access

...

      • Key

...

      • create

...

      • a

...

      • directory

...

      • called

...

      • ~/.hadoop-ec2

...

      • w/

...

      • file

...

      • ec2-clusters.cfg

...

      • with

...

      • content:

...

      • Code Block

...

      • [my-hadoop-cluster]
        ami=ami-6159bf08
        instance_type=m1.small
        key_name=janAmazonKey2
        availability_zone=us-east-1a
        private_key=/home/training/.ec2/id_rsa-janAmazonKey2
        ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
        

...

      • filer a cluster of 1 server+2

...

      • nodes

...


      • hadoop-ec2

...

      • launch-cluster

...

      • my-hadoop-cluster

...

      • 2

...

      • define

...

      • bunch

...

      • of

...

      • variables in  /usr/src/hadoop/contrib/ec2/bin/hadoop-ec2-env.sh

...

1+2

...

machines

...

are

...

up

...

and

...

running:

{
Code Block
}Waiting for master to start (Reservation:r-d0d424b8)
...............................................
master	i-1cbe3974	ami-6159bf08	ec2-67-202-49-216.compute-1.amazonaws.com	ip-10-244-182-63.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:31:17.000Z	us-east-1c
Waiting for slaves to start
.............................................
slave	i-f2be399a	ami-6159bf08	ec2-67-202-14-97.compute-1.amazonaws.com	ip-10-244-182-127.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
slave	i-f4be399c	ami-6159bf08	ec2-75-101-201-30.compute-1.amazonaws.com	ip-10-242-19-81.ec2.internal	running	janAmazonKey	c1.medium	2009-10-26T02:32:19.000Z	us-east-1c
Waiting for jobtracker to start

Waiting for 2 tasktrackers to start
.................................................1.............2.
Browse the cluster at http://ec2-67-202-49-216.compute-1.amazonaws.com/
{code}

To

...

login to  cluster execute:
hadoop-ec2

...

login

...

my-hadoop-cluster

...

To

...

terminate

...

cluster

...

execute:

...


hadoop-ec2

...

terminate-cluster

...

my-hadoop-cluster

...


and

...

say

...

'yes'

...

!

...

!

...

To

...

Deleting

...

the

...

EC2

...

security

...

groups

...


hadoop-ec2

...

delete-cluster

...

my-hadoop-cluster

...


To

...

actually

...

brows

...

them

...

I

...

need

...

to:

...

  1. execute

...

  1. ./hadoop-ec2

...

  1. proxy

...

  1. my-hadoop-cluster

...


  1. Resulting

...

  1. with:

...


  1. export

...

  1. HADOOP_EC2_PROXY_PID=20873;

...


  1. echo

...

  1. Proxy

...

  1. pid

...

  1. 20873;

...

  1. and

...

  1. add

...

  1. proxy

...

  1. to

...

  1. the

...

  1. fireFox

...

  1. -

...

  1. so

...

  1. far

...

  1. it

...

  1. did

...

  1. not

...

  1. worked.

...

Functional packages I made

  1. Very simple word count example which works TAR BALL
  2. Compute PageRank  TAR BALL for medium-set wiki-pages from the Harvard class by Hanspeter Pfister, code still have problems in deploying it. It crashes on reduce if full 11M lines is processed on 10-node EC2 cluster
  3. here is tar-ball of the finale version of my code
    • to upload it more often I used command
      scp balewski@deltag5.lns.mit.edu:"0x/mySetup.sh"

...

    • .

...


    • ./mySetup.sh

...

    • -f

...

    • l

...

    • -v11

...

    • -D

...


    • it

...

    • contains

...

    • Code Block

...

    •  training  495 2009-11-11 20:25 abcd-pages
       training  290 2009-11-11 20:25 cleanup.py
       training 1374 2009-11-14 19:47 mappPR.py
       training 2302 2009-11-14 18:30 pageRankCommon.py
       training 2648 2009-11-14 18:31 pageRankCommon.pyc
       training 1034 2009-11-14 19:25 reduPR.py
       training 7251 2009-11-14 11:33 runPageRank.sh
       training 1806 2009-11-14 18:34 wiki2mappPR.py
      

...

    • to upload data set to hadoop HDFS by hand I did
      hadoop fs -put wL-pages-iter0

...

    • wL-pages-iter0

...

    • to

...

    • execute

...

    • full

...

    • map/reduce

...

    • job

...

    • w/

...

    • 3

...

    • iterations:

...


    • cleaniup

...

    • all,

...

    • write

...

    • raw

...

    • file,

...

    • use

...

    • 4map+2red,

...

    • init

...

    • Map,

...

    • 3

...

    • x

...

    • M/R,

...

    • final

...

    • sort

...


    • /runPageRank.sh

...

    • -X

...

    • -w

...

    • -D

...

    • 4.2

...

    • -m

...

    • -I

...

    • 0.3

...

    • -i

...

    • -f

...

More advance instruction about Running Hadoop on Amazon EC2

http://wiki.apache.org/hadoop/AmazonEC2

...