Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

EBS

...

permanent

...

disk

...

-

...

(is

...

not

...

reliable

...

for

...

me)

...

Nov

...

14

...

:

...

EBS

...

disk

...

looses

...

information

...

after

...

it

...

is

...

disconnected

...

and

...

reconnected.

...

I

...

used

...

the

...

following

...

command

{
Code Block
}mkdir /storage
mount /dev/sdf1 /storage
cd /storage
ls {code}

The

...

file

...

structure

...

seems

...

to

...

be

...

fine,

...

but

...

when

...

I

...

try

...

to

...

read

...

some

...

of

...

the

...

files

...

I

...

get

...

this

...

error

...

for

...

some

...

of

...

the

...

files.

{
Code Block
}[root@ip-10-251-198-4 wL-pages-iter2]# cat * >/dev/null
cat: wL1part-00000: Input/output error
cat: wL1part-00004: Input/output error
cat: wL1part-00005: Input/output error
cat: wL1part-00009: Input/output error
cat: wL2part-00000: Input/output error
cat: wL2part-00004: Input/output error
[root@ip-10-251-198-4 wL-pages-iter2]# pwd
/storage/iter/bad/wL-pages-iter2
{code}

Note,

...

EBS

...

disk

...

was

...

partitioned

...

&

...

formatter

...

with

...

on

...

the

...

exactly

...

the

...

same

...

operating

...

system

...

in

...

the

...

previous

...

session

{
Code Block
}fdisk /dev/sdf
mkfs.ext3 /dev/sdf1
{code}


h4. file transfer

Nov 13    
\*scp RCF \--> Amazon, 

file transfer

Nov 13    
*scp RCF --> Amazon, 3MB/sec,

...

~GB

...

files;

...


*scp

...

Amazon-->Amazon,

...

5-8

...

MB/sec,

...

~GB

...

files

...

  • Problem

...

  • w/

...

  • large

...

  • (~0.5

...

  • +

...

  • GB)

...

  • file

...

  • transfer:

...

  • there

...

  • are

...

  • 2

...

  • types

...

  • of

...

  • disks:

...

    • local

...

    • volatile

...

    • /mnt

...

    • of

...

    • size

...

    • ~140GB
    • permanent EBS storage (size ~$$$)
      scp of binary (xxx.gz)

...

    • to

...

    • EBS

...

    • disk

...

    • result

...

    • with

...

    • corruption

...

    • (gunzip

...

    • would

...

    • complain).

...

    • Once

...

    • the

...

    • file

...

    • size

...

    • was

...

    • off

...

    • by

...

    • 1

...

    • bit

...

    • (of

...

    • 0.4GB).

...

    • It

...

    • was

...

    • random,

...

    • multiple

...

    • transfers

...

    • would

...

    • succeed

...

    • after

...

    • several

...

    • trails.

...

    • If

...

    • multiple

...

    • scp

...

    • were

...

    • made

...

    • simultaneously

...

    • it

...

    • would

...

    • get

...

    • worse.

...


    • Once

...

    • I

...

    • change

...

    • destination

...

    • to

...

    • /mnt

...

    • disk

...

    • and

...

    • did

...

    • one

...

    • transfer

...

    • at

...

    • a

...

    • time

...

    • all

...

    • probelms

...

    • were

...

    • gone

...

    • -

...

    • I

...

    • scp

...

    • 3

...

    • files

...

    • of

...

    • 1GB

...

    • w/o

...

    • a

...

    • glitch.

...

    • Later

...

    • I

...

    • copied

...

    • files

...

    • from

...

    • /mnt

...

    • to

...

    • EBS

...

    • disk

...

    • took

...

    • ~5

...

    • minutes

...

    • per

...

    • GB).

...

Nov

...

14:

...

transfer

...

of

...

1GB

...

from

...

rcf

...

<-->

...

Amazon

...

takes

...

~5

...

minutes.

...

Launching nodes

Nov 13 :
*Matt's

...

customized

...

Ubuntu

...

w/o

...

STAR

...

software

...

-

...

4-6

...

minutes,

...

the

...

smallest

...

machine

...

$0.10

...


*default

...

public

...

Fedora

...

from

...

EC2

...

:

...

~2

...

minutes

...


*launching

...

Cloudera

...

cluster

...

1+4

...

or

...

1+10

...

seems

...

to

...

take

...

similar

...

time

...

of

...

~5

...

minutes

...

Nov

...

14

...

:

...

*there

...

is

...

a

...

limit

...

of

...

20

...

on

...

#

...

of

...

EC2

...

machines

...

I

...

could

...

launch

...

at

...

once

...

with

...

the

...

command:

...

hadoop-ec2

...

launch-cluster

...

my-hadoop-cluster19

...


'20'

...

would

...

not

...

work.

...

This

...

is

...

my

{
Code Block
}
> cat */.hadoop-ec2/ec2-clusters.cfg
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
*availability_zone=us-east-1a*
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
{code}

Make

...

sure

...

to

...

assign

...

proper

...

zone

...

if

...

you

...

use

...

EBS

...

disk

...

Computing

...

speed

...

Task

...

description

...

I

...

have

...

exercised

...

the

...

Cloudera

...

AMI

...

package,

...

requested

...

1

...

master+10

...

nodes.

...

The

...

task

...

was

...

to

...

compute

...

PageRank

...

for

...

large

...

size

...

set

...

of

...

interlinked

...

pages.

...

The

...

abstract

...

definition

...

of

...

the

...

task

...

is

...

to

...

fine

...

iteratively

...

solution

...

of

...

the

...

matrix

...

equation:

...


A*X=X

...


where

...

A

...

is

...

a

...

square

...

matrix

...

of

...

the

...

dimension

...

N

...

equal

...

to

...

#

...

of

...

wikipedia

...

pages

...

pointed

...

by

...

any

...

wikipedia

...

page.

...

X

...

is

...

the

...

vector

...

of

...

the

...

same

...

dimension

...

describing

...

the

...

ultimate

...

weight

...

of

...

the

...

given

...

page

...

(

...

the

...

Page-Rank

...

value).

...

The

...

N

...

of

...

my

...

problem

...

was

...

1e6..1e7.

...

I

...

was

...

given

...

a

...

dump

...

of

...

all

...

Wikipedia

...

pages

...

HM5,6

...

in

...

the

...

format:

...


<page><title>The

...

Title</title><text>The

...

page

...

body</text></page>

...

,

...

one

...

line

...

of

...

text

...

per

...

page,

...

the

...

(human

...

typed

...

in

...

)

...

content

...

was

...

extremely

...

non-homogenous,

...

multi-lingual,

...

with

...

many

...

random

...

characters

...

and

...

typos.

...


I

...

wrote

...

4

...

python

...

string

...

processing

...

functions:

...

  1. init converting input text to <key,value>

...

  1. format

...

  1. (my

...

  1. particular

...

  1. choice

...

  1. of

...

  1. the

...

  1. meaning

...

  1. )

...

  1. mapp

...

  1. and

...

  1. reduce

...

  1. functions,

...

  1. run

...

  1. in

...

  1. pair,

...

  1. multiple

...

  1. iterations

...

  1. finish function exporting final list of pages ordered by page rank.
  2. I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08,

...

  1. instance_type=m1.small

...


  1. The

...

  1. goal

...

  1. was

...

  1. to

...

  1. perform

...

  1. all

...

  1. ini

...

  1. +

...

  1. N_iter

...

  1. +

...

  1. fin

...

  1. steps

...

  1. using

...

  1. 10

...

  1. nodes

...

  1. &

...

  1. hadoop

...

  1. framework.

...

Test 1:

...

execution

...

of

...

the

...

full

...

chain

...

for

...

ini

...

+2

...

iter

...

+fin

...

using

...

a

...

~10%

...

sub-set

...

of

...

wikipedia

...

pages

...

(enwiki-20090929-one-page-per-line-part3)

...

  • the

...

  • unzipped

...

  • file

...

  • had

...

  • size

...

  • of

...

  • 2.2GB

...

  • ASCII

...

  • ,

...

  • contained

...

  • 1.1M

...

  • lines

...

  • (original

...

  • pages)

...

  • which

...

  • pointed

...

  • to

...

  • 14M

...

  • pages

...

  • (outgoing

...

  • links,

...

  • include

...

  • self

...

  • reference,

...

  • non

...

  • unique).

...

  • After

...

  • 1st

...

  • iteration

...

  • the

...

  • #

...

  • of

...

  • lines

...

  • (pages

...

  • which

...

  • are

...

  • pointed

...

  • to

...

  • by

...

  • any

...

  • of

...

  • the

...

  • original

...

  • )

...

  • grew

...

  • to

...

  • 5M

...

  • pages

...

  • and

...

  • stabilized.

...

  • I

...

  • brought

...

  • part3.gz

...

  • file

...

  • to

...

  • the

...

  • master

...

  • node

...

  • &

...

  • unzip

...

  • it

...

  • on

...

  • the

...

  • /mnt

...

  • disk

...

  • (has

...

  • enough

...

  • space

...

  • (took

...

  • few

...

  • minutes)

...

  • I

...

  • stick

...

  • to

...

  • the

...

  • default

...

  • choice

...

  • to

...

  • run

...

  • 20

...

  • mappers

...

  • and

...

  • 10

...

  • reducers

...

  • for

...

  • every

...

  • step

...

  • (for

...

  • 10-node

...

  • cluster)

...


  • Timing

...

  • results

...

  1. copy

...

  1. local

...

  1. file

...

  1. to

...

  1. HDFS

...

  1. :

...

  1. ~2

...

  1. minutes

...

  1. init

...

  1. :

...

  1. 410

...

  1. sec

...

  1. mapp/reduce

...

  1. iter

...

  1. 0

...

  1. :

...

  1. 300

...

  1. sec

...

  1. mapp/reduce

...

  1. iter

...

  1. 1

...

  1. :

...

  1. 180

...

  1. sec

...

  1. finish

...

  1. :

...

  1. 190

...

  1. sec

...


  1. Total

...

  1. time

...

  1. was

...

  1. 20

...

  1. minutes

...

  1. ,

...

  1. 11

...

  1. CPUs

...

  1. were

...

  1. involved.

...

Test

...

2:

...

execution

...

of

...

a

...

single

...

map/reduce

...

step

...

on

...

27M

...

linked

...

pages,

...

using

...

full

...

set

...

of

...

wikipedia

...

pages

...

(enwiki-20090929-one-page-per-line-part1+2+3).

...

I

...

did

...

minor

...

modification

...

of

...

map/reduce

...

code

...

which

...

could

...

slow

...

it

...

down

...

by

...

~20%-30%.

...

  • the

...

  • unzipped

...

  • file

...

  • had

...

  • size

...

  • of

...

  • 21

...

  • GB

...

  • ASCII

...

  • ,

...

  • contained

...

  • 9M

...

  • lines

...

  • (original

...

  • pages)

...

  • which

...

  • pointed

...

  • to

...

  • 142M

...

  • pages

...

  • (outgoing

...

  • links,

...

  • include

...

  • self

...

  • reference,

...

  • non

...

  • unique).

...

  • After

...

  • 1st

...

  • iteration

...

  • (which

...

  • I

...

  • run

...

  • serially

...

  • on

...

  • a

...

  • different

...

  • machine)

...

  • the

...

  • #

...

  • of

...

  • lines

...

  • (pages

...

  • which

...

  • are

...

  • pointed

...

  • to

...

  • by

...

  • any

...

  • of

...

  • the

...

  • original

...

  • )

...

  • grew

...

  • to

...

  • 27M

...

  • pages.

...

  • I

...

  • brought

...

  • 1GB

...

  • output

...

  • of

...

  • iteration

...

  • 1

...

  • to

...

  • the

...

  • master

...

  • node

...

  • &

...

  • unzip

...

  • it

...

  • on

...

  • the

...

  • /mnt

...

  • disk

...

  • (took

...

  • 5

...

  • for

...

  • scp

...

  • and

...

  • 5

...

  • for

...

  • unzip)

...

  • I

...

  • run

...

  • 20

...

  • mappers

...

  • and

...

  • 10

...

  • reducers

...

  • for

...

  • every

...

  • step

...

  • (for

...

  • 10-node

...

  • cluster)

...


  • Timing

...

  • results

...

  1. copy

...

  1. local

...

  1. file

...

  1. to

...

  1. HDFS

...

  1. :

...

  1. ~10

...

  1. minutes.

...

  1. Hadoop

...

  1. decided

...

  1. to

...

  1. divide

...

  1. the

...

  1. data

...

  1. in

...

  1. to

...

  1. 40

...

  1. sets

...

  1. (and

...

  1. will

...

  1. issue

...

  1. 40

...

  1. mapp

...

  1. jobs)

...

  1. 3

...

  1. mapp

...

  1. jobs

...

  1. finished

...

  1. after

...

  1. 8

...

  1. minutes.

...

  1. 5

...

  1. mapp

...

  1. jobs

...

  1. finished

...

  1. after

...

  1. 16

...

  1. minutes.

...

  1. 16

...

  1. mapp

...

  1. jobs

...

  1. finished

...

  1. after

...

  1. 29

...

  1. minutes.

...

  1. all

...

  1. 40

...

  1. mapp

...

  1. jobs

...

  1. finished

...

  1. after

...

  1. 42

...

  1. minutes

...

  1. (one

...

  1. of

...

  1. the

...

  1. map

...

  1. jobs

...

  1. was

...

  1. restarted

...

  1. during

...

  1. this

...

  1. time)

...

  1. reduce

...

  1. failed

...

  1. for

...

  1. all

...

  1. 10

...

  1. jobs

...

  1. after

...

  1. ~5

...

  1. minutes,

...

  1. all

...

  1. 10

...

  1. ~simultaneously

...

  1. hadoop

...

  1. tried

...

  1. twice

...

  1. to

...

  1. reissue

...

  1. the

...

  1. 10

...

  1. sort+10

...

  1. reduce

...

  1. jobs

...

  1. and

...

  1. it

...

  1. failed

...

  1. again

...

  1. after

...

  1. another

...

  1. ~5

...

  1. minutes

...


  1. At

...

  1. this

...

  1. stage

...

  1. I

...

  1. killed

...

  1. the

...

  1. cluster.

...

  1. It

...

  1. was

...

  1. consuming

...

  1. 11

...

  1. CPU/hour

...

  1. and

...

  1. I

...

  1. had

...

  1. no

...

  1. clue

...

  1. how

...

  1. to

...

  1. debug

...

  1. it.

...

  1. I

...

  1. suspect

...

  1. some

...

  1. internal

...

  1. memory

...

  1. (HDFS

...

  1. ?)

...

  1. limit

...

  1. was

...

  1. not

...

  1. sufficient

...

  1. to

...

  1. hold

...

  1. sort

...

  1. results

...

  1. after

...

  1. mapp

...

  1. tasks.

...

  1. My

...

  1. estimate

...

  1. is

...

  1. 3GB

...

  1. of

...

  1. unzipped

...

  1. input

...

  1. could

...

  1. grew

...

  1. by

...

  1. a

...

  1. factor

...

  1. of

...

  1. few

...

  1. -

...

  1. may

...

  1. be

...

  1. there

...

  1. is

...

  1. a

...

  1. 10GB

...

  1. limit

...

  1. I

...

  1. should

...

  1. change

...

  1. (or

...

  1. pay

...

  1. extra?)

...