...
EBS
...
permanent
...
disk
...
-
...
(is
...
not
...
reliable
...
for
...
me)
...
Nov
...
14
...
:
...
EBS
...
disk
...
looses
...
information
...
after
...
it
...
is
...
disconnected
...
and
...
reconnected.
...
I
...
used
...
the
...
following
...
command
Code Block |
---|
}mkdir /storage mount /dev/sdf1 /storage cd /storage ls {code} |
The
...
file
...
structure
...
seems
...
to
...
be
...
fine,
...
but
...
when
...
I
...
try
...
to
...
read
...
some
...
of
...
the
...
files
...
I
...
get
...
this
...
error
...
for
...
some
...
of
...
the
...
files.
Code Block |
---|
}[root@ip-10-251-198-4 wL-pages-iter2]# cat * >/dev/null cat: wL1part-00000: Input/output error cat: wL1part-00004: Input/output error cat: wL1part-00005: Input/output error cat: wL1part-00009: Input/output error cat: wL2part-00000: Input/output error cat: wL2part-00004: Input/output error [root@ip-10-251-198-4 wL-pages-iter2]# pwd /storage/iter/bad/wL-pages-iter2 {code} |
Note,
...
EBS
...
disk
...
was
...
partitioned
...
&
...
formatter
...
with
...
on
...
the
...
exactly
...
the
...
same
...
operating
...
system
...
in
...
the
...
previous
...
session
Code Block |
---|
}fdisk /dev/sdf mkfs.ext3 /dev/sdf1 {code} h4. file transfer Nov 13 \*scp RCF \--> Amazon, |
file transfer
Nov 13
*scp RCF --> Amazon, 3MB/sec,
...
~GB
...
files;
...
*scp
...
Amazon-->Amazon,
...
5-8
...
MB/sec,
...
~GB
...
files
...
- Problem
...
- w/
...
- large
...
- (~0.5
...
- +
...
- GB)
...
- file
...
- transfer:
...
- there
...
- are
...
- 2
...
- types
...
- of
...
- disks:
...
- local
...
- volatile
...
- /mnt
...
- of
...
- size
...
- ~140GB
- permanent EBS storage (size ~$$$)
scp of binary (xxx.gz)
...
- to
...
- EBS
...
- disk
...
- result
...
- with
...
- corruption
...
- (gunzip
...
- would
...
- complain).
...
- Once
...
- the
...
- file
...
- size
...
- was
...
- off
...
- by
...
- 1
...
- bit
...
- (of
...
- 0.4GB).
...
- It
...
- was
...
- random,
...
- multiple
...
- transfers
...
- would
...
- succeed
...
- after
...
- several
...
- trails.
...
- If
...
- multiple
...
- scp
...
- were
...
- made
...
- simultaneously
...
- it
...
- would
...
- get
...
- worse.
...
Once
...
- I
...
- change
...
- destination
...
- to
...
- /mnt
...
- disk
...
- and
...
- did
...
- one
...
- transfer
...
- at
...
- a
...
- time
...
- all
...
- probelms
...
- were
...
- gone
...
- -
...
- I
...
- scp
...
- 3
...
- files
...
- of
...
- 1GB
...
- w/o
...
- a
...
- glitch.
...
- Later
...
- I
...
- copied
...
- files
...
- from
...
- /mnt
...
- to
...
- EBS
...
- disk
...
- took
...
- ~5
...
- minutes
...
- per
...
- GB).
...
Nov
...
14:
...
transfer
...
of
...
1GB
...
from
...
rcf
...
<-->
...
Amazon
...
takes
...
~5
...
minutes.
...
Launching nodes
Nov 13 :
*Matt's
...
customized
...
Ubuntu
...
w/o
...
STAR
...
software
...
-
...
4-6
...
minutes,
...
the
...
smallest
...
machine
...
$0.10
...
*default
...
public
...
Fedora
...
from
...
EC2
...
:
...
~2
...
minutes
...
*launching
...
Cloudera
...
cluster
...
1+4
...
or
...
1+10
...
seems
...
to
...
take
...
similar
...
time
...
of
...
~5
...
minutes
...
Nov
...
14
...
:
...
*there
...
is
...
a
...
limit
...
of
...
20
...
on
...
#
...
of
...
EC2
...
machines
...
I
...
could
...
launch
...
at
...
once
...
with
...
the
...
command:
...
hadoop-ec2
...
launch-cluster
...
my-hadoop-cluster19
...
'20'
...
would
...
not
...
work.
...
This
...
is
...
my
Code Block |
---|
} > cat */.hadoop-ec2/ec2-clusters.cfg ami=ami-6159bf08 instance_type=m1.small key_name=janAmazonKey2 *availability_zone=us-east-1a* private_key=/home/training/.ec2/id_rsa-janAmazonKey2 ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no {code} |
Make
...
sure
...
to
...
assign
...
proper
...
zone
...
if
...
you
...
use
...
EBS
...
disk
...
Computing
...
speed
...
Task
...
description
...
I
...
have
...
exercised
...
the
...
Cloudera
...
AMI
...
package,
...
requested
...
1
...
master+10
...
nodes.
...
The
...
task
...
was
...
to
...
compute
...
PageRank
...
for
...
large
...
size
...
set
...
of
...
interlinked
...
pages.
...
The
...
abstract
...
definition
...
of
...
the
...
task
...
is
...
to
...
fine
...
iteratively
...
solution
...
of
...
the
...
matrix
...
equation:
...
A*X=X
...
where
...
A
...
is
...
a
...
square
...
matrix
...
of
...
the
...
dimension
...
N
...
equal
...
to
...
#
...
of
...
wikipedia
...
pages
...
pointed
...
by
...
any
...
wikipedia
...
page.
...
X
...
is
...
the
...
vector
...
of
...
the
...
same
...
dimension
...
describing
...
the
...
ultimate
...
weight
...
of
...
the
...
given
...
page
...
(
...
the
...
Page-Rank
...
value).
...
The
...
N
...
of
...
my
...
problem
...
was
...
1e6..1e7.
...
I
...
was
...
given
...
a
...
dump
...
of
...
all
...
Wikipedia
...
pages
...
...
in
...
the
...
format:
...
<page><title>The
...
Title</title><text>The
...
page
...
body</text></page>
...
,
...
one
...
line
...
of
...
text
...
per
...
page,
...
the
...
(human
...
typed
...
in
...
)
...
content
...
was
...
extremely
...
non-homogenous,
...
multi-lingual,
...
with
...
many
...
random
...
characters
...
and
...
typos.
...
I
...
wrote
...
4
...
python
...
string
...
processing
...
functions:
...
- init converting input text to <key,value>
...
- format
...
- (my
...
- particular
...
- choice
...
- of
...
- the
...
- meaning
...
- )
...
- mapp
...
- and
...
- reduce
...
- functions,
...
- run
...
- in
...
- pair,
...
- multiple
...
- iterations
...
- finish function exporting final list of pages ordered by page rank.
- I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08,
...
- instance_type=m1.small
...
The
...
- goal
...
- was
...
- to
...
- perform
...
- all
...
- ini
...
- +
...
- N_iter
...
- +
...
- fin
...
- steps
...
- using
...
- 10
...
- nodes
...
- &
...
- hadoop
...
- framework.
...
Test 1:
...
execution
...
of
...
the
...
full
...
chain
...
for
...
ini
...
+2
...
iter
...
+fin
...
using
...
a
...
~10%
...
sub-set
...
of
...
wikipedia
...
pages
...
(enwiki-20090929-one-page-per-line-part3)
...
- the
...
- unzipped
...
- file
...
- had
...
- size
...
- of
...
- 2.2GB
...
- ASCII
...
- ,
...
- contained
...
- 1.1M
...
- lines
...
- (original
...
- pages)
...
- which
...
- pointed
...
- to
...
- 14M
...
- pages
...
- (outgoing
...
- links,
...
- include
...
- self
...
- reference,
...
- non
...
- unique).
...
- After
...
- 1st
...
- iteration
...
- the
...
- #
...
- of
...
- lines
...
- (pages
...
- which
...
- are
...
- pointed
...
- to
...
- by
...
- any
...
- of
...
- the
...
- original
...
- )
...
- grew
...
- to
...
- 5M
...
- pages
...
- and
...
- stabilized.
...
- I
...
- brought
...
- part3.gz
...
- file
...
- to
...
- the
...
- master
...
- node
...
- &
...
- unzip
...
- it
...
- on
...
- the
...
- /mnt
...
- disk
...
- (has
...
- enough
...
- space
...
- (took
...
- few
...
- minutes)
...
- I
...
- stick
...
- to
...
- the
...
- default
...
- choice
...
- to
...
- run
...
- 20
...
- mappers
...
- and
...
- 10
...
- reducers
...
- for
...
- every
...
- step
...
- (for
...
- 10-node
...
- cluster)
...
Timing
...
- results
...
- copy
...
- local
...
- file
...
- to
...
- HDFS
...
- :
...
- ~2
...
- minutes
...
- init
...
- :
...
- 410
...
- sec
...
- mapp/reduce
...
- iter
...
- 0
...
- :
...
- 300
...
- sec
...
- mapp/reduce
...
- iter
...
- 1
...
- :
...
- 180
...
- sec
...
- finish
...
- :
...
- 190
...
- sec
...
Total
...
- time
...
- was
...
- 20
...
- minutes
...
- ,
...
- 11
...
- CPUs
...
- were
...
- involved.
...
Test
...
2:
...
execution
...
of
...
a
...
single
...
map/reduce
...
step
...
on
...
27M
...
linked
...
pages,
...
using
...
full
...
set
...
of
...
wikipedia
...
pages
...
(enwiki-20090929-one-page-per-line-part1+2+3).
...
I
...
did
...
minor
...
modification
...
of
...
map/reduce
...
code
...
which
...
could
...
slow
...
it
...
down
...
by
...
~20%-30%.
...
- the
...
- unzipped
...
- file
...
- had
...
- size
...
- of
...
- 21
...
- GB
...
- ASCII
...
- ,
...
- contained
...
- 9M
...
- lines
...
- (original
...
- pages)
...
- which
...
- pointed
...
- to
...
- 142M
...
- pages
...
- (outgoing
...
- links,
...
- include
...
- self
...
- reference,
...
- non
...
- unique).
...
- After
...
- 1st
...
- iteration
...
- (which
...
- I
...
- run
...
- serially
...
- on
...
- a
...
- different
...
- machine)
...
- the
...
- #
...
- of
...
- lines
...
- (pages
...
- which
...
- are
...
- pointed
...
- to
...
- by
...
- any
...
- of
...
- the
...
- original
...
- )
...
- grew
...
- to
...
- 27M
...
- pages.
...
- I
...
- brought
...
- 1GB
...
- output
...
- of
...
- iteration
...
- 1
...
- to
...
- the
...
- master
...
- node
...
- &
...
- unzip
...
- it
...
- on
...
- the
...
- /mnt
...
- disk
...
- (took
...
- 5
...
- for
...
- scp
...
- and
...
- 5
...
- for
...
- unzip)
...
- I
...
- run
...
- 20
...
- mappers
...
- and
...
- 10
...
- reducers
...
- for
...
- every
...
- step
...
- (for
...
- 10-node
...
- cluster)
...
Timing
...
- results
...
- copy
...
- local
...
- file
...
- to
...
- HDFS
...
- :
...
- ~10
...
- minutes.
...
- Hadoop
...
- decided
...
- to
...
- divide
...
- the
...
- data
...
- in
...
- to
...
- 40
...
- sets
...
- (and
...
- will
...
- issue
...
- 40
...
- mapp
...
- jobs)
...
- 3
...
- mapp
...
- jobs
...
- finished
...
- after
...
- 8
...
- minutes.
...
- 5
...
- mapp
...
- jobs
...
- finished
...
- after
...
- 16
...
- minutes.
...
- 16
...
- mapp
...
- jobs
...
- finished
...
- after
...
- 29
...
- minutes.
...
- all
...
- 40
...
- mapp
...
- jobs
...
- finished
...
- after
...
- 42
...
- minutes
...
- (one
...
- of
...
- the
...
- map
...
- jobs
...
- was
...
- restarted
...
- during
...
- this
...
- time)
...
- reduce
...
- failed
...
- for
...
- all
...
- 10
...
- jobs
...
- after
...
- ~5
...
- minutes,
...
- all
...
- 10
...
- ~simultaneously
...
- hadoop
...
- tried
...
- twice
...
- to
...
- reissue
...
- the
...
- 10
...
- sort+10
...
- reduce
...
- jobs
...
- and
...
- it
...
- failed
...
- again
...
- after
...
- another
...
- ~5
...
- minutes
...
At
...
- this
...
- stage
...
- I
...
- killed
...
- the
...
- cluster.
...
- It
...
- was
...
- consuming
...
- 11
...
- CPU/hour
...
- and
...
- I
...
- had
...
- no
...
- clue
...
- how
...
- to
...
- debug
...
- it.
...
- I
...
- suspect
...
- some
...
- internal
...
- memory
...
- (HDFS
...
- ?)
...
- limit
...
- was
...
- not
...
- sufficient
...
- to
...
- hold
...
- sort
...
- results
...
- after
...
- mapp
...
- tasks.
...
- My
...
- estimate
...
- is
...
- 3GB
...
- of
...
- unzipped
...
- input
...
- could
...
- grew
...
- by
...
- a
...
- factor
...
- of
...
- few
...
- -
...
- may
...
- be
...
- there
...
- is
...
- a
...
- 10GB
...
- limit
...
- I
...
- should
...
- change
...
- (or
...
- pay
...
- extra?)
...