ESTIMATED TIME: 18 hours.
DUE DATE: Wednesday, June 2 at Midnight.
RIGHT NOW: Immediately complete the steps to set up your Amazon Web Services (AWS) account (here). That may take a couple of days to go through, so you want to do that right away so it is ready when you need it. You will start the assignment on your local machine and use the cluster for large runs at the end.
TURN IN INSTRUCTIONS: Turn in eight files (details below) using the regular catalyst dropbox.
GROUPS: You may work with a partner on this assignment. If you do work with a partner, one member of the group should turn in a single project with everyone's name on it and all members of the group will receive the same score. You should also include a short readme.txt file listing the members of the group and gives a short summary of who did what. Everyone in the group is responsible for the material regardless of how you organize the work.
HADOOP: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
PIG LATIN: Pig Latin is a language for the analysis
of extremely large datasets.
The motivation for Pig Latin is the fact that many people
who analyze [extremely large datasets]
are entrenched procedural programmers, who find the declarative, SQL
style to be unnatural.
The map-reduce paradigm was a success with these
programmers, but it
is too low-level and rigid, and leads to a great deal
of custom user code that is hard to maintain, and reuse.
So Pig Latin aims for the "sweet spot" between the declarative style of
SQL
and the procedural style of map-reduce. Pig
Latin: A Not-So-Foreign Language for Data Processing
Pig Latin is implemented in PIG, a system which compiles Pig Latin into
physical plans that are then executed over Hadoop.
AWS: The Amazon Web Services (AWS) are a collection of remote computing services (also called web services) offered over the Internet by Amazon.com (taken from the wiki page). One subset of these, the cloud services lets any person run a cluster of designated machines and execute any scripts or programs that are supported by the cloud services.
In the scope of this class, we are interested in their Elastic Map Reduce service, which enables any one to easily run hadoop related scripts (including PIG) with a click of a button. Elastic Map Reduce relies on Elastic Compute Cloud (EC2) to set up a cluster of machines as well as hadoop (v0.18.3) and pig (v0.3.0). Moreover, scripts or data can be accessed or stored in a centralized location via Simple Storage Service (S3). We will be primarily dealing with Elastic Map Reduce and only touching the base of S3 since we will be accessing the larger data (~ 40 MB and ~ 6 GB) from a S3 bucket (location/folder). The details of doing this are provided in the problems. Finally, here is a link to their documentation on different web services that are provided by AWS, while links to the getting started and developer guides are mentioned below for reference purposes only (not required to be read since the project instructions provide specific details or links at the appropriate places).
# cmd to be run on the AWS master node after ssh-ing into it.
% ls
# cmd to be run on the local computer
$ ls
$ cd project4
$ tar -xzf pigtutorial.tar.gz
$ tar -xzf hadoop-0.18.3.tar.gz
$ chmod u+x ~/project4/hadoop-0.18.3/bin/hadoop
$ export PIGDIR=~/project4/pigtmpThe variable JAVA_HOME should be set to point to your system's Java directory. On the PC Lab machines, this should already be done for you. If you are on your own machine, you may have to set it.
$ export HADOOP=~/project4/hadoop-0.18.3
$ export HADOOPSITEPATH=~/project4/hadoop-0.18.3/conf/
$ export PATH=$HADOOP/bin/:$PATH
Howevever, there is a complication involving the scripts that are meant to be run on the hadoop cluster. The commands that specify input and output files use relative addresses. You should specify absolute paths to be safe.
First create a /user/hadoop directory on the AWS cluster (although it should already be there):
% hadoop dfs -mkdir /user/hadoopDo a listing of this directory with the command (and you should see some output and no exceptions/errors): The Pig scripts also use relative addresses that you will have to change to absolute addresses. name.
First, we will be loading the excite.log.bz2 via S3. Hence in script1-hadoop.pig you would change:
raw = LOAD 'excite.log.bz2' USING PigStorage....to
raw = LOAD 's3n://uw-cse444-proj4/excite.log.bz2' USING PigStorage....
Secondly, we are going to store the results on hdfs, so in the same script, the line:
STORE ordered_uniq_frequency INTO 'script1-hadoop-results' USING ...becomes
STORE ordered_uniq_frequency INTO '/user/hadoop/script1-hadoop-results' USING ...
In all your scripts that run on the cluster you will have to use absolute path names.
% java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main script1-hadoop.pigYou can and must type:
% pig -l . script1-hadoop.pig
Once you have successfully ran the script, the output will be stored in the hdfs on the master node at the absolute path you specified in STORE. That is, for script1, the output will be stored at '/user/hadoop/script1-hadoop-results'. Before you can copy this to your local machine, you will have to copy the directory from the hdfs to the master node and then you can run scp to copy it to your machine.
% hadoop dfs -copyToLocal /user/hadoop/script1-hadoop-results script1-hadoop-resultsThis will create a directory script1-hadoop-results with part-* files in it.
$ chmod 600 </path/to/saved/keypair/file.pem>
You will be starting the cluster using Management Console (GUI) and ssh-ing into the machine .
$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
After you are done, terminate the cluster.
For the purposes of this assignment we are just going to use scp to copy data to and from the machine using scp.
$ tar -czvf file.tar.gz folder/
$ tar -czvf file.tar.gz file1 file2 file3 ... filen
$ tar -xvzf file.tar.gz
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
Or if you have the SSH_OPTS var set then:
$ scp $SSH_OPTS local_file hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<complete_file_path> .
Or if you have the ssh_opts in a var then:
$ scp $SSH_OPTS hadoop@<master.public-dns-name.amazonaws.com>:<complete_file_path> .
Follow one of the two choices below.
$ cd ~/.mozilla/firefox
$ cd <profile.default>
$ cp foxyproxy.xml foxyproxy_bak.xml
$ cp /cse/courses/cse444/aws/foxyproxy/foxyproxy.xml foxyproxy.xml
Depending on how you set up the proxy, follow one of the two choices below.
ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>Keep this window running in the background (minimize it).
http://[master_dns_name]:9100/ # web UI for MapReduce job tracker(s)
http://[master_dns_name]:9101/ # web UI for HDFS name node(s) [OPTIONAL]
% lynx http://localhost:9100/ # web UI for MapReduce job tracker(s)
After you have completed the Preliminaries: Run the first example script in the tutorial on the cluster (script1-hadoop.pig), and then answer the following questions (also see hints below):
1.1 How many MapReduce jobs are generated by script1?
1.2 How many map tasks are within the first MapReduce job? How many maps are within later MapReduce jobs? (Note that this number will be small because the dataset is small.)
1.3 How long does each job take? How long does the entire script take?
1.4 What do tuples look like after command clean2 = ... in script1? What do tuples look like after command uniq_frequency1 = ... in script1? What do tuples look like after command same = ... in script2?
Hint 1: Use the job-tracker at http://[master_dns_name]:9100 to see the number of map and reduce tasks for your MapReduce jobs.
Hint 2: To see the schema for intermediate
results, you can use PIG's interactive command line client Grunt,
which you can launch with the following command
java -cp pig.jar org.apache.pig.Main -x local
When using grunt, two commands that you may want
to know about are dump and describe.
What you need to turn in:
Required: Submit your answers to problems
1.1 - 1.4 in a file named problem1-answers.txt .
Write a pig script that creates a histogram showing the distribution of user activity levels.
So, for a point (x,y) that we plot, we mean to say that y users each performed a total of x queries. You can run the org.apache.pig.tutorial.NonURLDetector(query) filter on the data to remove queries that are empty or only contain URLs.
A few comments to help you get started:
chmod +x plot.shThe script generates a PNG image of the plot in your current directory. Your PIG_RESULTS_FILE needs to be tab-separated and have two columns, x and y. The data also needs to be (numerically) sorted by x. Although it is also possible to sort using PIG, we recommend that you simply run Unix' sort -n input > output after your job has completed (by default sorting in PIG is alphabetic).
./plot.sh PIG_RESULTS_FILE
What you need to turn in:
Required: Submit your pig program in problem2.pig.
Run your program on excite.log.bz2,
and submit your
computed result file (problem2-results.txt),
and your PNG plot (problem2-results.png).
The astro dataset contains snapshots of a cosmological simulation that follows the evolution of structure within a volume of 150 million light years shortly after the Big Bang until the present day. The simulation had 512 timesteps, and snapshots were taken every even timestep. Each snapshot consists of the state of each particle in the simulation.
The following is an example of what the data looks like:
snapshot, #Time , partId, #mass, position0, position1, position2, velocity0, velocity1, velocity2, eps, phi, rho, temp, hsmooth, metalsRelevant to us are snapshot, partId, and velocity0-2. Each particle has a unique id (partId) and the data tracks the state of particles over time.
2, 0.290315, 0, 2.09808e-08, -0.489263, -0.47879, -0.477001, 0.1433, 0.0721365, 0.265767, , -0.0263865, 0.38737, 48417.3, 9.6e-06, 0
2, 0.290315, 1, 2.09808e-08, -0.48225, -0.481107, -0.480114, 0.0703595, 0.142529, 0.0118989, , -0.0269804, 1.79008, 662495, 9.6e-06, 0
We created three files of such data with different sizes. tiny.txt
has only a couple of rows and might be useful
when writing the script. medium.txt has
about 1.3 MB and might be useful when testing your script locally.
Finally, we have files containing data from 11 snapshots (2,
12, 22,
.., 102)
called dbtest128g.00<snap_shot_num>.txt.f that
have the actual data (~ 530 MB each).
tiny.txt, and medium.txt
are contained in project4.tar.gz. The name of the
different files are:
dbtest128g.00002.txt.f
dbtest128g.00012.txt.f
dbtest128g.00022.txt.f
dbtest128g.00032.txt.f
dbtest128g.00042.txt.f
dbtest128g.00052.txt.f
dbtest128g.00062.txt.f
dbtest128g.00072.txt.f
dbtest128g.00082.txt.f
dbtest128g.00092.txt.f
dbtest128g.00102.txt.f
All these file together can be loaded in PIG scripts via:
raw = LOAD 's3n://uw-cse444-proj4/dbtest128g.00*' USING PigStorage...;
Or two (or more) at a time using the UNION operator (say we want the pair (02, 12)):
raw02 = LOAD 's3n://uw-cse444-proj4/dbtest128g.00002.txt.f' USING PigStorage....;
raw12 = LOAD 's3n://uw-cse444-proj4/dbtest128g.00012.txt.f' USING PigStorage....;
raw = UNION raw02, raw12;
3.1 Write a script that counts the number of particles per snapshot. Test it on medium.txt.
For each of the following datasets (available on the cluster), what is the level of parallelism when running your script (Use the web interface to determine the number of concurrently executing map tasks for each dataset; this is the level of parallelism)? What can you say about the scalability of the system?
Run the script on the snapshots at
For each of the 4 cases above, launch your script on the cluster and check in the web interface how many map tasks are created. You can then immediately cancel your script again.
What you need to turn in:
Required: 1. Submit your pig program in problem3.1.pig.
2. Submit your answers to the
above
questions in problem3.1-answers.txt.
3.2 For each pair of subsequent snapshots (2, 12), (12, 22), (22, 32), ..., (92, 102), compute the number of particles which increased their velocity, and the number of particles which reduced their velocity. If the particle neither increased nor decreased their velocity, count it as decelerate.
Your output should have the following format2.0, accelerate, 2The first column denotes the start snapshot (int or float), the second column accelerate or decelerate, and the third column a count of the number of particles. It's ok to leave out rows with 0 counts. The results you turn in should be based on the large dbtest128g datasets stored on S3 (for eg. dbtest128g.0002.txt.f and dbtest128g.0012.txt.f files in the case of (2,12)).
2.0, decelerate, 3
12.0, decelerate, 16
22.0, accelerate, 2
...
Hints:
What you need to turn in:
Required: 1. Submit your pig program in problem3.2.pig.
2. Submit your computed result file (problem3.2-results.txt).