Project 4 main page | Preliminaries | AWS and cluster setup | Problems

Project 4 - problems

Problem 1: Getting started with Pig

After you have completed the Preliminaries, run the first example script in the tutorial on an AWS cluster with 5 nodes (script1-hadoop.pig), and then answer the following questions (also see hints below):

1.1 How many MapReduce jobs are generated by script1?

1.2 How many map tasks are within the first MapReduce job? How many maps are within later MapReduce jobs? (Note that this number will be small, no more than the number of nodes in the cluster, because the dataset is small.)

1.3 How long does each job take? How long does the entire script take?

1.4 What do tuples look like after command clean2 = ... in script1? What do tuples look like after command uniq_frequency1 = ... in script1? What do tuples look like after command same = ... in script2?

Hint 1: Use the job tracker to see the number of map and reduce tasks for your MapReduce jobs.

Hint 2: To see the schema for intermediate results, you can use Pig's interactive command line client Grunt, which you can launch with the following command:

$ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local

When using grunt, two commands that you may want to know about are dump and describe. To see a list of other commands, type help.

What you need to turn in:
Required: Submit your answers to problems 1.1 - 1.4 in a file named problem1-answers.txt .

Problem 2: Simple tweak to tutorial

Write a Pig script that creates a histogram showing the distribution of user activity levels.

The x-axis is the total number of requests for a user, and
The y-axis is the total number of users that made this number of requests.

So, for each point (x,y) that we generate, we mean to say that y users each performed a total of x queries. You can run the org.apache.pig.tutorial.NonURLDetector(query) filter on the data to remove queries that are empty or only contain URLs.

Run this script against the excite.log.bz2 dataset on AWS. You can use any size cluster you like, but at least 5 nodes is recommended. Then copy the results to your local machine and use gnuplot to plot the histogram points.

A few comments to help you get started:

Always start locally on a small dataset such as excite-small.log. You don't need to use Hadoop to write your program.
To get more familiar with the Pig Latin commands, we suggest that you also take a look at the Pig Latin wiki page.
After you run your job on the cluster, you will need to copyToLocal (i.e. move the files to your local directory) and "cat" together all files named part-....
Once the results have been cat-ed together, copy them back to your local machine.
We have already prepared a script which makes it easier for you to run gnuplot. Use the files plot.sh and plot.gnu as follows:
```
chmod +x plot.sh
./plot.sh PIG_RESULTS_FILE
```
The script generates a PNG image of the plot in your current directory. Your PIG_RESULTS_FILE needs to be tab-separated and have two columns, x and y. The data also needs to be (numerically) sorted by x. Although it is also possible to sort using Pig, we recommend that you simply run Unix' sort -n input > output after your job has completed (by default sorting in Pig is alphabetic).

What you need to turn in:
Required: Submit your Pig program in problem2.pig.
Run your program on excite.log.bz2, and submit your computed result file (problem2-results.txt), and your PNG plot (problem2-results.png).

Problem 3: Run the script on the much larger astronomy data

The astronomy dataset contains snapshots of a cosmological simulation that follows the evolution of structure within a volume of 150 million light years shortly after the Big Bang until the present day. The simulation had 512 timesteps, and snapshots were taken every even timestep. Each snapshot consists of the state of each particle in the simulation.

The following is an example of what the data looks like:

snapshot, time, partId, mass, position0, position1, position2, velocity0, velocity1, velocity2, eps, phi, rho, temp, hsmooth, metals
2, 0.290315, 0, 2.09808e-08, -0.489263, -0.47879, -0.477001, 0.1433, 0.0721365, 0.265767, , -0.0263865, 0.38737, 48417.3, 9.6e-06, 0
2, 0.290315, 1, 2.09808e-08, -0.48225, -0.481107, -0.480114, 0.0703595, 0.142529, 0.0118989, , -0.0269804, 1.79008, 662495, 9.6e-06, 0

Relevant to us are snapshot, partId, and velocity0-2. Each particle has a unique id (partId) and the data tracks the state of particles over time.

We created several files of this data with different sizes. The first two are in the project4.tar.gz archive:

tiny.txt has only a couple of rows and might be useful when writing the script.
medium.txt has about 1.3 MB and might be useful when testing your script locally.

We also have files containing data from 11 snapshots (2, 12, 22, …, 102) called dbtest128g.00<snap_shot_num>.txt.f that have the actual data (~ 530 MB each). These files are stored in Amazon S3 under the following names:

dbtest128g.00002.txt.f
dbtest128g.00012.txt.f
dbtest128g.00022.txt.f
dbtest128g.00032.txt.f
dbtest128g.00042.txt.f
dbtest128g.00052.txt.f
dbtest128g.00062.txt.f
dbtest128g.00072.txt.f
dbtest128g.00082.txt.f
dbtest128g.00092.txt.f
dbtest128g.00102.txt.f

All these files together can be loaded in Pig scripts on the AWS cluster via:

raw = LOAD 's3n://uw-cse444-proj4/dbtest128g.00*' USING PigStorage...;

Or two (or more) at a time using the UNION operator (say we want the pair (2, 12)):

raw02 = LOAD 's3n://uw-cse444-proj4/dbtest128g.00002.txt.f' USING PigStorage....;
raw12 = LOAD 's3n://uw-cse444-proj4/dbtest128g.00012.txt.f' USING PigStorage....;
raw = UNION raw02, raw12;

3.1 Write a script that counts the number of particles per snapshot. Test it on medium.txt.

Then, run the script on a 5-node AWS cluster, using the large dbtest128g datasets stored in S3, for each of the following combinations of snapshots:

timestep 2
timestep 2 and 12
timestep 2, 12, and 22
timestep 2, 12, 22, and 32

For each of the 4 cases above, what is the level of parallelism when running your script? Determine this by launching the script on each group of datasets, and using the job tracker to find the number of concurrently executing map tasks for each job; this is the job's level of parallelism. (You can then immediately cancel the job; you don't need to turn in the job results.) What can you say about the scalability of the system?

What you need to turn in:
Required:

Submit your Pig program in problem3.1.pig.
Submit your answers to the above questions in problem3.1-answers.txt.

3.2 For each pair of subsequent snapshots (2, 12), (12, 22), (22, 32), ..., (92, 102), compute the number of particles which increased their speed (magnitude of velocity), and the number of particles which reduced their speed. If a particle neither increases nor decreases its speed, count it as decelerated.

Your output should have the following format:

2.0, accelerate, 2
2.0, decelerate, 3
12.0, decelerate, 16
22.0, accelerate, 2
...

The first column denotes the start snapshot (int or float), the second column accelerate or decelerate, and the third column a count of the number of particles. It's OK to leave out rows with 0 counts. The results you turn in should be based on the large dbtest128g datasets stored in S3 (for eg. dbtest128g.0002.txt.f and dbtest128g.0012.txt.f files in the case of (2,12)).

Some hints:

Again, start locally on the tiny.txt and medium.txt datasets.
For this problem, you will need to use a variety of Pig commands. The Pig Latin wiki page is a brief reference. The Pig manual has a more up-to-date, but harder-to-read reference.
velocity0-2 refer to the (x, y, z) dimensions of velocity, so you can determine the magnitude by the Euclidean distance formula.

What you need to turn in:
Required:

Submit your Pig program in problem3.2.pig.
Run it on AWS (any size cluster, suggested size 15 nodes), and submit your computed result file (problem3.2-results.txt).