Project 4 - problems

Problem 1: Getting started with Pig

After you have completed the Preliminaries, run the first example script in the tutorial on an AWS cluster with 5 nodes (script1-hadoop.pig), and then answer the following questions (also see hints below):

1.1 How many MapReduce jobs are generated by script1?

1.2 How many map tasks are within the first MapReduce job? How many maps are within later MapReduce jobs? (Note that this number will be small, no more than the number of nodes in the cluster, because the dataset is small.)

1.3 How long does each job take? How long does the entire script take?

1.4 What do tuples look like after command clean2 = ... in script1? What do tuples look like after command uniq_frequency1 = ... in script1? What do tuples look like after command same = ... in script2?

Hint 1: Use the job tracker to see the number of map and reduce tasks for your MapReduce jobs.

Hint 2: To see the schema for intermediate results, you can use Pig's interactive command line client Grunt, which you can launch with the following command:

$ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local

When using grunt, two commands that you may want to know about are dump and describe. To see a list of other commands, type help.

What you need to turn in:
Required: Submit your answers to problems 1.1 - 1.4 in a file named problem1-answers.txt .

Problem 2: Simple tweak to tutorial

Write a Pig script that creates a histogram showing the distribution of user activity levels.

So, for each point (x,y) that we generate, we mean to say that y users each performed a total of x queries. You can run the org.apache.pig.tutorial.NonURLDetector(query) filter on the data to remove queries that are empty or only contain URLs.

Run this script against the excite.log.bz2 dataset on AWS. You can use any size cluster you like, but at least 5 nodes is recommended. Then copy the results to your local machine and use gnuplot to plot the histogram points.

A few comments to help you get started:

What you need to turn in:
Required: Submit your Pig program in problem2.pig.
Run your program on excite.log.bz2, and submit your computed result file (problem2-results.txt), and your PNG plot (problem2-results.png).

Problem 3: Run the script on the much larger astronomy data

The astronomy dataset contains snapshots of a cosmological simulation that follows the evolution of structure within a volume of 150 million light years shortly after the Big Bang until the present day. The simulation had 512 timesteps, and snapshots were taken every even timestep. Each snapshot consists of the state of each particle in the simulation.

The following is an example of what the data looks like:

snapshot, time, partId, mass, position0, position1, position2, velocity0, velocity1, velocity2, eps, phi, rho, temp, hsmooth, metals
2, 0.290315, 0, 2.09808e-08, -0.489263, -0.47879, -0.477001, 0.1433, 0.0721365, 0.265767, , -0.0263865, 0.38737, 48417.3, 9.6e-06, 0
2, 0.290315, 1, 2.09808e-08, -0.48225, -0.481107, -0.480114, 0.0703595, 0.142529, 0.0118989, , -0.0269804, 1.79008, 662495, 9.6e-06, 0

Relevant to us are snapshot, partId, and velocity0-2. Each particle has a unique id (partId) and the data tracks the state of particles over time.

We created several files of this data with different sizes. The first two are in the project4.tar.gz archive:

We also have files containing data from 11 snapshots (2, 12, 22, …, 102) called dbtest128g.00<snap_shot_num>.txt.f that have the actual data (~ 530 MB each). These files are stored in Amazon S3 under the following names:

dbtest128g.00002.txt.f
dbtest128g.00012.txt.f
dbtest128g.00022.txt.f
dbtest128g.00032.txt.f
dbtest128g.00042.txt.f
dbtest128g.00052.txt.f
dbtest128g.00062.txt.f
dbtest128g.00072.txt.f
dbtest128g.00082.txt.f
dbtest128g.00092.txt.f
dbtest128g.00102.txt.f

All these files together can be loaded in Pig scripts on the AWS cluster via:

raw = LOAD 's3n://uw-cse444-proj4/dbtest128g.00*' USING PigStorage...;

Or two (or more) at a time using the UNION operator (say we want the pair (2, 12)):

raw02 = LOAD 's3n://uw-cse444-proj4/dbtest128g.00002.txt.f' USING PigStorage....;
raw12 = LOAD 's3n://uw-cse444-proj4/dbtest128g.00012.txt.f' USING PigStorage....;
raw = UNION raw02, raw12;

3.1 Write a script that counts the number of particles per snapshot. Test it on medium.txt.

Then, run the script on a 5-node AWS cluster, using the large dbtest128g datasets stored in S3, for each of the following combinations of snapshots:

  1. timestep 2
  2. timestep 2 and 12
  3. timestep 2, 12, and 22
  4. timestep 2, 12, 22, and 32

For each of the 4 cases above, what is the level of parallelism when running your script? Determine this by launching the script on each group of datasets, and using the job tracker to find the number of concurrently executing map tasks for each job; this is the job's level of parallelism. (You can then immediately cancel the job; you don't need to turn in the job results.) What can you say about the scalability of the system?

What you need to turn in:
Required:

  1. Submit your Pig program in problem3.1.pig.
  2. Submit your answers to the above questions in problem3.1-answers.txt.

3.2 For each pair of subsequent snapshots (2, 12), (12, 22), (22, 32), ..., (92, 102), compute the number of particles which increased their speed (magnitude of velocity), and the number of particles which reduced their speed. If a particle neither increases nor decreases its speed, count it as decelerated.

Your output should have the following format:

2.0, accelerate, 2
2.0, decelerate, 3
12.0, decelerate, 16
22.0, accelerate, 2
...

The first column denotes the start snapshot (int or float), the second column accelerate or decelerate, and the third column a count of the number of particles. It's OK to leave out rows with 0 counts. The results you turn in should be based on the large dbtest128g datasets stored in S3 (for eg. dbtest128g.0002.txt.f and dbtest128g.0012.txt.f files in the case of (2,12)).

Some hints:

What you need to turn in:
Required:

  1. Submit your Pig program in problem3.2.pig.
  2. Run it on AWS (any size cluster, suggested size 15 nodes), and submit your computed result file (problem3.2-results.txt).