Project 4: Hadoop and PIG

ESTIMATED TIME: 18 hours.

DUE DATE: Wednesday, August 19 at 11 pm. No late assignments will be accepted, even if you have late days remaining because it is the end of the summer quarter.

TURN IN INSTRUCTIONS: Turn in eight txt files (details below) to the Catalyst drop box.

UPDATE 8/11: Because of crunch time at the end of the short summer quarter, we're simplifying things a little:

You may work with others in a small group (2-4) on this project. If you do, one member of the group should turn in a single project with everyone's name on it and all members of the group will receive the same score. If you work with a group, you should also include a short readme.txt file listing the members of the group and giving a short summary of who did what. Everyone in the group is responsible for the material regardless of how you organize the work.
You only need to get the code running locally, not on the cluster. That gives a basic introduction to this style of programming, which is the core of the assignment. Anything that you get running on the cluster counts as extra credit, and you are encouraged to do as much as you can, but it is not required.

HADOOP: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

PIG LATIN: Pig Latin is a language for the analysis of extremely large datasets. The motivation for Pig Latin is the fact that many people who analyze [extremely large datasets] are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The map-reduce paradigm was a success with these programmers, but it is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. So Pig Latin aims for the "sweet spot" between the declarative style of SQL and the procedural style of map-reduce. Pig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin is implemented in PIG, a system which compiles Pig Latin into physical plans that are then executed over Hadoop.

Preliminaries

To get more familiar with MapReduce and PIG, we recommend that you first skim through the following two research papers:
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
- C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD, 2008
You will need access to the Hadoop cluster. Follow the instructions here to get access to our department's IBM/Google cluster. You will create a new account to use for the cluster. We will refer to the login name for this account as hadoopuser and we will refer to the login name for you UW account as uwuser. You never type in either of these, you replace them with your own uw and hadoop user names. Your hadoop login name can be the same as your uw id, but it will be visible on the cluster, which is run by organizations outside of UW CSE. Use a different hadoop login name if that is a concern. In the instructions below, we will point out the extra steps you need to take if they are not the same.
The following instructions assume you are working on a Linux machine in a bash shell window.
Download project4.tar.gz to your home directory, and unzip it. (Warning: this file is about 20MB, so you want to have a fast connection when you grab it.) This will create a directory called project4. Henceforth, we assume that you have, indeed, downloaded the file to your home directory. Change directory into project4, and unzip pig and hadoop.
```
cd project4
tar -xzf pigtutorial.tar.gz
tar -xzf hadoop-0.17.2.tar.gz
```

Make sure the hadoop script is executable:

chmod u+x ~/project4/hadoop-0.17.2/bin/hadoop

Set a few environment variables. You will need them later. You must set these environment variables EACH time, OR you can put these commands in the .bashrc, .profile, or other appropriate configuration file in your home directory.
```
export PIGDIR=~/project4/pigtmp
export HADOOP=~/project4/hadoop-0.17.2
export HADOOPSITEPATH=~/project4/hadoop-0.17.2/conf/
export PATH=$HADOOP/bin/:$PATH
```
The variable JAVA_HOME should be set to point to your system's Java directory. On the PC Lab machines, this should already be done for you. If you are on your own machine, you may have to set it.
Important: To succesfully connect to the cluster, you will need to use SSH to open a SOCKS proxy to the cluster through a gateway. Open a new terminal (i.e. command line prompt), and run the following:
```
ssh -D 6789 -L 50128:127.0.0.1:50128 hadoopuser@univgw.hipods.ihost.com
```
Be sure to execute this command before using the cluster. Minimize this window, and keep it open forever! Don't close it. You will need it to connect to the cluster, as well as for job-tracking.
To create your directory on the hadoop cluster, run the command:
```
hadoop dfs -mkdir /user/hadoopuser
```
Do this on your local machine (i.e. not in the terminal window you opened for Step 5). Do a listing of this directory with the command:
```
hadoop dfs -ls /user/hadoopuser
```
Note the owner of the file. Then check to see where your default directory is on the cluster by running the command:
```
hadoop dfs -ls .
```
You will probably see /user/uwuser listed. If this uwuser is not the same as hadoopuser, then every time a command or script that is run on the cluster uses a relative path you will have to substitute an absolute path. This applies to the scripts in the Pig Tutorial
Read and complete the Pig Tutorial. Skip over all the installation instructions (you already did that above). Go straight to the section entitled "Pig Scripts: Local Mode". Try to run Pig Script 1 in both local mode and hadoop mode. If you have any problems at this point, please contact a TA or talk to your classmates.
A complication involves the scripts that are meant to be run on the hadoop cluster. The commands that specify input and output files use relative addresses. If your hadoopuser name is different from your uwuser name, then you have to put in absolute paths so these files will be in your /user/hadoopuser/ directory.

The following command is shown on the tutorial:
```
hadoop fs –copyFromLocal excite.log.bz2 .
```
For one thing, there is an "ndash" character instead of a "dash. If you cut and paste, be sure to replace that character. For another thing, if your hadoop user name is not your uw user name, you will have to change the command to:
```
hadoop fs -copyFromLocal excite.log.bz2 /user/hadoopuser/
```
This is because the "." character (dot) is a relative address that your hadoop program will fill in by referring to your uwuser name. The Pig scripts also use relative addresses that you will have to change if your hadoop name does not match your uw name.

In script1-hadoop.pig you would change:
```
raw = LOAD 'excite.log.bz2' USING PigStorage....
```
to
```
raw = LOAD '/user/hadoopuser/excite.log.bz2' USING PigStorage....
```
Also, in the same script, the line:
```
STORE ordered_uniq_frequency INTO 'script1-hadoop-results' USING ...
```
becomes
```
STORE ordered_uniq_frequency INTO '/user/hadoopuser/script1-hadoop-results' USING ...
```
In all your scripts that run on the cluster you will have to use absolute path names.

Also, be aware that the hadoop cluster is a complex environment with many users. A command that fails once may run the next time, even though you have not changed anything.
Set up FoxyProxy, so that you can view the progress of your hadoop jobs with a webtool. Make sure that your SSH connection is still open. To set up FoxyProxy, you must:
1. Download and install FoxyProxy from here .
2. Select menu item "Tools" > "FoxyProxy" > "Options".
3. In the "Global Settings", check "Use SOCKS for DNS lookups".
4. Back in the tab "Proxies", click "Add New Proxy".
5. In "General" tab, make sure the proxy is enabled. Give it a name, e.g., "Google/IBM cluster"
6. In "Proxy Details" tab, select "Manual Proxy Configuration", enter "127.0.0.1" for "Host Name", 6789 for "Port", check "SOCKS Proxy?", and make sure "SOCKS v5" is selected.
7. Take a look at hadoop-0.17.2/conf/hadoop-site.xml. Find the ip adddress for the job-tracker (see the "mapred.job.tracker" property). Remember to remove the port number at the end of the ip address when you copy & paste it.
8. In "URL Patterns" tab, click "Add New Pattern". Given it a name (e.g., "default"), and enter the pattern "http://JOBTRACKER_IP_ADDRESS:*/*". Substitute the real IP address.
9. Click "OK" to finish setting up the proxy.
10. Change "Mode" to "Use proxies based on their pre-defined patterns and priorities".
To view the JobTracker, simply go to: http://JOBTRACKER_IP_ADDRESS:50030/. If this does not work, double-check that you are still running the SSH SOCKS proxy (see step 5).

Problem 1: Getting started with PIG

After you have completed the Preliminaries: Run the first example script in the tutorial on the cluster (script1-hadoop.pig), and then answer the following questions (also see hints below):

1.1 How many MapReduce jobs are generated by script1?

1.2 How many map tasks are within the first MapReduce job? How many maps are within later MapReduce jobs? (Note that this number will be small because the dataset is small.)

1.3 How long does each job take? How long does the entire script take?

1.4 What do tuples look like after command clean2 = ... in script1? What do tuples look like after command uniq_frequency1 = ... in script1? What do tuples look like after command same = ... in script2?

Hint 1: Use the job-tracker at http://JOBTRACKER_IP_ADDRESS:50030/ to see the number of map and reduce tasks for your MapReduce jobs.

Hint 2: To see the schema for intermediate results, you can use PIG's interactive command line client Grunt, which you can launch with the following command
java -cp pig.jar org.apache.pig.Main -x local
When using grunt, two commands that you may want to know about are dump and describe.

What you need to turn in:
Required: Submit your answers to problems 1.1 and 1.4 in a file named problem1-answers.txt . (Although 1.1 is easier done on the cluster, you should be able to figure it out by thinking about how it translates into Map-Reduce jobs.)

Extra credit: Submit your answers to problems 1.2 and 1.3 in a file named problem1-bonus.txt .

Problem 2: Simple tweak to tutorial

Write a pig script that creates a histogram showing the distribution of user activity levels.
- The x-axis is the total number of requests for a user, and
- The y-axis is the total number of users that made this number of requests.
Use gnuplot to plot the results.

So, for a point (x,y) that we plot, we mean to say that y users each performed a total of x queries. You can run the org.apache.pig.tutorial.NonURLDetector(query) filter on the data to remove queries that are empty or only contain URLs.

A few comments to help you get started:
- Always start locally on a small dataset such as excite-small.log. You don't need to use Hadoop to write your program.
- To get more familiar with the PIG Latin commands, we suggest that you also take a look at the PigLatin Wiki.
- After you run your job on the cluster, you will need to copyToLocal (i.e. move the files to your local directory) and "cat" together all files named part-....
- We have already prepared a script which makes it easier for you to run gnuplot. Use the files plot.sh and plot.gnu as follows:
```
	chmod +x plot.sh
	./plot.sh PIG_RESULTS_FILE
	
```
  The script generates a PNG image of the plot in your current directory. Your PIG_RESULTS_FILE needs to be tab-separated and have two columns, x and y. The data also needs to be (numerically) sorted by x. Although it is also possible to sort using PIG, we recommend that you simply run Unix' sort -n input > output after your job has completed (by default sorting in PIG is alphabetic).
What you need to turn in:
Required: Submit your pig program in problem2.txt. Run your program on excite-small.log, and submit your computed result file (named problem2-results.txt), and your PNG plot (named problem2-results.png).

Extra credit: Run your program on excite.log.bz2, and submit your computed result file (problem2-bonus-results.txt), and your PNG plot (problem2-bonus-results.png).
Problem 3: Script on the much larger astronomy data

The astro dataset contains snapshots of a cosmological simulation that follows the evolution of structure within a volume of 150 million light years shortly after the Big Bang until the present day. The simulation had 512 timesteps, and snapshots were taken every even timestep. Each snapshot consists of the state of each particle in the simulation.

The following is an example of what the data looks like:
```
snapshot, #Time   , partId, #mass,        position0, position1, position2, velocity0, velocity1, velocity2, eps, phi,        rho,     temp,    hsmooth, metals
2,        0.290315, 0,      2.09808e-08, -0.489263,  -0.47879,  -0.477001, 0.1433,    0.0721365, 0.265767,     , -0.0263865, 0.38737, 48417.3, 9.6e-06, 0
2,        0.290315, 1,      2.09808e-08, -0.48225,   -0.481107, -0.480114, 0.0703595, 0.142529,  0.0118989,    , -0.0269804, 1.79008, 662495,  9.6e-06, 0
```
Relevant to us are snapshot, partId, and velocity0-2. Each particle has a unique id (partId) and the data tracks the state of particles over time.

We created three files of such data with different sizes. tiny.txt has only a couple of rows and might be useful when writing the script. medium.txt has about 1.3 MB and might be useful when testing your script locally. Finally, we have large.txt which has 6 GB. All files contain data from 11 snapshots (2, 12, 22, .., 102). tiny.txt, and medium.txt are contained in project4.tar.gz. large.txt is available on the cluster (in /user/nodira).

3.1 Write a script that counts the number of particles per snapshot. Test it on medium.txt.

For each of the following datasets (available on the cluster), what is the level of parallelism when running your script (Use the web interface to determine the number of concurrently executing map tasks for each dataset; this is the level of parallelism)? What can you say about the scalability of the system?

Run the script on these files (available on the cluster):
large-2.txt (contains only snapshot at timestep 2)
large-2-12.txt (contains only snapshots at timesteps 2 and 12)
large-2-12-22.txt (contains only snapshots at timesteps 2, 12, 22)
large-2-12-22-32.txt (contains only snapshots at timesteps 2, 12, 22, 32)

For each of the 4 datasets, launch your script on the cluster and check in the web interface how many map tasks are created. You can then immediately cancel your script again.

What you need to turn in:
Required: Submit your pig program in problem3.1.txt.

Extra credit: Submit your answers to the above questions in problem3.1-answers.txt.

3.2 For each pair of subsequent snapshots (2, 12), (12, 22), (22, 32), ..., (92, 102), compute the number of particles which increased their velocity, and the number of particles which reduced their velocity. If the particle neither increased nor decreased their velocity, count it as decelerate.
Your output should have the following format
```
2.0, accelerate, 2
2.0, decelerate, 3
12.0, decelerate, 16
22.0, accelerate, 2
...
```
The first column denotes the start snapshot (int or float), the second column accelerate or decelerate, and the third column a count of the number of particles. It's ok to leave out rows with 0 counts. The results you turn in should be based on the medium.txt dataset.

Hints:
- Again, start locally on the tiny.txt dataset.
- For this problem, you will need to use a variety of PIG commands. Look them up here.
- velocity0-2 refer to the x,y,z dimensions of velocity, so you can determine the magnitude by the Euclidean distance formula.
What you need to turn in:
Required: Submit your pig program in problem3.2.txt. Run your program on medium.txt, and submit your computed result file (named problem3.2-results.txt).

Extra credit: Run your program on large-2-12.txt, and submit your computed result file (problem3.2-bonus-results.txt).

Project 4: Hadoop and PIG

Preliminaries

Problem 1: Getting started with PIG

Problem 2: Simple tweak to tutorial

Problem 3: Script on the much larger astronomy data