ESTIMATED TIME: 18 hours.
DUE DATE: Friday, June 5th, 11:45pm >
TURN IN INSTRUCTIONS: Turn in eight txt files (details below) to the Catalyst drop box.
HADOOP: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
PIG: Pig Latin is a declarative, SQL-style language for ad-hoc analysis of extremely large datasets. The motivation for Pig Latin is the fact that many people who analyze extremely large datasets are entrenched programmers, who are fit in MapReduce. The MapReduce paradigm is low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. Pig Latin is implemented in PIG, a system which compiles Pig Latin into physical plans that are then executed over Hadoop.
cd project4 tar -xzf pigtutorial.tar.gz tar -xzf hadoop-0.17.2.tar.gz
setenv PIGDIR ~/project4/pigtmp setenv HADOOP ~/project4/hadoop-0.17.2 setenv HADOOPSITEPATH ~/project4/hadoop-0.17.2/conf/ setenv PATH $HADOOP/bin/:$PATH chmod u+x $HADOOP/bin/hadoopThe bash equivalent:
export PIGDIR=~/project4/pigtmp export HADOOP=~/project4/hadoop-0.17.2 export HADOOPSITEPATH=~/project4/hadoop-0.17.2/conf/ export PATH=$HADOOP/bin/:$PATH chmod u+x $HADOOP/bin/hadoop
ssh -D 6789 -L 50128:127.0.0.1:50128 yourusername@64.88.164.202Be sure to execute this command before using the cluster. Minimize this window, and keep it open forever! Don't close it. You will need it to connect to the cluster, as well as for job-tracking.
After you have completed the Preliminaries: Run the first example script in the tutorial on the cluster (script1-hadoop.pig), and then answer the following questions (also see hints below):
1.1 How many MapReduce jobs are generated by script1?
1.2 How many map tasks are within the first MapReduce job? How many maps are within later MapReduce jobs? (Note that this number will be small because the dataset is small.)
1.3 How long does each job take? How long does the entire script take?
1.4 What do tuples look like after command clean2 = ... in script1? What do tuples look like after command uniq_frequency1 = ... in script1? What do tuples look like after command same = ... in script2?
Hint 1: Use the job-tracker at http://JOBTRACKER_IP_ADDRESS:50030/ to see the number of map and reduce tasks for your MapReduce jobs.
Hint 2: To see the schema for intermediate results, you can use PIG's interactive command line client Grunt, which you can launch with the following command
java -cp pig.jar org.apache.pig.Main -x local
When using grunt, two commands that you may want to know about are
dump and describe.
What you need to turn in:
Required: Submit your answers to problems 1.1 and 1.4 in a file named problem1-answers.txt .
(Although 1.1 is easier done on the cluster, you should be able
to figure it out by thinking about how it translates into Map-Reduce jobs.)
Extra credit: Submit your answers to problems 1.2 and 1.3 in a file named problem1-bonus.txt .
Write a pig script that creates a histogram showing the distribution of user activity levels.
So, for a point (x,y) that we plot, we mean to say that y users each performed a total of x queries. You can run the org.apache.pig.tutorial.NonURLDetector(query) filter on the data to remove queries that are empty or only contain URLs.
A few comments to help you get started:
chmod +x plot.sh ./plot.sh PIG_RESULTS_FILEThe script generates a PNG image of the plot in your current directory. Your PIG_RESULTS_FILE needs to be tab-separated and have two columns, x and y. The data also needs to be (numerically) sorted by x. Although it is also possible to sort using PIG, we recommend that you simply run Unix' sort -n input > output after your job has completed (by default sorting in PIG is alphabetic).
What you need to turn in:
Required: Submit your pig program in problem2.txt. Run your program on excite-small.log,
and submit your computed result file (named problem2-results.txt), and your PNG plot (named problem2-results.png).
Extra credit: Run your program on excite.log.bz2, and submit your
computed result file (problem2-bonus-results.txt), and your PNG plot (problem2-bonus-results.png).
The astro dataset contains snapshots of a cosmological simulation that follows the evolution of structure within a volume of 150 million light years shortly after the Big Bang until the present day. The simulation had 512 timesteps, and snapshots were taken every even timestep. Each snapshot consists of the state of each particle in the simulation.
The following is an example of what the data looks like:
snapshot, #Time , partId, #mass, position0, position1, position2, velocity0, velocity1, velocity2, eps, phi, rho, temp, hsmooth, metals 2, 0.290315, 0, 2.09808e-08, -0.489263, -0.47879, -0.477001, 0.1433, 0.0721365, 0.265767, , -0.0263865, 0.38737, 48417.3, 9.6e-06, 0 2, 0.290315, 1, 2.09808e-08, -0.48225, -0.481107, -0.480114, 0.0703595, 0.142529, 0.0118989, , -0.0269804, 1.79008, 662495, 9.6e-06, 0Relevant to us are snapshot, partId, and velocity0-2. Each particle has a unique id (partId) and the data tracks the state of particles over time.
We created three files of such data with different sizes. tiny.txt has only a couple of rows and might be useful when writing the script. medium.txt has about 1.3 MB and might be useful when testing your script locally. Finally, we have large.txt which has 6 GB. All files contain data from 11 snapshots (2, 12, 22, .., 102). tiny.txt, and medium.txt are contained in project4.tar.gz. large.txt is available on the cluster (in /user/nodira).
3.1 Write a script that counts the number of particles per snapshot. Test it on medium.txt.
For each of the following datasets (available on the cluster), what is the level of parallelism when running your script (Use the web interface to determine the number of concurrently executing map tasks for each dataset; this is the level of parallelism)? What can you say about the scalability of the system?
Run the script on these files (available on the cluster):
large-2.txt (contains only snapshot at timestep 2)
large-2-12.txt (contains only snapshots at timesteps 2 and 12)
large-2-12-22.txt (contains only snapshots at timesteps 2, 12, 22)
large-2-12-22-32.txt (contains only snapshots at timesteps 2, 12, 22, 32)
For each of the 4 datasets, launch your script on the cluster and check in the web interface how many map tasks are created. You can then immediately cancel your script again.
What you need to turn in:
Required: Submit your pig program in problem3.1.txt.
Extra credit: Submit your answers to the above questions in problem3.1-answers.txt.
3.2 For each pair of subsequent snapshots (2, 12), (12, 22), (22, 32), ..., (92, 102), compute the number of particles which increased their velocity, and the number of particles which reduced their velocity. If the particle neither increased nor decreased their velocity, count it as decelerate.
Your output should have the following format2.0, accelerate, 2 2.0, decelerate, 3 12.0, decelerate, 16 22.0, accelerate, 2 ...The first column denotes the start snapshot (int or float), the second column accelerate or decelerate, and the third column a count of the number of particles. It's ok to leave out rows with 0 counts. The results you turn in should be based on the medium.txt dataset.
Hints:
What you need to turn in:
Required: Submit your pig program in problem3.2.txt. Run your program on medium.txt,
and submit your computed result file (named problem3.2-results.txt).
Extra credit: Run your program on large-2-12.txt, and submit your
computed result file (problem3.2-bonus-results.txt).