CSE 544 Homework 3: Hadoop and PigLatin

Objectives:: To get experience with running data analytics on the cloud.
Assignment tools:
: Amazon Web Services (AWS), Pig Latin, Hadoop
Due date:: Thursday, November 19th, 2015, at 11:45pm. Turn it in here.
What to turn in:: See below.
Starter code:: Here.

Amazon AWS account

Please follow the instructions on this page to sign up for an account with free credits that you can use for this assignment. Use your CSE / UW email address when you sign up.

Check the instructions. NOTE: It will take you a good 60 minutes to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud today!

Warmup code

Please download the project archive from here. You will find example.pig in hw3.tar.gz. example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group. Follow the README.txt: it instructs you on how to run the sample program example.pig.

Useful links

Description

In this assignment, we will perform some basic analysis over a large graph, coming from the billion triple dataset, the 2010 version. This is an RDF dataset that contains a billion (add or take a few) triples from the Semantic Web. Some Webpages on the Web have a machine-readable description of their semantics stored as RDF triples: our dataset was obtained by a crawler that extracted all RDF triples from the Web.

RDF data is represented in triples of the form:

subject  predicate  object  [context]

The [context] is not part of the triple, but is sometimes added to tell where the data is coming from. For example, here are two concrete "triples" taken from the file btc-2010-chunk-200 (they are actually "quads" because they have the context too):

<http://www.last.fm/user/ForgottenSound>
<http://xmlns.com/foaf/0.1/nick> "ForgottenSound"
<http://rdf.opiumfield.com/lastfm/friends/life-exe> .

<http://dblp.l3s.de/d2r/resource/publications/journals/cg/WestermannH96>
<http://xmlns.com/foaf/0.1/maker>
<http://dblp.l3s.de/d2r/resource/authors/Birgit_Westermann>
<http://dblp.l3s.de/d2r/data/publications/journals/cg/WestermannH96> .

The first says that Webpage <http://www.last.fm/user/ForgottenSound> has the nickname "ForgottenSound"; the second describes the maker of another webpage. foaf stands for Friend of a Friend. Confused? Don't worry. You don't need to know what they mean, you just need to analyize the graph. There were 317 files of 2GB each in the billion triple dataset. We uploaded them to Amazon's Web Services in S3, which took about two days, and there were many errors: only 251 uploaded correctly, for a total of about 550 GB of data. This is enough for us to play with. In this assignment, we will compute the out-degree sequence of the graph. The out-degree of a node is the number of edges leaving that node; the out-degree sequence is a scatter plot (x,y) where x is an outdegree, and y is the number of nodes with that out-degree; it is sometimes called the outdegree histogram.

You will access the following datasets in S3, through pig (using the LOAD command -- see example.pig). Also, please note the following:

Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

In other words if s3n:// is giving you trouble, try the URI with s3:// instead.

s3n://uw-cse344-test/cse344-test-file -- 250KB. This is used in example.pig. Always use this file for debugging your scripts first!
s3n://uw-cse344/btc-2010-chunk-000 -- The first of the 2GB files.
s3n://uw-cse344 -- 0.5TB. This directory contains 251 files btc-2010-chunk-000 to btc-2010-chunk-317 (since only 251 of the original 318 files uploaded correctly). You will use this in question 2.

1. Warmup Question: Compute the Out-Degree Histogram on cse344-test-file, or btc-2010-chunk-000

Using cse344-test-file (or the larger btc-2010-chunk-200), write a Pig script to compute the out-degree sequence.

The x-axis represents a value of the out-degree
The y-axis is the number of nodes that have that out-degree.

For example, if the point (x,y)=(30, 372) were in your output, then it means that 372 subjects had exactly 30 outgoing edges. Recall that a triple is (subject, predicate, object); so to compute the outdegree, you want to first group by subject. Run your script on an AWS cluster and record the mapreduce jobs information (cluster size, # MapReduce jobs, runtimes, # reduce tasks per job).

Copy the results to your local machine. Generate a log-log scatter-plot graph, using either Excel or Google Chart or OpenOffice or gnuplot etc to plot the histogram points. Save, and turn in, the plot in some image format, e.g., jpeg or png.

A few comments to help you get started:

We expect that your script will (1) group the input data by subject and count the tuples associated with each subject then (2) group the results by these intermediate counts (x-axis values) and compute the final counts (y-axis values).
To get more familiar with the Pig Latin commands, we suggest that you also take a look at the Pig Latin Documentation listed above.
Copying files. After you run your job on the cluster, you will need to copyToLocal (i.e. move the files to your local directory) and "cat" together all files named part-*. Once the results have been cat-ed together, copy them back to your local machine. Alternatively, you can use the "hadoop dfs -getmerge" command. See instructions.
Generating the plot. If you use excel, then: (a) import the tab-separated text file in excel, (b) generate a scatter-plot, (c) click on each axis and make it logarithmic. If you use gnuplot, we have prepared a script which makes it easier for you to run gnuplot. Use the files plot.sh and plot.gnu as follows:
```
chmod +x plot.sh
./plot.sh PIG_RESULTS_FILE
```
The script generates a PNG image of the plot in your current directory. Your PIG_RESULTS_FILE needs to be tab-separated and have two columns, x and y. The data also needs to be (numerically) sorted by x. You can sort either using Pig or simply run Unix' sort -n input > output after your job has completed (by default sorting in Pig is alphabetical). If you use other plotting tools, then you are on your own. ☺

What to turn in:

Nothing. This was a warmup step.

2: Compute a Histogram on the Entire Dataset

Compute the out-degree histogram in Question 1 on the entire 0.5TB dataset. Use 19 nodes: this is your maximum number allowed (since you are using one node for the master). For bigger clusters, you need to ask explicit permission form Amazon (don't bother...)

You need to modify the load instruction to:

raw = LOAD 's3://uw-cse344' USING TextLoader as (line:chararray);

Note: this query will take more than 4 hours to run (with a high variance!). Plan accordingly, and monitor carefully: if anything looks wrong, abort, fix, restart.

When you are done, appreciate how relatively quick and easy it was to analyze a 0.5TB graph!

What to turn in:

Run your program on the entire dataset (uw-cse344) and submit a hw3.tar.gz tarball with four files:

your Pig program in problem2.pig.
your scatter-plot in problem2.png, or problem2.jpeg (or some other picture format).
your computed result file (problem2-results.txt).
your MapReduce jobs information (problem2-answers.txt). For this, answer the following questions:
2.1 How many MapReduce jobs are generated?

2.2 How many reduce tasks are within the first MapReduce job? How many reduce tasks are within later MapReduce jobs?

2.3 How long does each job take? How long does the entire script take?

2.4 What is the schema of the tuples after each command?

Hint 1: Use the job tracker to see the number of map and reduce tasks for your MapReduce jobs.

Hint 2: To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is describe. To see a list of other commands, type help.