CSE 544 Homework 3: Hadoop and PigLatin

Objectives:
To get experience with running data analytics on the cloud.
Assignment tools:
Amazon Web Services (AWS), Pig Latin, Hadoop
Due date:
Thursday, November 19th, 2015, at 11:45pm. Turn it in here.
What to turn in:
See below.
Starter code:
Here.

Amazon AWS account

Please follow the instructions on this page to sign up for an account with free credits that you can use for this assignment. Use your CSE / UW email address when you sign up.

Check the instructions. NOTE: It will take you a good 60 minutes to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud today!

Warmup code

Please download the project archive from here. You will find example.pig in hw3.tar.gz. example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group. Follow the README.txt: it instructs you on how to run the sample program example.pig.

Useful links

Description

In this assignment, we will perform some basic analysis over a large graph, coming from the billion triple dataset, the 2010 version. This is an RDF dataset that contains a billion (add or take a few) triples from the Semantic Web. Some Webpages on the Web have a machine-readable description of their semantics stored as RDF triples: our dataset was obtained by a crawler that extracted all RDF triples from the Web.

RDF data is represented in triples of the form:

subject  predicate  object  [context]

The [context] is not part of the triple, but is sometimes added to tell where the data is coming from. For example, here are two concrete "triples" taken from the file btc-2010-chunk-200 (they are actually "quads" because they have the context too):

<http://www.last.fm/user/ForgottenSound>
<http://xmlns.com/foaf/0.1/nick> "ForgottenSound"
<http://rdf.opiumfield.com/lastfm/friends/life-exe> .

<http://dblp.l3s.de/d2r/resource/publications/journals/cg/WestermannH96> <http://xmlns.com/foaf/0.1/maker> <http://dblp.l3s.de/d2r/resource/authors/Birgit_Westermann> <http://dblp.l3s.de/d2r/data/publications/journals/cg/WestermannH96> .

The first says that Webpage <http://www.last.fm/user/ForgottenSound> has the nickname "ForgottenSound"; the second describes the maker of another webpage. foaf stands for Friend of a Friend. Confused? Don't worry. You don't need to know what they mean, you just need to analyize the graph. There were 317 files of 2GB each in the billion triple dataset. We uploaded them to Amazon's Web Services in S3, which took about two days, and there were many errors: only 251 uploaded correctly, for a total of about 550 GB of data. This is enough for us to play with. In this assignment, we will compute the out-degree sequence of the graph. The out-degree of a node is the number of edges leaving that node; the out-degree sequence is a scatter plot (x,y) where x is an outdegree, and y is the number of nodes with that out-degree; it is sometimes called the outdegree histogram.

You will access the following datasets in S3, through pig (using the LOAD command -- see example.pig). Also, please note the following:

Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

In other words if s3n:// is giving you trouble, try the URI with s3:// instead.

1. Warmup Question: Compute the Out-Degree Histogram on cse344-test-file, or btc-2010-chunk-000

Using cse344-test-file (or the larger btc-2010-chunk-200), write a Pig script to compute the out-degree sequence.

For example, if the point (x,y)=(30, 372) were in your output, then it means that 372 subjects had exactly 30 outgoing edges. Recall that a triple is (subject, predicate, object); so to compute the outdegree, you want to first group by subject. Run your script on an AWS cluster and record the mapreduce jobs information (cluster size, # MapReduce jobs, runtimes, # reduce tasks per job).

Copy the results to your local machine. Generate a log-log scatter-plot graph, using either Excel or Google Chart or OpenOffice or gnuplot etc to plot the histogram points. Save, and turn in, the plot in some image format, e.g., jpeg or png.

A few comments to help you get started:

What to turn in:

Nothing. This was a warmup step.

2: Compute a Histogram on the Entire Dataset

Compute the out-degree histogram in Question 1 on the entire 0.5TB dataset. Use 19 nodes: this is your maximum number allowed (since you are using one node for the master). For bigger clusters, you need to ask explicit permission form Amazon (don't bother...)

You need to modify the load instruction to:

raw = LOAD 's3://uw-cse344' USING TextLoader as (line:chararray);

Note: this query will take more than 4 hours to run (with a high variance!). Plan accordingly, and monitor carefully: if anything looks wrong, abort, fix, restart.

When you are done, appreciate how relatively quick and easy it was to analyze a 0.5TB graph!

What to turn in:

Run your program on the entire dataset (uw-cse344) and submit a hw3.tar.gz tarball with four files:

Hint 1: Use the job tracker to see the number of map and reduce tasks for your MapReduce jobs.

Hint 2: To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is describe. To see a list of other commands, type help.