Homework 6: Hadoop and Pig

Due date: Tuesday, May 31st

ESTIMATED TIME: Up to 10 hours, though likely less.

AMAZON CODES: Check your AWS code.

AWS SETUP: Check instructions.

WARMUP CODE: Download the project archive, hw6.tar.gz. Follow README.txt: it instructs you on how to run a sample program..

TURN IN INSTRUCTIONS: Turn in eleven files to the Catalyst dropbox.

Useful Links

Pig Latin wiki page

Homework Description

The billion triple dataset is an RDF dataset that that contains a billion (add or take a few) triples from the Semantic Web. Some Webpages on the Web have a machine-readable description of their semantics stored as RDF triples: our dataset was obtained by a crawler that extracted all RDF triples from the Web.

RDF data is represented in triples of the form:

		subject  predicate  object  [context]

The [context] is not part of the triple, but is sometimes added to tell where the data is coming from. For example, file btc-2010-chunk-200 contains the two "triples" (they are actually "quads" because they have the context too):

<http://www.last.fm/user/ForgottenSound> <http://xmlns.com/foaf/0.1/nick> "ForgottenSound" <http://rdf.opiumfield.com/lastfm/friends/life-exe> .
<http://dblp.l3s.de/d2r/resource/publications/journals/cg/WestermannH96> <http://xmlns.com/foaf/0.1/maker> <http://dblp.l3s.de/d2r/resource/authors/Birgit_Westermann> <http://dblp.l3s.de/d2r/data/publications/journals/cg/WestermannH96> .

The first says that Webpage <http://www.last.fm/user/ForgottenSound> has the nickname "ForgottenSound"; the second describes the maker of another webpage. foaf stands for Friend of a Friend. Confused ? You don't need to know what they mean; some of the many triples refer to music, http://dbtune.org, others refer to company relationships, etc. For our purpose, these triples are just a large collection of triples. There were 317 2GB files in the billion triple dataset when we downloaded it. We uploaded them to Amazon's Web Services in S3: there were some errors, and only 251 uploaded correctly, for a total of about 550 GB of data.

You will access the following datasets in S3, throught pig (using the LOAD command -- see example.pig)

s3n://uw-cse344-test/cse344-test-file -- 250KB. This is used in example.pig

s3n://uw-cse344/btc-2010-chunk-000 -- 2GB. You will use this dataset in questions 1, 2, 3..

s3n://uw-cse344 -- 0.5TB. This directory contains 251 files btc-2010-chunk-000 to btc-2010-chunk-317 (since only 251 of the original 318 files uploaded correctly). You will use this in question 4.

Problem 1: Getting started with Pig, on chunk-000

Note: You will need to create directories and copy data from the Hadoop filesystem for all problems. You can find instructions to do this here.

Modify example.pig to use btc-2010-chunk-000 file instead of cse344-test-file. Run on an AWS cluster with 10 nodes, and answer the following questions (also see hints below):

1.1 How many MapReduce jobs are generated by example.pig?

1.2 How many reduce tasks are within the first MapReduce job? How many reduce tasks are within later MapReduce jobs?

1.3 How long does each job take? How long does the entire script take?

1.4 What is the schema of the tuples after each of the following commands in example.pig?

After the command ntriples = ...
After the command objects = ...
After the command count_by_object = ...

Hint 1: Use the job tracker to see the number of map and reduce tasks for your MapReduce jobs.

Hint 2: To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is describe . To see a list of other commands, type help.

What you need to turn in:
Run your program on btc-2010-chunk-000, and submit your answers to problems 1.1 - 1.4 in a file named problem1-answers.txt.

Problem 2: Compute a Histogram on chunk-000

Using the 'btc-2010-chunk-000' file, write a Pig script that groups tuples by the subject column, and creates/stores histogram data showing the distribution of counts per subject, then generate a scatter-plot of this histogram. The histogram consists of:

The x-axis is the counts associated with the subjects, and
The y-axis is the total number of subjects associated with each particular count.

So, for each point (x,y) that we generate, we mean to say that y subjects each had x tuples associated with them after we group by subject. Run your script on an AWS cluster with 5 nodes, and record the mapreduce jobs information (# MapReduce jobs, runtimes, # reduce tasks per job). Copy the results to your local machine. Generate a log-log scatter-plot graph, using either excel or gnuplot to plot the histogram points (we recommend excel). Save, and turn in, the plot in some image format, e.g. jpeg or png.

A few comments to help you get started:

To get more familiar with the Pig Latin commands, we suggest that you also take a look at the Pig Latin Wiki Page.
Copying files. After you run your job on the cluster, you will need to copyToLocal (i.e. move the files to your local directory) and "cat" together all files named part-*. Once the results have been cat-ed together, copy them back to your local machine, see instructions.
Generating the plot. If you use excel, then: (a) import the tab-separated text file in excel, (b) generate a scatter-plot, (c) click on each axis and make it logarithmic. If you use gnuplot, we have prepared a script which makes it easier for you to run gnuplot. Use the files plot.sh and plot.gnu as follows:
```
chmod +x plot.sh
./plot.sh PIG_RESULTS_FILE
    
```
The script generates a PNG image of the plot in your current directory. Your PIG_RESULTS_FILE needs to be tab-separated and have two columns, x and y. The data also needs to be (numerically) sorted by x. Although it is also possible to sort using Pig, we recommend that you simply run Unix' sort -n input > output after your job has completed (by default sorting in Pig is alphabetic).

Note: this script took about 30 minutes with 5 nodes.

What you need to turn in:
Run your program on btc-2010-chunk-000, and submit four files: (a) your Pig program in problem2.pig. (b) your scatter-plot in problem2.png, or problem2.jpeg (or some other picture format), (c) your computed result file (problem2-results.txt), (d) your MapReduce jobs information (problem2-answers.txt).

Problem 3: Compute a Join on chunk-000

Use the file 'btc-2010-chunk-000'. In this problem we will consider the subgraph consisting of triples whose subject matches rdfabout.com: for that, filter on subject matches '.*rdfabout\\.com.*'. Find all chains of lengths 2 in this subgraph. More precisely, return all sextuples (subject, predicate, object, subject2, predicate2, object2) where object=subject2. Suggestions on how to proceed:

First filter the data so you only have tuples whose subject matches 'rdfabout.com'.
Make another copy of the filtered collection (it's best to re-label the subject, predicate, and objects, for example to subject2, predicate2, object2).
Now join the two copies:
- the first copy of the 'rdfabout.com' collection should match on object.
- the second copy of the 'rdfabout.com' collection should match on subject2.
Remove duplicate tuples from the result of the join
Order the results by the predicate from the first copy

Run this script on an AWS cluster with as many nodes as you like. Add a comment to your pig script describing the number of nodes you used and how long it took your script to run.

Note: this script took about 25 minutes with 10 nodes.

What you need to turn in:
Run your program on btc-2010-chunk-000, and submit two files: (a) your Pig program in problem3.pig, and (b) your computed result file (problem3-results.txt).

Problem 4: Compute a Histogram on the Entire Dataset

Compute the histogram in Problem 2 on the entire dataset. Use 20 instances.

You need to modify the load instruction to:

raw = LOAD 's3n://uw-cse344' USING TextLoader as (line:chararray);

Note: this query will take more than 4 hours to run. Plan accordingly, and monitor carefully: if anything looks wrong, abort, fix, restart.

What you need to turn in:
Run your program on the entire dataset uw-cse344 and submit four files: (a) your Pig program in problem4.pig. (b) your scatter-plot in problem4.png, or problem4.jpeg (or some other picture format), (c) your computed result file (problem4-results.txt), (d) your MapReduce jobs information (problem4-answers.txt).