CSE 544 Homework 2: Database Systems

Objectives:: To get familiar with the internals of a database system, parallel query environment, and to practice database statistics.
Assignment tools:: Java, Amazon Web Services (AWS)
Due date:: Sunday, May 6th, 2012, at 11:59pm. Turn it in here.
What to turn in:: Specific instructions are given for Part A, Part B, and Part C (in red).

Part A: SimpleDB

Acknowledgement

This assignment comes from Prof. Sam Madden's 6.830 class at MIT and from Prof. Magda Balazinksa's CSE544-Autumn 2009 class at University of Washington.

Turn-in Instructions

Please submit your code (see below), along with a short README file that describes your approach. You should:

Describe any design decisions you made, including your choice of page eviction policy. If you used something other than a nested-loops join, describe the tradeoffs of the algorithm you chose.
Discuss and justify any changes you made to the API.
Describe any missing or incomplete elements of your code.
Describe how long you spent on the lab, and whether there was anything you found particularly difficult or confusing.

To submit your code, please create a hw2_A.tar.gz tarball (such that, untarred, it creates a 544-hw2/src/simpledb directory with your code) and submit it to the dropbox. We will use the latest version you submit that arrives before the deadline. Please also attach the README as a PDF or a text file.

Grading

60% of your grade on this problem will be based on whether or not your code passes the system test suite we will run over it. These tests will be a superset of the tests we have provided. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest.

Important: before testing, we will replace your build.xml, HeapFileEncoder.java (you need not worry about this for this particular assignment), and the entire contents of the test/ directory with our version of these files! This means you cannot change the format of .dat files! You should therefore be careful changing our APIs. This also means you need to test whether your code compiles with our test programs. In other words, we will untar your tarball, replace the files mentioned above, compile it, and then grade it. It will look roughly like this:

$ tar zxvf 544-hw2.tar
$ cd ./544-hw2
[replace build.xml, HeapFileEncoder.java, and test]
$ ant test
$ ant systemtest
[additional tests]

An additional 40% of your grade will be based on our subjective evaluation of your code (a good README file will help here).

Assignment goal

In this assignment, you will write parts of a basic database management system called SimpleDB. In this lab assignment, you will write a set of operators for SimpleDB to implement selections, joins, and aggregates. Additionally, the system, as provided to you has a very simple buffer pool management policy that does not deal with the problem that arises when we reference more pages than we can fit in memory over the lifetime of the database. In this assignment, you will also design an eviction policy to flush stale pages from the buffer pool. We will not ask you to add transactions, locking, and concurrent queries because quarters are so short. However, we invite you to think how you would add such functionality into the system (after we cover transactions in class, later in this quarter).

SimpleDB is written in Java. We have provided you with a set of mostly unimplemented classes and interfaces. You will need to write the code for these classes. We will grade your code by running a set of system tests written using JUnit. We have also provided a number of unit tests, which we will not use for grading but that you may find useful in verifying that your code works.

The extra credit questions ask you to implement table modifications (e.g., insert and delete records).

You do not need to implement transactions or locking in this lab.

The remainder of this document describes the basic architecture of SimpleDB, gives some suggestions about how to start coding, and discusses how to hand in your assignment.

We strongly recommend that you start as early as possible on this assignment. It requires you to write a fair amount of code!

Extra Credit

The parts labeled Extra credit will count for an additional 5%. Please work on them only if you have spare time!

Important: The extra credit questions appear in the middle of the assignment instructions. Make sure to read past these questions even if you don't do the extra credit!!

0. Find bugs, be patient, earn candy bars

SimpleDB is a relatively complex piece of code. It is very possible you are going to find bugs, inconsistencies, and bad, outdated, or incorrect documentation, etc.

We ask you, therefore, to do this assignment with an adventurous mindset. Don't get mad if something is not clear, or even wrong; rather, try to figure it out yourself or send us a friendly email. We promise to help out by posting bug fixes, new tarballs, etc., as bugs and issues are reported.

And if you find a bug in our code, we'll give you a candy bar from the Benson store (see “submitting a bug” at the end of the instructions)!

1. Getting started

These instructions are written for any Unix-based platform (e.g., Linux, MacOS, etc.). Because the code is written in Java, it should work under Windows as well, though the directions in this document may not apply.

We have included Section 1.2 on using the project with Eclipse.

Download the code from here and untar it. For example:

$ wget http://www.cs.washington.edu/education/courses/544/11wi/assign2/hw2.tgz
$ tar zxvf hw2.tgz
$ cd 544-hw2

SimpleDB uses the Ant build tool to compile the code and run tests. Ant is similar to make, but the build file is written in XML and is somewhat better suited to Java code. Most modern Linux distributions include Ant.

To help you during development, we have provided a set of unit tests in addition to the end-to-end tests that we use for grading. These are by no means comprehensive, and you should not rely on them exclusively to verify the correctness of your project.

To run the unit tests use the test build target:

$ cd 544-hw2
$ # run all unit tests
$ ant test
$ # run a specific unit test
$ ant runtest -Dtest=TupleTest

You should see output similar to:

# build output...
test:
 [junit] Running simpledb.TupleTest
 [junit] Testsuite: simpledb.TupleTest
 [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.036 sec
 [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.036 sec
# ... stack traces and error reports ...

The output above indicates that no errors occurred during compilation; this is because this part of the code has been already implemented for you. As you complete parts of the assignment, you will work towards passing additional unit tests. If you wish to write new unit tests as you code, they should be added to the test/simpledb directory.

For more details about how to use Ant, see the manual. The Running Ant section provides details about using the ant command. However, the quick reference table below should be sufficient for working on the assignments.

Command	Description
`ant`	Build the default target (for simpledb, this is dist).
`ant -projecthelp`	List all the targets in `build.xml` with descriptions.
`ant dist`	Compile the code in src and package it in `dist/simpledb.jar`.
`ant test`	Compile and run all the unit tests.
`ant runtest -Dtest=testname`	Run the unit test named `testname`.
`ant systemtest`	Compile and run all the system tests.
`ant runsystest -Dtest=testname`	Compile and run the system test named `testname`.

1.1. Running end-to-end tests

We have also provided a set of end-to-end tests that will eventually be used for grading. These tests are structured as JUnit tests that live in the test/simpledb/systemtest directory. To run all the system tests, use the systemtest build target:

$ ant systemtest

# ... build output ...

systemtest:

[junit] Running simpledb.systemtest.ScanTest
[junit] Testsuite: simpledb.systemtest.ScanTest
[junit] Tests run: 3, Failures: 0, Errors: 3, Time elapsed: 0.237 sec
[junit] Tests run: 3, Failures: 0, Errors: 3, Time elapsed: 0.237 sec
[junit]
[junit] Testcase: testSmall took 0.017 sec
[junit] 	Caused an ERROR
[junit] implement this
[junit] java.lang.UnsupportedOperationException: implement this
[junit] 	at simpledb.HeapFile.id(HeapFile.java:46)
[junit] 	at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:90)
[junit] 	at simpledb.systemtest.SystemTestUtil.matchTuples(SystemTestUtil.java:83)
[junit] 	at simpledb.systemtest.ScanTest.validateScan(ScanTest.java:30)
[junit] 	at simpledb.systemtest.ScanTest.testSmall(ScanTest.java:41)
# ... more error messages ...

This indicates that this test failed, showing the stack trace where the error was detected. To debug, start by reading the source code where the error occurred. When the tests pass, you will see something like the following:

$ ant systemtest

# ... build output ...

[junit] Testsuite: simpledb.systemtest.ScanTest
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 7.278 sec
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 7.278 sec
[junit]
[junit] Testcase: testSmall took 0.937 sec
[junit] Testcase: testLarge took 5.276 sec
[junit] Testcase: testRandom took 1.049 sec

BUILD SUCCESSFUL
Total time: 52 seconds

1.1.1 Creating dummy tables

It is likely you'll want to create your own tests and your own data tables to test your own implementation of SimpleDB. You can create any .txt file and convert it to a .dat file in SimpleDB's HeapFile format using the command:

$ ant dist

$ java -jar dist/simpledb.jar convert file.txt N

where file.txt is the name of the file and N is the number of columns in the file. Notice that file.txt has to be in the following format:

int1,int2,...,intN
int1,int2,...,intN
int1,int2,...,intN
int1,int2,...,intN

where each intN is a non-negative integer.

To view the contents of a table, use the print command.

$ java -jar dist/simpledb.jar print file.dat N

where file.dat is the name of a table created with the convert command, and N is the number of columns in the file.

1.2. Working in Eclipse

Eclipse is a graphical software development environment that you might be more comfortable with working in. The instructions we provide were generated by using Eclipse 3.4.0 (Ganymede) for Java Developers (not the enterprise edition) with Java 1.5.0_13 on Ubuntu 7.10. They should also work under Windows or on MacOS.

Setting the Assignment Up in Eclipse

Once Eclipse is installed, start it, and note that the first screen asks you to select a location for your workspace (we will refer to this directory as $W).
On the file system, copy hw2.tar.gz to $W/hw2.tar.gz. Un-GZip and un-tar it, which will create a directory $W/544-hw2 (to do this, you can type tar -pzxvf hw2.tar.gz).
With Eclipse running, select File->New->Project->Java->Java Project, and push Next ( You may also be able to do directly: File->New->Java Project).
Enter “544-hw2” as the project name.
On the same screen that you entered the project name, select “Create project from existing source,” and browse to $W/544-hw2.
Click Finish, and you should be able to see “544-hw2” as a new project in the Project Explorer tab on the left-hand side of your screen (if you just installed Eclipse, make sure to close the “Welcome” window). Opening this project reveals the directory structure discussed above - implementation code can be found in “src,” and unit tests and system tests found in “test.”
Running Individual Unit and System Tests

To run a unit test or system test (both are JUnit tests, and can be initialized the same way), go to the Package Explorer tab on the left side of your screen. Under the “544-hw2” project, open the “test” directory. Unit tests are found in the “simpledb” package, and system tests are found in the “simpledb.systemtests” package. To run one of these tests, select the test (they are all called *Test.java - don't select TestUtil.java or SystemTestUtil.java), right click on it, select “Run As,” and select “JUnit Test.” This will bring up a JUnit tab, which will tell you the status of the individual tests within the JUnit test suite, and will show you exceptions and other errors that will help you debug problems.

Running Ant Build Targets

If you want to run commands such as “ant test” or “ant systemtest,” right click on build.xml in the Package Explorer. Select “Run As” and then “Ant Build…” (note: select the option with the ellipsis (…), otherwise you won't be presented with a set of build targets to run). Then, in the “Targets” tab of the next screen, check off the targets you want to run (probably “dist” and one of “test” or “systemtest”). This should run the build targets and show you the results in Eclipse's console window.

Adding External Libraries to the Build Path

Eclipse users will have to take a few more steps for their code to compile. First, in the package explorer, right click the project name (probably 544-hw2) and select Refresh. Then right click the project name again (under package explorer), and select Properties. In the dialog that appears, choose Java Build Path on the left-hand-side, then click on the Libraries tab on the right-hand-side. Push the Add External JARs… button, navigate to the lib directory of your project and select zql.jar and jline-0.9.94.jar, and push OK, followed by OK. Your code should now compile.

1.3. Implementation hints

Before beginning to write code, we strongly encourage you to read through this entire document to get a feel for the high-level design of SimpleDB.

You will need to fill in any piece of code that is not implemented. It will be obvious where we think you should write code. You may need to add private methods and/or helper classes. You may change APIs, but make sure our grading tests still run and make sure to mention, explain, and defend your decisions in your writeup.

In addition to the methods that you need to fill out for this assignment, the class interfaces contain numerous methods that you need not implement in this assignment. These will either be indicated per class:

 // Not necessary for assignment 1.
 public class Insert implements DbIterator {

or per method:

 public boolean deleteTuple(Tuple t) throws DbException {

 // Some code goes here
 // Not necessary for assignment 1
 return false;
 }

The code that you submit should compile without having to modify these methods.

We suggest exercises along this document to guide your implementation, but you may find that a different order makes more sense for you. We will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets test and systemtest. See the “Grading” section at the end of this document for a complete discussion of grading and list of the tests you will need to pass.

Here's a rough outline of one way you might proceed with your SimpleDB implementation; more details on the steps in this outline, including exercises, are given in Section 2 below.

Implement the operators Filter and Join and verify that their corresponding tests work. The Javadoc comments for these operators contain details about how they should work. We have given you implementations of Project and OrderBy which may help you understand how other operators work.
Implement IntAggregator and StringAggregator. Here, you will write the logic that actually computes an aggregate over a particular field across multiple groups in a sequence of input tuples. Use integer division for computing the average, since SimpleDB only supports integers. StringAggegator only needs to support the COUNT aggregate, since the other operations do not make sense for strings.
Implement the Aggregate operator. As with other operators, aggregates implement the DbIterator interface so that they can be placed in SimpleDB query plans. Note that the output of an Aggregate operator is an aggregate value of an entire group for each call to next(), and that the aggregate constructor takes the aggregation and grouping fields.
[Extra credit] Implement the methods related to tuple insertion and deletion.
[Extra credit] Implement the Insert and Delete operators. Like all operators, Insert and Delete implement DbIterator, accepting a stream of tuples to insert or delete and outputting a single tuple with an integer field that indicates the number of tuples inserted or deleted. These operators will need to call the appropriate methods in BufferPool that actually modify the pages on disk. Check that the tests for inserting and deleting tuples work properly.
- Note that SimpleDB does not implement any kind of consistency or integrity checking, so it is possible to insert duplicate records into a file and there is no way to enforce primary or foreign key constraints.
Implement a page eviction policy in the BufferPool. You do not need to worry about transactions at this point. If you do not implement tuple insertion/deletions, you do NOT need to worry about writing modified pages back to disk. Your database will be a read-only database.

At this point you should be able to pass all of the tests in the ant systemtest target, which is the goal of this lab. Note: if you are not doing the extra credit questions, tests related to inserting and deleting tuples will not work for you.

You'll also be able to use the provided SQL parser to run SQL queries against your database! See Section 2.7 for a brief tutorial and a description.

Finally, you might notice that the iterators in this lab extend the AbstractDbIterator class instead of implementing the DbIterator interface. Because the implementation of next/hasNext is often repetitive, annoying, and error-prone, AbstractDbIterator implements this logic generically, and only requires that you implement a simpler readNext. Feel free to use this style of implementation, or just implement the DbIterator interface if you prefer. To implement the DbIterator interface, remove extends AbstractDbIterator from iterator classes, and in its place put implements DbIterator.

1.4. Transactions, locking, and recovery

As you look through the interfaces that we have provided you, you will see a number of references to locking, transactions, and recovery. You do not need to support these features. We will not be implementing this part of SimpleDB this quarter. (We may use it in subsequent quarters.) The test code we have provided you with generates a fake transaction ID that is passed into the operators of the query it runs; you should pass this transaction ID into other operators and the buffer pool.

2. SimpleDB Architecture and Implementation Guide

2.1. Filter and Join

Recall that SimpleDB DbIterator classes implement the operations of the relational algebra. You will now implement two operators that will enable you to perform queries that are slightly more interesting than a table scan.

Filter: This operator only returns tuples that satisfy a Predicate that is specified as part of its constructor. Hence, it filters out any tuples that do not match the predicate.
Join: This operator joins tuples from its two children according to a JoinPredicate that is passed in as part of its constructor. We only require a simple nested loops join, but you may explore more interesting join implementations. Describe your implementation in your lab writeup.

Exercise 1. Implement the skeleton methods in:
src/simpledb/Predicate.java
src/simpledb/JoinPredicate.java
src/simpledb/Filter.java
src/simpledb/Join.java

At this point, your code should pass the unit tests in PredicateTest, JoinPredicateTest, FilterTest, and JoinTest. Furthermore, you should be able to pass the system tests FilterTest and JoinTest.

2.2. Aggregates

An additional SimpleDB operator implements basic SQL aggregates with a GROUP BY clause. You should implement the five SQL aggregates (COUNT, SUM, AVG, MIN, MAX) and support grouping. You only need to support aggregates over a single field, and grouping by a single field. In order to calculate aggregates, we use an Aggregator interface which merges a new tuple into the existing calculation of an aggregate. The Aggregator is told during construction what operation it should use for aggregation. Subsequently, the client code should call Aggregator.merge() for every tuple in the child iterator. After all tuples have been merged, the client can retrieve a DbIterator of aggregation results. Each tuple in the result is a pair of the form (groupValue, aggregateValue), unless the value of the group by field was Aggregator.NO_GROUPING, in which case the result is a single tuple of the form (aggregateValue).

Note that this implementation requires space linear in the number of distinct groups. For the purposes of this lab, you do not need to worry about the situation where the number of groups exceeds available memory.

Exercise 2. Implement the skeleton methods in:
src/simpledb/IntAggregator.java
src/simpledb/StringAggregator.java
src/simpledb/Aggregate.java

At this point, your code should pass the unit tests IntAggregatorTest, StringAggregatorTest, and AggregateTest. Furthermore, you should be able to pass the AggregateTest system test.

2.3. HeapFile Mutability [Extra credit – If you skip Extra Credit go to 2.5]

Now, we will begin to implement methods to support modifying tables. We begin at the level of individual pages and files. There are two main sets of operations: adding tuples and removing tuples.

Removing tuples: To remove a tuple, you will need to implement deleteTuple. Tuples contain RecordIDs which allow you to find the page they reside on, so this should be as simple as locating the page a tuple belongs to and modifying the headers of the page appropriately.

Adding tuples: The addTuple method in HeapFile.java is responsible for adding a tuple to a heap file. To add a new tuple to a HeapFile, you will have to find a page with an empty slot. If no such pages exist in the HeapFile, you need to create a new page and append it to the physical file on disk. You will need to ensure that the RecordID in the tuple is updated correctly.

Exercise 3. Implement the remaining skeleton methods in:
src/simpledb/HeapPage.java
src/simpledb/HeapFile.java (note that you do not necessarily need to implement writePage at this point.)

To implement HeapPage, you will need to modify the header bitmap for methods such as addTuple() and deleteTuple(). You may find that the getNumEmptySlots() and getSlot() methods serve as useful abstractions. Note that there is a setSlot() method provided as an abstraction to modify the filled or cleared status of a tuple in the page header.

Note that it is important that the HeapFile.addTuple() and HeapFile.deleteTuple() methods access pages using the BufferPool.getPage() method; otherwise, your implementation of transactions in the next lab will not work properly [Note: we will not do the next lab this quarter but still make sure to go through the BufferPool].

In src/simpledb/BufferPool.java:

Exercise 3 (cont.) Implement the following skeleton methods: 
insertTuple()
deleteTuple()

These methods should call the appropriate methods in the HeapFile that belong to the table being modified (this extra level of indirection is needed to support other types of files — like indices — in the future).

At this point, your code should pass the unit tests in HeapPageWriteTest and HeapFileWriteTest. We have not provided additional unit tests for HeapFile.deleteTuple() or BufferPool.

2.4. Insertion and deletion [Extra credit – If you skip Extra Credit go to 2.5]

Now that you have written all of the HeapFile machinery to add and remove tuples, you will implement the Insert and Delete operators.

For plans that implement insert and delete queries, the top-most operator is a special Insert or Delete operator that modifies the pages on disk. These operators return the number of affected tuples. This is implemented by returning a single tuple with one integer field, containing the count.

Insert: This operator adds the tuples it reads from its child operator to the tableid specified in its constructor. It should use the BufferPool.insertTuple() method to do this.
Delete: This operator deletes the tuples it reads from its child operator from the tableid specified in its constructor. It should use the BufferPool.deleteTuple() method to do this.

Exercise 4. Implement the skeleton methods in:
src/simpledb/Insert.java
src/simpledb/Delete.java

At this point, your code should pass the unit tests in InsertTest. We have not provided unit tests for Delete. Furthermore, you should be able to pass the InsertTest and DeleteTest system tests.

2.5. Page eviction

The existing implementation of SimpleDB provided to you does not correctly observe the limit on the maximum number of pages in the buffer pool defined by the constructor argument numPages. Now, you will choose a page eviction policy and instrument any previous code that reads or creates pages to implement your policy.

When more than numPages pages are in the buffer pool, one page should be evicted from the pool before the next is loaded. The choice of eviction policy is up to you; it is not necessary to do something sophisticated. Describe your policy in the lab writeup.

Important: If you are not doing the extra credit and thus are not implementing inserting/deleting tuples, you don't need to worry about writing dirty pages back to disk! All you have to do for this question is to implement the evictPage() method. You may also need to modify your getPage() method. You do NOT need to implement flushAllPages() nor flushPage().

If you are doing the extra credit, notice that BufferPool asks you to implement a flushAllPages() method. This is not something you would ever need in a real implementation of a buffer pool. However, we need this method for testing purposes. You should never call this method from any real code. Because of the way we have implemented ScanTest.testCache, you will need to ensure that your flushPage() and flushAllPages() methods do not evict pages from the buffer pool to properly pass this test. flushAllPages() should call flushPage() on all pages in the BufferPool, and flushPage() should write any dirty page to disk and mark it as not dirty, while leaving it in the BufferPool. The only method which should remove a page from the buffer pool is evictPage(), which should call flushPage() on any dirty page it evicts.

Note that we also use flushAllPages() in the query parser. Since we do not implement transactions in this assignment, there is no guarantee that any changes have been made durable when SimpleDB exits. Since it is fun to see changes actually appear on disk, we force all dirty pages to disk before quitting.

Exercise 5.
Fill in the evictPage() method and update your getPage() method in the file:

src/simpledb/BufferPool.java

At this point, your code should pass the EvictionTest system test.

Since we will not be checking for any particular eviction policy, this test works by creating a BufferPool with 16 pages (NOTE: while DEFAULT_PAGES is 100, we are initializing the BufferPool with less!), scanning a file with many more than 16 pages, and seeing if the memory usage of the JVM increases by more than 2MB. If you do not implement an eviction policy correctly, you will not evict enough pages, and will go over the size limitation, thus failing the test.

For those doing the extra credit, also fill in the flushPage() and flushAllPages() methods and appropriately fill in or modify any additional helper methods.

If you did not implement writePage() in HeapFile.java above, you will also need to do that here.

Part B: Hadoop and PigLatin

AMAZON CODES: You should have received your Amazon code by email. Please email the TA or instructor if you did not get the code.

AWS SETUP: Check the instructions. NOTE: It will take you a good 60 minutes to go through all these instructions without even trying to run example.pig at the end. But they are worth it. You are learning how to use the Amazon cloud, which is by far the most popular cloud today!

WARMUP CODE: Download the project archive from here. You will find example.pig in hw2.tar.gz. example.pig is a Pig Latin script that loads and parses the billion triple dataset that we will use in this assignment into triples: (subject, predicate, object). Then it groups the triples by their object attribute and sorts them in descending order based on the count of tuple in each group. Follow the README.txt: it instructs you on how to run the sample program example.pig .

USEFUL LINKS: The Pig Latin wiki page

Description

We live in a "big data" era. A large fraction of this data takes the form of gigantic graphs: A social network is a graph where vertices represent people and edges represent friendships. The Web is a graph where vertices represent pages and edges represent hyperlinks between pages. These graphs are very large and are difficult to study. One of the key challenges is that many graph algorithms are difficult to parallelize.

In this assignment, we will perform some basic analysis over one such graph. This graph is representative of other important graphs. The graph that we will study comes from the billion triple dataset. This is an RDF dataset that contains a billion (add or take a few) triples from the Semantic Web. Some Webpages on the Web have a machine-readable description of their semantics stored as RDF triples: our dataset was obtained by a crawler that extracted all RDF triples from the Web.

RDF data is represented in triples of the form:

		subject  predicate  object  [context]

The [context] is not part of the triple, but is sometimes added to tell where the data is coming from. For example, file btc-2010-chunk-200 contains the two "triples" (they are actually "quads" because they have the context too):

<http://www.last.fm/user/ForgottenSound> <http://xmlns.com/foaf/0.1/nick> "ForgottenSound" <http://rdf.opiumfield.com/lastfm/friends/life-exe> .
<http://dblp.l3s.de/d2r/resource/publications/journals/cg/WestermannH96> <http://xmlns.com/foaf/0.1/maker> <http://dblp.l3s.de/d2r/resource/authors/Birgit_Westermann> <http://dblp.l3s.de/d2r/data/publications/journals/cg/WestermannH96> .

The first says that Webpage <http://www.last.fm/user/ForgottenSound> has the nickname "ForgottenSound"; the second describes the maker of another webpage. foaf stands for Friend of a Friend. Confused? You don't need to know what they mean; some of the many triples refer to music, http://dbtune.org, others refer to company relationships, etc. For our purpose, these triples are just a large collection of triples. There were 317 2GB files in the billion triple dataset when we downloaded it. We uploaded them to Amazon's Web Services in S3: there were some errors, and only 251 uploaded correctly, for a total of about 550 GB of data.

This graph is similar in size to the web graph. As part of this assignment, we will compute the out-degree of each node in the graph. The out-degree of a node is the number of edges coming out of the node. This is an important property. If a graph is random, the out-degree of nodes will follow an exponential distribution (i.e., the number of nodes with degree d should be exp(- c*d) for some constant c). We will write the script in Question 1, where we will run it on a small data sample. We will run the script on the big graph in Question 2. What is very interesting is that we will find the distribution of node out-degrees to follow a power law (1/d^k for some constant k and it will look roughly like a straight-line on a graph with logarithmic scales on both the x and y axes) instead of an exponential distribution. If you look at Figures 2 and 3 in this paper, you will find that the degrees of web pages on the web, in general, follow a similar power law distribution. This is very interesting because it means that the Web and the semantic Web cannot be modeled as random graphs; instead, they need a different theoretical model.

We will do all this on a very real 0.5TB graph!

You will access the following datasets in S3, through pig (using the LOAD command -- see example.pig)

s3n://uw-cse344-test/cse344-test-file -- 250KB. This is used in example.pig. Always use this file for debugging your scripts first!
s3n://uw-cse344 -- 0.5TB. This directory contains 251 files btc-2010-chunk-000 to btc-2010-chunk-317 (since only 251 of the original 318 files uploaded correctly). You will use this in question 2.

Warmp Question: Compute a Histogram on cse344-test-file

Using the 'cse344-test-file' file, write a Pig script that groups tuples by the subject column, and creates/stores histogram data showing the distribution of counts per subject, then generate a scatter-plot of this histogram. The histogram consists of:

The x-axis is the counts associated with the subjects, and
The y-axis is the total number of subjects associated with each particular count.

So, for each point (x,y) that we generate, we mean to say that y subjects each had x tuples associated with them after we group by subject. Run your script on an AWS cluster and record the mapreduce jobs information (cluster size, # MapReduce jobs, runtimes, # reduce tasks per job). Copy the results to your local machine. Generate a log-log scatter-plot graph, using either excel or gnuplot to plot the histogram points. Save, and turn in, the plot in some image format, e.g. jpeg or png.

A few comments to help you get started:

We expect that your script will (1) group the input data by subject and count the tuples associated with each subject then (2) group the results by these intermediate counts (x-axis values) and compute the final counts (y-axis values).
To get more familiar with the Pig Latin commands, we suggest that you also take a look at the Pig Latin Wiki Page.
Copying files. After you run your job on the cluster, you will need to copyToLocal (i.e. move the files to your local directory) and "cat" together all files named part-*. Once the results have been cat-ed together, copy them back to your local machine. Alternatively, you can use the "hadoop dfs -getmerge" command. See instructions.
Generating the plot. If you use excel, then: (a) import the tab-separated text file in excel, (b) generate a scatter-plot, (c) click on each axis and make it logarithmic. If you use gnuplot, we have prepared a script which makes it easier for you to run gnuplot. Use the files plot.sh and plot.gnu as follows:
```
chmod +x plot.sh
./plot.sh PIG_RESULTS_FILE
    
```
The script generates a PNG image of the plot in your current directory. Your PIG_RESULTS_FILE needs to be tab-separated and have two columns, x and y. The data also needs to be (numerically) sorted by x. You can sort either using Pig or simply run Unix' sort -n input > output after your job has completed (by default sorting in Pig is alphabetical).

DEBUGGING:

Since you are using the small test file in this question, you can run a small, 1-node cluster.
In this question, we are debugging the script. The output of this question is thus not going to be terribly interesting. In fact, your scatterplot should only have two points: (1,1) and (3,333).
To debug a Pig Latin script, try to run Pig as follows:
```
pig -x local  
```
Run all commands as you normally would, except for store. You need to store your results locally:
```
store my_final_output into '/tmp/finaloutput' using PigStorag()
```
Take the wort try literally: we had reports that the -local option did not work in the past.
Once you are done debugging in local mode, try to run your script by issuing real MapReduce jobs. That is run with "pig" instead of "pig -x local" (remember to change the store command).

What you need to turn in: Nothing. This was a warmup step.

Question: Compute a Histogram on the Entire Dataset

Compute the histogram in Question 1 on the entire 0.5TB dataset. Use 19 nodes: this is your maximum number allowed (since you are using one node for the master). For bigger clusters, you need to ask explicit permission form Amazon.

You need to modify the load instruction to:

raw = LOAD 's3n://uw-cse344' USING TextLoader as (line:chararray);

Note: this query will take more than 4 hours to run. Plan accordingly, and monitor carefully: if anything looks wrong, abort, fix, restart.

When you are done, appreciate how relatively quick and easy it was to analyze a 0.5TB graph!

What you need to turn in:
Run your program on the entire dataset uw-cse344 and submit a hw2_B.tar.gz tarball with four files: (a) your Pig program in problemB2.pig. (b) your scatter-plot in problemB2.png, or problemB2.jpeg (or some other picture format), (c) your computed result file (problemB2-results.txt), (d) your MapReduce jobs information (problemB2-answers.txt). For this, answer the following:

2.1 How many MapReduce jobs are generated?

2.2 How many reduce tasks are within the first MapReduce job? How many reduce tasks are within later MapReduce jobs?

2.3 How long does each job take? How long does the entire script take?

2.4 What is the schema of the tuples after each command?

Hint 1: Use the job tracker to see the number of map and reduce tasks for your MapReduce jobs.

Hint 2: To see the schema for intermediate results, you can use Pig's interactive command line client grunt, which you can launch by running Pig without specifying an input script on the command line. When using grunt, a command that you may want to know about is describe . To see a list of other commands, type help.

Part C: Database Statistics

In this part you will answer some simple questions on paper (i.e. in a text file).

Consider the relations R(A,B),S(B,C),T(C,D) and the following histograms on R.A and T.D:

R.A	0..999	1000..1999	2000..2999	3000..3999	4000..4999
	10⁴	2·10⁴	3·10⁴	2·10⁴	2·10⁴

T.D	0..2499	2500..2699	2700..3999	4000..7999
	10⁴	10⁴	10⁴	10⁴

Question 1: Histogram type

For each of the two histograms indicate whether it is an eq-width or an eq-depth histogram.

Question 2: Simple estimation

Estimate the number of tuples returned by the relational algebra expression: sigma_500<=A<3500

Question 3: Complex Estimation

Assume that we have the two histograms above, and the following database statistics:

T (R) = 10⁵ T(S) = 6 · 10⁶ T(T) = 4 · 10⁴ V(R, B) = V(S, B) = 3 · 10³ V(S, C) = V(T, C) = 2 · 10⁴

Estimate the number of tuples returned by the following SQL query:

SELECT * FROM R, S, T WHERE R.A = 2432 and R.B = S.B and S.C = T.C and T.D = 1234

What you need to turn in:
A text file with your answers.