In this homework, you are required to implement the algorithm (described below) on a big data analytics system of choice.
Out: Nov 1, 2014
Due: Nov 13, 2014
The goal of this problem is to compute the least common ancestor (LCA) of two academic papers.
Ancestor: An ancestor a
of a paper p
is any paper that is transitively cited by p
. So if p
is the Scatter paper by Glendenning, et al., then the CFS paper by Dabek, et al. is an ancestor: Scatter cites BigTable, which cites Chord, which cites CFS.
Common Ancestor: A common ancestor a
of two papers p1
and p2
is any paper that is an ancestor of both papers. If p2
is the Phoenix paper by Chen, Wang, et al., then CFS is a common ancestor of p
and p2
: Phoenix cites Vivaldi, which cites CFS.
Least Common Ancestor: The least common ancestor a*
of p1
and p2
is the unique ancestor of both papers that is smallest according to the following rules:
a
is an ancestor of both p1
and p2
, let d1
(the depth of p1
) be the number of citation hops in the shortest path from p1
to a
, with d2
defined analogously for p2
. Then the depth
of ancestor a
, denoted da
, is the larger of d1
and d2
. If a
and b
are two ancestors of both p1
and p2
, and da < db
, then a < b
.ya
be the publication year of ancestor a
, and yb
defined similarly for b
. If da = db
and ya < yb
, then a < b
.da = db
and ya = yb
, then break the tie by choosing the ancestor a* with the smaller id
.Because no two papers have the same id
, there is at most one least common ancestor a*
for any two papers.
Example: In the above examples, Scatter has a depth of 3 and Phoenix has a depth of 2 (with respect to the CFS paper). When considering CFS as a common ancestor of the two papers, the depth is 3 and the year is 2001.
NB: The Scatter and Phoenix citation graphs were manually expanded and the given paths may not be the shortest paths. If there is a shorter path from Scatter to CFS then its depth may be shorter. Take the example with a grain of salt.
The data for this problem is a citation graph. It contains two tables: papers and cites.
papers Two columns: id
and year
, signifying that the paper with the specified id
was published in the specified year
.
cites Two columns: p1
and p2
, signifying that paper p1
cites paper p2
.
The data can be downloaded via the AWS S3 links: papers, cites.
Definition: Given a citation graph as defined by the papers
and cities
tables, and a set of seed papers s
, find the least common ancestor, if one exists, for every pair of papers in s
. Represent this answer set with the 5-tuple (p1, p2, a, depth, year)
where p1 < p2
, a
is the least common ancestor of p1
and p2
, depth = max(d1, d2)
, and year
is the publication year of a
.
Note that we only need to produce the answer tuple where p1 < p2
, since the result (p2, p1, a, depth, year)
provides no extra information.
Variable-sized: Define a seed set SN
to be all papers with id <= N
. You can vary the size of the problem instance by varying N
.
The given program example.py computes the least common ancestors (LCAs) for the set SN
with N = 50
. (The example was tested using Python 2.7 and you will need to install the networkx
library, likely via pip install networkx. The example does not scale well.
Your assignment is to implement the algorithm for finding the least common ancestor on any big data analytics system of your choice. Pointers to some of the systems that you can use and instructions on using them on EC2 is provided below.
Your deliverable should include the following:
N
can you compute LCAs for SN
in 1 hour? We will publish a leaderboard based on these results, and the winning teams will receive prizes and local fame!
We provide the following systems that you can choose to solve the assignment challenge: the Spark family (Spark, SparkSQL, and GraphX), GraphLab, and Cloudera's Impala, all of which are open sourced and well documented. Get familiar with the APIs of the system of your choice by searching documentations online. You need to install the system on the computing platform, EC2 for example.
To get bootstrapped, we provide a hello world example in Python that counts the number of rows that contain "12" for Spark and SparkSQL. Skip reading the rest of "Systems" if you're already familiar with the APIs of your system of choice.
Make sure you set the path to cites.csv (for example, https://s3-us-west-2.amazonaws.com/550.cs.washington.edu/cites.csv) and download/install Spark binaries. Run with command path/to/spark/bin/pyspark spark_example.py, and you will see 1,638,576 "12"s found in a few seconds.
"""spark_example.py"""
from pyspark import SparkContext
sc = SparkContext("local", "ExampleSpark")
citesFile = "rawdata/cites.csv"
citeData = sc.textFile(citesFile)
numWords = citeData.filter(lambda s: '12' in s).count()
print "Lines with '12': %i" % (numWords)
Running SQL SELECT count(row) FROM cites WHERE row LIKE '12' would give you the result.
Each group email TA Sophia Wang an account name of your choice (first come first serve, a random number will be appended to the end in the case of conflicts). Sophia will send you a URL, username, and password to log into your EC2 account, and AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
for automatically running EC2 instances.
This can be done by logging into your account, clicking on "EC2" in the dashboard, and then clicking on "key Pairs" on the left sidebar. Then, create and download a key, and make sure that you set the permissions for the private key file to 600 chmod 600 /path/to/keypair/file.pem
Start from the console page, click on "EC2", click on "Launch Instance", select system of your choice, choose an instance type (choose t2.micro for testing purposes), click on "Next: Configure Instance Details", click on "Review and Launch", click on "Launch", select the key pair you created, check the checkbox, and click on "Launch Instances". This process can take a few minutes, click on "Check Instances" to see the progress. After the instance turns green (running), select the instance and click on "Connect" at the top. This will give you information as to how to login into the EC2 instance you created.
After you're done using the instance, remember to right click the instance and click on "Stop" or "Terminate". "Stop" will save the virtual machine state of your instance so that you can reuse them next time. "Terminate" will make everything gone.
Search for the tutorials of the system of your choice, and follow the instructions in the tutorials. For example, Spark provides you a script that automatically creates a number of instances of your choice (instructions). Note that in the instructions, keypair means the name of your key pair file and key-file means the path to your key pair file.
At this point, you should be able to automatically run EC2 instances on the system of your choice.