In this homework, you are required to implement the algorithm (described below) on a big data analytics system of choice.

Out: Nov 1, 2014
Due: Nov 13, 2014

Algorithm: Least Common Ancestor (LCA)


Problem definition

The goal of this problem is to compute the least common ancestor (LCA) of two academic papers.

Ancestor: An ancestor a of a paper p is any paper that is transitively cited by p. So if p is the Scatter paper by Glendenning, et al., then the CFS paper by Dabek, et al. is an ancestor: Scatter cites BigTable, which cites Chord, which cites CFS.

Common Ancestor: A common ancestor a of two papers p1 and p2 is any paper that is an ancestor of both papers. If p2 is the Phoenix paper by Chen, Wang, et al., then CFS is a common ancestor of p and p2: Phoenix cites Vivaldi, which cites CFS.

Least Common Ancestor: The least common ancestor a* of p1 and p2 is the unique ancestor of both papers that is smallest according to the following rules:

  • If a is an ancestor of both p1 and p2, let d1 (the depth of p1) be the number of citation hops in the shortest path from p1 to a, with d2 defined analogously for p2. Then the depth of ancestor a, denoted da, is the larger of d1 and d2. If a and b are two ancestors of both p1 and p2, and da < db, then a < b.
  • Let ya be the publication year of ancestor a, and yb defined similarly for b. If da = db and ya < yb, then a < b.
  • If da = db and ya = yb, then break the tie by choosing the ancestor a* with the smaller id.
  • Because no two papers have the same id, there is at most one least common ancestor a* for any two papers.

    Example: In the above examples, Scatter has a depth of 3 and Phoenix has a depth of 2 (with respect to the CFS paper). When considering CFS as a common ancestor of the two papers, the depth is 3 and the year is 2001.

    NB: The Scatter and Phoenix citation graphs were manually expanded and the given paths may not be the shortest paths. If there is a shorter path from Scatter to CFS then its depth may be shorter. Take the example with a grain of salt.


    Data

    The data for this problem is a citation graph. It contains two tables: papers and cites.

    papers Two columns: id and year, signifying that the paper with the specified id was published in the specified year.

    cites Two columns: p1 and p2, signifying that paper p1 cites paper p2.

    The data can be downloaded via the AWS S3 links: papers, cites.


    Variable-sized problem instances

    Definition: Given a citation graph as defined by the papers and cities tables, and a set of seed papers s, find the least common ancestor, if one exists, for every pair of papers in s. Represent this answer set with the 5-tuple (p1, p2, a, depth, year) where p1 < p2, a is the least common ancestor of p1 and p2, depth = max(d1, d2), and year is the publication year of a.

    Note that we only need to produce the answer tuple where p1 < p2, since the result (p2, p1, a, depth, year) provides no extra information.

    Variable-sized: Define a seed set SN to be all papers with id <= N. You can vary the size of the problem instance by varying N.


    Example

    The given program example.py computes the least common ancestors (LCAs) for the set SN with N = 50. (The example was tested using Python 2.7 and you will need to install the networkx library, likely via pip install networkx. The example does not scale well.


    Assignment


    Your assignment is to implement the algorithm for finding the least common ancestor on any big data analytics system of your choice. Pointers to some of the systems that you can use and instructions on using them on EC2 is provided below.

    Your deliverable should include the following:

    • Code for the assignment
    • A writeup describing your experiences in building the system, including:
      • A performance evaluation of how well does your implementation perform. For example, how much time does it take to compute as you increase the data size? And how does the execution time decrease as you increase the number of nodes? (We will likely limit the number of slave nodes in your experiment to 19 for now, but email us if you want to experiment with a much larger cluster size.)
      • Can you also explain as to why you observe the performance that you observe? Is it scaling as well as you expect? If not, why does it not scale?
      • Comment on the big data analytics systems and whether it was a good fit for this algorithm. If it was not a good fit, which other system would have been a better fit?
    • On your big data technology of choice, how large an N can you compute LCAs for SN in 1 hour? We will publish a leaderboard based on these results, and the winning teams will receive prizes and local fame!
    • As with the Paxos assignment, you will meet with the TA/instructor to discuss your submission and demonstrate your code.

    Systems


    We provide the following systems that you can choose to solve the assignment challenge: the Spark family (Spark, SparkSQL, and GraphX), GraphLab, and Cloudera's Impala, all of which are open sourced and well documented. Get familiar with the APIs of the system of your choice by searching documentations online. You need to install the system on the computing platform, EC2 for example.

    To get bootstrapped, we provide a hello world example in Python that counts the number of rows that contain "12" for Spark and SparkSQL. Skip reading the rest of "Systems" if you're already familiar with the APIs of your system of choice.


    Spark

    Make sure you set the path to cites.csv (for example, https://s3-us-west-2.amazonaws.com/550.cs.washington.edu/cites.csv) and download/install Spark binaries. Run with command path/to/spark/bin/pyspark spark_example.py, and you will see 1,638,576 "12"s found in a few seconds.

    """spark_example.py"""
    from pyspark import SparkContext
    sc = SparkContext("local", "ExampleSpark")
    
    citesFile = "rawdata/cites.csv"
    citeData = sc.textFile(citesFile)
    numWords = citeData.filter(lambda s: '12' in s).count()
    print "Lines with '12': %i" % (numWords)

    SparkSQL

    Running SQL SELECT count(row) FROM cites WHERE row LIKE '12' would give you the result.


    Setting up EC2


    Getting an account

    Each group email TA Sophia Wang an account name of your choice (first come first serve, a random number will be appended to the end in the case of conflicts). Sophia will send you a URL, username, and password to log into your EC2 account, and AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for automatically running EC2 instances.


    Setting up an EC2 key pair

    This can be done by logging into your account, clicking on "EC2" in the dashboard, and then clicking on "key Pairs" on the left sidebar. Then, create and download a key, and make sure that you set the permissions for the private key file to 600 chmod 600 /path/to/keypair/file.pem


    Launching an EC2 instance manually

    Start from the console page, click on "EC2", click on "Launch Instance", select system of your choice, choose an instance type (choose t2.micro for testing purposes), click on "Next: Configure Instance Details", click on "Review and Launch", click on "Launch", select the key pair you created, check the checkbox, and click on "Launch Instances". This process can take a few minutes, click on "Check Instances" to see the progress. After the instance turns green (running), select the instance and click on "Connect" at the top. This will give you information as to how to login into the EC2 instance you created.

    After you're done using the instance, remember to right click the instance and click on "Stop" or "Terminate". "Stop" will save the virtual machine state of your instance so that you can reuse them next time. "Terminate" will make everything gone.


    Launching an EC2 instance automatically

    Search for the tutorials of the system of your choice, and follow the instructions in the tutorials. For example, Spark provides you a script that automatically creates a number of instances of your choice (instructions). Note that in the instructions, keypair means the name of your key pair file and key-file means the path to your key pair file.


    At this point, you should be able to automatically run EC2 instances on the system of your choice.