Objectives:

  • To learn how to use AWS.
  • To learn how to run distributed queries using Spark.

Assignment tools: Amazon Web Services (AWS) and Spark
Follow these instructions to set up your account and cluster.

Due date: Monday, November 27, 2017 by 11:00pm.
Turn in your solution using the assignment drop box linked from the main course web page.

What to turn in:
Submit two files. First, hw6-queries.sql should contain your queries and comments showing the results. Second, hw6-histogram.{pdf|jpg|png} should be an image (in one of those three formats) showing the histogram you obtained after your sixth query below.

IMPORTANT: Setting up the Spark cluster for this assignment can take a few hours (or longer if you run into problems), so make sure you start early.

Homework Description

In this homework, we will use Amazon AWS to create a Spark cluster and investigate Freebase data. Freebase is a knowledge graph, like Wikipedia, holding topics and facts about the world. (Although the public facing version of Freebase is no longer supported, Google still uses the data.) If you want more information about how Freebase data is organized, go to https://developers.google.com/freebase/guide/basic_concepts.

At a high level, the data is represented in triples (or quadruples) of the form:

subject  predicate  object  [context]

The subject is usually a machine ID (MID), a unique ID that looks like /m/#### (the #### represents some number of alphanumeric characters). The predicate explains what the object is in relation to the subject.

For example, if you issued a query to retrieve all tuples with the subject MID of /m/0284r5q, you would see the following:

subject     predicate                       obj               context
___________________________________________________________________________________________
/m/0284r5q  /type/object/key                /wikipedia/en_id  9,327,603
/m/0284r5q  /type/object/key                /wikipedia/en     Flyte_$0028chocolate_bar$0029
/m/0284r5q  /type/object/type               /common/topic 
/m/0284r5q  /type/object/type               /food/candy_bar
/m/0284r5q  /food/candy_bar/manufacturer    /m/01kh5q 
/m/0284r5q  /food/candy_bar/sold_in         /m/09c7w0
/m/0284r5q  /type/object/name               /lang/en          Flyte
/m/0284r5q  /common/topic/image             /m/04v6jtv
...

The two /type/object/type predicates indicate that two types of MID /m/0284r5q are the type "topic" and the type "candy bar". You can see for the predicate /type/object/name, the object is the English language and the context is Flyte. This means that MID /m/0284r5q is the candy bar called Flyte in English.

This is just a subset of the data for a specific topic. In this homework, you will be looking at data for all topics in Freebase. This dataset is 35 GB, which is why we need Amazon AWS to help. As a comparison, the flights data you used in assignment 2 was 0.11 GB.

Running Spark Queries [20 pts]

Follow these instructions to launch a Spark cluster on AWS and connect to the Zeppelin UI. Once you have done so, you are ready to run queries on the data.

For each question below, write your query in a file called hw6-queries.sql with a comment indicating the question number. Then, copy your query into the Zeppelin notebook, run it on Spark, and copy the result into your hw6-queries.sql file as a comment under the query. For example, for question 1, it should look like this:

-- Question 1
SELECT...
FROM...
WHERE...
/*
Answer to query
*/
  1. What is the total number of rows in the data? [2 pts]
  2. What is the number of distinct predicates in the data? [2 pts]
  3. In the example in the description, we showed some tuples with the subject of MID /m/0284r5q. What are all the tuples with the subject of MID /m/0284r5q? [3 pts]
  4. How many travel destinations does Freebase have? To do this, we'll use the type /travel/travel_destination. In particular, we want to find the number of subjects that have a /type/object/type predicate with the object equal to /travel/travel_destination. [3 pts]
  5. Building off the previous query, what 20 travel destination have the most tourist attractions? Return the location name and count. [4 pts]

    Use the /travel/travel_destination/tourist_attractions predicate to find the tourist attractions for each destination. Use the /type/object/name predicate and /lang/en object to get the name of each location (the name will be the context of the tuple with predicate /type/object/name and object /lang/en). Sort your result by the number of tourist attractions from largest to smallest and then on the destination name alphabetically and only return the top 20.

  6. Generate a histogram of the number of distinct predicates per subject. [6 pts]

    This is asking for more than a count of the number of distinct predicates per subject. This is asking for a distribution of the number of distinct predicates. For your answer, still put the query in hw6-queries.sql, but instead of copying the result as a comment, make a chart of your results in Zeppelin (the little icons below the query allow you to toggle output modes). Take a screenshot of a barchart of your histogram and submit it as hw6-histogram.{pdf|jpg|png}.

You can now shut down your cluster. You will not need it to answer any of the remaining questions.

Additional Questions [8 pts]

For each question below except the last, choose one answer. Include your answer (along with the question number) as a comment at the end of your hw6-queries.sql file.

  1. In the setup code, you ran this command to put data in HDFS for Spark to read:
    hadoop fs -put /data/freebase-datadump-quadruples.tsv /data/spark_data.tsv

    By default, Spark looks in HDFS for data, but you can actually tell Spark to read files locally, rather than from HDFS. For this to work, what additional preprocessing step would I need to take before even opening my Zeppelin notebook to prepare the data?

    1. Transferring the file to the /tmp/ directory for Spark to access.
    2. Replicate the data to be on all the nodes in the cluster.
    3. Format the data to be in RDD format.
    4. Move the data into memory so it can be read by Spark.

  2. How is Spark different from MapReduce?
    1. Spark can read data that is compressed while Hadoop MapReduce cannot.
    2. While computing map(f), if a single call to f(x) runs out of memory, then Spark will throw an OutOfMemory error while Hadoop will successfully finish.
    3. Spark writes intermediate results to memory while Hadoop writes to disk.
    4. Spark can run Machine Learning algorithms while Hadoop cannot.
  3. Which of the following is NOT a good use case of MapReduce?
    1. Counting the number of times each word occurs in a massive collection of documents.
    2. Computing matrix multiplication on two massive matrices.
    3. Returning files that contain a certain word in a massive collection of documents.
    4. Running a large number of transactions for a major bank.
  4. In a simple MapReduce job with m mapper tasks and r reducer tasks, how many output files do you get?
    1. 1
    2. r
    3. m
    4. m + r
  5. One of the key features of Map-Reduce and Spark is their ability to cope with server failure. For each statement below indicate whether it is true or false.
    1. In MapReduce, every map task and every reduce task is replicated across several workers.
    2. When a server fails, Spark recomputes the RDD partitions at that server from parent RDD partitions.
    3. In Spark, when the programmer calls the persist() function, the data is stored on disk, to facilitate recovery from failure.
    4. When a server running a reduce task fails, MapReduce restarts that reduce task either at the same or another server, reusing data stored in local files at the mappers.

Extra Credit [4 pts]

Download the data from Amazon (you can use the command scp once you are logged in to the master node). Ingest the data into PostgreSQL or SQLServer. Run the query from question 3 and time the results. Turn in a file hw6-extracredit.txt with the database system you used, the answer to the query, and the runtime in minutes.