Objectives:
Assignment tools: Amazon Web Services (AWS) and Spark
Follow these instructions to set up your
account and cluster.
Due date: Monday, November 27, 2017 by 11:00pm.
Turn in your solution using the assignment drop box linked from the main course
web page.
What to turn in:
Submit two files. First, hw6-queries.sql should contain your queries
and comments showing the results. Second, hw6-histogram.{pdf|jpg|png}
should be an image (in one of those three formats) showing the histogram you
obtained after your sixth query below.
In this homework, we will use Amazon AWS to create a Spark cluster and investigate Freebase data. Freebase is a knowledge graph, like Wikipedia, holding topics and facts about the world. (Although the public facing version of Freebase is no longer supported, Google still uses the data.) If you want more information about how Freebase data is organized, go to https://developers.google.com/freebase/guide/basic_concepts.
At a high level, the data is represented in triples (or quadruples) of the form:
subject predicate object [context]
The subject is usually a machine ID (MID), a unique ID that looks like /m/#### (the #### represents some number of alphanumeric characters). The predicate explains what the object is in relation to the subject.
For example, if you issued a query to retrieve all tuples with the subject MID of /m/0284r5q, you would see the following:
subject predicate obj context ___________________________________________________________________________________________ /m/0284r5q /type/object/key /wikipedia/en_id 9,327,603 /m/0284r5q /type/object/key /wikipedia/en Flyte_$0028chocolate_bar$0029 /m/0284r5q /type/object/type /common/topic /m/0284r5q /type/object/type /food/candy_bar /m/0284r5q /food/candy_bar/manufacturer /m/01kh5q /m/0284r5q /food/candy_bar/sold_in /m/09c7w0 /m/0284r5q /type/object/name /lang/en Flyte /m/0284r5q /common/topic/image /m/04v6jtv ...
The two /type/object/type predicates indicate that two types of MID /m/0284r5q are the type "topic" and the type "candy bar". You can see for the predicate /type/object/name, the object is the English language and the context is Flyte. This means that MID /m/0284r5q is the candy bar called Flyte in English.
This is just a subset of the data for a specific topic. In this homework, you will be looking at data for all topics in Freebase. This dataset is 35 GB, which is why we need Amazon AWS to help. As a comparison, the flights data you used in assignment 2 was 0.11 GB.
Follow these instructions to launch a Spark cluster on AWS and connect to the Zeppelin UI. Once you have done so, you are ready to run queries on the data.
For each question below, write your query in a file called hw6-queries.sql with a comment indicating the question number. Then, copy your query into the Zeppelin notebook, run it on Spark, and copy the result into your hw6-queries.sql file as a comment under the query. For example, for question 1, it should look like this:
-- Question 1 SELECT... FROM... WHERE... /* Answer to query */
Use the /travel/travel_destination/tourist_attractions predicate to find the tourist attractions for each destination. Use the /type/object/name predicate and /lang/en object to get the name of each location (the name will be the context of the tuple with predicate /type/object/name and object /lang/en). Sort your result by the number of tourist attractions from largest to smallest and then on the destination name alphabetically and only return the top 20.
This is asking for more than a count of the number of distinct predicates per subject. This is asking for a distribution of the number of distinct predicates. For your answer, still put the query in hw6-queries.sql, but instead of copying the result as a comment, make a chart of your results in Zeppelin (the little icons below the query allow you to toggle output modes). Take a screenshot of a barchart of your histogram and submit it as hw6-histogram.{pdf|jpg|png}.
You can now shut down your cluster. You will not need it to answer any of the remaining questions.
For each question below except the last, choose one answer. Include your answer (along with the question number) as a comment at the end of your hw6-queries.sql file.
hadoop fs -put /data/freebase-datadump-quadruples.tsv /data/spark_data.tsv
By default, Spark looks in HDFS for data, but you can actually tell Spark to read files locally, rather than from HDFS. For this to work, what additional preprocessing step would I need to take before even opening my Zeppelin notebook to prepare the data?
Download the data from Amazon (you can use the command scp once you are logged in to the master node). Ingest the data into PostgreSQL or SQLServer. Run the query from question 3 and time the results. Turn in a file hw6-extracredit.txt with the database system you used, the answer to the query, and the runtime in minutes.