HW 2

Due date: Tuesday, Oct 30th at 11:59pm.

Objectives: Learn how to use MapReduce with Hive offered as a cloud service. We will use Amazon Elastic MapReduce(EMR) on the Amazon Web Service(AWS) cloud. In this assignment, we will set up a Hadoop cluster using EMR, ingest data, and run some queries.

Assignment tools: Elastic Map-Reduce on Amazon Web Services. Unlike Redshift, Amazon EMR is supported in AWS Educate. For this assignment, deploy an EMR cluster in the AWS Educate classroom environment.

IMPORTANT: Use m4.large instances for this homework. Remember to terminate the cluster after you are done to prevent being charged for the cluster.

What to turn in: You will turn in SQL for the queries, run time for each query, number of rows returned and first two rows from the result set (or all rows if a query returns fewer than 2 rows). Submit everything as a single pdf or docx file.

How to submit the assignment: In your gitlab repository, you should see a directory called hw2. Put your report in that directory. Remember to git add, git commit, and git push. You can add your report early and keep updating it and pushing it as you do more work. We will collect the final version after the deadline passes. If you need extra time on an assignment, let us know. This is a graduate course, so we are reasonably flexible with deadlines but please do not overuse this flexibility. Use extra time only when you truly need it.

Getting started with Amazon EMR is covered in Section 3. Follow the instructions in Section 3 to deploy a one master, two core node EMR cluster with emr-5.17.0. As in Homework 1, ingest the data and run queries.

Ingest Data(10 points): Ingest data for TPCH, 10 GB dataset, for the HIVE data is in the parent s3 bucket uwdb/tpch/uniform/10GB-HIVE with a subfolder for each of the eight tables: customer, lineitem, nation, orders, part, partsupp, region, supplier. Refer to create tables from hw1 for schema of the tables.

Run Queries(25 points): Run the following queries:

What is the total number of parts offered by each supplier? The query should return the name of the supplier and the total number of parts.
What is the cost of the most expensive part by any supplier? The query should return only the price of that most expensive part. No need to return the name.
What is the cost of the most expensive part for each supplier? The query should return the name of the supplier and the cost of the most expensive part but you do not need to return the name of that part.
What is the total number of customers per nation? The query should return the name of the nation and the number of unique customers.
What is number of parts shipped between 10 oct, 1996 and 10 nov, 1996 for each supplier? The query should return the name of the supplier and the number of parts.

Comparison with Redshift:(15 points)Compare the time and cost to ingest data and run queries for the TPCH 10GB dataset. For this step you shoudl be able to use the runtime from hw1 and estimate the cost of running these queries.