Due date: Tuesday, Oct 30th at 11:59pm.
Objectives: Learn how to use MapReduce with Hive offered as a cloud service. We will use Amazon Elastic MapReduce(EMR) on the Amazon Web Service(AWS) cloud. In this assignment, we will set up a Hadoop cluster using EMR, ingest data, and run some queries.
Assignment tools: Elastic Map-Reduce on Amazon Web Services. Unlike Redshift, Amazon EMR is supported in AWS Educate. For this assignment, deploy an EMR cluster in the AWS Educate classroom environment.
IMPORTANT: Use m4.large instances for this homework. Remember to terminate the cluster after you are done to prevent being charged for the cluster.
What to turn in: You will turn in SQL for the queries, run time for each query, number of rows returned and first two rows from the result set (or all rows if a query returns fewer than 2 rows). Submit everything as a single pdf or docx file.
How to submit the assignment: In your gitlab repository, you should see a directory called hw2. Put your report in that directory. Remember to git add, git commit, and git push. You can add your report early and keep updating it and pushing it as you do more work. We will collect the final version after the deadline passes. If you need extra time on an assignment, let us know. This is a graduate course, so we are reasonably flexible with deadlines but please do not overuse this flexibility. Use extra time only when you truly need it.
Getting started with Amazon EMR is covered in Section 3. Follow the instructions in Section 3 to deploy a one master, two core node EMR cluster with emr-5.17.0. As in Homework 1, ingest the data and run queries.