Project 4: Hadoop and Pig

Due: Wednesday, August 18 at 11:00 pm - NO LATE WORK accepted

ESTIMATED TIME: Up to 18 hours, though likely less.

RIGHT NOW: Immediately complete the steps to set up your Amazon Web Services (AWS) account. This may take a couple of days to go through, so you want to do it right away so it is ready when you need it. Once you have AWS access, read through and complete all the Preliminaries, to make sure you can access AWS and run Pig scripts there. You will start the project on your local machine and use an AWS cluster for large runs at the end.

STARTER CODE: Download the project archive, project4.tar.gz. It contains Hadoop, Pig, and the data files and scripts you need for this project. Note: it is about 20 MB compressed.

TURN IN INSTRUCTIONS: Turn in eight files (details below) at the Catalyst dropbox. If you are turning in the assignment early, please notify the TA by email (but don't attach the project files, those go in the dropbox) so we can start grading as early as possible.

GROUPS: We strongly recommend you work with a partner on this assignment. If you do work with a partner, one member of the group should turn in a single project with everyone's name on it and all members of the group will receive the same score. You should also include a short readme.txt file listing the members of the group and giving a short summary of who did what. Everyone in the group is responsible for the material regardless of how you organize the work.

Where to go from here

The remainder of the instructions for this project consists of three parts:

NOTE: You need to keep your AWS use down to avoid using up your AWS credits from us and getting charged real money. Once you finish the Pig tutorial on AWS, we recommend you do the following: