Project 4: Hadoop and Pig

Due: Friday, December 10 at 11:00 pm - NO LATE WORK accepted

ESTIMATED TIME: Up to 18 hours, though likely less.

RIGHT NOW:

STARTER CODE: Download the project archive, project4.tar.gz. It contains Hadoop, Pig, and the data files and scripts you need for this project. Note: it is about 20 MB compressed.

TURN IN INSTRUCTIONS: Turn in eight files (details on the problems page) to the Catalyst dropbox.

GROUPS: We strongly recommend you work with a partner on this assignment. If you do work with a partner, one member of the group should turn in a single project with everyone's name on it and all members of the group will receive the same score. You should also include a short readme.txt file listing the members of the group and giving a short summary of who did what. Everyone in the group is responsible for the material regardless of how you organize the work.

Where to go from here

The remainder of the instructions for this project consists of three parts:

NOTE: You need to keep your AWS use down to avoid using up your AWS credits from us and getting charged real money. Once you finish the Pig tutorial on AWS, we recommend you do the following: