AWS Setup

If you would like a video demonstration of generating a key pair and setting up a test cluster, go here for Windows and here for Mac. The clusters set up in these videos use small instances just for demo purposes. You will likely need a higher level instance in the homework.

Setting up your AWS Account

Go to https://console.aws.amazon.com/console/home and sign in to the AWS console. You now need to apply your credit code. Click on your name in the upper right corner of the console, select Billing and Cost Management, and then click on credits. Follow the instructions to redeem your code. Be aware that if you exceed the $100, Amazon will charge your credit card without warning. Normally, this credit is more than enough for this homework assignment (if you are interested in their changes, see AWS charges: currently, AWS charges about 5 cents/node/hour for the default "small" node size.). However, you must remember to terminate manually the AWS cluster (called Job Flows) when you are done: if you just close the browser, the job flows continue to run, and amazon will continue to charge you for days and weeks, exhausting your credit and charging you huge amount on your credit card. Remember to terminate the AWS cluster.

Setting up an EC2 Key Pair

Note: Some students were having problem running job flows because of no active key found, go to AWS security credentials page and make sure that you see a key under the access key, if not just click Create a new Access Key.

To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following:

  1. After setting up your account, follow Amazon's instructions to create a key pair. Follow the instructions in section "Creating Your Key Pair Using Amazon EC2" (Don't do this in Internet Explorer, or you might not be able to download the .pem private key file.)
  2. Download and save the .pem private key file to disk. We will reference the .pem file as </path/to/saved/keypair/file.pem> in the following instructions.
  3. Make sure only you can access the .pem file. If you do not change the permissions, you will get an error message later:
    $ chmod 600 </path/to/saved/keypair/file.pem>
  4. Note: This step will NOT work on Windows 7 with cygwin. Windows 7 does not allow file permissions to be changed through this mechanism, and they must be changed for ssh to work. So if you must use Windows, you should use PuTTY as your ssh client. In this case, you will further have to transform this key file into PuTTY format. For more information go to Amazon's instruction on EC2 Instance connection using PuTTY. The rest of the steps can be followed to connect to EC2 instance once you start an AWS cluster in the next section.

Starting an AWS Cluster and running Pig Interactively

To run a Pig job on AWS, you need to start up an AWS cluster using the web Management Console and connect to the Hadoop master node. Follow the steps below. You may also find Amazon's interactive Pig tutorial useful, but note that the screenshots are slightly out of date.

To set up and connect to a pig cluster, perform the following steps: