AWS Setup

Setting up your AWS account

Note: Amazon will ask you for your credit card information during the setup process. This is normal.

  1. Go to http://aws.amazon.com/ and sign up:
    1. You may sign in using your existing Amazon account or you can create a new account by selecting "I am a new user."
    2. Enter your contact information and confirm your acceptance of the AWS Customer Agreement.
    3. Once you have created an Amazon Web Services Account, you may need to accept a telephone call to verify your identity. Some students have used Google Voice successfully if you don't have or don't want to give a mobile number. You need Access Identifiers to make valid web service requests.
  2. Go to http://aws.amazon.com/ and sign in.
  3. You should have received your AWS credit code by email or in class. Armed with this code, go to http://aws.amazon.com/awscredits/ This step will give you $100 credit towards AWS. Be aware that if you exceed it, amazon will charge your credit card without warning. Normally, this credit is more than enough for this homework assignment (if you are interested in their changes, see AWS charges: currently, AWS charges about 10 cents/node/hour for the default "small" node size.). However, you must remember to terminate manually the AWS clusters when you are done: if you just close the browser, the clusters continue to run, and amazon will continue to charge you for days and weeks, exhausting your credit and charging you huge amount on your credit card. Remember to terminate the AWS cluster.

Setting up an EC2 key pair

To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following:

  1. Go to AWS security credentials page and make sure that you see a key under Access Keys. If not just click Create a new Access Key.
  2. Go to the EC2 Management Console. On the top panel, towards the right, choose the region to be N. Virginia. We will create our cluster in N. Virginia since the input data is present in that region. Then, click "Key Pairs" on the navigation panel. Then click the "Create Key Pair" button. Enter a key pair name and click "Yes". (Don't do this in Internet Explorer, or you might not be able to download the .pem private key file.)
  3. Download and save the .pem private key file to disk. We will reference the .pem file as </path/to/saved/keypair/file.pem> in the following instructions.
  4. Make sure only you can access the .pem file. If you do not change the permissions, you will get an error message later:
    $ chmod 400 </path/to/saved/keypair/file.pem>
Note: This step will NOT work on Windows 7 with cygwin. Windows 7 does not allow file permissions to be changed through this mechanism, and they must be changed for ssh to work. So if you must use Windows, you should use PuTTY as your ssh client. PuTTY should be already installed in all the lab machines and remote desktop windows machines available to you. In this case, you will further have to transform this key file into PuTTY format. For more information go to Amazon's instruction on EC2 Instance connection using PuTTY.

Starting an MapReduce Cluster and running Pig Interactively

To run a Pig job on AWS, you need to start up an cluster using the Elastic MapReduce Management Console and connect to the Hadoop master node. Follow the steps below. You may also find Amazon's interactive Pig tutorial useful, but note that the screenshots are slightly out of date.

To set up and connect to a pig cluster, perform the following steps: