Project 4 - AWS setup and cluster usage help

Note: Some of the sample commands on this page are quite long. When you enter them they should be on a single line even though they may be broken into more than one line for convenience when you view or print the web page.

Setting up your AWS account

  1. Go to http://aws.amazon.com/ and sign up:
    1. You may sign in using your existing Amazon account or you can create a new account by selecting "I am a new user." We suggest you consider setting up a new Amazon account for this project separate from any existing Amazon customer account you might have.
    2. Enter your contact information and confirm your acceptance of the AWS Customer Agreement.
    3. Once you have created an Amazon Web Services Account, check your email for your confirmation step. You need Access Identifiers to make valid web service requests.
  2. Welcome to Amazon Web Services. Before doing anything, get an AWS Credit Coupon number from us for your use in CSE 444. The $100 coupon should be sufficient to cover AWS charges for this project. (Currently, AWS charges about 10 cents/node/hour for the default "small" node size.)
    1. You should receive your AWS claim code by email. If you don't receive it by the end of the day on Friday, August 6, please email us.
    2. Go to http://aws.amazon.com/awscredits/.
    3. Enter your claim code and click Redeem. The credits will be added to your AWS account.
  3. Sign up for Amazon Elastic MapReduce. Amazon Elastic MapReduce uses Amazon Elastic Compute Cloud to run your job flows and Amazon Simple Storage Service to store and access your data. After completing the sign-up process, you will have signed up to use Amazon Elastic Compute Cloud and Simple Storage Service.
    1. Go to http://aws.amazon.com/elasticmapreduce/.
    2. Sign up. (Note: you need to give your credit card number; however, your AWS Credit Coupon balance will be charged before your credit card. This project shouldn't use all your AWS credit balance.)

Setting up an EC2 key pair

To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following:

  1. As part of your account setup, create a keypair using Amazon's instructions. (Don't do this in Internet Explorer, or you might not be able to download the .pem private key file.)
  2. Instead of calling the key MyFirstKeyPair.pem, give it a descriptive name and save it in a convenient location. We will reference the .pem file as </path/to/saved/keypair/file.pem> in the following instructions.
  3. Make sure only you can access the .pem file, just to be safe:
    $ chmod 600 </path/to/saved/keypair/file.pem>

Running jobs on AWS

To run a Pig job on AWS, you need to start up an AWS cluster using the web Management Console, then ssh into the Hadoop master node, as follows:

  1. Complete Section 1 and Section 2 in Amazon's interactive Pig tutorial. Note that for your final runs, you should set your cluster to have at least 5 nodes, rather than the 1 node suggested on that page.
  2. You should now be able to connect to the master node using SSH:
    $ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
    
    where <master.public-dns-name.amazonaws.com> is the master node's Public DNS Name listed in the Management Console.

    From here, you can run Pig and Hadoop jobs that will automatically use all the nodes in your cluster.

Terminating a running cluster

After you are done, shut down the AWS cluster:

  1. Go to the Management Console.
  2. Select the job in the list.
  3. Click the Terminate button (it should be right below "Your Elastic MapReduce Job Flows").
  4. Wait for a while and recheck until the job state becomes TERMINATED.

Copying scripts/data to/from the master node

We will use the scp program (a cousin of SSH and SFTP) to copy files to and from the master node.

Copying to the AWS master node:

  1. After you have a cluster running, get the master node's public DNS name from the Management Console.
    Let's call this <master.public-dns-name.amazonaws.com>.
  2. Now to copy local_file to the home folder on the master node (the folder you start in when you ssh in), use:
    $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:
    
    Don't forget the extra : colon at the end of the master node's DNS name.
    If you'd like to copy files to a different directory <dest_dir>, put the directory name after the colon:
    $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
    
    The path <dest_dir> can be absolute, or relative to the master node's home folder.
  3. Now on the AWS master node, you can cd into the <dest_dir> and you should see your file there.

Copying from the AWS master node to the local computer:

  1. Once your job has completed or you want to save an updated version of your script, copy your file back to the local computer by running this command on the local machine:
    $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<file_path> .
    
    where <file_path> can be absolute or relative to the master node's home folder.
  2. The file should be copied onto your current directory ('.') on your local computer.

Copying multiple files

One way to copy multiple files is to tar them up, copy the tarball over, and un-tar the tarball on the other side.

Monitoring Hadoop jobs with the job tracker

You can monitor the running Hadoop jobs on your AWS cluster using the master node's job tracker web UI. You can access the job tracker from the master node directly by using the lynx text-based browser. To do this, use this command at the master node's shell prompt:

% lynx http://localhost:9100/

However, we recommend you access it with Firefox on your local machine. To do this you must first set up Firefox to use a proxy when connecting to the master node. Then you need to tunnel the job tracker site through a proxy running over your SSH connection to the master node. This has two main steps:

Set up Firefox to use FoxyProxy

  1. Install the FoxyProxy extension for Firefox.
  2. Copy the foxyproxy.xml configuration file from the project4/ folder into your Firefox profile folder.
  3. If the previous step doesn't work for you, try deleting the foxyproxy.xml you copied into your profile, and using Amazon's instructions to set up FoxyProxy manually.

Tunneling the job tracker UI via a SOCKS proxy

  1. Open a new local terminal window and create the SSH SOCKS tunnel to the master node using the following:
    $ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>
    (The -N option tells ssh not to start a shell, and the -D 8157 option tells ssh to start the proxy and have it listen on port 8157.)
    Keep this window running in the background (minimize it) until you are finished with the proxy.

  2. Now enable FoxyProxy in Firefox and access the job tracker UI using these URLs (per Amazon's instructions):