Project 4 - AWS setup and cluster usage help

Read this page together with the Preliminaries - the two pages often refer to each other!

Note: Some of the sample commands on this page are quite long. When you enter them they should be on a single line even though they may be broken into more than one line for convenience when you view or print the web page.

Setting up your AWS account

  1. Go to http://aws.amazon.com/ and sign up:
    1. You may sign in using your existing Amazon account or you can create a new account by selecting "I am a new user." For this project, both you and your partner should probably create new, separate Amazon accounts for AWS use.
    2. Enter your contact information and confirm your acceptance of the AWS Customer Agreement.
    3. Once you have created an Amazon Web Services Account, check your email for your confirmation step. You need Access Identifiers to make valid web service requests.
  2. Welcome to Amazon Web Services. Now, sign up for Amazon Elastic MapReduce. Elastic MapReduce uses Elastic Compute Cloud (EC2) to run your job flows and Simple Storage Service (S3) to store and access your data. After completing the sign-up process, you will have signed up to use EC2, S3, and Elastic MapReduce.
    1. Go to http://aws.amazon.com/elasticmapreduce/.
    2. Sign up. (You do need to give your credit card number; however, your AWS credit coupon balance will be charged before your credit card. This project shouldn't use all your AWS credit balance.)
  3. Before trying to run any Hadoop jobs, get an AWS Credit Coupon number from us for your team's use in CSE 444. The $100 coupon should be sufficient to cover AWS charges for this project. (Currently, AWS charges about 10 cents/node/hour for the default "small" node size.) This code must be applied to an individual partner's account; you will have to share use of that account for any jobs on the AWS clusters.
    1. Wait for your AWS claim code. We will send it to you by noon on Tuesday, November 23.
    2. Decide which partner's account you want to apply the code to.
    3. Go to http://aws.amazon.com/awscredits/ and sign in as the partner you want to apply the code to.
    4. Enter the claim code and click Redeem. The credits will be added to your AWS account.

Setting up an EC2 key pair

To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following:

  1. After setting up your account, follow Amazon's instructions to create a key pair. Follow the instructions in section "Having AWS create the key pair for you," subsection "AWS Management Console." (Don't do this in Internet Explorer, or you might not be able to download the .pem private key file.)
  2. Download and save the .pem private key file to disk. We will reference the .pem file as </path/to/saved/keypair/file.pem> in the following instructions.
  3. Make sure only you can access the .pem file, just to be safe:
    $ chmod 600 </path/to/saved/keypair/file.pem>

Running jobs on AWS

To run a Pig job on AWS, you need to start up an AWS cluster using the web Management Console, then connect to the Hadoop master node, as follows:

  1. Complete Section 1 and Section 2 in Amazon's interactive Pig tutorial. Note that for your final runs, you should set your cluster to have at least 5 nodes, rather than the 1 node suggested on that page.
  2. You should now be able to connect to the master node using SSH:
    $ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
    
    where <master.public-dns-name.amazonaws.com> is the master node's Public DNS Name listed in the Management Console.

    From here, you can run Pig and Hadoop jobs that will automatically use all the nodes in your cluster.

Terminating a running cluster

After you are done, shut down the AWS cluster:

  1. Go to the Management Console.
  2. Select the job in the list.
  3. Click the Terminate button (it should be right below "Your Elastic MapReduce Job Flows").
  4. Wait for a while and recheck until the job state becomes TERMINATED.

Copying files to or from the master node

We will use the scp program (a cousin of SSH and SFTP) to copy files to and from the master node.

Copying to the AWS master node:

  1. After you have a cluster running, get the master node's public DNS name from the Management Console.
    Let's call this <master.public-dns-name.amazonaws.com>.
  2. Now to copy local_file to the home folder on the master node (the folder you start in when you ssh in), use:
    $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:
    
    Don't forget the extra : colon at the end of the master node's DNS name.
    If you'd like to copy files to a different directory <dest_dir>, put the directory name after the colon:
    $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
    
    The path <dest_dir> can be absolute, or relative to the master node's home folder.
  3. Now on the AWS master node, you can cd into the <dest_dir> and you should see your file there.

Copying from the AWS master node to the local computer:

  1. To copy files from the master node back to your computer, run this command on the local computer:
    $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<file_path> .
    
    where <file_path> can be absolute or relative to the master node's home folder.
  2. The file should be copied onto your current directory ('.') on your local computer.

Copying multiple files

The easiest way to copy multiple files with scp is to put them all into the same directory, and then use scp's -r option to copy that directory tree recursively, similarly to regular cp's -r option:

Monitoring Hadoop jobs with the job tracker

You can monitor the running Hadoop jobs on your AWS cluster using the master node's job tracker web UI. You can access the job tracker from the master node directly by using the lynx text-based browser. To do this, use this command at the master node's shell prompt:

% lynx http://localhost:9100/

Remember to open a separate ssh connection to the master node so you can run this command simultaneously with the actual Pig job.

If you prefer a graphical browser, you can access the job tracker with Firefox on your local machine. To do this you must first set up Firefox to use a proxy when connecting to the master node. Then you need to tunnel the job tracker site through a proxy running over your SSH connection to the master node. This has two main steps:

Set up Firefox to use FoxyProxy

  1. Install the FoxyProxy extension for Firefox.
  2. Copy the foxyproxy.xml configuration file from the project4/ folder into your Firefox profile folder.
  3. If the previous step doesn't work for you, try deleting the foxyproxy.xml you copied into your profile, and using Amazon's instructions to set up FoxyProxy manually.

Tunneling the job tracker UI via a SOCKS proxy

  1. Open a new local terminal window and create the SSH SOCKS tunnel to the master node using the following:
    $ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>
    (The -N option tells ssh not to start a shell, and the -D 8157 option tells ssh to start the proxy and have it listen on port 8157.)

    The resulting SSH window will appear to hang, without any output; this is normal as SSH has not started a shell on the master node, but just created the tunnel over which proxied traffic will run.

    Keep this window running in the background (minimize it) until you are finished with the proxy, then close the window to shut the proxy down.

  2. Now enable FoxyProxy in Firefox and access the job tracker UI using these URLs (per Amazon's instructions):