Project 4 - AWS setup and cluster usage help
Read this page together with the Preliminaries - the two pages often refer to each other!
Note: Some of the sample commands on this page are quite long. When you enter
them they should be on a single line even though they may be broken into
more than one line for convenience when you view or print the web page.
Setting up your AWS account
- Go to http://aws.amazon.com/
and sign up:
- You may sign in using your existing Amazon account or
you can create a new account by selecting "I am a new user."
For this project, both you and your partner should probably
create new, separate Amazon accounts for AWS use.
- Enter your contact information and confirm your
acceptance of the AWS Customer Agreement.
- Once you have created an Amazon Web Services Account,
check your email for your confirmation step. You need Access Identifiers
to make valid web service requests.
- Welcome to Amazon Web Services. Now, sign up for Amazon Elastic MapReduce.
Elastic MapReduce uses Elastic Compute Cloud (EC2) to run your job
flows and Simple Storage Service (S3) to store and access your data.
After completing the sign-up process, you will have signed up to use
EC2, S3, and Elastic MapReduce.
- Go to http://aws.amazon.com/elasticmapreduce/.
- Sign up. (You do need to give your credit card
number; however, your AWS credit coupon balance will be charged
before your credit card. This project shouldn't use all your AWS
credit balance.)
- Before trying to run any Hadoop jobs, get an AWS Credit
Coupon number from us for your team's use in CSE 444. The $100 coupon
should be sufficient to cover
AWS charges
for this project. (Currently, AWS charges about 10 cents/node/hour for
the default "small" node size.) This code must be applied to an individual
partner's account; you will have to share use of that account for any
jobs on the AWS clusters.
- Wait for your AWS claim code. We will send it to you by noon on Tuesday, November 23.
- Decide which partner's account you want to apply the code to.
- Go to
http://aws.amazon.com/awscredits/ and sign in as the partner
you want to apply the code to.
- Enter the claim code and click Redeem. The credits will be added to
your AWS account.
Setting up an EC2 key pair
To connect to an Amazon EC2 node, such as the master nodes for the
Hadoop clusters you will be creating, you need an SSH key pair.
To create and install one, do the following:
- After setting up your account, follow
Amazon's instructions to create a key pair. Follow the instructions in
section "Having AWS create the key pair for you," subsection "AWS Management
Console." (Don't do this in Internet Explorer,
or you might not be able to download the .pem private key file.)
- Download and save the .pem private key file to disk. We will reference
the .pem file as
</path/to/saved/keypair/file.pem>
in the following instructions.
- Make sure only you can access the .pem file, just to be safe:
$ chmod 600 </path/to/saved/keypair/file.pem>
Running jobs on AWS
To run a Pig job on AWS, you need to start up an AWS cluster using the
web Management Console, then connect to the Hadoop master node,
as follows:
- Complete Section 1 and Section 2 in
Amazon's interactive Pig tutorial. Note that for your final runs, you should
set your cluster to have at least 5 nodes, rather than the 1 node suggested
on that page.
- You should now be able to connect to the master node using SSH:
$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
where <master.public-dns-name.amazonaws.com> is the master node's Public DNS Name
listed in the Management Console.
From here, you can run Pig and Hadoop jobs that will automatically
use all the nodes in your cluster.
Terminating a running cluster
After you are done, shut down the AWS cluster:
- Go to the
Management Console.
- Select the job in the list.
- Click the Terminate button (it should be right below "Your
Elastic MapReduce Job Flows").
- Wait for a while and recheck until the job state becomes TERMINATED.
Copying files to or from the master node
We will use the
scp
program (a cousin of SSH and SFTP)
to copy files to and from the master node.
Copying to the AWS master node:
- After you have a cluster running, get the master node's
public DNS name from the
Management Console.
Let's call this <master.public-dns-name.amazonaws.com>.
-
Now to copy
local_file
to the home folder on the master node
(the folder you start in when you ssh in), use:
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:
Don't forget the extra :
colon at the end of the master node's
DNS name.
If you'd like to copy files to a different directory <dest_dir>
,
put the directory name after the colon:
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
The path <dest_dir>
can be absolute, or relative to the
master node's home folder.
- Now on the AWS master node, you can cd into the
<dest_dir>
and you should see your file there.
Copying from the AWS master node to the local computer:
- To copy files from the master node back to your computer,
run this command on the local computer:
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<file_path> .
where <file_path>
can be absolute or relative to the
master node's home folder.
- The file should be copied onto your current directory
('.') on your local computer.
Copying multiple files
The easiest way to copy multiple files with scp
is to put them all into the same directory, and then use scp
's
-r
option to copy that directory tree recursively, similarly
to regular cp
's -r
option:
- To copy files on your computer to the master node:
$ mkdir to-copy
$ cp file1 file2... to-copy/
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -r to-copy/ hadoop@<master.public-dns-name.amazonaws.com>:
- To copy files on the master node to your computer:
# on the master node
% mkdir to-copy
% cp file1 file2... to-copy/
# on local computer
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -r hadoop@<master.public-dns-name.amazonaws.com>:to-copy/ .
Monitoring Hadoop jobs with the job tracker
You can monitor the running Hadoop jobs on your AWS cluster using
the master node's job tracker web UI. You can access
the job tracker from the master node directly by using
the lynx
text-based browser.
To do this, use this command at the master node's shell prompt:
% lynx http://localhost:9100/
Remember to open a separate ssh
connection to the master node
so you can run this command simultaneously with the actual Pig job.
If you prefer a graphical browser, you can access the job tracker
with Firefox on your local machine. To do this you must first set up Firefox
to use a proxy when connecting to the master node. Then you need to tunnel
the job tracker site through a proxy running over your SSH connection to the
master node. This has two main steps:
Set up Firefox to use FoxyProxy
- Install the
FoxyProxy extension for Firefox.
- Copy the
foxyproxy.xml
configuration file from the
project4/
folder into your
Firefox profile folder.
- If the previous step doesn't work for you, try deleting
the
foxyproxy.xml
you copied into your profile, and using
Amazon's instructions to set up FoxyProxy manually.
Tunneling the job tracker UI via a SOCKS proxy
- Open a new local terminal window and create the SSH SOCKS tunnel
to the master node using the following:
$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>
(The -N
option tells ssh
not to start a shell,
and the -D 8157
option tells ssh
to start
the proxy and have it listen on port 8157.)
The resulting SSH window will appear to hang, without any output;
this is normal as SSH has not started a shell on the master node, but
just created the tunnel over which proxied traffic will run.
Keep this window running in the background (minimize it) until
you are finished with the proxy, then close the window to shut
the proxy down.
- Now enable FoxyProxy in Firefox and access the job tracker UI
using these URLs (per
Amazon's instructions):
- For the job tracker:
http://<master.public-dns-name.amazonaws.com>:9100/
- For HDFS management:
http://<master.public-dns-name.amazonaws.com>:9101/