HW 7 - AWS setup and cluster usage help
Note: Some of the sample commands on this page are quite long. When you enter
them they should be on a single line even though they may be broken into
more than one line for convenience when you view or print the web page.
Setting up your AWS account
- Go to http://aws.amazon.com/
and sign up:
- You may sign in using your existing Amazon account or
you can create a new account by selecting "I am a new user."
We suggest you consider setting up a new Amazon account for this project
separate from any existing Amazon customer account you might have.
- Enter your contact information and confirm your
acceptance of the
AWS Customer Agreement.
- Once you have created an Amazon Web Services Account,
check your
email for your confirmation step. You need Access Identifiers to make
valid web service requests.
- Welcome to Amazon Web Services. Before doing anything, get an AWS Credit
Coupon number from us for your use in CSEP 544. The $100 coupon
should be sufficient to cover
AWS charges
for this project.
(Currently, AWS charges about 10 cents/node/hour for the default "small" node size.)
- Retrieve your AWS claim code from here.
- Go to
http://aws.amazon.com/awscredits/.
- Enter your claim code and click Redeem. The credits will be added to
your AWS account.
- Sign up for Amazon Elastic MapReduce. Amazon
Elastic MapReduce uses Amazon Elastic Compute Cloud to run your job
flows and Amazon Simple Storage Service to store and access your data.
After completing the sign-up process, you will have signed up to use
Amazon Elastic Compute Cloud and Simple Storage Service.
- Go to http://aws.amazon.com/elasticmapreduce/.
- Sign up. (Note: you need to give your credit card
number; however, your AWS Credit Coupon balance will be charged
before your credit card. This project shouldn't use all your AWS
credit balance.)
Setting up an EC2 key pair
To connect to an Amazon EC2 node, such as the master nodes for the
Hadoop clusters you will be creating, you need an SSH key pair.
To create and install one, do the following:
- As part of your account setup, create a keypair using
Amazon's instructions under section "To create an Amazon EC2 key pair". (Don't do this in Internet Explorer,
or you might not be able to download the .pem private key file.)
- Instead of calling the key MyFirstKeyPair.pem, give it a
descriptive name and save it in a convenient location. We will reference
the .pem file as
</path/to/saved/keypair/file.pem>
in the following instructions.
- Make sure only you can access the .pem file, just to be safe:
$ chmod 600 </path/to/saved/keypair/file.pem>
Running jobs on AWS
To run a Pig job on AWS, you need to start up an AWS cluster using the
web Management Console, then ssh into the Hadoop master node,
as follows:
- Complete Section 1 and Section 2 in
Amazon's interactive Pig tutorial. Note that for your final runs, you should
set your cluster to have at least 5 nodes, rather than the 1 node suggested
on that page.
- You should now be able to connect to the master node using SSH:
$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
where <master.public-dns-name.amazonaws.com> is the master node's Public DNS Name
listed in the Management Console.
From here, you can run Pig and Hadoop jobs that will automatically
use all the nodes in your cluster.
Terminating a running cluster
After you are done, shut down the AWS cluster:
- Go to the
Management Console.
- Select the job in the list.
- Click the Terminate button (it should be right below "Your
Elastic MapReduce Job Flows").
- Wait for a while and recheck until the job state becomes TERMINATED.
Copying scripts/data to/from the master node
We will use the
scp
program (a cousin of SSH and SFTP)
to copy files to and from the master node.
Copying to the AWS master node:
- After you have a cluster running, get the master node's
public DNS name from the
Management Console.
Let's call this <master.public-dns-name.amazonaws.com>.
-
Now to copy
file_to_copy
to the home folder on the master node
(the folder you start in when you ssh in), use:
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> file_to_copy hadoop@<master.public-dns-name.amazonaws.com>:
Don't forget the extra :
colon at the end of the master node's
DNS name.
If you'd like to copy files to a different directory <dest_dir>
,
put the directory name after the colon:
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> file_to_copy hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
The path <dest_dir>
can be absolute, or relative to the
master node's home folder.
- Now on the AWS master node, you can cd into the
<dest_dir>
and you should see your file there.
Copying from the AWS master node to the local computer:
- Once your job has completed or you want to save an
updated version of your script, copy your file back to the local
computer by running this command on the local machine:
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<file_path> .
where <file_path>
can be absolute or relative to the
master node's home folder.
- The file should be copied onto your current directory
('.') on your local computer.
Copying multiple files
One way to copy multiple files is to tar
them up,
copy the tarball over, and un-tar
the tarball on
the other side.
Monitoring Hadoop jobs with the job tracker
You can monitor the running Hadoop jobs on your AWS cluster
using the master node's job tracker web UI.
You can access the job tracker from the master node directly
by using the lynx
text-based browser. To do this,
use this command at the master node's shell prompt:
% lynx http://localhost:9100/
However, we recommend you access it with Firefox on your local machine.
To do this you must first set up Firefox to use a proxy when connecting to
the master node. Then you need to tunnel the job tracker site through
a proxy running over your SSH connection to the master node.
This has two main steps:
Set up Firefox to use FoxyProxy
- Install the
FoxyProxy extension for Firefox.
- Copy the
foxyproxy.xml
configuration file from the
hw7/
folder into your
Firefox profile folder.
- If the previous step doesn't work for you, try deleting
the
foxyproxy.xml
you copied into your profile, and using
Amazon's instructions to set up FoxyProxy manually.
Tunneling the job tracker UI via a SOCKS proxy
- Open a new local terminal window and create the SSH SOCKS tunnel
to the master node using the following:
$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>
(The -N
option tells ssh
not to start a shell,
and the -D 8157
option tells ssh
to start
the proxy and have it listen on port 8157.)
Keep this window running in the background (minimize it) until
you are finished with the proxy.
- Now enable FoxyProxy in Firefox and access the job tracker UI
using these URLs (per
Amazon's instructions):
- For the job tracker:
http://<master.public-dns-name.amazonaws.com>:9100/
- For HDFS management:
http://<master.public-dns-name.amazonaws.com>:9101/