Note: Amazon will ask you for your credit card information during the setup process. This is normal.
Note: Some students were having problem running job flows because of no active key found, go to AWS security credentials page and make sure that you see a key under the access key, if not just click Create a new Access Key.
To connect to an Amazon EC2 node, such as the master nodes for the Hadoop clusters you will be creating, you need an SSH key pair. To create and install one, do the following:
</path/to/saved/keypair/file.pem>
in the following instructions.$ chmod 600 </path/to/saved/keypair/file.pem>
To run a Pig job on AWS, you need to start up an AWS cluster using the web Management Console and connect to the Hadoop master node. Follow the steps below. You may also find Amazon's interactive Pig tutorial useful, but note that the screenshots are slightly out of date.
To set up and connect to a pig cluster, perform the following steps:$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
hadoop
as the username when connecting to the master DNS node, as in the ssh example above and elsewhere below.$ pig
grunt>In this homework we will use pig only interactively. (The alternative is to have pig read the program from a file.)
example.pig
. You are now ready to return to the homework assignment.
quit
at the grunt>
promt. To terminate the ssh session, type exit
at the unix prompt: after that you must terminate the AWS cluster (see next).You are required in this homework to monitor the running Hadoop jobs on your AWS cluster using the master node's job tracker web UI.
By far the easiest way to do this is to use ssh tunneling.
ssh -L 9100:localhost:9100 -L 9101:localhost:9101 -i ~/.ssh/<your pem file> hadoop@<master DNS>
There are two other ways to do this: using lynx or using your own browser with a SOCKS proxy.
ssh
connection to the AWS master node and type:
% lynx http://localhost:9100/
up/down arrows
= move through the links (current link is highlighted); enter
= follows a link; left arrow
= return to previous page. foxyproxy.xml
configuration file from the
hw6/
folder into your
Firefox profile folder.foxyproxy.xml
you copied into your profile, and using
Amazon's instructions to set up FoxyProxy manually. If you use Amazon's instructions, be careful to use port 8888 instead of the port in the instructions.
$ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8888 hadoop@<master.public-dns-name.amazonaws.com>
(The -N
option tells ssh
not to start a shell,
and the -D 8888
option tells ssh
to start
the proxy and have it listen on port 8888.)http://<master.public-dns-name.amazonaws.com>:9100/
http://<master.public-dns-name.amazonaws.com>:9101/
The job tracker enables you to see what MapReduce jobs are executing in your cluster and the details on the number of maps and reduces that are running or already completed.
Note that, at this point in the instructions, you will not see any MapReduce jobs running but you should see that your cluster has the capacity to run a couple of maps and reducers on your one instance.
The HDFS manager gives you more low-level details about your cluster and all the log files for your jobs.
Later, in the assignment, we will show you how to launch MapReduce jobs through Pig. You will basically write Pig Latin scripts that will be translated into MapReduce jobs (see lecture notes). Some of these jobs can take a long time to run. If you decide that you need to interrupt a job before it completes, here is the way to do it:
If you want to kill pig, you first type CTRL/C, which kills pig only. Next, kill the hadoop job, as follows. From the job tracker interface find the hadoop job_id
, then type:
% hadoop job -kill job_id
You do not need to kill any jobs at this point.
However, you can now exit pig (just type "quit") and exit your ssh session. You can also kill the SSH SOCKS tunnel to the master node.
When you are done running Pig scripts, make sure to ALSO terminate your job flow. This is a step that you need to do in addition to stopping pig and Hadoop (if necessary) above.
This step shuts down your AWS cluster:
Pay attention to this step. If you fail to terminate your job and only close the browser, or log off AWS, your AWS will continue to run, and AWS will continue to charge you: for hours, days, weeks, and when your credit is exhausted, it chages your creditcard. Make sure you don't leave the console until you have confirmation that the job is terminated.
You can now shut down your cluster.
Please check your balance regularly!!!
To avoid unnecessary charges, terminate your job flows when you are not using them.
USEFUL: AWS customers can now use billing alerts to help monitor the charges on their AWS bill. You can get started today by visiting your Account Activity page to enable monitoring of your charges. Then, you can set up a billing alert by simply specifying a bill threshold and an e-mail address to be notified as soon as your estimated charges reach the threshold.
For the next step, you need to restart a new cluster as follows. Hopefully, it should now go very quickly:
We will now get into more details about running Pig scripts.
Your pig program stores the results in several files in a directory. You have two options: (1) store these files in the Hadoop File System, or (2) store these files in S3. In both cases you need to copy them to your local machine.
This is done through the following pig command (used in example.pig
):
store count_by_object_ordered into '/user/hadoop/example-results' using PigStorage();
Before you run the pig query, you need to (A) create the /user/hadoop directory. After you run the query you need to (B) copy this directory to the local directory of the AWS master node, then (C) copy this directory from the AWS master node to your local machine.
You will need to do this for each new job flow that you create.
To create a /user/hadoop
directory on the AWS cluster's
HDFS file system run this from the AWS master node:
% hadoop dfs -mkdir /user/hadoopCheck that the directory was created by listing it with this command:
% hadoop dfs -ls /user/hadoop
You may see some output from either command, but you should not see any errors.
You can also do this directly from grunt with the following command.
grunt> fs -mkdir /user/hadoop
Now you are ready to run your first sample program. Take a look at the starter code that we provided in hw6.tar.gz. Copy and paste the content of example.pig.
(We give more details about this program back in hw6.html).
Note: The program may appear to hang with a 0% completion time... go check the job tracker. Scroll down. You should see a MapReduce job running with some non-zero progress.
Note 2: Once the first MapReduce job gets to 100%... if your grunt terminal still appears to be suspended... go back to the job tracker and make sure that the reduce phase is also 100% complete. It can take some time for the reducers to start making any progress.
Note 3: The example generates more than 1 MapReduce job... so be patient.
The result of a pig script is stored in the hadoop directory specified by the store
command.
That is, for example.pig
, the output will be stored at
/user/hadoop/example-results
, as specified in the script.
HDFS is separate from the master node's file system, so
before you can copy this to your local machine, you must copy the directory
from HDFS to the master node's Linux file system:
% hadoop dfs -copyToLocal /user/hadoop/example-results example-results
This will create a directory example-results
with
part-*
files in it, which you can copy
to your local machine with scp
.
You can then concatenate all the part-*
files to get
a single results file, perhaps sorting the results if you like.
An easier option may be to use
% hadoop fs -getmerge /user/hadoop/example-results example-results
This command takes a source directory and a destination file as input and concatenates files in src into the destination local file.
Use hadoop dfs -help
or see the hadoop dfs
guide
to learn how to manipulate HDFS. (Note that hadoop fs
is
the same as hadoop dfs
.)
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<file_path> .
where <file_path>
can be absolute or relative to the
AWS master node's home folder. The file should be copied onto your current directory
('.') on your local computer.example-results
. They type the following on your loal computer:
$ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -r hadoop@<master.public-dns-name.amazonaws.com>:example-results .
To use this approach, go to your AWS Management Console, click on Create Bucket, and create a new bucket (=directory). Give it a name that may be a public name. Do not use any special chatacters, including underscore. Let's say you call it supermanhw6
. Click on Actions, Properties, Permissions. Make sure you have all the permissions.
Modify the store command of example.pig
to:
store count_by_object_ordered into 's3n://supermanhw6/example-results';
Run your pig program. When it terminates, then in your S3 console you should see the new directory example-results
. Click on individual files to download. The number of files depends on the number of reduce tasks, and may vary from one to a few dozens. The only disadvantage of using S3 is that you have to click on each file separately to download.
Note that S3 is permanent storage, and you are charged for it. You can safely store all your query answers for several weeks without exceeding your credit; at some point in the future remember to delete them.