Using the IBM/Google Hadoop cluster

1 Getting access to the IBM/Google cluster

Step 1: Acquire credentials from Cloud Administration to access the cluster

Using a web browser, navigate to http://www.cs.washington.edu/lab/facilities/hadoop.html and follow the instructions on that page to create your user account.

Then open a UNIX terminal window, and using SSH log in to the cloud gateway server hadoop.cs.washington.edu. Enter your username as you chose it on this site.

Your password is: pwd4cloud

You will be prompted to choose a new password; select one and enter it. It will automatically log you out immediately after you set it.
There is a second server behind the gateway which you must also reset your password on. This machine will still have the default password "pwd4cloud" attached to your name. To set this password:

ssh in to hadoop.cs.washington.edu again. Use your username and your new password you set yourself
ssh into 10.1.133.1. We will refer to this as the "submission node."
Your password here is pwd4cloud. Change this password to one of your liking (for simplicitly, the same one as hadoop.cs.washington.edu); it will log you out as soon as it is set.
log out of hadoop.cs.washington.edu by typing "exit" and hitting enter.

Step 2: Log in to the submission node

Log in to username@hadoop.cs.washington.edu. Use the password you set earlier. From there, log in to the submission node: username@10.1.133.1. Use the (second) password you set earlier.

You are now logged in to a machine where you can run Hadoop. Hadoop is installed in:

/hadoop/hadoop-0.18.2-dev

We will refer to this directory as $HADOOP, below.

Step 3: Test your connection to Hadoop itself by browsing the DFS

After you have logged in to the submission node, type the following commands

$HADOOP/bin/hadoop dfs -ls /

It should return a listing similar to this:

Found 3 items drwxr-xr-x - hadoop supergroup 0 2008-10-31 13:56 /shared drwxrwxrwx - hadoop supergroup 0 2008-10-30 23:49 /tmp drwxr-xr-x - hadoop supergroup 0 2008-10-30 22:23 /user

2 Uploading Data to the Cluster

To copy a directory from your local machine to the cluster, we first zip the directory (you can also use WinZip):

zip -r mydir.zip mydir
Afterwards, copy the the file by running

scp mydir.zip username@hadoop.cs.washington.edu:

(don't forget the ":" at the end)
This will copy the file to the gateway machine.

Now ssh to hadoop.cs.washington.edu, and repeat this process to copy the file to 10.1.133.1:

scp my-zip-file.zip 10.1.133.1:

Since your username is the same on both machines, you don't need to provide that twice.

Now log in to the submission node (ssh 10.1.133.1).
Unzip the file you just created:
unzip mydir.zip
This should create your input/ directory with all the files in it on the submission node.

We must now copy the data files into the Distributed File System (HDFS) itself.

In your terminal, change to the mydir directory of your input directory. Execute:
$HADOOP/bin/hadoop dfs -copyFromLocal excite.log.bz2 .

This will copy the local excite.log.bz2 file (first argument) to the target directory (second argument) on the DFS. This path is relative to /user/$USERNAME/ if you didn't supply an absolute path here.

3 Running Hadoop and PIG programs on the cluster

First, set your $HADOOPSITEPATH variable. It must point to the directory which contains the hadoop-site.xml configuration file. Set it by running

HADOOPSITEPATH=/hadoop/hadoop-0.18.2-dev/conf

Next, make sure you have Java on your path: Run

PATH=/opt/ibm/java-i386-60/bin:$PATH

If you have copied over your PIG tutorial examples, you can run them by executing

java -Xmx512m -cp "${HADOOPSITEPATH}:pig.jar" org.apache.pig.Main script1-hadoop.pig

After your script completes, copy the results from HDFS back into the machine's file system by running

$HADOOP/bin/hadoop dfs -copyToLocal script1-hadoop-results/part-00000 .
or copy the entire folder
$HADOOP/bin/hadoop dfs -copyToLocal "script1-hadoop-results/*" .

Don't forget the period at the end.

Note that there might be multiple part-files in the output folder. You will have to cat them together to get the final result.

To copy data from the submission node back to hadoop.cs.washington.edu, you will have to use the IP address of that machine, e.g.

scp results.txt 10.1.143.245:

To copy it from hadoop.cs.washington.edu to another department machine, say attu, you again have to use the IP address (use ping hostname to find out). For example, to copy to attu, type

scp results.txt username@128.208.1.138:

Important: If you want to re-run your job, you must delete the output directory first, or else your program will exit with an error message. To delete directories in HDFS:
$HADOOP/bin/hadoop dfs -rmr dirname
e.g.,
$HADOOP/bin/hadoop dfs -rmr script1-hadoop-results.txt

Tip: Only few commands work on the cluster. If you would like to edit a file, use the nano text editor.

4 Viewing System State in the Browser

The following steps are necessary for viewing more detailed information about your submitted tasks (such as the number of Maps and Reduces), as well as the state of the DFS as a whole. You need this information for Problem 1 and Problem 3.1.

Step 1: Start SOCKS Proxy

Our cluster, in the 10.1.133.X network space, is not directly accessible; we must access it through a SOCKS proxy connected to the gateway node, hadoop.cs.washington.edu. You must now configure a proxy connection to allow you to make this connection.

If you have cygwin installed (or are using OSX/Linux), open a cygwin terminal, and type:

ssh -D 2600 username@hadoop.cs.washington.edu

Replacing username with your login name. When prompted for your password, enter it. You will see an ssh session to this node. You will not use the ssh session directly -- just minimize the window. (It is forwarding traffic over port 2600 in the background.)

If you are a Windows user and have putty installed, start putty, and in the "PuTTY Configuration" window, go to Connection -- SSH -- Tunnels. Type "2600" in the "Source Port" box, click "Dynamic," then click "Add." An entry for "D 2600" should appear in the Forwarded Ports panel. Go back to the Session category, and type hadoop.cs.washington.edu in the Host Name box. Log in with your username and password. When this has logged in, you do not need to do anything else with this window; just minimize it, and it will forward SOCKS traffic in the background.

If you are using Windows and do not have an ssh client, download PuTTY:

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

You need to set up a proxy ssh connection any time you need to access the web interface.

Step 2: Configure FoxyProxy so that you can see the system state

The Hadoop system will expose information about its health, current load, etc, on two web services that are hosted behind the firewall. You can view this information, as well as browse the DFS and read the error log files from services and jobs via this interface.

You must use Firefox to access these sites.

Download the FoxyProxy Firefox extension at: http://foxyproxy.mozdev.org/downloads.html

Install the extension and restart FireFox. If it prompts you to configure FoxyProxy, click "yes." If not, go to Tools * FoxyProxy * Options.

Set the "Mode" to "Use proxies based on their pre-defined patterns and priorities" In the "Proxies" tab, click "Add New Proxy"

Make sure "Enabled" is checked
Proxy Name: "UW Hadoop" (or something reasonable)
Under "Proxy Details," select Manual Proxy Configuration.

hostname: localhost.
port: 2600
SOCKS proxy? should be checked
Select the radio button for SOCKS v5

Under URL Patterns, click Add New Pattern

Pattern name: UW Private IPs:
URL Pattern: http://10.1.133.*:*/*
Select "whitelist" and "Wildcards"

Click Add New Pattern again

Pattern name: xenhosts
URL pattern: http://xenhost-*:*/*
Select "whitelist" and "Wildcards"

In the "Global Settings" tab of the top-level FoxyProxy Options window, select "Use SOCKS proxy for DNS lookups".

Click "OK" to exit all the options.

You will now be able to surf the web regularly, while still redirecting the appropriate traffic through the SOCKS proxy to access cluster information.

Step 3: Test FoxyProxy configuration

Visit the URL http://10.1.133.3:50030/. It displays information about each job that is running on the cluster: the number of map tasks, the number of reduce tasks etc.

In case you are interested (not necessary for the project assignment), you can also visit the URL http://10.1.133.3:50070/ to get more information about the DFS: the number of cluster nodes, their capacity, if they are alive etc.

It should show you information about the state of the DFS as a whole. This lets you know that the "10.1.133..." rule was set up correctly. Click on one of the "XenHost-???" links in the middle of the page. If it takes you to a view of the DFS, you have set up the "xenhosts" patterns correctly. Hurray!

Acknowledgements: Thanks to Aaron Kimball who wrote most of the instructions in this tutorial.