Instructions for accessing and using the Hadoop cluster

Welcome to the UW/IBM Hadoop cluster!

How to access the cluster

The cluster must be reached through a gateway machine (see e-mail)

We have created accounts on both the gateway and the cluster for each of you

The account names are:

- USERNAME (you'll get this in your email)

To access these accounts, ssh to GATEWAY (you'll get this in your email) and log in with the case-sensitive password:

PASSWORD (you'll get this in your email)

Please change your password immediately once you log in. The machine should force you to change the password, but in case it doesn.t you can manually do so by using the passwd command.

Once you are on the gateway, you can access the cluster machines by ssh.ing to the cluster controller machine at CONTROLLER (See email). Log in with the same account name and default password (as above). After changing your password to something more secure, you will be ready to run your first job on Hadoop!

How to use Hadoop

Hadoop is installed in the directory /hadoop/hadoop-0.13.1. We have already started and configured the cluster for you. To run a basic hadoop application, go to that directory and run:

bin/hadoop jar hadoop-0.13.1-examples.jar pi 25 25

This will start a job that uses the computing power of the entire cluster to calculate pi.

Hopefully the value will be close to 3.14! (If you want more accuracy at the expense of performance, increase the input arguments from 25 to, say, 100).

Another simple test that showcases how hadoop reads and writes data is Grep. Make two directories called input and output in your home directory. Put the files you wish to search in the input directory. Then go to the hadoop directory and run:

bin/hadoop dfs .put $HOME/input input

to place your input into the Hadoop Distributed File System (DFS). Now Hadoop can access your input. You then run.

bin/hadoop jar hadoop-0.13.1-examples.jar grep input output <string>

where <string> is the phrase you wish to search for.

Hadoop will then determine where the <string> appears in your files and write the output to the output directory on its DFS. You retrieve this information into human-readable form by executing:

bin/hadoop dfs .get output output cat output/*

Other examples that might help you write Hadoop applications are in the java package org.apache.hadoop.examples. I strongly recommend downloading the Hadoop source code from http://lucene.apache.org/hadoop/version_control.html and looking at how these example applications were written. Another tool you can use is our Eclipse plugin.

How to set up Eclipse to facilitate writing Hadoop applications

We have written an Eclipse plugin that facilitates the use of Hadoop through the Eclipse IDE. To install, download Hadoop 0.13.1 from http://lucene.apache.org. Copy hadoop-eclipse-plugin.jar from /projects/instr/07au/cse454/hadoop-eclipse-plugin.jar (NOT THE ONE THAT COMES WITH THE DISTRIBUTION) into the plugin directory for Eclipse. (Those of you interested in running hadoop applications with scripts written in non-Java languages should also make sure to get the hadoop-streaming.jar file from the same directory.)

Place the eclipse plugin jar file in the plugins directory off your Eclipse home directory. The plugin requires Java 1.5+ and Eclipse 3.2+. Then start Eclipse and open the MapReduce view. You can now create Mappers, Reducers, and Drivers with coding templates. You can also send created applications to be run on the Hadoop cluster, by providing the cluster.s network information, and you can also track the progress of your jobs in real-time. See the cheat sheets included with the plugin for more details.