Project 4: Hadoop and PIG

ESTIMATED TIME: 12 hours.

HADOOP: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

PIG: Pig Latin is a declarative, SQL-style language for ad-hoc analysis of extremely large datasets. The motivation for Pig Latin is the fact that many people who analyze extremely large datasets are entrenched programmers, who are fit in MapReduce. The MapReduce paradigm is low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. Pig Latin is implemented in PIG, a system which compiles Pig Latin into physical plans that are then executed over Hadoop.

This assignment consists of three parts:

  1. Hadoop on local machine. First, you set up Hadoop on your local machine, and build several Hadoop programs which you run on a small sample dataset locally.
  2. PIG on local machine. Next, you set up PIG locally, write several simple PIG scripts and test them on the same dataset.
  3. Running jobs on the cluster. Then, you set up an account on our department's IBM/Google cluster, and run the programs that you created in the first two parts of this project on a large dataset, namely the entire Wikipedia.

To get more familiar with MapReduce and PIG, we recommend that you first skim through the following two research papers:

1. Hadoop on local machine

1.0 Requirements

1.1 Setting up Hadoop

Open a UNIX terminal window and download Hadoop from http://hadoop.apache.org/core/. Make sure to get version 0.18.1. You can directly obtain the release archive by running

wget http://mirror.its.uidaho.edu/pub/apache/hadoop/core/hadoop-0.18.1/hadoop-0.18.1.tar.gz

Unzip this someplace on your computer (e.g. /tmp/USERNAME/hadoop). Remember where you put it. We'll call this directory $HADOOP.

Even if you do not run Hadoop locally (you should!), you will need to have a copy of the .jar so that you can compile against it.

1.2 Running a simple Hadoop example

Inputs and Outputs

The Map/Reduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a Map/Reduce job:

(input) <k1, v1> -&tg; map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Counting Words with Hadoop

A straight-forward example of how to use Hadoop is to count occurrences of words in files.

Suppose we had two files named A.text and B.txt. Their contents are below:

A.txt
This is the A file, it has words in it
B.txt
Welcome to the B file, it has words too

The algorithm we use to count the words must fit in MapReduce. We will implement the following pseudo-code algorithm:

Mapper: takes as input a line from a document

foreach word w in line:
   emit(word, 1)

Reducer: takes an input a key (word) and a set of values (all of which will be "1")

sum = 0
foreach v in values:
   sum = sum + v
emit(word, sum)

The mappers are given all of the lines from each of the files, one line at a time. We break them apart into words and emit (word, 1) pairs -- indicating that at that instant, we saw a given word once. The reducer then collects all of the "1"s arranged by common words, and sums them up into a final count for the word.

So the mapper outputs will be:

Mapper for A.txt:

< This, 1 >
< is, 1 >
< the, 1 >
< A, 1 >
< file, 1 >
< it, 1 >
< has, 1 >
< words, 1 >
< in, 1 >
< it, 1 >

Mapper for B.txt:

< Welcome, 1 >
< to, 1 >
< the, 1 >
< B, 1 >
< file, 1 >
< it, 1 >
< has, 1 >
< words, 1 >
< too, 1 >

Each word will have its own reduce process. The reducer for the word "it" will see: key="it"; values=iterator[1, 1, 1] as its input, and will emit as output:

< it, 3 >
as we expect.

Running the Word Count Example

Given an input text, WortCount uses Hadoop to produce a summary of the number of words in each of several documents.

The Hadoop code below for the word counter is actually pretty short. It is presented in this document.

Note that the reducer is associative and commutative: it can be composed with itself with no adverse effects. To lower the amount of data transfered from mapper to reducer nodes, the Reduce class is also used as the combiner for this task. All the 1's for a word on a single machine will be summed into a subcount; the subcount is sent to the reducer instead of the list of 1's it represents.

Step 1: Create a new project
Step 2: Create the Mapper Class

The listings below put all the classes in a package named 'wordCount.' To add the package, in the Project Explorer, right click on your project name and then click New * Package. Give this package a name of your choosing (e.g., 'wordCount').

Now in the Project Explorer, right click on your package, then click New * Class Give this class a name of your choosing (e.g., WordCountMapper).

Word Count Map:

A Java Mapper class is defined in terms of its input and intermediate <key, value> pairs. To declare one, subclass from MapReduceBase and implement the Mapper interface. The Mapper interface provides a single method: public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter). The map function takes four parameters which in this example correspond to:

  1. LongWritable key - the byte-offset of the current line in the file
  2. Text value - the line from the file
  3. OutputCollector - output - this has the .collect() method to output a <key, value> pair
  4. Reporter reporter - allows us to retrieve some information about the job (like the current filename)

The Hadoop system divides the (large) input data set into logical "records" and then calls map() once for each record. How much data constitutes a record depends on the input data type; For text files, a record is a single line of text. The main method is responsible for setting output key and value types.

Since in this example we want to output <word, 1> pairs, the output key type will be Text (a basic string wrapper, with UTF8 support), and the output value type will be IntWritable (a serializable integer class). (These are wrappers around String and Integer designed for compatibility with Hadoop.)

For the word counting problem, the map code takes in a line of text and for each word in the line outputs a string/integer key/value pair: <word, 1>.

The Map code below accomplishes that by...

  1. Parsing each word out of value. For the parsing, the code delegates to a utility StringTokenizer object that implements hasMoreTokens() and nextToken() to iterate through the tokens.
  2. Calling output.collect(word, value) to output a <key, value> pair for each word.

Listing 1: The word counter Mapper class:

package wordCount;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordCountMapper extends MapReduceBase implements Mapper {

    private final IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
      OutputCollector output, Reporter reporter)
      throws IOException {
        String line = value.toString();
        StringTokenizer itr = new StringTokenizer(line.toLowerCase());
        while(itr.hasMoreTokens()) {
          word.set(itr.nextToken());
          output.collect(word, one);
        }
    }
}
When run on many machines, each mapper gets part of the input -- so for example with 100 Gigabytes of data on 200 mappers, each mapper would get roughly its own 500 Megabytes of data to go through. On a single mapper, map() is called going through the data in its natural order, from start to finish. The Map phase outputs <h;key, value> pairs, but what data makes up the key and value is totally up to the Mapper code. In this case, the Mapper uses each word as a key, so the reduction below ends up with pairs grouped by word. We could instead have chosen to use the word-length as the key, in which case the data in the reduce phase would have been grouped by the lengths of the words being counted. In fact, the map() code is not required to call output.collect() at all. It may have its own logic to prune out data simply by omitting collect. Pruning things in the Mapper is efficient, since it is highly parallel, and already has the data in memory. By shrinking its output, we shrink the expense of organizing and moving the data in preparation for the Reduce phase.

Step 3: Create the Reducer Class

In the Package Explorer, perform the same process as before to add a new class. This time, add a class named "WordCountReducer".

Defining a Reducer is just as easy. Subclass MapReduceBase and implement the Reducer interface: public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter). The reduce() method is called once for each key; the values parameter contains all of the values for that key. The Reduce code looks at all the values and then outputs a single "summary" value. Given all the values for the key, the Reduce code typically iterates over all the values and either concats the values together in some way to make a large summary object, or combines and reduces the values in some way to yield a short summary value.

The reduce() method produces its final value in the same manner as map() did, by calling output.collect(key, summary). In this way, the Reduce specifies the final output value for the (possibly new) key. It is important to note that when running over text files, the input key is the byte-offset within the file. If the key is propagated to the output, even for an identity map/reduce, the file will be filed with the offset values. Not only does this use up a lot of space, but successive operations on this file will have to eliminate them. For text files, make sure you don't output the key unless you need it (be careful with the IdentityMapper and IdentityReducer).

Word Count Reduce:

The word count Reducer takes in all the <word, 1> key/value pairs output by the Mapper for a single word. For example, if the word "foo" appears 4 times in our input corpus, the pairs look like: <foo, 1>, <foo, 1>, <foo, 1>, <foo, 1>. Given all those <key, value> pairs, the reduce outputs a single integer value. For the word counting problem, we simply add all the 1's together into a final sum and emit that: <foo, 4>.

Listing 2: The word counter Reducer class:

package wordCount;

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WordCountReducer extends MapReduceBase implements
  Reducer {

    public void reduce(Text key, Iterator values,
      OutputCollector output, Reporter reporter)
      throws IOException {

        int sum = 0;
        while (values.hasNext()) {
          IntWritable value = (IntWritable) values.next();
          sum += value.get(); // process value
        }
        output.collect(key, new IntWritable(sum));
    }
}

Step 4: Create the Driver Program

You now require a final class to tie it all together, which provides the main() function for the program. Using the same process as before, add a class to your program named "WordCount".

Given the Mapper and Reducer code, the short main() below starts the Map-Reduction running. The Hadoop system picks up a bunch of values from the command line on its own, and then the main() also specifies a few key parameters of the problem in the JobConf object, such as what Map and Reduce classes to use and the format of the input and output files. Other parameters, i.e. the number of machines to use, are optional and the system will determine good values for them if not specified.

Listing 3: The driver class:

package wordCount;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;

public class WordCount {
    public static void main(String[] args) {
        JobClient client = new JobClient();
        JobConf conf = new JobConf(wordCount.WordCount.class);

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        // input and output directories (not files)
        FileInputFormat.addInputPath(conf, new Path("input"));
        FileOutputFormat.setOutputPath(conf, new Path("output"));

        conf.setMapperClass(wordCount.WordCountMapper.class);
        conf.setReducerClass(wordCount.WordCountReducer.class);
        conf.setCombinerClass(wordCount.WordCountReducer.class);

        client.setConf(conf);
        try {
          JobClient.runJob(conf);
        } catch (Exception e) {
          e.printStackTrace();
        }
    }
}
You will note that in addition to the Mapper and Reducer, we have also set the combiner class to be WordCountReducer. Since addition (the reduction operation) is commutative and associative, we can perform a "local reduce" on the outputs produced by a single Mapper, before the intermediate values are shuffled (expensive I/O) to the Reducers. e.g., if machine A emits <foo, 1>, <foo, 1> and machine B emits <foo, 1>, a Combiner can be executed on machine A, which emits <foo, 2>. This value, along with the <foo, 1> from machine B will both go to the Reducer node -- we have now saved bandwidth but preserved the computation. (This is why our reducer actually reads the value out of its input, instead of simply assuming the value is 1.)

Step 5: Run the Program Locally

We can test this program locally before running it on the cluster. This is the fastest and most convenient way to debug. You can use the debugger integrated into Eclipse, and use files on your local file system.

Before we run the task, we must provide it with some data to run on. In your project workspace, (e.g., c:\path\to\workspace\projectname) create a directory named "input".(You can create directory in the Project Explorer by right clicking the project and selecting New * Folder, then naming it "input.") Put some text files of your choosing or creation in there. In a pinch, a copy of the source code you just wrote will work.

Now run the program in Eclipse. Right click the driver class in the Project Explorer, click Run As... and then Java Application.

The console should display progress information on how the process runs. When it is done, right click the project in the Project Explorer and click Refresh (or press F5). You should see a new folder named "output" with one entry named "part-00000". Open this file to see the output of the MapReduce job.

If you are going to re-run the program, you must delete the output folder (right click it, then click Delete) first, or you will receive an error.

If you receive the following error message:

08/10/02 14:56:14 INFO mapred.MapTask: io.sort.mb = 100
08/10/02 14:56:14 WARN mapred.LocalJobRunner: job_local_0001
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:371)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
at wordCount.WordCount.main(WordCount.java:30)
... then add the following line to your main() method, somewhere before the call to runJob():
conf.set("io.sort.mb", "10");
and run the program again. Remember to remove or comment this line out before running on the cluster. An equally valid alternative approach is to edit the "Run Configuration" used by Eclipse. In the "VM Arguments" box, add the argument "-Xmx512m" to give it more memory to use.

When run correctly, you should have an output file with lines each containing a word followed by its frequency.

1.3 TASK 1: Finding the most popular infobox classes in Wikipedia

Motivation: Semantifying Wikipedia

Much research today is concerned with building the Semantic Web, which will enable next-generation search mechanisms (e.g. "What U.S. presidents were born in Texas?"), but also better browsing and visualization. A first step towards this goal is to semantify Wikipedia.

Currently, Wikipedia does not contain semantic annotations, but Wikipedians have started to create infoboxes --- short summaries of the key facts --- for many articles. For example, look at our university's Wikipedia article and see the infobox on the right: it contains information, such as the founding year, and the number of students. Other university articles have similar boxes; they are using the same template. For other kinds of entities, we see different templates with different attributes.

Unfortunately, infoboxes are not used consistently and your first task is to find out how frequently different infobox classes (templates) have been used.

Wikipedia Data

For our experiments we downloaded the latest dump of all articles in the English Wikipedia (7.6 million pages), from last mid October 2008. It consists of three TSV (tab separated values) files, which each contain a database table:

page.txt
pageId namespace title           restrictions counter isRedirect isNew random               timestamp      latest    len
------------------------------------------------------------------------------------------------------------------------
     6         0 AmericanSamoa                      0          1     1 0.012136827455833554 20070525171206 133452270  48
     8         0 AppliedEthics                      0          1     1 0.5709511071909219   20070525171209 133452279  49
    10         0 AccessibleComputing                0          1     1 0.12219194346107543  20070525171212 133452289  57
Most fields in this table will not be relevant to us, but we would like to point out title, the (unique) title of each page, and latest which is a foreign key to table text. Note that anybody can edit a Wikipedia article and Wikipedia keeps track of all prevision revisions. Table text contains the source for each revision and the field latest in table page points to the latest revision of a particular article.

revision.txt
revisionId pageId    textId comment               userId user  timestamp      minorEdit deleted len  parentId
-------------------------------------------------------------------------------------------------------------
133452270       6 133452270 Revert edit(s) by ... 241822 Gurch 20070525171206         1       0 NULL     NULL
133452279       8 133452279 Revert edit(s) by ... 241822 Gurch 20070525171209         1       0 NULL     NULL
133452289      10 133452289 Revert edit(s) by ... 241822 Gurch 20070525171212         1       0 NULL     NULL
We will not use table revision.txt for our project.

text.txt
textId     text
---------------------------------------------------------------------
133452270  #REDIRECT [[American Samoa]]{{R from CamelCase}}
133452279  #REDIRECT [[Applied ethics]] {{R from CamelCase}}
133452289  #REDIRECT [[Computer accessibility]] {{R from CamelCase}}
We will assume that revisionId and textId are equal. We will also be using text, which contains the source of a page in Wiki markup language.

Combined, the three tables consume about 20 GB of space. You will later run your programs on the complete tables on the IBM/Google cluster. The cluster is -- of course -- capable of running jobs several orders of magnitude larger than this, but in our class more than 50 students are competing for resources and there is another class using the same cluster; so we would like to keep the workload low.

When you write your Hadoop programs, you should always test locally on a smaller dataset. We therefore created smaller versions of the above tables containing only the top 1000, and top 10,000 records.

You can download these here:

Finding the most popular infobox classes

Your task is to generate a list of infobox classes with counts of how frequently they are used in Wikipedia. The list you generate should be sorted in decreasing order of frequency. An example list (made up, of course) is

45345 Infobox Planet in the Universe
12242 Infobox Bankrupt Hedge Fund Manager
934 Infobox Ugly Looking Person
To help you get started, we are already providing a few starter classes. Please download these here and fill in the empty methods.

HINT: Hadoop is always sorting outputs by the key.

HINT: You can run multiple map/reduce jobs in a sequence.

HINT: Look at the Hadoop's javadocs and look at the examples in the Hadoop package you downloaded.

1.4 TASK 2: Counting number of categories per article

In your second task, you should output the number of categories that each article is assigned to. Your output should include the article title and the category count; it does not need to be sorted. An example output is

Ruschein 3
Men_Against_the_Sea 4

Again, we are providing some starter code.

2. PIG on local machine

2.0 Requirements

2.1 Setting up PIG

Open a UNIX terminal window and download the latest version of PIG using subversion.

svn co http://svn.apache.org/repos/asf/incubator/pig/trunk

A new folder named trunk is created in your current directory. Relevant to us are file trunk/pig.jar and folder trunk/tutorial.

2.2 Running a simple PIG example

To get started with PIG, we will walk through the examples in the PIG tutorial. The examples process a small set of query logs which are contained in trunk/tutorial/data.

Before we can run PIG, we need to compile a few classes that are called in our example. To compile, navigate to trunk/tutorial and run ant from the command line. Ant will initiate the build using the settings of build.xml located in this directory, and will write the output to trunk/tutorial/build. The output contains a Jar file named pigtutorial.jar, which we will later need to put on the class path when we run the tutorial scripts.

The following table lists all files we need:

FileDescription
trunk/pig.jarPIG Jar file
trunk/tutorial/build/pigtutorial.jarUser-defined functions (UDFs) and Java classes
trunk/tutorial/scripts/script1-local.pigPIG Script 1, Query Phrase Popularity (local mode)
trunk/tutorial/scripts/script1-hadoop.pigPIG Script 1, Query Phrase Popularity (Hadoop cluster)
trunk/tutorial/scripts/script2-local.pigPIG Script 2, Temporal Query Phrase Popularity (local mode)
trunk/tutorial/scripts/script2-hadoop.pigPIG Script 2, Temporal Query Phrase Popularity (Hadoop cluster)
trunk/tutorial/data/excite-small.logLog file, Excite search engine (local mode)
trunk/tutorial/build/pigtutorial.jarLog file, Excite search engine (Hadoop cluster)
The user-defined functions (UDFs) which are contained in pigtutorial.jar and are called from our scripts are described here:
UDFDescription
ExtractHourExtracts the hour from the record.
NGramGeneratorComposes n-grams from the set of words.
NonURLDetectorRemoves the record if the query field is empty or a URL.
ScoreGeneratorCalculates a "popularity" score for the n-gram.
ToLowerChanges the query field to lowercase.
TutorialUtilDivides the query string into a set of words.

Example 1: Query Phrase Popularity

We begin by taking a look at script1-local.pig.

REGISTER ./tutorial.jar;
raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);
clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;
houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;
ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
ngramed2 = DISTINCT ngramed1;
hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));
uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;
filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;
ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score);
STORE ordered_uniq_frequency INTO 'script1-local-results.txt' USING PigStorage();

Let's now walk through this script line by line.

3. Running jobs on the cluster

3.1 Getting access to the IBM/Google cluster

Step 1: Acquire credentials from Cloud Administration to access the cluster

Using a web browser, navigate to http://www.cs.washington.edu/lab/facilities/hadoop.html and follow the instructions on that page to create your user account.

Then open a UNIX terminal window, and using SSH log in to the cloud gateway server hadoop.cs.washington.edu. Enter your username as you chose it on this site.

Your password is: pwd4cloud

You will be prompted to choose a new password; select one and enter it. It will automatically log you out immediately after you set it.

There is a second server behind the gateway which you must also reset your password on. This machine will still have the default password "pwd4cloud" attached to your name. To set this password:

  1. ssh in to hadoop.cs.washington.edu again. Use your username and your new password you set yourself
  2. ssh into 10.1.133.1. We will refer to this as the "submission node."
  3. Your password here is pwd4cloud. Change this password to one of your liking (for simplicitly, the same one as hadoop.cs.washington.edu); it will log you out as soon as it is set.
  4. log out of hadoop.cs.washington.edu by typing "exit" and hitting enter.
Step 2: Start SOCKS Proxy

Our cluster, in the 10.1.133.X network space, is not directly accessible; we must access it through a SOCKS proxy connected to the gateway node, hadoop.cs.washington.edu. You must now configure a proxy connection to allow you to make this connection.

If you have cygwin installed (or are using OSX/Linux), open a cygwin terminal, and type:

ssh -D 2600 username@hadoop.cs.washington.edu

Replacing username with your login name. When prompted for your password, enter it. You will see an ssh session to this node. You will not use the ssh session directly -- just minimize the window. (It is forwarding traffic over port 2600 in the background.)

If you are a Windows user and have

putty

installed, start putty, and in the "PuTTY Configuration" window, go to Connection -- SSH -- Tunnels. Type "2600" in the "Source Port" box, click "Dynamic," then click "Add." An entry for "D 2600" should appear in the Forwarded Ports panel. Go back to the Session category, and type hadoop.cs.washington.edu in the Host Name box. Log in with your username and password. When this has logged in, you do not need to do anything else with this window; just minimize it, and it will forward SOCKS traffic in the background.

If you are using Windows and do not have an ssh client, download PuTTY. It is free. Download putty.exe from this page:

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

You need to set up a proxy ssh connection any time you need to access the web interface (see steps 3 and 4), or if you are trying to connect directly to the cluster from your own machine.

Step 3: Configure FoxyProxy so that you can see the system state

The Hadoop system will expose information about its health, current load, etc, on two web services that are hosted behind the firewall. You can view this information, as well as browse the DFS and read the error log files from services and jobs via this interface.

You must use Firefox to access these sites.

Download the FoxyProxy Firefox extension at: http://foxyproxy.mozdev.org/downloads.html

Install the extension and restart FireFox. If it prompts you to configure FoxyProxy, click "yes." If not, go to Tools * FoxyProxy * Options.

Set the "Mode" to "Use proxies based on their pre-defined patterns and priorities" In the "Proxies" tab, click "Add New Proxy"

In the "Global Settings" tab of the top-level FoxyProxy Options window, select "Use SOCKS proxy for DNS lookups".

Click "OK" to exit all the options.

You will now be able to surf the web regularly, while still redirecting the appropriate traffic through the SOCKS proxy to access cluster information.

Step 4: Test FoxyProxy configuration

Visit the URL http://10.1.133.0:50070/

It should show you information about the state of the DFS as a whole. This lets you know that the "10.1.133..." rule was set up correctly. Click on one of the "XenHost-???" links in the middle of the page. If it takes you to a view of the DFS, you have set up the "xenhosts" patterns correctly. Hurray!

Step 5: Log in to the submission node

If you are using a lab machine, then you do not have the permissions to edit the hosts file, which is necessary for a direct connection to the Hadoop service. Therefore, you will need to perform all your actual Hadoop instructions on the submission node, which is configured to talk directly to the Hadoop system.

Log in to username@hadoop.cs.washington.edu. Use the password you set earlier. From there, log in to the submission node: username@10.1.133.1. Use the (second) password you set earlier.

You are now logged in to a machine where you can run Hadoop. Hadoop is installed in:

/hadoop/hadoop-0.18.2-dev

We will refer to this directory as $HADOOP, below.

Step 6: Test your connection to Hadoop itself by browsing the DFS

Open up a terminal (Windows users: cygwin window).

Change to the $HADOOP directory where you installed Hadoop (This is /hadoop/ hadoop-0.18.2-dev on the submission node)

Type the following command (assuming "$" is the prompt):

$ bin/hadoop dfs -ls /

It should return a listing similar to this:

Found 2 items
drwxr-xr-x - hadoop supergroup 0 2008-10-01 17:56 /tmp
drwxr-xr-x - hadoop supergroup 0 2008-10-02 05:10 /user

3.2 Uploading Data to the Cluster

Acknowledgements: Many of the installation instructions in this tutorial were written by Aaron Kimball.