ESTIMATED TIME: 12 hours.
HADOOP: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
PIG: Pig Latin is a declarative, SQL-style language for ad-hoc analysis of extremely large datasets. The motivation for Pig Latin is the fact that many people who analyze extremely large datasets are entrenched programmers, who are fit in MapReduce. The MapReduce paradigm is low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. Pig Latin is implemented in PIG, a system which compiles Pig Latin into physical plans that are then executed over Hadoop.
This assignment consists of three parts:
To get more familiar with MapReduce and PIG, we recommend that you first skim through the following two research papers:
- UNIX terminal or Cygwin if you are using Windows
- Java 1.5 or higher
Open a UNIX terminal window and download Hadoop from http://hadoop.apache.org/core/. Make sure to get version 0.18.1. You can directly obtain the release archive by running
wget http://mirror.its.uidaho.edu/pub/apache/hadoop/core/hadoop-0.18.1/hadoop-0.18.1.tar.gz
Unzip this someplace on your computer (e.g. /tmp/USERNAME/hadoop). Remember where you put it. We'll call this directory $HADOOP.
Even if you do not run Hadoop locally (you should!), you will need to have a copy of the .jar so that you can compile against it.
Counting Words with HadoopThe Map/Reduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a Map/Reduce job:
(input) <k1, v1> -&tg; map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
Running the Word Count ExampleA straight-forward example of how to use Hadoop is to count occurrences of words in files.
Suppose we had two files named A.text and B.txt. Their contents are below:
A.txt
This is the A file, it has words in itB.txt
Welcome to the B file, it has words tooThe algorithm we use to count the words must fit in MapReduce. We will implement the following pseudo-code algorithm:
Mapper: takes as input a line from a document
foreach word w in line:
emit(word, 1)Reducer: takes an input a key (word) and a set of values (all of which will be "1")
sum = 0
foreach v in values:
sum = sum + v
emit(word, sum)The mappers are given all of the lines from each of the files, one line at a time. We break them apart into words and emit (word, 1) pairs -- indicating that at that instant, we saw a given word once. The reducer then collects all of the "1"s arranged by common words, and sums them up into a final count for the word.
So the mapper outputs will be:
Mapper for A.txt:
< This, 1 >
< is, 1 >
< the, 1 >
< A, 1 >
< file, 1 >
< it, 1 >
< has, 1 >
< words, 1 >
< in, 1 >
< it, 1 >Mapper for B.txt:
< Welcome, 1 >
< to, 1 >
< the, 1 >
< B, 1 >
< file, 1 >
< it, 1 >
< has, 1 >
< words, 1 >
< too, 1 >Each word will have its own reduce process. The reducer for the word "it" will see: key="it"; values=iterator[1, 1, 1] as its input, and will emit as output:
< it, 3 >as we expect.
Step 1: Create a new projectGiven an input text, WortCount uses Hadoop to produce a summary of the number of words in each of several documents.
The Hadoop code below for the word counter is actually pretty short. It is presented in this document.
Note that the reducer is associative and commutative: it can be composed with itself with no adverse effects. To lower the amount of data transfered from mapper to reducer nodes, the Reduce class is also used as the combiner for this task. All the 1's for a word on a single machine will be summed into a subcount; the subcount is sent to the reducer instead of the list of 1's it represents.
Step 2: Create the Mapper Class
- Click File * New * Project
- Select Java Project from the list
- Give the project a name of your choosing (e.g., "WordCountExample")
- Then click Next. You should now be on the Java Settings screen. Click the Libraries tab.
- Now click Add External Jars
- In the dialog box, navigate to where you unzipped Hadoop ($HADOOP). Select hadoop-0.18.1-core.jar. Click Ok.
- Click Add External Jars again.
- Navigate to $HADOOP again (it should already be there), and then to the lib subdirectory.
- Select all of the jar files in the directory and click Ok.
- When you are ready, click Finish. It will now create the project and you can compile against Hadoop.
Step 3: Create the Reducer ClassThe listings below put all the classes in a package named 'wordCount.' To add the package, in the Project Explorer, right click on your project name and then click New * Package. Give this package a name of your choosing (e.g., 'wordCount').
Now in the Project Explorer, right click on your package, then click New * Class Give this class a name of your choosing (e.g., WordCountMapper).
Word Count Map:A Java Mapper class is defined in terms of its input and intermediate <key, value> pairs. To declare one, subclass from MapReduceBase and implement the Mapper interface. The Mapper interface provides a single method: public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter). The map function takes four parameters which in this example correspond to:
- LongWritable key - the byte-offset of the current line in the file
- Text value - the line from the file
- OutputCollector - output - this has the .collect() method to output a <key, value> pair
- Reporter reporter - allows us to retrieve some information about the job (like the current filename)
The Hadoop system divides the (large) input data set into logical "records" and then calls map() once for each record. How much data constitutes a record depends on the input data type; For text files, a record is a single line of text. The main method is responsible for setting output key and value types.
Since in this example we want to output <word, 1> pairs, the output key type will be Text (a basic string wrapper, with UTF8 support), and the output value type will be IntWritable (a serializable integer class). (These are wrappers around String and Integer designed for compatibility with Hadoop.)
For the word counting problem, the map code takes in a line of text and for each word in the line outputs a string/integer key/value pair: <word, 1>.
The Map code below accomplishes that by...
- Parsing each word out of value. For the parsing, the code delegates to a utility StringTokenizer object that implements hasMoreTokens() and nextToken() to iterate through the tokens.
- Calling output.collect(word, value) to output a <key, value> pair for each word.
Listing 1: The word counter Mapper class:
package wordCount;When run on many machines, each mapper gets part of the input -- so for example with 100 Gigabytes of data on 200 mappers, each mapper would get roughly its own 500 Megabytes of data to go through. On a single mapper, map() is called going through the data in its natural order, from start to finish. The Map phase outputs <h;key, value> pairs, but what data makes up the key and value is totally up to the Mapper code. In this case, the Mapper uses each word as a key, so the reduction below ends up with pairs grouped by word. We could instead have chosen to use the word-length as the key, in which case the data in the reduce phase would have been grouped by the lengths of the words being counted. In fact, the map() code is not required to call output.collect() at all. It may have its own logic to prune out data simply by omitting collect. Pruning things in the Mapper is efficient, since it is highly parallel, and already has the data in memory. By shrinking its output, we shrink the expense of organizing and moving the data in preparation for the Reduce phase.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase implements Mapper{
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollectoroutput, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Step 4: Create the Driver ProgramIn the Package Explorer, perform the same process as before to add a new class. This time, add a class named "WordCountReducer".
Defining a Reducer is just as easy. Subclass MapReduceBase and implement the Reducer interface: public void reduce(Text key, Iterator
values, OutputCollector output, Reporter reporter) . The reduce() method is called once for each key; the values parameter contains all of the values for that key. The Reduce code looks at all the values and then outputs a single "summary" value. Given all the values for the key, the Reduce code typically iterates over all the values and either concats the values together in some way to make a large summary object, or combines and reduces the values in some way to yield a short summary value.The reduce() method produces its final value in the same manner as map() did, by calling output.collect(key, summary). In this way, the Reduce specifies the final output value for the (possibly new) key. It is important to note that when running over text files, the input key is the byte-offset within the file. If the key is propagated to the output, even for an identity map/reduce, the file will be filed with the offset values. Not only does this use up a lot of space, but successive operations on this file will have to eliminate them. For text files, make sure you don't output the key unless you need it (be careful with the IdentityMapper and IdentityReducer).
Word Count Reduce:The word count Reducer takes in all the <word, 1> key/value pairs output by the Mapper for a single word. For example, if the word "foo" appears 4 times in our input corpus, the pairs look like: <foo, 1>, <foo, 1>, <foo, 1>, <foo, 1>. Given all those <key, value> pairs, the reduce outputs a single integer value. For the word counting problem, we simply add all the 1's together into a final sum and emit that: <foo, 4>.
Listing 2: The word counter Reducer class:
package wordCount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WordCountReducer extends MapReduceBase implements
Reducer{
public void reduce(Text key, Iteratorvalues,
OutputCollectoroutput, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
IntWritable value = (IntWritable) values.next();
sum += value.get(); // process value
}
output.collect(key, new IntWritable(sum));
}
}
Step 5: Run the Program LocallyYou now require a final class to tie it all together, which provides the main() function for the program. Using the same process as before, add a class to your program named "WordCount".
Given the Mapper and Reducer code, the short main() below starts the Map-Reduction running. The Hadoop system picks up a bunch of values from the command line on its own, and then the main() also specifies a few key parameters of the problem in the JobConf object, such as what Map and Reduce classes to use and the format of the input and output files. Other parameters, i.e. the number of machines to use, are optional and the system will determine good values for them if not specified.
Listing 3: The driver class:
package wordCount;You will note that in addition to the Mapper and Reducer, we have also set the combiner class to be WordCountReducer. Since addition (the reduction operation) is commutative and associative, we can perform a "local reduce" on the outputs produced by a single Mapper, before the intermediate values are shuffled (expensive I/O) to the Reducers. e.g., if machine A emits <foo, 1>, <foo, 1> and machine B emits <foo, 1>, a Combiner can be executed on machine A, which emits <foo, 2>. This value, along with the <foo, 1> from machine B will both go to the Reducer node -- we have now saved bandwidth but preserved the computation. (This is why our reducer actually reads the value out of its input, instead of simply assuming the value is 1.)
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class WordCount {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(wordCount.WordCount.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
// input and output directories (not files)
FileInputFormat.addInputPath(conf, new Path("input"));
FileOutputFormat.setOutputPath(conf, new Path("output"));
conf.setMapperClass(wordCount.WordCountMapper.class);
conf.setReducerClass(wordCount.WordCountReducer.class);
conf.setCombinerClass(wordCount.WordCountReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
We can test this program locally before running it on the cluster. This is the fastest and most convenient way to debug. You can use the debugger integrated into Eclipse, and use files on your local file system.
Before we run the task, we must provide it with some data to run on. In your project workspace, (e.g., c:\path\to\workspace\projectname) create a directory named "input".(You can create directory in the Project Explorer by right clicking the project and selecting New * Folder, then naming it "input.") Put some text files of your choosing or creation in there. In a pinch, a copy of the source code you just wrote will work.
Now run the program in Eclipse. Right click the driver class in the Project Explorer, click Run As... and then Java Application.
The console should display progress information on how the process runs. When it is done, right click the project in the Project Explorer and click Refresh (or press F5). You should see a new folder named "output" with one entry named "part-00000". Open this file to see the output of the MapReduce job.
If you are going to re-run the program, you must delete the output folder (right click it, then click Delete) first, or you will receive an error.
If you receive the following error message:
08/10/02 14:56:14 INFO mapred.MapTask: io.sort.mb = 100... then add the following line to your main() method, somewhere before the call to runJob():
08/10/02 14:56:14 WARN mapred.LocalJobRunner: job_local_0001
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:371)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
at wordCount.WordCount.main(WordCount.java:30)
conf.set("io.sort.mb", "10");and run the program again. Remember to remove or comment this line out before running on the cluster. An equally valid alternative approach is to edit the "Run Configuration" used by Eclipse. In the "VM Arguments" box, add the argument "-Xmx512m" to give it more memory to use.When run correctly, you should have an output file with lines each containing a word followed by its frequency.
Wikipedia DataMuch research today is concerned with building the Semantic Web, which will enable next-generation search mechanisms (e.g. "What U.S. presidents were born in Texas?"), but also better browsing and visualization. A first step towards this goal is to semantify Wikipedia.
Currently, Wikipedia does not contain semantic annotations, but Wikipedians have started to create infoboxes --- short summaries of the key facts --- for many articles. For example, look at our university's Wikipedia article and see the infobox on the right: it contains information, such as the founding year, and the number of students. Other university articles have similar boxes; they are using the same template. For other kinds of entities, we see different templates with different attributes.
Unfortunately, infoboxes are not used consistently and your first task is to find out how frequently different infobox classes (templates) have been used.
Finding the most popular infobox classesFor our experiments we downloaded the latest dump of all articles in the English Wikipedia (7.6 million pages), from last mid October 2008. It consists of three TSV (tab separated values) files, which each contain a database table:
page.txtMost fields in this table will not be relevant to us, but we would like to point out title, the (unique) title of each page, and latest which is a foreign key to table text. Note that anybody can edit a Wikipedia article and Wikipedia keeps track of all prevision revisions. Table text contains the source for each revision and the field latest in table page points to the latest revision of a particular article.
pageId namespace title restrictions counter isRedirect isNew random timestamp latest len ------------------------------------------------------------------------------------------------------------------------ 6 0 AmericanSamoa 0 1 1 0.012136827455833554 20070525171206 133452270 48 8 0 AppliedEthics 0 1 1 0.5709511071909219 20070525171209 133452279 49 10 0 AccessibleComputing 0 1 1 0.12219194346107543 20070525171212 133452289 57
revision.txtWe will not use table revision.txt for our project.
revisionId pageId textId comment userId user timestamp minorEdit deleted len parentId ------------------------------------------------------------------------------------------------------------- 133452270 6 133452270 Revert edit(s) by ... 241822 Gurch 20070525171206 1 0 NULL NULL 133452279 8 133452279 Revert edit(s) by ... 241822 Gurch 20070525171209 1 0 NULL NULL 133452289 10 133452289 Revert edit(s) by ... 241822 Gurch 20070525171212 1 0 NULL NULL
text.txtWe will assume that revisionId and textId are equal. We will also be using text, which contains the source of a page in Wiki markup language.
textId text --------------------------------------------------------------------- 133452270 #REDIRECT [[American Samoa]]{{R from CamelCase}} 133452279 #REDIRECT [[Applied ethics]] {{R from CamelCase}} 133452289 #REDIRECT [[Computer accessibility]] {{R from CamelCase}}Combined, the three tables consume about 20 GB of space. You will later run your programs on the complete tables on the IBM/Google cluster. The cluster is -- of course -- capable of running jobs several orders of magnitude larger than this, but in our class more than 50 students are competing for resources and there is another class using the same cluster; so we would like to keep the workload low.
When you write your Hadoop programs, you should always test locally on a smaller dataset. We therefore created smaller versions of the above tables containing only the top 1000, and top 10,000 records.
You can download these here:
- page1000.txt (0.1 MB)
- revision1000.txt (0.1 MB)
- text1000.txt (12 MB)
- page10000.txt (8 MB)
- revision10000.txt (1MB)
- text10000.txt (135 MB)
Your task is to generate a list of infobox classes with counts of how frequently they are used in Wikipedia. The list you generate should be sorted in decreasing order of frequency. An example list (made up, of course) is
45345 Infobox Planet in the UniverseTo help you get started, we are already providing a few starter classes. Please download these here and fill in the empty methods.
12242 Infobox Bankrupt Hedge Fund Manager
934 Infobox Ugly Looking PersonHINT: Hadoop is always sorting outputs by the key.
HINT: You can run multiple map/reduce jobs in a sequence.
HINT: Look at the Hadoop's javadocs and look at the examples in the Hadoop package you downloaded.
In your second task, you should output the number of categories that each article is assigned to. Your output should include the article title and the category count; it does not need to be sorted. An example output is
Ruschein 3
Men_Against_the_Sea 4Again, we are providing some starter code.
- UNIX terminal or Cygwin if you are using Windows
- Java 1.5 or higher
- The JAVA_HOME environment variable is set to the root of your Java installation
Open a UNIX terminal window and download the latest version of PIG using subversion.
svn co http://svn.apache.org/repos/asf/incubator/pig/trunk
A new folder named trunk is created in your current directory. Relevant to us are file trunk/pig.jar and folder trunk/tutorial.
Example 1: Query Phrase PopularityTo get started with PIG, we will walk through the examples in the PIG tutorial. The examples process a small set of query logs which are contained in trunk/tutorial/data.
Before we can run PIG, we need to compile a few classes that are called in our example. To compile, navigate to trunk/tutorial and run ant from the command line. Ant will initiate the build using the settings of build.xml located in this directory, and will write the output to trunk/tutorial/build. The output contains a Jar file named pigtutorial.jar, which we will later need to put on the class path when we run the tutorial scripts.
The following table lists all files we need:
The user-defined functions (UDFs) which are contained in pigtutorial.jar and are called from our scripts are described here:
File Description trunk/pig.jar PIG Jar file trunk/tutorial/build/pigtutorial.jar User-defined functions (UDFs) and Java classes trunk/tutorial/scripts/script1-local.pig PIG Script 1, Query Phrase Popularity (local mode) trunk/tutorial/scripts/script1-hadoop.pig PIG Script 1, Query Phrase Popularity (Hadoop cluster) trunk/tutorial/scripts/script2-local.pig PIG Script 2, Temporal Query Phrase Popularity (local mode) trunk/tutorial/scripts/script2-hadoop.pig PIG Script 2, Temporal Query Phrase Popularity (Hadoop cluster) trunk/tutorial/data/excite-small.log Log file, Excite search engine (local mode) trunk/tutorial/build/pigtutorial.jar Log file, Excite search engine (Hadoop cluster)
UDF Description ExtractHour Extracts the hour from the record. NGramGenerator Composes n-grams from the set of words. NonURLDetector Removes the record if the query field is empty or a URL. ScoreGenerator Calculates a "popularity" score for the n-gram. ToLower Changes the query field to lowercase. TutorialUtil Divides the query string into a set of words.
We begin by taking a look at script1-local.pig.
REGISTER ./tutorial.jar; raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query); clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram; ngramed2 = DISTINCT ngramed1; hour_frequency1 = GROUP ngramed2 BY (ngram, hour); hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count; uniq_frequency1 = GROUP hour_frequency2 BY group::ngram; uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1)); uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean; filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0; ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score); STORE ordered_uniq_frequency INTO 'script1-local-results.txt' USING PigStorage();Let's now walk through this script line by line.
- Register the tutorial JAR file so that the included UDFs can be called in the script.
REGISTER ./tutorial.jar;- Use the [WWW] PigStorage function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields user, time, and query.
raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);- Call the NonURLDetector UDF to remove records if the query field is empty or a URL.
clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);- Call the ToLower UDF to change the query field to lowercase.
clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;- Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the ExtractHour UDF to extract the hour (HH) from the time field.
houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;- Call the NGramGenerator UDF to compose the n-grams of the query.
ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;- Use the [WWW] DISTINCT command to get the unique n-grams for all records.
ngramed2 = DISTINCT ngramed1;- Use the [WWW] GROUP command to group records by n-gram and hour.
hour_frequency1 = GROUP ngramed2 BY (ngram, hour);- Use the [WWW] COUNT function to get the count (occurrences) of each n-gram.
hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;- Use the [WWW] GROUP command to group records by n-gram only. Each group now corresponds to a distinct n-gram and has the count for each hour.
uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;- For each group, identify the hour in which this n-gram is used with a particularly high frequency. Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram.
uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));- Use the [WWW] FOREACH-GENERATE command to assign names to the fields.
uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;- Use the [WWW] FILTER command to move all records with a score less than or equal to 2.0.
filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;- Use the [WWW] ORDER command to sort the remaining records by hour and score.
ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score);- Use the [WWW] PigStorage function to store the results. The output file contains a list of n-grams with the following fields: hour, ngram, score, count, mean.
STORE ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage();
Step 2: Start SOCKS ProxyUsing a web browser, navigate to http://www.cs.washington.edu/lab/facilities/hadoop.html and follow the instructions on that page to create your user account.
Then open a UNIX terminal window, and using SSH log in to the cloud gateway server hadoop.cs.washington.edu. Enter your username as you chose it on this site.
Your password is: pwd4cloud
You will be prompted to choose a new password; select one and enter it. It will automatically log you out immediately after you set it.
There is a second server behind the gateway which you must also reset your password on. This machine will still have the default password "pwd4cloud" attached to your name. To set this password:
- ssh in to hadoop.cs.washington.edu again. Use your username and your new password you set yourself
- ssh into 10.1.133.1. We will refer to this as the "submission node."
- Your password here is pwd4cloud. Change this password to one of your liking (for simplicitly, the same one as hadoop.cs.washington.edu); it will log you out as soon as it is set.
- log out of hadoop.cs.washington.edu by typing "exit" and hitting enter.
Step 3: Configure FoxyProxy so that you can see the system stateOur cluster, in the 10.1.133.X network space, is not directly accessible; we must access it through a SOCKS proxy connected to the gateway node, hadoop.cs.washington.edu. You must now configure a proxy connection to allow you to make this connection.
If you have cygwin installed (or are using OSX/Linux), open a cygwin terminal, and type:
ssh -D 2600 username@hadoop.cs.washington.edu
Replacing username with your login name. When prompted for your password, enter it. You will see an ssh session to this node. You will not use the ssh session directly -- just minimize the window. (It is forwarding traffic over port 2600 in the background.)
If you are a Windows user and have
putty
installed, start putty, and in the "PuTTY Configuration" window, go to Connection -- SSH -- Tunnels. Type "2600" in the "Source Port" box, click "Dynamic," then click "Add." An entry for "D 2600" should appear in the Forwarded Ports panel. Go back to the Session category, and type hadoop.cs.washington.edu in the Host Name box. Log in with your username and password. When this has logged in, you do not need to do anything else with this window; just minimize it, and it will forward SOCKS traffic in the background.If you are using Windows and do not have an ssh client, download PuTTY. It is free. Download putty.exe from this page:
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
You need to set up a proxy ssh connection any time you need to access the web interface (see steps 3 and 4), or if you are trying to connect directly to the cluster from your own machine.
Step 4: Test FoxyProxy configurationThe Hadoop system will expose information about its health, current load, etc, on two web services that are hosted behind the firewall. You can view this information, as well as browse the DFS and read the error log files from services and jobs via this interface.
You must use Firefox to access these sites.
Download the FoxyProxy Firefox extension at: http://foxyproxy.mozdev.org/downloads.html
Install the extension and restart FireFox. If it prompts you to configure FoxyProxy, click "yes." If not, go to Tools * FoxyProxy * Options.
Set the "Mode" to "Use proxies based on their pre-defined patterns and priorities" In the "Proxies" tab, click "Add New Proxy"
In the "Global Settings" tab of the top-level FoxyProxy Options window, select "Use SOCKS proxy for DNS lookups".
- Make sure "Enabled" is checked
- Proxy Name: "UW Hadoop" (or something reasonable)
- Under "Proxy Details," select Manual Proxy Configuration.
- hostname: localhost.
- port: 2600
- SOCKS proxy? should be checked
- Select the radio button for SOCKS v5
- Under URL Patterns, click Add New Pattern
- Pattern name: UW Private IPs:
- URL Pattern: http://10.1.133.*:*/*
- Select "whitelist" and "Wildcards"
- Click Add New Pattern again
- Pattern name: xenhosts
- URL pattern: http://xenhost-*:*/*
- Select "whitelist" and "Wildcards"
Click "OK" to exit all the options.
You will now be able to surf the web regularly, while still redirecting the appropriate traffic through the SOCKS proxy to access cluster information.
Step 5: Log in to the submission nodeVisit the URL http://10.1.133.0:50070/
It should show you information about the state of the DFS as a whole. This lets you know that the "10.1.133..." rule was set up correctly. Click on one of the "XenHost-???" links in the middle of the page. If it takes you to a view of the DFS, you have set up the "xenhosts" patterns correctly. Hurray!
Step 6: Test your connection to Hadoop itself by browsing the DFSIf you are using a lab machine, then you do not have the permissions to edit the hosts file, which is necessary for a direct connection to the Hadoop service. Therefore, you will need to perform all your actual Hadoop instructions on the submission node, which is configured to talk directly to the Hadoop system.
Log in to username@hadoop.cs.washington.edu. Use the password you set earlier. From there, log in to the submission node: username@10.1.133.1. Use the (second) password you set earlier.
You are now logged in to a machine where you can run Hadoop. Hadoop is installed in:
/hadoop/hadoop-0.18.2-dev
We will refer to this directory as $HADOOP, below.
Open up a terminal (Windows users: cygwin window).
Change to the $HADOOP directory where you installed Hadoop (This is /hadoop/ hadoop-0.18.2-dev on the submission node)
Type the following command (assuming "$" is the prompt):
$ bin/hadoop dfs -ls /
It should return a listing similar to this:
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2008-10-01 17:56 /tmp
drwxr-xr-x - hadoop supergroup 0 2008-10-02 05:10 /user
Acknowledgements: Many of the installation instructions in this tutorial were written by Aaron Kimball.