Project 4 - preliminaries and Pig tutorial

Technologies you'll be using

Hadoop: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

Pig Latin: Pig Latin is a language for the analysis of extremely large datasets. Its designers claim that many people who analyze [extremely large datasets] are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The MapReduce paradigm was a success with these programmers, but it is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. So Pig Latin aims for the "sweet spot" between the declarative style of SQL and the procedural style of map-reduce. Pig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin is implemented in Pig, a system which compiles Pig Latin into physical plans that are then executed over Hadoop.

AWS: Amazon Web Services (AWS) is a collection of computing services offered over the Internet by Amazon.com (taken from the AWS Wikipedia page). One subset of these, the cloud services, lets any person run a cluster of designated machines and execute any scripts or programs that are supported by the cloud services.

For this project, you'll use the AWS Elastic Map Reduce service, which enables anyone to easily run Hadoop programs (including Pig) with a click of a button. Elastic Map Reduce relies on Elastic Compute Cloud (EC2) to set up a cluster of machines as well as Hadoop and Pig. Moreover, programs or data can be accessed or stored in a centralized location via Simple Storage Service (S3). We will be primarily dealing with Elastic Map Reduce and only touching the base of S3 as we will be accessing the larger data (~ 40 MB and ~ 6 GB) from a S3 bucket (location/folder). The details of doing this are provided in the problems.

Amazon provides documentation on the various AWS computing services; in particular, documentation for the specific AWS products we use is linked below, for reference only (not required reading, because the project instructions should provide enough information or links thereto):

Getting started guides

Developer guides

Before you run anything…

  1. To get more familiar with MapReduce and Pig, we recommend that you first skim through the following two research papers:
  2. Now set up an AWS account for this project, by following the instructions in the following sections:
    1. Setting up your AWS account
    2. Setting up an EC2 keypair
    3. Monitoring Hadoop jobs with the job tracker (read the introductory material, but only complete the section titled "Setting up Firefox to use FoxyProxy")
  3. Once your AWS account is set up and activated for Elastic Map Reduce, try setting up an AWS cluster and running the Pig tutorial (below).

NOTE: As previously stated, you should minimize your use of AWS in order to save money. After finishing the Pig tutorial, you should avoid using a cluster until you have tested your code locally against the smaller datasets.

Setting up an AWS Hadoop cluster and trying Pig

Note: Some of the sample commands on this page are quite long. When you enter them they should be on a single line even though they may be broken into more than one line for convenience when you view or print the web page.

The following instructions assume you are working on a Linux computer in the undergrad labs using a bash terminal.

In the commands shown below, a '%' sign before a command means it is supposed to be run on the AWS master node (after ssh-ing into it) while '$' means your local computer (attu or a lab Linux machine). That is:

# cmd to be run on the AWS master node after ssh-ing into it.
% ls
# cmd to be run on the local computer
$ ls
  1. Download project4.tar.gz to your home directory, and unzip it. (Warning: this file is about 20MB, so you want to have a fast connection when you grab it.) This will create a directory called project4. Henceforth, we assume that you have, indeed, downloaded the file to your home directory. Change directory into project4, and unzip Pig and Hadoop:
    $ cd project4
    $ tar -xzf pigtutorial.tar.gz
    $ tar -xzf hadoop-0.18.3.tar.gz
    
  2. Make sure the hadoop program is executable:
    $ chmod u+x ~/project4/hadoop-0.18.3/bin/hadoop
  3. Set a few environment variables. You will need them later. You must set these environment variables EACH time, OR you can put these commands in the .bashrc, .profile, or other appropriate configuration file in your home directory.
    $ export PIGDIR=~/project4/pigtmp
    $ export HADOOP=~/project4/hadoop-0.18.3
    $ export HADOOPSITEPATH=~/project4/hadoop-0.18.3/conf/
    $ export PATH=$HADOOP/bin/:$PATH
    
    The variable JAVA_HOME should be set to point to your system's Java directory. On the undergrad lab machines, this should already be done for you. If you are on your own machine, you may have to set it.

  4. Now read and complete the Pig tutorial, with the following caveats:

    1. Skip over all the installation instructions (you already did that above). Go straight to the section entitled "Pig Scripts: Local Mode." (Type the commands shown yourself, rather than just copying and pasting - sometimes the dashes are actually different from the ones recognized by Linux!) Try to first run Pig Script 1 in local mode on your machine (or Linux lab machine). This is how you will be developing and testing the scripts locally on your computer, before actually trying it on the cluster.

    2. Now start a new cluster of 1 node using the instructions in "Running an AWS job" and ssh into the master node.

    3. Copy over the Pig scripts 1 and 2 and tutorial.jar over to the master node (you don't need to copy excite.log.bz2 ; we have already uploaded it to S3 for your convenience), using the instructions in section "Copying files to or from the master node". If you have any problems at this point, please contact a TA or talk to your classmates.

    4. Now you can connect to the AWS cluster! But you are not ready to run the Pig tutorial scripts yet.

      In the Pig scripts, the input and output files are named using relative paths into the Hadoop Distributed File System (HDFS). Before running them on the AWS cluster, you need to modify these paths to make them absolute paths in HDFS and S3, then copy the new scripts again to the master node:

      1. First create a /user/hadoop directory on the AWS cluster's HDFS file system, if it is not already there. From the master node, run:
        % hadoop dfs -mkdir /user/hadoop
        
        Check that the directory was created by listing it with this command:
        % hadoop dfs -ls /user/hadoop
        
        You may see some output from either command, but you should not see any errors.

      2. Next, modify the tutorial Pig scripts to load the input file excite.log.bz2 from S3. To do this, change:
        raw = LOAD 'excite.log.bz2' USING PigStorage....
        to
        raw = LOAD 's3n://uw-cse444-proj4/excite.log.bz2' USING PigStorage....
      3. Finally, we need to specify an absolute rather than relative HDFS path for the Pig scripts' output, so in script1-hadoop.pig, the line:
        STORE ordered_uniq_frequency INTO 'script1-hadoop-results' USING ...
        becomes
        STORE ordered_uniq_frequency INTO '/user/hadoop/script1-hadoop-results' USING ...
        Make the analogous change in script2-hadoop.pig.

      4. Copy the new versions of the Pig scripts to the master node.

      In all your Pig scripts than run on the cluster, you will have to use absolute S3 or HDFS file paths, just like with the tutorial scripts above.

    5. Now to run a script, instead of the following command as shown in the tutorial:
      % java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main script1-hadoop.pig
      you can and must type:
      % pig -l . script1-hadoop.pig
    6. While the script is running, make sure you can access the job tracker UI. (The instructions are available in the section "Monitoring Hadoop jobs with the job tracker".)

    7. Copy back the data from the script using the instructions in the section "Copying files to or from the master node".

      Once you have successfully run a script, the output will be stored in a directory (not a file) in the AWS cluster's HDFS file system, located at the absolute path you specify in the Pig Latin STORE command. That is, for script1-hadoop.pig, the output will be stored at /user/hadoop/script1-hadoop-results . HDFS is separate from the master node's file system, so before you can copy this to your local machine, you must copy the directory from HDFS to the master node's Linux file system:

      % hadoop dfs -copyToLocal /user/hadoop/script1-hadoop-results script1-hadoop-results
      This will create a directory script1-hadoop-results with part-* files in it, which you can copy to your local machine with scp. You can then concatenate all the part-* files to get a single results file, perhaps sorting the results if you like.

      Use hadoop dfs -help or see the hadoop dfs guide to learn how to manipulate HDFS. (Note that hadoop fs is the same as hadoop dfs.)

    8. Now terminate the cluster (using the instructions in the section "Terminating a running cluster") when the script has finished and you are convinced by the results. Note: all data on the master node and the cluster's HDFS will be lost! Make sure you've copied all data you want to keep back to your local machine.

The above Pig tutorial is an example of how we are going to run our finished/developed scripts on the cluster. One way to do this project is to complete all the problems locally (or the parts which can be done locally) and then run a cluster and finish all the Hadoop/Pig parts to be run on the cluster. For example, you can complete the whole Hadoop part using a cluster of 15 nodes in 3 hours (this includes problem 1-3).

Also note that the AWS billing for nodes is on an hourly basis and rounds up, so if you start up a cluster, we would suggest killing it about 10 minutes before the end of an hour (to give adequate time for shutdown). Moreover, if you start a cluster, and kill it within few minutes of starting it, you will still be billed for a complete hour.