Project 4: Hadoop and PIG

LAST UPDATED TIME OF THIS DOCUMENT: December 5, 2009 at 11:30 PM.

ESTIMATED TIME: 18 hours.

CHANGE HISTORY
* For problem 3.2, we want the results based on the large dbtest128g datasets rather than the medium.txt dataset.

DUE DATE: Saturday, December 12 at 11 pm. No late assignments will be accepted, even if you have late days remaining because it is the end of the autumn quarter.

RIGHT NOW: Immediately complete the steps to set up your Amazon Web Services (AWS) account (here). That will take a couple of days to go through, so you want to do that right away so it is ready when you need it. You will start the assignment on your local machine and use the cluster for large runs at the end.

TURN IN INSTRUCTIONS: Turn in eight txt files (details below) using the regular catalyst dropbox.

GROUPS: You may work with a partner on this assignment. If you do work with a partner, one member of the group should turn in a single project with everyone's name on it and all members of the group will receive the same score. You should also include a short readme.txt file listing the members of the group and gives a short summary of who did what. Everyone in the group is responsible for the material regardless of how you organize the work.

HADOOP: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

PIG LATIN: Pig Latin is a language for the analysis of extremely large datasets. The motivation for Pig Latin is the fact that many people who analyze [extremely large datasets] are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The map-reduce paradigm was a success with these programmers, but it is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. So Pig Latin aims for the "sweet spot" between the declarative style of SQL and the procedural style of map-reduce. Pig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin is implemented in PIG, a system which compiles Pig Latin into physical plans that are then executed over Hadoop.

AWS: The Amazon Web Services (AWS) are a collection of remote computing services (also called web services) offered over the Internet by Amazon.com (taken from the wiki page). One subset of these, the cloud services lets any person run a cluster of designated machines and execute any scripts or programs that are supported by the cloud services. 

In the scope of this class, we are interested in their Elastic Map Reduce service, which enables any one to easily run hadoop related scripts (including PIG) with a click of a button. Elastic Map Reduce relies on Elastic Compute Cloud (EC2) to set up a cluster of machines as well as hadoop (v0.18.3) and pig (v0.3.0). Moreover, scripts or data can be accessed or stored in a centralized location via Simple Storage Service (S3). We will be primarily dealing with Elastic Map Reduce and only touching the base of S3 since we will be accessing the larger data (~ 40 MB and ~ 6 GB) from a S3 bucket (location/folder). The details of doing this are provided in the problems. Finally, here is a link to their documentation on different web services that are provided by AWS, while links to the getting started and developer guides are mentioned below for reference purposes only (not required to be read since the project instructions provide specific details or links at the appropriate places).

   Getting Started Guides

    1. Elastic Map Reduce
    2. S3
    3. EC2

   Developer Guides

    1. Elastic Map Reduce
    2. S3
    3. EC2 

Preliminaries

  1. To get more familiar with MapReduce and PIG, we recommend that you first skim through the following two research papers:
  2. Now set up an AWS account, so complete the instructions in the sections below titled: 
    1. "Setting up Your AWS Account"
    2. "Setting up an EC2 keypair"
    3. "Setting up the proxy to view job tracker"
  3. After completing the tutorial (which invovles setting up a Elastic MapReduce cluster of 5 nodes), please complete all the problems on your local machine using the hadoop and pig in the project4.tar.gz file. Only when you are sure about the correctness of the scripts (tested them locally), execute the scripts on the AWS cluster. after the cluster is running, copy over the scripts Also, once done with the scripts, copy back the required files over to the local machine and don't forget to terminate the cluster.
  4. The following instructions assume you are working on a Linux machine on one of the lab pcs using a bash shell window.
  5. In the commands referenced below, a '%' sign before a command means it is supposed to be run on the AWS master node (after ssh-ing into it) while '$' means your local computer (attu or a lab linux machine). That is:
    # cmd to be run on the AWS master node after ssh-ing into it.
    % ls
    # cmd to be run on the local computer
    $ ls
  6. Download project4.tar.gz to your home directory, and unzip it. (Warning: this file is about 20MB, so you want to have a fast connection when you grab it.) This will create a directory called project4. Henceforth, we assume that you have, indeed, downloaded the file to your home directory. Change directory into project4, and unzip pig and hadoop.
    $ cd project4
    $ tar -xzf pigtutorial.tar.gz
    $ tar -xzf hadoop-0.18.3.tar.gz
  7. Make sure the hadoop script is executable:
    $ chmod u+x ~/project4/hadoop-0.18.3/bin/hadoop
  8. Set a few environment variables. You will need them later. You must set these environment variables EACH time, OR you can put these commands in the .bashrc, .profile, or other appropriate configuration file in your home directory.
    $ export PIGDIR=~/project4/pigtmp
    $ export HADOOP=~/project4/hadoop-0.18.3
    $ export HADOOPSITEPATH=~/project4/hadoop-0.18.3/conf/
    $ export PATH=$HADOOP/bin/:$PATH
    The variable JAVA_HOME should be set to point to your system's Java directory. On the PC Lab machines, this should already be done for you. If you are on your own machine, you may have to set it.
  9. Now we should complete the pig tutorial.
    1. Read and complete the Pig Tutorial
      1. Skip over all the installation instructions (you already did that above). Go straight to the section entitled "Pig Scripts: Local Mode". Try to first run Pig Script 1 in local mode on your machine (or linux lab machine). This is how you will be developing and testing the scripts locally on your computer, before actually trying it on the cluster.
      2. Now start a new cluster of 5 nodes using the instructions in "Running an AWS job" (you can follow either the GUI way or the command line, if you are so inclined) and ssh into the master node.
      3. Now copy over the pig script 1 and 2 and tutorial.jar over to the master node (you don't need to copy excite.log.gz2, we will be accessing it via a S3 bucket which is set up beforehand for your convenience), using the instructions in section "Copying scripts/data to/from the master node". If you have any problems at this point, please contact a TA or talk to your classmates.

        Howevever, there is a complication involving the scripts that are meant to be run on the hadoop cluster. The commands that specify input and output files use relative addresses. You should specify absolute paths to be safe.

        First create a /user/hadoop directory on the AWS cluster (although it should already be there):

        % hadoop dfs -mkdir /user/hadoop
        Do a listing of this directory with the command (and you should see some output and no exceptions/errors): The Pig scripts also use relative addresses that you will have to change to absolute addresses. name.

        First, we will be loading the excite.log.bz2 via S3. Hence in script1-hadoop.pig you would change:

        raw = LOAD 'excite.log.bz2' USING PigStorage....
        to
        raw = LOAD 's3n://uw-cse444-proj4/excite.log.bz2' USING PigStorage....

        Secondly, we are going to store the results on hdfs, so in the same script, the line:

        STORE ordered_uniq_frequency INTO 'script1-hadoop-results' USING ...
        becomes
        STORE ordered_uniq_frequency INTO '/user/hadoop/script1-hadoop-results' USING ...

        In all your scripts that run on the cluster you will have to use absolute path names.

      4. Now to run the scripts, instead of the following command as shown in the tutorial:
        % java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main script1-hadoop.pig
        You can and must type:
        % pig script1-hadoop.pig
      5. While the script is running, make sure you can access the job tracker UI (the instructions are available in the section "Tunneling the job tracker UI").
      6. Copy back the data from the script using the instructions in the section "Copying scripts/data to/from the master node".

        Once you have successfully ran the script, the output will be stored in the hdfs on the master node at the absolute path you specified in STORE. That is, for script1, the output will be stored at '/user/hadoop/script1-hadoop-results'. Before you can copy this to your local machine, you will have to copy the directory from the hdfs to the master node and then you can run scp to copy it to your machine.

        % hadoop dfs -copyToLocal /user/hadoop/script1-hadoop-results script1-hadoop-results
        This will create a directory script1-hadoop-results with part-* files in it.
      7. Now terminate the cluster (using the instructions in the section "Terminating a running cluster") when the script has finished and you are convinced with the results.
  10. The above pig tutorial is an example of how we are going to run our finished/developed scripts on the cluster. One way to do this project is to complete all the problems locally (or the parts which can be done locally) and then run a cluster and finish all the hadoop/pig parts to be run on the cluster. For example, you can complete the whole hadoop part using a cluster of 15 nodes in 3 hours (this includes problem 1-3).
  11. Also note that the AWS billing for nodes is on hourly basis, so if you started a cluster, we would suggest killing it about 10 minutes before the end of an hour (to give adequate time for shutdown). Moreover, if you start a cluster, and kill it within few minutes of started it, you will be billed for complete hour.

    Setting up AWS account

    1.  Go to http://aws.amazon.com/ and sign up:
      1. You may sign in using your existing Amazon account or you can create a new account by selecting "I am a new user.” We suggest you consider setting up a new Amazon account for this project separate from any existing Amazon customer account you might have.
      2. Enter your contact information and confirm your acceptance of the AWS Customer Agreement.
      3. Once you have created an Amazon Web Services Account, check your email for your confirmation step. You need Access Identifiers to make valid web service requests.
    2. Welcome to the Amazon Web Service. Before doing anything, get an AWS Credit Coupon number from us for your use in CSE 444. This number should be sufficient to cover AWS charges for this project.
      1. Check your code by clicking on this link (Please ignore the warning about untrusted connection, for more details on this visit this link.) If you don't see your code or have problems with the link email the TAs immediately.
      2. Go to http://aws.amazon.com/awscredits
      3. Enter your claim code and click Redeem. The credits will be added to your AWS account.
      4. (Note: AWS does charge for some activities; therefore, you need an AWS Credit Coupon so that they will not charge your credit card.)
    3. Sign up for Amazon Elastic MapReduce. Amazon Elastic MapReduce uses Amazon Elastic Compute Cloud to run your job flows and Amazon Simple Storage Service to store and access your data. After completing the sign-up process, you will have signed up to use Amazon Elastic Compute Cloud and Simple Storage Service.
      1. Go to http://aws.amazon.com/elasticmapreduce/
      2. Sign up (note: you need to give your credit card number; however If you were issued an AWS Credit Coupon, they will charge your AWS credit balance before charging your credit card. This project shouldn’t use all your AWS credit balance.)

    Setting up an EC2 key pair

    1. As part of your account setup, create a keypair using the instructions on Amazon here.
    2. Instead of calling it MyFirstKeyPair.pem, give it a descriptive name and save it a convenient location. We will reference this as </path/to/saved/keypair/file.pem> in the following instructions.
    3. Make sure the pem file has adequate permissions, just to be safe run the following on it:
      $ chmod 600 </path/to/saved/keypair/file.pem>

    Running jobs on AWS

    Follow one of the two choices below. That is you have the option of either starting the cluster using Management Console (GUI) and ssh-ing into the machine or via the ruby command line.

    Starting the cluster via an interactive pig session in Management Console

    1. Complete Section 1 and Section 2 on the page Interactive pig session.
    2. Assuming you have done step 1., you should now be able to SSH into the machine by following the step 6. in the section below.

    Starting the cluster via the ruby command line

    1. If you are on the lab linux machines, you can use the ruby client installed here. If you want to, the ruby client is available for download via this link and the install instructions are available here.
    2. Edit the path variable to include the ruby path
    3. $ export PATH=$PATH:/cse/courses/cse444/aws/ruby_client
      $ export SSH_OPTS="-i </path/to/saved/keypair/file.pem> -o StrictHostKeyChecking=no -o ServerAliveInterval=30"
    4. Create a file called credentials.json file with the following content in it.
    5. {
      "access_id": "<insert your aws access id here>",
      "private_key": "<insert your aws secret access key here>",
      "keypair": "<insert the name of your amazon ec2 keypair here>",
      }
      To view your security credentials, navigate to the Security credentials page at http://aws.amazon.com/security-credentials. "IMPORTANT: Your Secret Access Key is a secret, which only you and AWS should know. It is important to keep it confidential to protect your account. Store it securely in a safe place. Never include it in your requests to AWS, and never e-mail it to anyone. Do not share it outside your organization, even if an inquiry appears to come from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key." (copied directly from the AWS page here)
    6. Now instantiate a cluster of five nodes using (a pig interactive session):
      $ elastic-mapreduce --create --alive --name "Testing PIG -- $USER" \
      --num-instances 5 --pig-interactive
      Created job flow j-36U2JMAE73054
    7. Using the job id returned above (the j-* string), you can get the job details via
    8. $ elastic-mapreduce --describe --jobflow <job_flow_id>
      {
      "JobFlows": [
      {
      "LogUri": null,
      "Name": "Development Job Flow",
      "ExecutionStatusDetail": {
      "EndDateTime": 1237948135.0,
      "CreationDateTime": 1237947852.0,
      "LastStateChangeReason": null,
      "State": "COMPLETED",
      "StartDateTime": 1237948085.0
      },
      "Steps": [],
      "Instances": {
      "Ec2KeyName": null,
      "InstanceCount": 5.0,
      "Placement": {
      "AvailabilityZone": "us-east-1a"
      },
      "KeepJobFlowAliveWhenNoSteps": false,
      "MasterInstanceType": "m1.small",
      "SlaveInstanceType": "m1.small",
      "MasterPublicDnsName": "ec2-67-202-3-73.compute-1.amazonaws.com",
      "MasterInstanceId": "i-39325750"
      },
      "JobFlowId": "j-36U2JMAE73054"
      }
      ]
      }
    9. Now wait until the "State" (bold and underlined) field changes to Running and then using the public dns name of the master node you can ssh into the machine using:
      $ ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>
      
      Or
      $ elastic-mapreduce --job-flow-id j-ABABABABAB --ssh --key-pair </path/to/saved/keypair/file.pem>

    Terminating a running cluster

    After you are done, terminate the cluster.

    1. If you started your cluster using the GUI
      1. Go to the Elastic map reduce management console
      2. Select the job
      3. Click Terminate (should be right below "Your Elastic MapReduce Job Flows"
      4. Wait for a while and recheck until the status becomes terminated
    2. If you started your job using ruby cmd line client, you can either terminate the jobs using the step 1 above, or using the command line client via:
      1. If you do remember/have the job id, then skip to step 2. otherwise get the job id using:
        $ elastic-mapreduce --list --active
      2. And terminate the instance with:
        $ elastic-mapreduce --terminate --jobflow j-ABABABASABA
      3. To be 100% sure, you can terminate all your running jobs via:
        $ elastic-mapreduce --list --active --terminate
      4. Finally, wait for a minute and make sure there are no active jobs returned by the step 1 above. (You can also check this by going to the GUI as well).

    Copying scripts/data to/from the master node

    For the purposes of this assignment we are just going to use scp to copy data to and from the machine using scp.

    1. Copying Mulitple files

      1. To copy multiple files, you should 'tar' them up, copy the tar-ed file over and 'untar' the tar-ed file on the other side.
      2. To tar the files in a folder (will tar all the files and subfolders present in folder 'folder'):
         $ tar -czvf file.tar.gz folder/
      3. To tar multiple files:
         $ tar -czvf file.tar.gz file1 file2 file3 ... filen
      4. To untar the files:
         $ tar -xvzf file.tar.gz
    2. Copying to the AWS master node:

      1.  After you have a cluster running, be sure to get the master node's public dns name (Either from the GUI or ruby command line using the job id). Let's call this <master.public-dns-name.amazonaws.com>. Now to copy the 'local_file' over to '<dest_dir>' (<dest_dir> can be /tmp) use:
         $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> local_file hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
        Or if you have the SSH_OPTS var set then:
         $ scp $SSH_OPTS local_file hadoop@<master.public-dns-name.amazonaws.com>:<dest_dir>
      2. Now on the AWS master node, you can cd-ing into the <dest_dir> and you should see your file there.
    3. Copying from the AWS master node to the local computer:

      1. Once your job has completed or you want to save an updated version of your script, ON the local machine, copy over the file using (complete file path can be, /tmp/my_file.tar.gz):
        $ scp -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>:<complete_file_path> .
        Or if you have the ssh_opts in a var then:
         $ scp $SSH_OPTS hadoop@<master.public-dns-name.amazonaws.com>:<complete_file_path> .
      2. The file should be copied onto your current directory ('.') on your local computer.

    Setting up the proxy to view job tracker 

    Follow one of the two choices below.

    Set up Firefox to use foxy-proxy

      1. Follow the instructions here to set up Foxy proxy on firefox.
      2. Alternatively, if you do not want to use the above manual process, go through all the foxy proxy setup, you can install foxy proxy on firefox and copy the foxyproxy.xml (accessible at /cse/courses/cse444/aws/foxyproxy/foxyproxy.xml) to your firefox profile directory. For example on the lab computers, my firefox profile directory is at (~/.mozilla/firefox/axtblayw.default). If you have multiple rand_str.default dirs under ~/.mozilla/firefox, then you can get your current profile by looking at the profile.ini file
        $ cd ~/.mozilla/firefox
        $ cd <profile.default>
        $ cp foxyproxy.xml foxyproxy_bak.xml
        $ cp /cse/courses/cse444/aws/foxyproxy/foxyproxy.xml foxyproxy.xml

        Command line way

      1. No additional steps are needed at this point.

    Tunneling the job tracker UI via [SOCKS] proxy (accessing the proxy)

            Depending on how you set up the proxy, follow one of the two choices below.

        Firefox foxy proxy way

      1. Proxy is not going to work until you have a AWS EC2 or Elastic Map Reduce instance running. So make sure you follow the instructions in running an instance of cluster and have the master.public-dns-name of the master node handy.
      2. Once, you have the master node running and know its DNS, open a new terminal window and create the SSH SOCKS tunnel using:
        ssh -o "ServerAliveInterval 10" -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>
        Keep this window running in the background (minimize it).
      3. Now enable the foxy proxy on firefox and access the job tracker UI in Firefox using (directly copied from here):
        http://[master_dns_name]:9100/ # web UI for MapReduce job tracker(s)
        http://[master_dns_name]:9101/ # web UI for HDFS name node(s) [OPTIONAL]

        Command line way of accessing the Job tracker UI

      1. If you don't want to use go through the pain of using firefox to access the job tracker UI and is okay with accessing webpages the 'lynx' or 'command line' way, then once you have ssh-ed into, use the following command to access the job tracker UI (actually the greeting message after ssh-ing in does have this command).
        % lynx http://localhost:9100/ # web UI for MapReduce job tracker(s)

    Problem 1: Getting started with PIG

    After you have completed the Preliminaries: Run the first example script in the tutorial on the cluster (script1-hadoop.pig), and then answer the following questions (also see hints below):

    1.1 How many MapReduce jobs are generated by script1?

    1.2 How many map tasks are within the first MapReduce job? How many maps are within later MapReduce jobs? (Note that this number will be small because the dataset is small.)

    1.3 How long does each job take? How long does the entire script take?

    1.4 What do tuples look like after command clean2 = ... in script1? What do tuples look like after command uniq_frequency1 = ... in script1? What do tuples look like after command same = ... in script2?

    Hint 1: Use the job-tracker at http://[master_dns_name]:9100 to see the number of map and reduce tasks for your MapReduce jobs.

    Hint 2: To see the schema for intermediate results, you can use PIG's interactive command line client Grunt, which you can launch with the following command
    java -cp pig.jar org.apache.pig.Main -x local
    When using grunt, two commands that you may want to know about are dump and describe.

    What you need to turn in:
    Required: Submit your answers to problems 1.1 - 1.4 in a file named problem1-answers.txt .

    Problem 2: Simple tweak to tutorial

    Write a pig script that creates a histogram showing the distribution of user activity levels.

    Use gnuplot to plot the results.

    So, for a point (x,y) that we plot, we mean to say that y users each performed a total of x queries. You can run the org.apache.pig.tutorial.NonURLDetector(query) filter on the data to remove queries that are empty or only contain URLs.

    A few comments to help you get started:

    What you need to turn in:
    Required: Submit your pig program in problem2.pig.
    Run your program on excite.log.bz2, and submit your computed result file (problem2-results.txt), and your PNG plot (problem2-results.png).

    Problem 3: Run the script on the much larger astronomy data

    The astro dataset contains snapshots of a cosmological simulation that follows the evolution of structure within a volume of 150 million light years shortly after the Big Bang until the present day. The simulation had 512 timesteps, and snapshots were taken every even timestep. Each snapshot consists of the state of each particle in the simulation.

    The following is an example of what the data looks like:

    snapshot, #Time , partId, #mass, position0, position1, position2, velocity0, velocity1, velocity2, eps, phi, rho, temp, hsmooth, metals
    2, 0.290315, 0, 2.09808e-08, -0.489263, -0.47879, -0.477001, 0.1433, 0.0721365, 0.265767, , -0.0263865, 0.38737, 48417.3, 9.6e-06, 0
    2, 0.290315, 1, 2.09808e-08, -0.48225, -0.481107, -0.480114, 0.0703595, 0.142529, 0.0118989, , -0.0269804, 1.79008, 662495, 9.6e-06, 0
    Relevant to us are snapshot, partId, and velocity0-2. Each particle has a unique id (partId) and the data tracks the state of particles over time.

    We created three files of such data with different sizes. tiny.txt has only a couple of rows and might be useful when writing the script. medium.txt has about 1.3 MB and might be useful when testing your script locally.
    Finally, we have files containing data from 11 snapshots (2, 12, 22, .., 102) called dbtest128g.00<snap_shot_num>.txt.f that have the actual data (~ 530 MB each). tiny.txt, and medium.txt are contained in project4.tar.gz. The name of the different files are:

    dbtest128g.00002.txt.f
    dbtest128g.00012.txt.f
    dbtest128g.00022.txt.f
    dbtest128g.00032.txt.f
    dbtest128g.00042.txt.f
    dbtest128g.00052.txt.f
    dbtest128g.00062.txt.f
    dbtest128g.00072.txt.f
    dbtest128g.00082.txt.f
    dbtest128g.00092.txt.f
    dbtest128g.00102.txt.f

    All these file together can be loaded in PIG scripts via:

    raw = LOAD 's3n://uw-cse444-proj4/dbtest128g.00*' USING PigStorage...;

    Or two (or more) at a time using the UNION operator (say we want the pair (02, 12)):

    raw02 = LOAD 's3n://uw-cse444-proj4/dbtest128g.0002.txt.f' USING PigStorage....;
    raw12 = LOAD 's3n://uw-cse444-proj4/dbtest128g.0012.txt.f' USING PigStorage....;
    raw = UNION raw02, raw12;

    3.1 Write a script that counts the number of particles per snapshot. Test it on medium.txt.

    For each of the following datasets (available on the cluster), what is the level of parallelism when running your script (Use the web interface to determine the number of concurrently executing map tasks for each dataset; this is the level of parallelism)? What can you say about the scalability of the system?

    Run the script on the snapshots at

    1. timestep 2
    2. timestep 2 and 12
    3. timestep 2, 12, and 22
    4. timestep 2, 12, 22, and 32

    For each of the 4 cases above, launch your script on the cluster and check in the web interface how many map tasks are created. You can then immediately cancel your script again.

    What you need to turn in:
    Required: 1. Submit your pig program in problem3.1.pig.

    2. Submit your answers to the above questions in problem3.1-answers.txt.

    3.2 For each pair of subsequent snapshots (2, 12), (12, 22), (22, 32), ..., (92, 102), compute the number of particles which increased their velocity, and the number of particles which reduced their velocity. If the particle neither increased nor decreased their velocity, count it as decelerate.

    Your output should have the following format
    2.0, accelerate, 2
    2.0, decelerate, 3
    12.0, decelerate, 16
    22.0, accelerate, 2
    ...
    The first column denotes the start snapshot (int or float), the second column accelerate or decelerate, and the third column a count of the number of particles. It's ok to leave out rows with 0 counts. The results you turn in should be based on the large dbtest128g datasets stored on S3 (for eg. dbtest128g.0002.txt.f and dbtest128g.0012.txt.f files in the case of (2,12)).

    Hints:

    What you need to turn in:
    Required: 1. Submit your pig program in problem3.2.pig.
    2. Submit your computed result file (problem3.2-results.txt).