Introduction to PIG

To use PIG, we recommend that you use a UNIX shell. You can use one of the department's UNIX machines, install Cygwin on your Windows machine, or use a terminal window on your MAC. Then follow these instructions to set up PIG.
  1. Download the latest version of PIG from the svn repository.

    svn checkout http://svn.apache.org/repos/asf/hadoop/pig/trunk

    A new folder named pig/trunk is created in your current directory.

  2. Build PIG from the source using Ant.

    cd pig/trunk
    ant

    A new file pig.jar is created.

  3. We also need to build the tutorial files.

    cd tutorial
    ant

    All the files which we need to run the tutorial scripts are now in pig/trunk/tutorial/build/output/pigtmp. You can copy this folder to a different location on your drive to make it more easily accessible.
  4. After you have setup your folder, you can run a PIG example script by calling

    java -cp "pig.jar" org.apache.pig.Main -x local script1-local.pig

    After running this command, a new file scrip1-local-results.txt has been created.

Although one can run PIG directly (as the above example shows), one typically runs PIG on top of Hadoop. You will need to do this to answer the questions in this problem assignment. We will run PIG on the IBM/Google cluster. Follow the instructions here to copy your pigtmp directory to the cluster.

You may run into an error when running script2-hadoop.pig on the cluster. If you get the message "ERROR org.apache.pig.tools.grunt.Grunt - java.io.IOException: Invalid alias: hour_frequency2::hour00::group::ngram in same ...", then change the line same1 = ... to same1 = FOREACH same GENERATE hour00::group::group::ngram as ngram, $2 as count00, $5 as count12;