Hadoop: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
Pig Latin: Pig Latin is a language for the analysis of extremely large datasets.
Its designers claim that
many people who analyze [extremely large datasets]
are entrenched procedural programmers, who find the declarative, SQL style
to be unnatural. The MapReduce paradigm was a success with these
programmers, but it
is too low-level and rigid, and leads to a great deal
of custom user code that is hard to maintain, and reuse.
So Pig Latin aims for the "sweet spot" between the declarative style of
SQL and the procedural style of map-reduce.
Pig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin is implemented in Pig, a system which compiles Pig Latin into physical plans that are then executed over Hadoop.
AWS: Amazon Web Services (AWS) is a collection of computing services offered over the Internet by Amazon.com (taken from the AWS Wikipedia page). One subset of these, the cloud services, lets any person run a cluster of designated machines and execute any scripts or programs that are supported by the cloud services.
For this project, you'll use the AWS Elastic Map Reduce service, which enables anyone to easily run Hadoop programs (including Pig) with a click of a button. Elastic Map Reduce relies on Elastic Compute Cloud (EC2) to set up a cluster of machines as well as Hadoop and Pig. Moreover, programs or data can be accessed or stored in a centralized location via Simple Storage Service (S3). We will be primarily dealing with Elastic Map Reduce and only touching the base of S3 as we will be accessing the larger data (~ 40 MB and ~ 6 GB) from a S3 bucket (location/folder). The details of doing this are provided in the problems.
Amazon provides documentation on the various AWS computing services; in particular, documentation for the specific AWS products we use is linked below, for reference only (not required reading, because the project instructions should provide enough information or links thereto):
NOTE: As previously stated, you should minimize your use of AWS in order to save money. After finishing the Pig tutorial, you should avoid using a cluster until you have tested your code locally against the smaller datasets.
Note: Some of the sample commands on this page are quite long. When you enter them they should be on a single line even though they may be broken into more than one line for convenience when you view or print the web page.
The following instructions assume you are working on a Linux computer in the undergrad labs using a bash terminal.
In the commands shown below, a '%' sign before a command means it is supposed to be run on the AWS master node (after ssh-ing into it) while '$' means your local computer (attu or a lab Linux machine). That is:
# cmd to be run on the AWS master node after ssh-ing into it. % ls # cmd to be run on the local computer $ ls
$ cd project4 $ tar -xzf pigtutorial.tar.gz $ tar -xzf hadoop-0.18.3.tar.gz
hadoopprogram is executable:
$ chmod u+x ~/project4/hadoop-0.18.3/bin/hadoop
$ export PIGDIR=~/project4/pigtmp $ export HADOOP=~/project4/hadoop-0.18.3 $ export HADOOPSITEPATH=~/project4/hadoop-0.18.3/conf/ $ export PATH=$HADOOP/bin/:$PATHThe variable
JAVA_HOMEshould be set to point to your system's Java directory. On the undergrad lab machines, this should already be done for you. If you are on your own machine, you may have to set it.
excite.log.bz2; we have already uploaded it to S3 for your convenience), using the instructions in section "Copying files to or from the master node". If you have any problems at this point, please contact a TA or talk to your classmates.
In the Pig scripts, the input and output files are named using relative paths into the Hadoop Distributed File System (HDFS). Before running them on the AWS cluster, you need to modify these paths to make them absolute paths in HDFS and S3, then copy the new scripts again to the master node:
/user/hadoopdirectory on the AWS cluster's HDFS file system, if it is not already there. From the master node, run:
% hadoop dfs -mkdir /user/hadoopCheck that the directory was created by listing it with this command:
% hadoop dfs -ls /user/hadoopYou may see some output from either command, but you should not see any errors.
excite.log.bz2from S3. To do this, change:
raw = LOAD 'excite.log.bz2' USING PigStorage....to
raw = LOAD 's3n://uw-cse444-proj4/excite.log.bz2' USING PigStorage....
script1-hadoop.pig, the line:
STORE ordered_uniq_frequency INTO 'script1-hadoop-results' USING ...becomes
STORE ordered_uniq_frequency INTO '/user/hadoop/script1-hadoop-results' USING ...Make the analogous change in
In all your Pig scripts than run on the cluster, you will have to use absolute S3 or HDFS file paths, just like with the tutorial scripts above.
% java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main script1-hadoop.pigyou can and must type:
% pig -l . script1-hadoop.pig
Once you have successfully run a script, the output will be stored in a
directory (not a file) in the AWS cluster's HDFS file system, located
at the absolute path you specify in the Pig Latin STORE command.
That is, for
script1-hadoop.pig, the output will be stored at
HDFS is separate from the master node's file system, so
before you can copy this to your local machine, you must copy the directory
from HDFS to the master node's Linux file system:
% hadoop dfs -copyToLocal /user/hadoop/script1-hadoop-results script1-hadoop-resultsThis will create a directory
part-*files in it, which you can copy to your local machine with
scp. You can then concatenate all the
part-*files to get a single results file, perhaps sorting the results if you like.
hadoop dfs -helpor see the
hadoop dfsguide to learn how to manipulate HDFS. (Note that
hadoop fsis the same as
The above Pig tutorial is an example of how we are going to run our finished/developed scripts on the cluster. One way to do this project is to complete all the problems locally (or the parts which can be done locally) and then run a cluster and finish all the Hadoop/Pig parts to be run on the cluster. For example, you can complete the whole Hadoop part using a cluster of 15 nodes in 3 hours (this includes problem 1-3).
Also note that the AWS billing for nodes is on an hourly basis and rounds up, so if you start up a cluster, we would suggest killing it about 10 minutes before the end of an hour (to give adequate time for shutdown). Moreover, if you start a cluster, and kill it within few minutes of starting it, you will still be billed for a complete hour.