Hadoop: Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
Pig Latin: Pig Latin is a language for the analysis of extremely large datasets.
Its designers claim that many people who analyze [extremely large datasets]
are entrenched procedural programmers, who find the declarative, SQL style
to be unnatural.
The MapReduce paradigm was a success with these
programmers, but it is too low-level and rigid, and leads to a great deal
of custom user code that is hard to maintain, and reuse.
So Pig Latin aims for the "sweet spot" between the declarative style of
SQL and the procedural style of map-reduce.
Pig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin is implemented in Pig,
a system which compiles Pig Latin into physical plans that are
then executed over Hadoop.
AWS: Amazon Web Services (AWS) is a collection of computing services offered over the Internet by Amazon.com (taken from the AWS Wikipedia page). One subset of these, the cloud services, lets any person run a cluster of designated machines and execute any scripts or programs that are supported by the cloud services.
For this project, you'll use the AWS Elastic Map Reduce service, which enables anyone to easily run Hadoop programs (including Pig) with a click of a button. Elastic Map Reduce relies on Elastic Compute Cloud (EC2) to set up a cluster of machines as well as Hadoop and Pig. Moreover, programs or data can be accessed or stored in a centralized location via Simple Storage Service (S3). We will be primarily dealing with Elastic Map Reduce and only touching the base of S3 as we will be accessing the larger data (~ 40 MB and ~ 6 GB) from a S3 bucket (location/folder). The details of doing this are provided in the problems.
Amazon provides documentation on the various AWS computing services; in particular, documentation for the specific AWS products we use is linked below, for reference only (not required reading, because the project instructions should provide enough information or links thereto):
NOTE: As previously stated, you should minimize your use of AWS in order to save money. After finishing the Pig tutorial, you should avoid using a cluster until you have tested your code locally against the smaller datasets.
The following instructions assume you are working on a Linux computer in the undergrad labs using a Bash shell window.
In the commands referenced below, a '%' sign before a command means it is supposed to be run on the AWS master node (after ssh-ing into it) while '$' means your local computer (attu or a lab linux machine). That is:
# cmd to be run on the AWS master node after ssh-ing into it. % ls # cmd to be run on the local computer $ ls
$ cd hw7 $ tar -xzf pigtutorial.tar.gz $ tar -xzf hadoop-0.18.3.tar.gz
hadoop
program is executable:
$ chmod u+x ~/hw7/hadoop-0.18.3/bin/hadoop
$ export PIGDIR=~/hw7/pigtmp $ export HADOOP=~/hw7/hadoop-0.18.3 $ export HADOOPSITEPATH=~/hw7/hadoop-0.18.3/conf/ $ export PATH=$HADOOP/bin/:$PATHThe variable
JAVA_HOME
should be set to point to your system's Java
directory. On the undergrad lab machines, this should already be done for you.
If you are on your own machine, you may have to set it.excite.log.bz2
; we have already
uploaded it to S3 for your convenience), using the instructions in section
"Copying scripts/data to/from the master node". If you have
any problems at this point, please contact a TA or talk to your
classmates.In the Pig scripts, the input and output files are named using relative paths into the Hadoop Distributed File System (HDFS). Before running them on the AWS cluster, you need to modify these paths to make them absolute paths in HDFS and S3, then copy the new scripts again to the master node:
/user/hadoop
directory on the AWS cluster's
HDFS file system, if it is not already there. From the master node, run:
% hadoop dfs -mkdir /user/hadoopCheck that the directory was created by listing it with this command:
% hadoop dfs -ls /user/hadoopYou may see some output from either command, but you should not see any errors except for one like this:
10/07/31 19:53:05 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectivelyThis warning message is an artifact of the AWS configuration of Hadoop, and can be safely ignored. It will appear whenever you run any Hadoop or Pig commands on the master node.
excite.log.bz2
from S3. To do this, change:
raw = LOAD 'excite.log.bz2' USING PigStorage....to
%declare S3BIN 's3n://uw-csep544-pl'; raw = LOAD '$S3BIN/excite.log.bz2' USING PigStorage....Note: To manage load this quarter, there are two S3 bins. The variable S3BIN should be set to 's3n://uw-csep544-pl' if your last name starts with the lettter A-K. If your last name starts with L-Z, the variable should be set to 's3n://uw-cse444-proj4/'
script1-hadoop.pig
,
the line:
STORE ordered_uniq_frequency INTO 'script1-hadoop-results' USING ...becomes
STORE ordered_uniq_frequency INTO '/user/hadoop/script1-hadoop-results' USING ...Make the analogous change in
script2-hadoop.pig
.In all your Pig scripts than run on the cluster, you will have to use absolute S3 or HDFS file paths, just like with the tutorial scripts above.
% java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main script1-hadoop.pigyou can and must type:
% pig -l . script1-hadoop.pig
Once you have successfully run a script, the output will be stored in a
directory (not a file) in the AWS cluster's HDFS file system, located
at the absolute path you specify in the Pig Latin STORE command.
That is, for script1-hadoop.pig
, the output will be stored at
/user/hadoop/script1-hadoop-results
.
HDFS is separate from the master node's file system, so
before you can copy this to your local machine, you must copy the directory
from HDFS to the master node's Linux file system:
% hadoop dfs -copyToLocal /user/hadoop/script1-hadoop-results script1-hadoop-resultsThis will create a directory
script1-hadoop-results
with
part-*
files in it, which you can copy
to your local machine with scp
.
hadoop dfs -help
or see the hadoop dfs
guide
to learn how to manipulate HDFS. (Note that hadoop fs
is
the same as hadoop dfs
.)The above Pig tutorial is an example of how we are going to run our finished/developed scripts on the cluster. One way to do this project is to complete all the problems locally (or the parts which can be done locally) and then run a cluster and finish all the Hadoop/Pig parts to be run on the cluster. For example, you can complete the whole Hadoop part using a cluster of 15 nodes in 3 hours (this includes problem 1-3).
Also note that the AWS billing for nodes is on an hourly basis and rounds up, so if you start up a cluster, we would suggest killing it about 10 minutes before the end of an hour (to give adequate time for shutdown). Moreover, if you start a cluster, and kill it within few minutes of starting it, you will still be billed for a complete hour.