As a student, you should be able to create an Amazon Web Services (AWS) account with credits that allow you to use it free of charge for your assignment in this class (though see the warnings below about shutting down your clusters when you're not using them). Follow these steps to set up your account:
If you create a new account using your @uw.edu email address, you should be able to apply for credits that allow you to complete this assignment without charge. (Even with credits, be sure to properly shut down your cluster when you are not using it. If you do not, it is likely that you will end up having to pay Amazon.)Warning: If you exceed the credit amount, Amazon will charge your credit card without warning. You should not exceed this amount provided that you remember to terminate the AWS clusters when you are not using it.
While working on your assignment, you should monitor your billing usage by going to the billing page, clicking on "Bill Details" (upper left), and then clicking on "Expand All".
We will use SSH to connect to the machines running in Amazon's cloud. In order to do so, we first need to create an SSH key pair. To do so, follow these steps:
$ chmod 400 </path/to/saved/keypair/file.pem>Windows users: This step will NOT work on Windows 7 with cygwin. Windows 7 does not allow file permissions to be changed through this mechanism, and they must be changed for ssh to work. Instead of cygwin, you can use PuTTY as your ssh client. Furthermore, you will have to transform this key file into PuTTY format. For more information, go to Amazon's instructions (look for "Converting Your Private Key Using PuTTYgen" on that page) and follow the instructions for how to convert your .pem file into a .ppk file that you can use with PuTTY.
Next, we will set up a Spark cluster using the Elastic MapReduce (EMR) Management Console. To do so, follow these steps:
In the "Software Configuration" section, set "Vendor" to be "Amazon", and set "Release" as "emr-4.2.0". Check "Zeppelin-Sandbox 0.5.5" and "Spark 1.5.2". Then, click "Next" at the bottom of the page.
In the software settings section, copy and paste the following configuration that allows Spark to use all of the memory of the cluster:
[{ "Classification": "spark", "Properties": { "maximizeResourceAllocation": "true" } }]
Once you've done that, the page should look like this:
Click "Next" at the bottom of the page to continue.
In the "Hardware Configuration" section, don't change the default "Network" and "EC2 Subnet", and don't create a VPC. Change the count of core instances to be 5 for the homework. (For experimenting, you can set the count to 1. If you find queries are too slow, you can resize the cluster and increase the count of core instances.) Keep the count of task instances as 0. You can change the instance type of the instances if you want, but the larger the instance, the more expensive it is. To start out, keep this set to m3.xlarge. (For more information, see the Amazon documentation on instance types and pricing. For this assignment, you do not need to check the "Request spot" option, but if you want to experiment with bidding for a machine rather than getting the set price, see the Amazon documentation on spot pricing.)
Once you've done those things, the page should look like this:
In the "General Options" section, find the "Cluster name" field and type a name such as "Spark Cluster" or "Spark HW6". Uncheck "Logging" and "Termination protection", but keep the rest unchanged.
The page should now look like this:
Click "Next" at the bottom of the page to continue.
In the "Security Options" section, select the SSH key pair you created above. IMPORTANT: make sure to select a key pair, otherwise you won't be able to ssh to the cluster. Leave the rest of the settings unchanged.
The page should now look like this:
Finally, click "Create cluster" at the bottom of the page.
If your cluster takes an extraordinarily long time or fails (it may fail, for example, by telling you that "instance type m3.xlarge is not supported in the requested availability zone"), Amazon may be near capacity. Try again later. If it still doesn't work, ask for help.
We will refer to this DNS address below as <master.public-dns-name.amazonaws.com>.
Do not forget to shut down your cluster when you are not using it. (See below for instructions.) You can easily run up a large bill with Amazon by leaving it running.
For the homework, we are going to connect to one of Amazon's public data sets. The following steps will work with any dataset stored on Amazon EBS, but we are only going to use the Freebase Quad Dump for the homework. (For experimenting, you may want to try using the Freebase Simple Topic Dump because it is smaller.)
To prepare the data, perform the following steps:
To find the master node, you can either find the instance that has a Public DNS matching your <master.public-dns-name.amazonaws.com> from earlier OR you can look for the instance whose security group include "ElasticMapReduce-master" as shown here:
In order to access your machine, you will need to allow SSH connections to your instance by opening up port 22 on the master node. (For more information on this, you can read Amazon's documentation on authorizing instance access.)
To allow SSH connections, perform the following steps:
You are now ready to connect to the Spark cluster and run some queries using the data you attached.
To do so, perform the following steps:
Mac users: Run the following command in the Terminal:
ssh -i </path/to/saved/keypair/file.pem> -N -L 8157:<master.public-dns-name.amazonaws.com>:8890 hadoop@<master.public-dns-name.amazonaws.com>
This command will look as it is hanging, and that's fine. Just leave it the terminal running, and it will forward connects you make in your browser to local port 8157 over to port 8890 on the master node.
Windows users: You can configure PuTTY to do port forwarding as well. When you start your session, click on "Tunnels" under "SSH" on the Category pane on the left. You should see a form like this:
Fill in the "Source port" with 8157 and the Destination with "127.0.0.1:8890". When you click Open, it should start an SSH session, which you can just leave running in the background.
cd /This will take you to the root directory where we will mount the data to the device. (For more information, see Amazon's documentation on using volumes.
lsblkThis will show you the disk devices available. You should see xvdf and xvdf1, with xvdf1 having the same size as the public data we are using.
sudo mkdir /dataThis will make a folder, called /data, where we will mount the public dataset. (So far, the data is only available through a device, but is not accessible anywhere in the file system.)
sudo mount /dev/xvdf1/ /data/Navigate to /data/. You should find a file of the public dataset, ready for use!
sudo chmod 664 /data/freebase-datadump-quadruples.tsvThis will make the file readable for the next step. (This command assumes you are using the Freebase Quad Dump. If you are using a different dataset, you will need to change the first file name.)
hadoop fs -mkdir /data/This will make a folder on HDFS for Spark to use. It may take a few seconds to complete.
hadoop fs -put /data/freebase-datadump-quadruples.tsv /data/spark_data.tsv(As above, this is assuming you are using the Freebase Quad Dump. Change the first file name if you are using another dataset.)
This command may take a while (up to 90 min) for the data to be transferred. You can monitor the progress by using SSH to connect to the master in another new terminal and running this commmand
hadoop fs -ls -h /dataThis will show you the current size of the file. The final size will be around 30GB for the Freebase Quad Dump. (If you want to log out and come back in an hour, you can instead run the first commmand above using screen. That will let you log out but keep the command running until you get back. See the screen documentation for full details.)
import org.apache.commons.io.IOUtils import java.net.URL import java.nio.charset.Charset //open file in hdfs val FBText = sc.textFile("/data/spark_data.tsv") //create the schema of the table for Freebase data case class RDFRow(subject: String, predicate: String, obj: String, context: String) //loop through each row of data, split it on a tab, and make a RDFRow object //to toDF makes it a DataFrame, which is equivalent to a relational table val fbRow = FBText.map(s => s.split("\t")).map(s => RDFRow( if (s.length >= 1) s(0) else "", if (s.length >= 2) s(1) else "", if (s.length >= 3) s(2) else "", if (s.length >= 4) s(3) else "")).toDF() //makes a table of the data called dbFacts fbRow.registerTempTable("fbFacts")
%sql SELECT * FROM fbFacts LIMIT 1
This sets the query language to SQL and, as you can see, lets you run SQL on your fbFacts table.
This sample query just returns a single row of data. But you can now run more complex queries as well, which you will do as you work on HW6.
Important: You should always shut down your cluster when you are done. Just closing the browser will not work. Amazon charges you per instance hour, which means you could spend a lot of money if you forget to shut down your cluster!
Note: The cluster will not shut itself down. If you do not see the cluster in the console, you are most likely looking at the wrong data center. (AWS defaults to the Oregon region rather than N. Virginia, which we are using.) You can change to the correct cluster using the drop-down in the top-right corner of the page.
To shut down your cluster, perform the following steps: