Setting up an AWS account

As a student, you should be able to create an Amazon Web Services (AWS) account with credits that allow you to use it free of charge for your assignment in this class (though see the warnings below about shutting down your clusters when you're not using them). Follow these steps to set up your account:

If you create a new account using your @uw.edu email address, you should be able to apply for credits that allow you to complete this assignment without charge. (Even with credits, be sure to properly shut down your cluster when you are not using it. If you do not, it is likely that you will end up having to pay Amazon.)

  1. Go to http://aws.amazon.com/ and sign up. Note that Amazon will ask you for your credit card information during the setup process. This is expected.
  2. To get free access, apply for credits from Amazon. Click the options for "Students". Then fill in the form (in Step 2) with your AWS Account ID (from the previous step) and your @uw.edu email address. (It will not work if you use another email adress.)
  3. After applying, you will have to wait to be approved. You should get an email when your application has been approved, which gives you a credit code. Once you have it, go to this page to apply them to your account.

Warning: If you exceed the credit amount, Amazon will charge your credit card without warning. You should not exceed this amount provided that you remember to terminate the AWS clusters when you are not using it.

While working on your assignment, you should monitor your billing usage by going to the billing page, clicking on "Bill Details" (upper left), and then clicking on "Expand All".

Creating an SSH key pair

We will use SSH to connect to the machines running in Amazon's cloud. In order to do so, we first need to create an SSH key pair. To do so, follow these steps:

  1. Go to AWS security credentials page and make sure that you see a key under Access Keys. If not, click "Create a new Access Key" to create one.
  2. Go to the EC2 Management Console. On the top panel, towards the right, choose the region to be N. Virginia. We will create our cluster in N. Virginia since the input data is present in that region. (If you want to use another, potentially less busy, region, follow the instructions on this page to copy the data to the cluster you want to use.) Next, click "Key Pairs" on the navigation panel. Then, click the "Create Key Pair" button. Enter a key pair name (anything is fine) and click "Yes". (Don't do this in Internet Explorer, though, or you might not be able to download the private key file for some reason.)
  3. Download and save the private key file — its name will end with .pem — to disk. We will refer to this file below as </path/to/saved/keypair/file.pem>.
  4. Make sure only you can access the .pem file by changing the permissions and running this command in the terminal:
    $ chmod 400 </path/to/saved/keypair/file.pem>
    Windows users: This step will NOT work on Windows 7 with cygwin. Windows 7 does not allow file permissions to be changed through this mechanism, and they must be changed for ssh to work. Instead of cygwin, you can use PuTTY as your ssh client. Furthermore, you will have to transform this key file into PuTTY format. For more information, go to Amazon's instructions (look for "Converting Your Private Key Using PuTTYgen" on that page) and follow the instructions for how to convert your .pem file into a .ppk file that you can use with PuTTY.

Starting Up Your Spark Cluster

Next, we will set up a Spark cluster using the Elastic MapReduce (EMR) Management Console. To do so, follow these steps:

  1. Go to the EMR home page and sign in. Make sure that the N. Virginia region is selected on the panel at the top right.
  2. Click the "Create Cluster" button.
  3. Click "Go to advanced options". You will then have four steps of options to complete:
    Step 1: Software and Steps

    In the "Software Configuration" section, set "Vendor" to be "Amazon", and set "Release" as "emr-4.2.0". Check "Zeppelin-Sandbox 0.5.5" and "Spark 1.5.2". Then, click "Next" at the bottom of the page.

    In the software settings section, copy and paste the following configuration that allows Spark to use all of the memory of the cluster:

    [{
      "Classification": "spark",
      "Properties": {
        "maximizeResourceAllocation": "true"
      }
    }]

    Once you've done that, the page should look like this:

    Click "Next" at the bottom of the page to continue.

    Step 2: Hardware

    In the "Hardware Configuration" section, don't change the default "Network" and "EC2 Subnet", and don't create a VPC. Change the count of core instances to be 5 for the homework. (For experimenting, you can set the count to 1. If you find queries are too slow, you can resize the cluster and increase the count of core instances.) Keep the count of task instances as 0. You can change the instance type of the instances if you want, but the larger the instance, the more expensive it is. To start out, keep this set to m3.xlarge. (For more information, see the Amazon documentation on instance types and pricing. For this assignment, you do not need to check the "Request spot" option, but if you want to experiment with bidding for a machine rather than getting the set price, see the Amazon documentation on spot pricing.)

    Once you've done those things, the page should look like this:

    Step 3: General Cluster Settings

    In the "General Options" section, find the "Cluster name" field and type a name such as "Spark Cluster" or "Spark HW8". Uncheck "Logging" and "Termination protection", but keep the rest unchanged.

    The page should now look like this:

    Click "Next" at the bottom of the page to continue.

    Step 4: Security

    In the "Security Options" section, select the SSH key pair you created above. IMPORTANT: make sure to select a key pair, otherwise you won't be able to ssh to the cluster. Leave the rest of the settings unchanged.

    The page should now look like this:

    Finally, click "Create cluster" at the bottom of the page.

  4. Go back to the cluster list and should see the cluster you just created. It may take a while for the cluster to launch (up to 45 min).

    If your cluster takes an extraordinarily long time or fails (it may fail, for example, by telling you that "instance type m3.xlarge is not supported in the requested availability zone"), Amazon may be near capacity. Try again later. If it still doesn't work, ask for help.

  5. On the cluster details page for your newly created cluster, make note of the Master Public DNS, listed on the top of the page:

    We will refer to this DNS address below as <master.public-dns-name.amazonaws.com>.

Do not forget to shut down your cluster when you are not using it. (See below for instructions.) You can easily run up a large bill with Amazon by leaving it running.

Prepare to Use Amazon's Public Dataset

For the homework, we are going to connect to one of Amazon's public data sets. The following steps will work with any dataset stored on Amazon EBS, but we are only going to use the Freebase Quad Dump for the homework. (For experimenting, you may want to try using the Freebase Simple Topic Dump because it is smaller.)

To prepare the data, perform the following steps:

  1. Find the snapshot ID of the dataset you want to use. This is listed on the public dataset's page. We will refer to it as <snapshotID>. For the Freebase Quad Dump, the ID is snap-b2ca9bdc.
  2. Go to the Amazon EC2 Console, and click on "Instances" under "Instances". You should see the cluster instances you just created in the prior step.
  3. Find and take note of the instance ID and availability zone of the master node. (The ID will be something like i-b0ead669, and availability zone will be something like us-east-1b). You will need this instance ID and availability zone in the next steps.

    To find the master node, you can either find the instance that has a Public DNS matching your <master.public-dns-name.amazonaws.com> from earlier OR you can look for the instance whose security group include "ElasticMapReduce-master" as shown here:

  4. Click on "Volume" under "Elastic Block Storage" on the left.
  5. Click on "Create Volume".
  6. Keep the volume type unchanged, and make the size large enough to fit the data (100 GiB should be fine). Select the availability zone to be the same one of the master node you found above. Under snapshot ID, enter <snapshotID>. Select the data from the drop down that appears. Then, click "Create Volume".
  7. Once loaded, check the volume you just created, and under "Actions", select "Attach Volume". In the instance field, select the instance ID of the master node. Keep the device field as it is. The form should now look like this:
    Note, a warning should come up about newer Linux kernels being renamed. That is nothing to worry about. Just click "Attach".

Allowing SSH Connections

In order to access your machine, you will need to allow SSH connections to your instance by opening up port 22 on the master node. (For more information on this, you can read Amazon's documentation on authorizing instance access.)

To allow SSH connections, perform the following steps:

  1. Go to Amazon EC2 Console, and click on "Security Groups" under "Network and Security".
  2. Click on the security group with the name "ElasticMapReduce-master". You should see a panel appear on the bottom.
  3. In the panel, select the "Inbound" tag, and click the Edit button.
  4. A box labeled "Edit inbound rules" should appear. Click on the Add Rule button on the bottom of the box.
  5. Change "Custom TCP Rule" to "SSH". The Protocol should become TCP and the Port Range should become 22. Leave those unchanged. Change "Custom IP" to "Anywhere". When you are done, the form should look like this:
    (For a production system, allowing connections from anywhere is unsafe, but for this homework, it should be fine.)
  6. Click the Save button.

Running First Query

You are now ready to connect to the Spark cluster and run some queries using the data you attached.

To do so, perform the following steps:

  1. Go back to the EMR Portal and select your newly created cluster.
  2. We will use SSH to forward port 8157 on your local machine to a web server port on the master node.

    Mac users: Run the following command in the Terminal:

    ssh -i </path/to/saved/keypair/file.pem> -N -L 8157:<master.public-dns-name.amazonaws.com>:8890 hadoop@<master.public-dns-name.amazonaws.com>

    This command will look as it is hanging, and that's fine. Just leave it the terminal running, and it will forward connects you make in your browser to local port 8157 over to port 8890 on the master node.

    Windows users: You can configure PuTTY to do port forwarding as well. When you start your session, click on "Tunnels" under "SSH" on the Category pane on the left. You should see a form like this:

    Fill in the "Source port" with 8157 and the Destination with "127.0.0.1:8890". When you click Open, it should start an SSH session, which you can just leave running in the background.

  3. Back on the cluster details page, next to the Master Public DNS name, click on "SSH" and follow those instructions in a new terminal window. (Keep the other terminal running.) That will connect you to the master node, so that you can run commands at the command-line. When you connect, you should see something like this:
  4. Once connected, run the command
    cd /
    This will take you to the root directory where we will mount the data to the device. (For more information, see Amazon's documentation on using volumes.
  5. To make sure things are correct, run the command
    lsblk
    This will show you the disk devices available. You should see xvdf and xvdf1, with xvdf1 having the same size as the public data we are using.
  6. Next, run the command
    sudo mkdir /data
    This will make a folder, called /data, where we will mount the public dataset. (So far, the data is only available through a device, but is not accessible anywhere in the file system.)
  7. Then, run the command
    sudo mount /dev/xvdf1/ /data/
    Navigate to /data/. You should find a file of the public dataset, ready for use!
  8. Now, run the command
    sudo chmod 664 /data/freebase-datadump-quadruples.tsv
    This will make the file readable for the next step. (This command assumes you are using the Freebase Quad Dump. If you are using a different dataset, you will need to change the first file name.)
  9. Next, run the command
    hadoop fs -mkdir /data/
    This will make a folder on HDFS for Spark to use. It may take a few seconds to complete.
  10. Finally, put the public dataset file into HDFS by running the command
    hadoop fs -put /data/freebase-datadump-quadruples.tsv /data/spark_data.tsv
    (As above, this is assuming you are using the Freebase Quad Dump. Change the first file name if you are using another dataset.)

    This command may take a while (up to 90 min) for the data to be transferred. You can monitor the progress by using SSH to connect to the master in another new terminal and running this commmand

    hadoop fs -ls -h /data
    This will show you the current size of the file. The final size will be around 30GB for the Freebase Quad Dump. (If you want to log out and come back in an hour, you can instead run the first commmand above using screen. That will let you log out but keep the command running until you get back. See the screen documentation for full details.)

  11. Open a new window in your browser, and navigate to http://localhost:8157/. (This connection will be forwarded by SSH to the master node.) This should open up the Zeppelin web UI.
  12. Click on "Create new note", and give your new notebook a name like "HW8 Spark". Then, click on that notebook. This will open up an interpreter to let you run Scala code to load the data into Spark and run SQL queries.
  13. Copy and paste the following the following Scala code:
    import org.apache.commons.io.IOUtils
    import java.net.URL
    import java.nio.charset.Charset
    
    //open file in hdfs
    val FBText = sc.textFile("/data/spark_data.tsv")
    //create the schema of the table for Freebase data
    case class RDFRow(subject: String, predicate: String, obj: String, context: String)
    //loop through each row of data, split it on a tab, and make a RDFRow object
    //to toDF makes it a DataFrame, which is equivalent to a relational table
    val fbRow = FBText.map(s => s.split("\t")).map(s => 
      RDFRow(
        if (s.length >= 1) s(0) else "",
        if (s.length >= 2) s(1) else "",
        if (s.length >= 3) s(2) else "",
        if (s.length >= 4) s(3) else "")).toDF()
    //makes a table of the data called dbFacts
    fbRow.registerTempTable("fbFacts")
  14. Click "Run" on the top right of the white paragraph box. This is going to run the code on your cluster without you having to do it yourself. Pretty nice!
  15. Finally, in a new paragraph, enter the following (without a semi-colon):
    %sql
    SELECT *
    FROM fbFacts
    LIMIT 1

    This sets the query language to SQL and, as you can see, lets you run SQL on your fbFacts table.

    This sample query just returns a single row of data. But you can now run more complex queries as well, which you will do as you work on HW8.

Shutting Down Your Spark Cluster

Important: You should always shut down your cluster when you are done. Just closing the browser will not work. Amazon charges you per instance hour, which means you could spend a lot of money if you forget to shut down your cluster!

Note: The cluster will not shut itself down. If you do not see the cluster in the console, you are most likely looking at the wrong data center. (AWS defaults to the Oregon region rather than N. Virginia, which we are using.) You can change to the correct cluster using the drop-down in the top-right corner of the page.

To shut down your cluster, perform the following steps:

  1. Go back to the EMR portal.
  2. You should see a list of your clusters. Click on the name of your cluster.
  3. At the top, now click "Terminate". (You may have to turn off termination protection first.) It may take a few minutes for everything to shut down.
  4. If you are totally done with the homework, delete your Volumes. As Volumes are cheap, it's okay to keep the running while you are working on your homework. To delete them, go back to the EC2 dashboard. Click on "Volumes" from the left hand side under "Elastic Block Storage". Then, delete your Volumes.