CSE logo University of Washington Computer Science & Engineering
 CSE333 12sp -- Systems Programming
  CSE Home   About Us    Search    Contact Info 

Course Home
 Home
Administration
 Overview
 Course email
 Anonymous feedback
Assignments
 Home Virtual Machines
 Homework Dropbox
 Class GoPost Forum
 Class Gradebook
Schedule
 Schedule
   

Homework #2

out: Monday April 16th, 2012
due: Thursday May 3rd, 2012 by 9:00pm.

[ summary | part a | part b | part c | how to submit | grading ]

Summary.

Homework #2 has several parts to it. In part A, you will read through our (somewhat complex) code for JSON parsing and complete it. In part B, you will complete our code for interacting with Twitter through their search API. In part C, you will write a simple search program that retrieves search results and creates a JavaScript-based word cloud from them for display in a browser.

Here is the code you should download that you'll modify for this assignment. As a very first step, once you've untar'ed the code, take a look in the libhw1/ directory. We've provided you with our hw1 library (libhw1.a) in case yours has bugs in it. But, feel free to replace this file with your own libhw1.a if you want to build on top of your solution rather than ours.

Part A -- JSON parsing

Background on JSON.

When programs need to communicate with each other over networks, they have to agree on a format for any data that they exchange with each other. JavaScript Object Notation (JSON) is a very simple data exchange format used by most public Web APIs, including Twitter. JSON is able to represent rich, nested structures that comprise strings, numbers, booleans, arrays, and maps.

Here is a simple example of some data encoded in JSON:

{
  "query": "lance armstrong",
  "results": [
    {
      "from_user": "Steve",
      "from_user_id": 14423,
      "text": "lance armstrong will be racing the Kona Ironman in 2012!"
    },
    {
      "from_user": "Lance",
      "from_user_id": 94342,
      "text": "steve gribble will be teaching cse333 in 2012!"
    }
  ]
}
Alternatively, here's the same data, but formatted in a way that is less verbose but also much less easy to read:
{"query":"lance armstrong","results":[{"from_user":"Steve","from_user_id":14423,
"text":"lance armstrong will be racing the Kona Ironman in 2012!"},{"from_user":"Lance",
"from_user_id":94342,"text":"steve gribble will be teaching cse333 in 2012!"}]}
This data is a subset of the kind of data that Twitter provides when you programmatically search it. At the top-level, the data contains a map with two key/value pairs. The first key is the string "query", and its value is the string "lance armstrong". The second key is the string "results", and its value is a two-element-long array. Each element of the array is another map.

Diagrammatically, this JSON structure looks like this:


As you can see, a piece of JSON is described by a tree, where each tree element is one of several base JSON values. These base JSON values are:

  • json object: essentially a key/value map, mapping strings to values
  • json array: an array of values
  • json string: a quoted string
  • json number: a floating-point number
  • json boolean: either the literal true or the literal false
  • json null: the literal null

Our JSON parsing implementation

We have implemented most of a JSON parser for you. The goal of the JSON parser is three-fold. First, it takes a piece of text containing JSON-formatted data and parses though it, producing a JSON data structure similar to the tree-like diagram shown above. Second, it provides you with some functions that allow you to manipulate this JSON data structure. Third, it provides a function that generates JSON text as output, given a JSON data structure as input.

Let's walk through the code. Take a look at JSONParser.h; this header file declares the public interface to the parser. The interface consists of a set of C++ class definitions. The first, JSONValue, is a purely abstract class that is never actually instantiated; it is a base class that is subclasses for each type of JSON value listed above (e.g., a json object, a json string, and so on.).

This base class does declare a class method, called JSONParse; the method accepts a std::string as input and returns a pointer to a JSONValue class (i.e., to one of the subclasses, cast to the abstract base class) as a return value. You use this method to parse a piece of JSON.

Next, you can see a virtual function called "dump"; each of the JSON subclasses implement this function. It's role is to produce a text representation of itself, and write that text representation to the provided stream. By traversing through a JSON data structure recursively, you can use this method to produce the text representation of an entire JSON data structure.

Below this, you will see a set of subclasses of JSONValue, one for each JSON data type. For the more complex subclasses (Object and Array), the subclass defines some methods that allow you to access and manipulate the value, such as by appending a value to the end of a JSON array. All of the classes, including the simple ones, provide a Dump method for converting to a text representation. As well, all of the classes expose their state through an appropriately named instance variable, such as tree_ for JSONObject or array_ for JSONArray.

Next, take a look at JSONParser.cc. This contains the partial implementation of our Parser. We have implemented most of the finicky, ugly parts of JSON parsing, but we want you to complete two pieces of our implementation. First, you should complete the implementation of ParseArray(). (Step 1). This is the function that takes a string, consumes the array from the front of the string, and produces/returns a JSONArray structure that represents it. We suggest you base your implementation off of a very similar implementation we provided you for ParseObject().

Second, you should complete the implementation of JSONArray::Dump(), which is the method that produces a string representation of a JSONArray object and writes it out to the provided stream. (Step 2). Again, we suggest that you base your implementation off of the similar implementation of JSONObject::Dump().

Third, you should complete the implementation of JSONArray::operator[](), which is the method that overrides C++'s array access operators, and allows customers to ask for particular items within the array. You will need to do a few things:

  • verify that the argument (index) is within the bounds of the array, and if not, throw a runtime_error exception with a helpful error message.

  • create a list iterator, and use it to iterate to the requested element of the array.

  • use LLIteratorGetPayload to get the payload (which is a (JSONValue *) pointer, and return that to the caller.

  • make sure you don't leak memory, and in case of errors, throw a helpful exception.

One of the things you'll notice is that our code makes use of some routines in Utils.cc/.h, particularly routines called JSONEscapeString and JSONUnescapeString. JSONEscapeString takes a piece of JSON text and adds the "\" escape character in the right places to make it safe for embedding in HTML. For example, it will take any newline characters and replace them with \n, and it will take any quote characters and replace them with \". JSONUnescapeString does the opposite. When you parse JSON text, it will be escaped, so you need to unescape it while doing the parsing. Similarly, you will need to escape any text that you generate with Dump(). You don't need to implement these escape and unescape routines; we've provided them for you in Utils.cc/.h.

What to do.

You should follow these steps to do Part A of this assignment:

  1. Finish the implementation of JSONParser.cc. First, read through the code. Next, find each comment that says "STEP X", and replace that comment with working code. Once you're ready to test your code, try running the unit test driver (test_suite) to see if you make it past the Test_JSONParser tests successfully. As a reminder, you can try our solution binaries to compare the output of your test suite with ours.

  2. As a reminder we'll also be testing whether your program has any memory leaks. We'll be using Valgrind to do this. To try out Valgrind for yourself, do this:

    • cd into the solution_binaries subdirectory, and run the following command:
      valgrind --leak-check=full ./test_suite
      and note that Valgrind indicates that no memory leaks were found in our sample solution.

Part B -- interacting with Twitter

I'm guessing you have used Twitter: it's a web site where you can publish short (140 characters or less) status updates, subscribe to others' status updates, and search for updates based on their content, sender, and receivers. In addition to providing a user interface through the Web, Twitter has also provided the world with an application programming interface, or API. We will be taking advantage of the search API that allows our programs to communicate with Twitter over the Internet, posing queries and receiving back JSON-formatted search responses.

The Twitter search API is REST-based; this is just a fancy way of saying that in order to request data, you just issue a Web request to a carefully formatted URL. So, for example, try visiting the following URL with your browser; it has been constructed to ask Twitter for any recent tweets mentioning the phrase "lance armstrong":

http://search.twitter.com/search.json?q=%22lance%20armstrong%22

Instead of a human-viewable web page, you'll see JSON data that describes a set of search results. The JSON data isn't particularly friendly to read; try using the clipboard to copy all of that text and paste it into this JSON pretty-printing site: it should make it much more readable.

Looking through the pretty-printed results, you can see that the results roughly have the structure shown in the diagram at the very top of this web page, with the exception of having more fields than we showed. The results are a single JSONObject (i.e., a single map); within the map, there is a key called "results" that contains an array of result JSONObjects. Each one of the result JSONObjects contains a set of interesting name/value pairs, including the name of the user that sent the tweet, the text of the tweet, and so on.

In this part of the assignment, you will write some code that opens up a network connection to Twitter, issues a search request, and receives and parses the reply JSON data, using the JSONParse code that you finished in part A. We've split part B up into three major steps:

  1. You will write some code to open up a socket to a hostname and port number provided to you by the customer. (For twitter, the hostname that you connect to is "search.twitter.com" and the port number is 80.) As part of this, you will write a routine that translates a DNS name into an IP address.

  2. You will write some code that, given a socket, writes a valid HTTP request on the socket, receives back the HTTP response, and separates the HTTP response into HTTP headers (which are thrown away) and the JSON response body (which is kept for the next step).

  3. You will write some code that, given a twitter search string, uses steps 1 and 2 to issue a search to twitter, receives the JSON response, and parses it into a valid JSONValue structure.

What do do.

Let's go over each step in more detail.

Step 1: opening a socket

Look in your code directory, and you'll notice files called SocketClient.h and SocketClient.cc. Take a look at SocketClient.h; this is a very simple header file that declares two functions, one called DNSLookup and one called ConnectToServer. Next, take a look at SocketClient.cc: notice that the implementation of these functions is missing.

Your job in Step 1 is to finish the implementation. Note that the implementation is quite similar to some of the code that we have covered in the sections and in the lecture! We have also provided a simple unit test that exercises these functions, so when you think you've finished, try running test_suite and see if you make it past the Test_SocketClient unit test; if so, you're successfully opening up a socket to a remote server!

Step 2: issuing an HTTP transaction

Now that you're able to open up a socket to a remote server, you need to implement the functionality to issue a web request against a web server and receive the reply. The protocol that you must use to do this is called HTTP (the hypertext transfer protocol). HTTP is quite complex, but fortunately, we'll be using a very simple subset of it in this assignment.

Take a look at HTTPClient.h; in it, we've declared a single function called HTTPGet. A customer invokes this function to connect to a web server, issue a request for a URL to that server, and then read the response data. Now, take a look at HTTPClient.cc; notice that we have not given you the implementation. Your job is this step of Part B is to implement HTTPGet.

Here are the rough steps you will need to follow:

  1. Given the name of a document on a server, such as:
    /foo/bar/baz
    on the server "www.uw.edu", the request you generate will look like this (where the \r\n at the end of the lines are carriage-return and newline characters, not the literal sequence '\r' followed '\n'):
    GET /foo/bar/baz HTTP/1.1\r\n
    Host: www.uw.edu\r\n
    Connection: close\r\n
    \r\n
    Note that the first line is where you request a particular page. In that line you see us asking for the "/foo/bar/baz" page. Also, note that the second line mentions the name of the server. This is necessary because a single web server can host pages from several different domain names, and it needs to know which one you are asking for.

  2. Read back the response from the socket, and split it into the HTTP header block and the response data. The HTTP header block appears first, and provides the client (your program) with some metadata about the HTTP transaction. You just want to skip this; the HTTP header block ends with the four-byte character sequence "\r\n\r\n". After this, all of the remaining data on the socket is response data, i.e., the document you asked for So, we will use fdopen() to convert the file descriptor for the socket into a FILE* object, read the entire response into memory, and then go through the response looking for the first instance of "\r\n\r\n" that indicates the split between the header and the response data.

    For example, the HTTP response could look something like this:

    HTTP/1.1 200 OK\r\n
    Content-Type: application/json;charset=utf-8\r\n
    Date: Wed, 11 Apr 2012 20:52:08 GMT\r\n
    Content-Length: 77393\r\n
    \r\n
    {"completed_in":0.114, ...etc. }
    Where the bold text is the response data, and the non-bold text is the HTTP header.

When you think you've finished, try running test_suite and see if you make it past the Test_HttpClient unit test.

Step 3. Issue a Twitter Search and Parse the JSON response data

In Step 3, you'll complete the implementation of TwitterAPI.cc by using HTTPClient.h to issue an HTTP request against the Twitter search API, receive back the JSON response data, parse the JSON data, and then construct a class with the query response fields nicely parsed out.

To issue a query against Twitter, all you need to do is use HTTPGet() from Step 2 to connect to hostname "search.twitter.com" and port 80, and pass in an appropriate document path. The document path you will ask for has the following URL:

/search.json?rpp=100&q=encoded search query
Given a search query passed to you by the customer, you must encode it using the URIEncode() routine from Utils.h to transform the query into text that is safe to embed within a URL. For example, if the customer wants to search for the phrase "lance armstrong", then the quotation marks and the space character need to be converted into safe versions of them using URIEncode(), and the resulting document path to ask for is:
/search.json?rpp=100&q=%22lance%20armstrong%22
So, using HTPGet(), you issue the request for this document, and you get back JSON-encoded search results.

Your next step is to use JSONParse from part A to parse these results into a nested JSONValue structure. Finally, you have to iterate through this structure, pulling out key fields and building up a std::vector of Results. (A Result is a nested class within the TwitterSearch class, as you can see from the TwitterAPI.h header file.)

The JSON that you receive back will be a more complex version of what was shown in the diagram near the top of this page in Part A. In particular, try clicking on this search URL to see that encoded JSON data in your browser:

http://search.twitter.com/search.json?q=%22lance%20armstrong%22

It's hard to read, but if you use the tool to make it more readable provided near the top of this page, you'll notice that there is a top level object with a results field whose value is an array of result objects. Each object contains some string values. You will need to build up a Result structure for each result object, extracting out the fields. You should use the Get JSONObject::Get() method to fetch the "results" array from the top-level results, and then you can use the JSONArray's array access operator [ ] to access the individual results. Within each result, you use the JSONObject::Get() method to access the string fields, such as the "from_user" field.

Now, take a look at TwitterAPI.cc; note that we have provided some of the code, but your job is to complete the implementation. Once you think you have finished, try running the unit tests and see if you make it past the TwitterAPI unit test.

Part C -- a simple Twitter search shell

Now that you have code that is able to interact with Twitter, we are going to implement a very simple user interface. You will complete a program that prompts the user for a query, reads the query from the user, then issues the query against twitter. Instead of printing the results out to the screen, the program will write a temporary HTML file with the results within it, then invoke the Firefox browser to display the results graphically.

First, try running our solution binary (solution_binaries/twitter_shell). Note how it works: the user types a query and presses return, and the shell then works its magic. Your job is to replicate this behavior.

Take a look at twitter_shell.cc; note that the implementation is partially missing. Fill it in the usual way: look for the STEP X comments, and follow the directions. Here's some detail on a few of steps:

  • Once you have successfully retrieved a query response from Twitter, you need to extract and process some data from the response. In particular, we want you to extract each word from each "text" field of each tweet, santize each word, and insert each word into a vector. Our code will then generate a frequency histogram counting how often each word appears in the result set. Our code will produce a vector of std::pair instances; a pair is just a pair of data types, and in our example, the first component of the pair is word frequency (an int) and the second component is the word itself (a string).

    As a hint for how to split individual words out of a large string of words, take a look at this page from stackoverflow, and scroll down to look at the response involving boost::split.

  • Given this frequency data, you need to generate a web page that renders it. We are using an open source JavaScript library that generates a fancy word cloud from this frequency data. What you need to do is figure out how to generate the right HTML file that prepares the data for the JavaScript library. Here's how: take a look at this example page, and mimic everything inside it precisely. The one part of the page you need to modify is the "word_list" array -- instead of using our word list, you need to dynamically generate the word list from your twitter data.

  • Once you've generated the web page, our code will take over and write it out to a temporary file and invoke Firefox to render it.

What to turn in

When you're ready to turn in your assignment, do the following:

  1. In the hw2 directory, run "make clean" to clean out any object files and emacs detritus; what should be left are your source files.

  2. Create a TURNIN.TXT file in hw2 that contains your name, student number, and UW email address.

  3. cd up a directory so that hw2 is a subdirectory of your working directory, and run the following command to create your submission tarball, but replacing "UWEMAIL" with your uw.edu email account name.
    tar -cvzf hw2_submission_UWEMAIL.tar.gz hw2
    For example, since my uw.edu email account is "gribble", I would run the command:
    tar -cvzf hw2_submission_gribble.tar.gz hw1

  4. Use the course dropbox (there is a link on the course homepage) to submit that tarball.

  5. Fill out the following survey to give us feedback on the assignment:
    https://catalyst.uw.edu/webq/survey/gribble/165045

Grading

We will be basing your grade on several elements:

  • The degree to which your code passes the unit tests. If your code fails a test, we won't attempt to understand why: we're planning on just including the number of points that the test drivers print out.

  • We have some additional unit tests that test a few additional cases that aren't in the supplied test driver. We'll be checking to see if your code passes these as well.

  • The quality of your code. We'll be judging this on several qualitative aspects, including whether you've sufficiently factored your code and whether there is any redundancy in your code that could be eliminated.

  • The readability of your code. For this assignment, we don't have formal coding style guidelines that you must follow; instead, attempt to mimic the style of code that we've provided you. Aspects you should mimic are conventions you see for capitalization and naming of variables, functions, and arguments, the use of comments to document aspects of the code, and how code is indented.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[ comments to gribble ]