Is California Happier than New York? Twitter Sentiment Analysis

Due: at 11pm on Thursday, July 19. Submit via Catalyst CollectIt (a.k.a. Dropbox).

Twitter represents a fundamentally new instrument to make social measurements. Millions of people voluntarily express opinions across any topic imaginable --- this data source is incredibly valuable for both research and business.

For example, researchers have shown that the "mood" of communication on twitter reflects biological rhythms and can even used to predict the stock market. A student here at UW used geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of the recent storms.

Researchers from Northeastern University and Harvard University studying the characteristics and dynamics of Twitter have an excellent resource for learning more about this area.

In this assignment, you will

Access the twitter Application Programming Interface using a python program.
Write a program to estimate the public's perception (the sentiment) of a particular term or phrase.
Build up a library of reusable functions for working with twitter data.
Use your library to study the relationship between location and mood based on a sample of twitter data.

First, download and unzip the file homework5.zip.

The Twitter Application Programming Interface

Twitter provides a very rich Application Programming Interface (API).

This interface is entirely web-based, meaning that programs connect to resources over the Internet to query the system, access data, and even control their account.

Unicode strings

You may notice that strings in the twitter data are prefixed with the letter "u", like this:

u"This is a string"

This indicates that the string is "unicode", which is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathmatical symbols, logograms from non-phonetic writing systems such as kanji, etc.)

In most circumstances, you will be able to use a unicode object just like a string.

If you encounter an error involving unicode, feel free to ask for assistance. Alternatively, you can use the encode method to just replace the international characters, like this:

a=u"aaaàçççñññ"
a.encode('ascii','replace')
'aaa???????'

Problem 1: Query Twitter with Python

Open the file "twitterquery.py" and read the comments.

This program is like a very simple web browser: It accesses a url and downloads the response to the local computer. However, there is an important difference: we are accessing data that is designed to be processed with a program rather than to be rendered on the screen and displayed to a user. You can see this for yourself by copying the following url into your web browser:

http://search.twitter.com/search.json?q=microsoft

Now run the program twitterquery.py from the command line.

The result displayed in your browser is the same data that the program prints to the screen. But with the browser, all we can do is display it. With Python, we can process it further.

The format of this data is called JSON, which stands for JavaScript Object Notation. It is a simple format for representing nested structures of data --- lists of lists of dictionaries of lists of .... you get the idea.

As you might imagine, it is fairly straightforward to convert JSON data into a Python data structure. Indeed, there is a convenient library to do so, called json, which we will use.

Back to the result returned from our query: It looks very complicated, but you can deduce that the outermst structure is a dictionary by noticing the curly braces, or just by using the type function. Twitter provides only partial documentation for understanding this data format. But we will use Python to inspect this data structure and figure out which parts are important for our purposes.

Where do we start? We know that the outer-most object is a dictionary. What are the keys of this dictionary? Use print response.keys() to inspect the keys.

You will see several keys that contain the word "page". This outer-most dictionary describes a single page of results. Where are the actual tweets? One of the keys is results --- that looks promising. Perhaps these are the actual search results returned by our query.

Print the results with print response["results"].

Now perhaps the tweets are more recognizable. Indeed, each element in this list is a dictionary representing a tweet. To access just the text of the tweet, use the key text. Write a short program to print out the text of every tweet.

What to turn in: In answers.txt, give the total number of tweets returned by this query.

Problem 2: Abstract the Twitter search into a function

Write a function called search that accepts one parameter, a twitter query term, and returns a list of strings. Each string in the list should just contain the text of a single tweet and no other metadata. Remember, each tweet is a dictionary, but you are currently only interested in the text of the tweet.

To test your function, check the maximum length of any tweet in the response. It will typically (but not necessarily) be close to 143.

tweet_lengths = [len(tweet) for tweet in search("microsoft")]
print max(tweet_lengths)

Use your function to return results for a few different search terms to make sure it works properly.

What to turn in: In answers.txt, write the maximum length of the tweets in your response.

Problem 3: Read the sentiment scores

The file AFINN-111.txt contains a list of sentiment scores. Each line in the file contains a word or phrase followed by a score. See the file AFINN-README.txt for more information.

Write a function called read_scores that takes a single parameter: a file name representing a list of sentiment scores in the format described above. The function should return a dictionary mapping terms to scores.

Write a test for your function. Using the returned dictionary, look up the scores for the following three words and sum their scores: foreclosure, masterpiece, questionable. Compute the sum manually by looking in the file source file AFINN-111.txt, and compare your answers. THis kind of exercise is always required in testing -- find a simple example for which you can commpute the answer manually, then make sure your program produces the same answer.

Hint: The term and score are separated by a tab character, which can be represented by the string "\t". You can use this to parse the file, or you can use the csv reader library you used in the last assignment.

Hint that you may or may not relate to your solution: If you want to concatenate a whole list of strings by a delimiter, you can use the join method of a string. Note that you call join as a method of the delimiter you wish to use, not as a method of the list you wish to concatenate.

>>> ",".join(["separated", "by", "commas"]
"separated,by,commas"
>>> "".join(["a","b","c"])
"abc"
>>> " ".join(["a","b","c"])
"a b c"

What to turn in: In answers.txt, write the sum of scores for the three words foreclosure, masterpiece, and questionable.

Problem 4: Compute the sentiment for a given string

Write a function called sentiment that takes a single argument: a string (think of it as a tweet). The function should return a number representing the sentiment by adding up the sentiment scores of the words in the tweet. You will need to lookup each word in the dictionary returned by read_scores. Ignore words that you do not find in the dictionary (One way to do this is to assume any word not in the dictionary has a sentiment of 0.)

(Optional) Write the program so that you do not re-read the sentiment file on each and every call. (You may have already done so.

Test your function by evaluating this line of code:

print sentiment("This significant announcement was short-sighted and disappointing, but ultimately useful.")

What to turn in: In answers.txt, write the output of the above call.

Problem 5: Compute the total sentiment for a list of strings

Write a one-line function called total_sentiment that takes a single argument: a list of strings. The function should return the total sentiment of all words in all strings in the list.

Test your function by evaluating the following code and making sure the answer you compute is correct.

tot = total_sentiment(["excellent fun", "terrible disaster", "foolish mistake", ""])
print "Total sentiment test: ", tot
assert tot == -2

What to turn in: Report the result of the above test in answers.txt. Remember that your function body should be only one line long.

Problem 6: Extend to include multiple pages of twitter results

If you look closely, you'll find that your function search is only returning a single page of results.

This kind of realization happens a lot --- your initial design turns out to be not quite general enough. So we want to refactor your design to a) meet new requirements but b) not breaking the existing code.

For example, put the following url into any browser:

http://search.twitter.com/search.json?q=microsoft&page=2

The argument at the end of the url instructs twitter to return the 2nd page of results.

Modify your function search to read 10 pages of twitter search results. To access another page of results, you must make a separate call to the twitter API and provide an optional page argument.

The search function should keep the same interface --- it will still accept one parameter, and still return a list of strings. Except now, the list of strings will be longer.

Your function will make multiple calls to the twitter API, replacing page=2 with page=3 and so on. The results from all pages should be assembled into a single list.

Hint: It is recommended, but not required, that you write an auxilliary function called search_onepage that takes two arguments: the search term and the page number. Once you have this function,your implementation of the search function can call this auxilliary function multiple times.

Test your function with the following call:

print len(search("university of washington"))

What to turn in: Use the functions you have written so far to compute the total sentiment for the term oil and compare it to the total sentiment for the term geothermal. Report the result in answers.txt.

Problem 7: Determine which tweets are from California

The file saturday_afternoon.json contains a 5-minute sample of about 1% of the complete twitter stream. You can access this stream directly from the twitter site, but it requires authentication with twitter. The authentication process is complicated, so we will not require it for this assignment. Instead, we will use this sample.

This file has a slightly different format from the response to the query you have been using. In this file, each line is a Tweet object, as described in the twitter documentation.

For some of the tweets in the file saturday_afternoon.json, you can access the profile of the user that submitted the tweet by checking for the key user. This profile is a dictionary. One of the keys in that dictionary is location, which is a self-reported description of their location.

Write a function is_from_california that takes a tweet as an argument and attempts to determine whether the tweet came from a user from California. One way to do this is to see if the string California appears anywhere in the locationfield. You can do this with the in operator. For example:

>>> "ice" in "ice cream"
True
>>> "c" in "abcdefg"
True
>>> "moi" in "Des Moines"
False

This data is dirty, so you probably won't get a perfect answer. Some people might refer to California with "CA" or even "Calif". Try to implement your function to be as effective as possible.

Also, note that not every tweet will have a user key, and not every user will have a location key. In fact, not every tweet will have a text key. Your program should be robust to these possibilities.

To test your function, use the following:

tweetfile = open("saturday_afternoon.json", "r")
tweets = [json.loads(line) for line in tweetfile]
calitweets = [tweet for tweet in tweets if is_from_california(tweet)]

Next, write a program to extract the text of the tweets from California and compute the average sentiment of tweets from California.

What to turn in: Compute the average sentiment of tweets from California in saturday_afternoon.json and report the answer in answers.txt

Problem 8: Are people from California happier than people from New York?

Write a function is_from_new_york in the same way you wrote is_from_california.

Note that there will be some overlap between the two functions --- resist the temptation to copy and paste. Figure out a way to reuse the overlapping parts, perhaps by writing a new function that these two functions can call. Do not be afraid to implement a one-line function if you think it is useful.

Use both functions to compare the total sentiment for California users with the total sentiment for New York users. Which state seems "happier"?

Hint: Remember, not every tweet dictionary will have a text key -- real data is dirty. You may want to write a function that encapsulates the two-step process of testing for a key then extracting it if it exists.

What to turn in:

Turn in the file twitterquery.py containing all the functions you have written.
Report the overall sentiment of tweets from California and the overall sentiment of tweets from New York in answers.txt.
Finally, assess the validity of this analysis. Can we conclude that California is happier than New York? Suggest at least one change to the experiment that would increase your confidence in the result. Write the answer in answers.txt

Submit your work

You are almost done!

At the bottom of your answers.txt file, in the “Collaboration” part, state which students or other people (besides the course staff) helped you with the assignment, or that no one did.

At the bottom of your answers.txt file, in the “Reflection” part, state how many hours you spent on this assignment. Also state what you or the staff could have done to help you with the assignment.

Submit the following files via Catalyst CollectIt (a.k.a. Dropbox):

answers.txt
twitterquery.py