Social networking and recommendation systems

Due: at 11pm on Thursday, July 12. Submit via Catalyst CollectIt (a.k.a. Dropbox).

When you sign into Facebook, it suggests friends. In this assignment, you will write a program that reads Facebook data and makes friend recommendations.

This assignment looks longer than it actually is.

Part 1 is background material and explanations; there is nothing to turn in for this part.
Part 2 asks you to create a graph, write 7 functions and two loops to call them, and answer three written questions.
Part 3 asks you to create a graph, then redo part of the work from Part 1, which essentially consists of calling functions you already wrote. You can complete this part quickly.

First, download and unzip the file homework4.zip.

Contents:

Part 1: Background
Part 2: Test on a small graph
Part 3: Work with the Facebook graph
Submit your work

Part 1: Background

Recommendation systems

Facebook suggests people you may be (or should be) friends with. Netflix suggests movies you might like. Amazon suggests products to buy. How do they do that? In this assignment, you will learn one simple way to make such suggestions, called “collaborative filtering”. The actual algorithms used by these companies are closely-guarded trade secrets.

A computer system that makes suggestions is called a recommender system. As background, there are two general approaches: collaborative filtering and content-based filtering.

Collaborative filtering says that, if your past behavior/preferences were similar to some other user's, then your future behavior may be as well. As a concrete example, suppose that you like John, Paul, and George, and other people like John, Paul, George, and Ringo. Then it stands to reason that you will like Ringo as well, even if you had never previously heard of him. The recommender system does not have to understand anything about what “John”, “Paul”, “George”, and “Ringo” are — they could even be brands of toilet paper, and the algorithm would work identically.
Content-based filtering considers the characteristics of the things you like, and it recommends similar sorts of things. For instance, if you like “Billie Jean”, “Crazy Train”, and “Don't Stop the Music”, then you might like other songs in the key of F-sharp minor, such as Rachmaninoff's “Piano Concerto No. 1”, even if no one else has ever had that particular set of favorite songs before.

In this assignment, you will implement a collaborative filtering recommendation system for suggesting friends on Facebook.

Representing a social network as a graph

A graph or network represents relationships among things. The things are represented as nodes or vertices, and the relationships are represented as edges.

One common use for a graph is to represent travel possibilities, such as on a road map or airline map. The nodes of the graph are cities, and the edges show which cities are directly connected. Then, you can use the graph to plan travel.

Northeast roads Airline network

Another common use for a graph is to represent friendship among people in a social network. For example, here is the friendship graph for some of the characters of “Romeo and Juliet”:

Romeo and Juliet social network

An edge between person A and person B means that A considers B a friend, and also B considers A a friend.

This graph is unable to represent certain information. For example, Count Paris wishes to wed Juliet, but she does not reciprocate his affection. You do not need to worry about this information, because Facebook does not represent this information either. (Some other social networking sites, such as Twitter and Google+, do permit one-way links.)

In the image above, ignore the gray background and the labels for the families ("houses"); those are there just to help you interpret the graph but are not part of the social network itself.

Recommending friends

You will implement two mechanisms for recommending a new friend in a social network. A simple way to state this question is, “For user X, who is the best person to recommend as a friend?”

You will answer a more comprehensive question: “For user X, list some non-friends in order, starting with the best friend recommendation and ending with the worst.” A non-friend is a user who is not X and is not a friend of X. Depending on the recommendation algorithm, the list may include all non-friends or some of them.

For example, for Mercutio the list might be:

Capulet
Montague
Benvolio
Friar Laurence
Juliet

Further note that the recommendations might not be symmetric: the best friend recommendation for Montague might be Mercutio, but the best friend recommendation for Mercutio might be Capulet.

Your task will be to write code that, given a user U in the social network, produces friend recommendations for U, in order from best to worst. You will do this by assigning each potential friend a number called a score, where higher scores indicate a better match. Then you can sort your list according to the score. Given user X, if two people Y and Z would be equally good as new friends for X (they have the same score), then they should be listed in alphabetical order (for names) or numerical order (for numerical user IDs).

Representing a graph in Python

The NetworkX library represents a graph in Python. Work through the NetworkX tutorial, up to and including the Accessing edges section. In order for you to use this powerful library effectively, it's important that you actually do the exercises: type the inputs and understand the outputs.

To use the networkx library in a program, write import networkx or import networkx as nx near the top of your program. The latter has already been done for you in the supplied social_network.py, so you can use shorter commands like nx.draw(...) instead of the (slightly longer) networkx.draw(...).

Part 2: Test on a small graph

Problem 1: Create a graph

Create a graph named rj corresponding to the Romeo and Juliet graph above (ignoring the shaded family/house information). You will use this graph for testing. It is always a good idea to test your code on a dataset that is small enough for you to manually compute the results.

Use the Graph class (not DiGraph, MultiGraph, or MultiDiGraph). Review its documentation before you begin.

To verify that your graph is correct, the provided code draws the graph both to a window and to a file romeo-and-juliet.pdf. Compare it to the figure above. The nodes will appear in different locations; that's fine so long as the same nodes exist and they are connected in the same way by the edges.

Note that your program will pause until you close the window that contains the the graph drawing.

When you are happy with your graph, comment out the call draw_rj() call (not the definition of the draw_rj function). The reason is so that you don't have to close the window every time you run your program.

Problem 2: Recommend by number of common friends

If non-friend Y is your friend's friend, then maybe Y should be your friend too. If person Y is the friend of many of your friends, then Y is an even better recommendation. The best friend recommendation is the person with whom you have the largest number of mutual friends. You will implement this heuristic.

As a concrete example, consider Mercutio.

Mercutio has two friends in common with Capulet (Escalus and Paris).
Mercutio has two friends in common with Montague (Escalus and Romeo).
Mercutio has one friend in common with Benvolio (Romeo).
Mercutio has one friend in common with Friar Laurence (Romeo).
Mercutio has one friend in common with Juliet (Romeo).
Mercutio has no friends in common with the Nurse.
Mercutio has no friends in common with Tybalt.

Therefore, Capulet and Montague are the best friend recommendations for Mercutio, and the Nurse and Tybalt are the worst friend recommendations. (In fact, the Nurse and Tybalt are such poor friend recommendations that your program will not even suggest them.)

Write the following 5 functions, whose documentation strings appear in the template file social_network.py that you were provided.

friends_of_friends
common_friends
number_of_common_friends_map
number_map_to_sorted_list
recommend_by_number_of_common_friends

The template file defines a helper function, friends(graph, user), that you may find useful.

The template file also contains assert statements to help you test your code. We strongly encourage you to write additional tests as well, in order to verify that your code is correct.

Hint: Remember that when Python tests sets or dictionaries for equality, it ignores the order of elements. A consequence of this is that two sets can print differently but be equal because they represent the same set, and likewise for dictionaries. For example, this Python expression evaluates to True:

{ 'Capulet', 'Escalus', 'Montague' } == { 'Montague', 'Capulet', 'Escalus' }

Hint: Throughout this assignment, you are permitted, but not required, to define additional functions beyond the required ones.

Hints for friends_of_friends and common_friends: When defining these routines, you should not use any data structures other than sets. If you find yourself using even one list or dictionary, then you are doing the problem wrong. The reason to use sets is that the code is much shorter and simpler. In fact, a good solution to friends_of_friends is 4 lines long, and a good solution to common_friends is 1 line long. Longer solutions are possible and can also be good style, but the point is that these are small, simple functions. The other functions you will define are little, if any, longer.

Hints for number_map_to_sorted_list:

You may want to use the key argument to the sort routine. Furthermore, you may find the operator.itemgetter routine useful as the key argument. For details, see the Python Sorting HowTo.
There are multiple ways to sort primarily on criterion c1, but to sort on criterion c2 where c1 does not make a distinction. One way is to first sort with respect to c2, then sort with respect to c1. This works because when Python sorts, it retains the ordering of any items whose sort keys are equal. (Another way is to use, as the sort key, a tuple/list of all the sort criteria. A way to do this is with a key expression like itemgetter(1, 0). This works because Python, when comparing a tuple/list, does so lexicographically (element-by-element, stopping when there is any difference.) For details, see the Python Sorting HowTo.

Problem 3: Recommend by influence

We will now give a different algorithm for computing a friendship score.

Consider the following hypothetical situation.

Two of your friends are J.D. Salinger and Tim Kinsella.
J.D. Salinger has only two friends (you and one other person).
Tim Kinsella has 7 billion friends.
J.D. and Tim have no friends in common (besides you).

Since J.D. is highly selective in terms of friendship, and is a friend of yours, you are likely to have a lot in common with J.D.'s other friend. On the other hand, Tim is indiscriminate and there is little reason to believe that you should be friendly with any particular one of Tim's other friends.

Incorporate the above idea into your friend recommendation algorithm. Here is the concrete way that you will do so. We call the technique “influence scoring”.
Suppose that user1 and user2 have three friends in common: f1, f2, and f3. In Problem 2, the score for user2 as a friend of user1 is 1+1+1: each common friend contributes 1 to the score. In this problem, the score for user2 as a friend of user1 is 1/numfriends(f1) + 1/numfriends(f2) + 1/numfriends(f3), where numfriends(f) is the number of friends that f has. In other words, each friend F of user1 has a total influence score of 1 to contribute, and divides it equally among all of F's friends.

In the example above, J.D. Salinger's other friend would have a score of 1/2, and each of Tim Kinsella's friends would have a score of 1/7000000000.

The social_network.py file gives two functions (influence_map and recommend_by_influence) that you should implement. You may find that their implementations are quite similar to code that you have already written in Problem 2; that is OK. The file also gives one test case for each of the two functions.

Do not change the code that you wrote for Problem 2. However, you can reuse most of it. You can solve the problem with just the two new functions (influence_map and recommend_by_influence), plus re-using some unchanged functions from Problem 2.

Problem 4: Does the recommendation algorithm make a difference?

Does the change of recommendation algorithm make any difference? Maybe not: you can see that Mercutio gets the same friend recommendations with both recommendation approaches. Does everyone get identical results with the two recommendation approaches?

Write code to print a list of people for whom the two approaches make the same recommendations, then print a list of people for whom the two approaches make different recommendations. Paste the output into an answers.txt file, which you will create.

Hint: There are 5 people for whom the recommendations are the same, and 6 people for whom the recommendations are different.

Problem 5: Analysis of the algorithm

In Problem 2, every friend recommendation had a score of 1 or more. In Problem 3, every friend recommendation had a score of less than 1. Will that always be the case?

In your answers.txt file, state whether each influence score will always be less than 1. If your answer is “yes”, explain why. If your answer is “no”, then give the smallest graph that you can that has a friend influence score of 1 or greater. In either case, you do not have to write any code — all your answers will appear in your answers.txt file.

Problem 6: Randomly choosing a friend connection

Suppose that you wanted to randomly choose a friend connection from a social network. Here are two ways to do this:

Obtain a list of all the friend connections, and randomly choose one of those.
Randomly choose a user in the social network, then randomly choose one of their friends.

The former makes each choice of friend connection equally likely. The latter does not necessarily do so.

In your answers.txt file, describe the smallest graph that you can for which the two approaches lead to different results. In other words, the first approach leads to each edge in the social network graph being chosen with equal probability, and the second approach leads to some edges being more likely to be chosen than others. State the likelihood of each edge in your example graph being chosen, when using the latter approach.

There is no code to write for this question. Just write your solution in answers.txt

Hint: If you are having trouble, then start with the simplest possible graph, compute the likelihoods, and make the graph incrementally more complicated until you find the example you seek.

Problem 7: How good are your recommendations?

We have seen that the two recommendation systems give different results. Which one is better?

You will test the two recommendation systems in the following way:

Randomly choose a real friend connection; call the two friends F1 and F2.
Remove their friendship from the graph.
Compute friend recommendations for F1 and F2.
Determine the rank of F1 in F2's list of recommended friends.
Determine the rank of F2 in F1's list of recommended friends.
The "rank" is also known as the "index" or "position".
Don't forget to put their friendship back in the graph.

For a perfect recommendation system, the first recommendation for F1 would be F2, and the first recommendation for F2 would be F1. In general, the closer to the front of the list these recommendations are, the better the recommendation system.

For each recommendation system, perform the above experiment 100 times, using the Romeo and Juliet data. Compute the average rank of the correct recommendation within the list of recommendations. (If the correct recommendation does not appear within the list of recommendations, ignore that trial when computing the average rank.)

To prevent different random choices from skewing your results, use the same random choices for both recommendation systems. Another way of saying this is that each time you make a random choice, you should evaluate both recommendation systems using that choice. Then go on to the next choice.

Ensure that your code is general. Write it as a function that takes a graph as an input, then call that function. You will re-use the function in Problem 12.

Ensure that your code is documented. In your answers.txt file, give the average index for each recommendation system. State which recommendation system is better for the rj graph.

Part 3: Work with the Facebook graph

Problem 8: Create a Facebook graph

Create a graph named facebook from the Facebook data in file facebook-links.txt. As above, use the Graph class.

Print the number of nodes and the number of edges in your facebook graph. The output should be:

Facebook nodes: 63731
Facebook edges: 817090

The facebook-links.txt file in your homework4 directory is courtesy of the Max Planck Institute for Software Systems. Here is a slightly clarified version of the documentation for this file:

File facebook-links.txt contains a list of all of the user-to-user links from the Facebook New Orleans networks. These links are undirected on Facebook.

Format: Each line contains two numeric user identifiers, meaning the second user appeared in the first user's friend list, and the first user appeared in the second user's friend list. Finally, the third column is a UNIX timestamp with the time of link establishment (if it could be determined, otherwise it is '\N').

A Unix timestamp is the number of seconds since January 1, 1970. You may ignore it in this assignment. (Facebook does use the recency of your activity to help it in making recommendations.)

Hint: Don't be alarmed if reading the Facebook data takes a little while. The file is large, and reading it may take up to a minute. However, do not try to draw the Facebook graph. This may cause your computer to hang, and even if it were successful, you would not learn much from a tangled mess of 817,090 edges.

Problem 9: Recommend by number of friends for Facebook

For every Facebook user with an id that is a multiple of 1000, print a list containing the first 10 friend recommendations, as determined by number of common friends. If there are fewer than 10 recommendations, print all the recommendations.

Paste the output into answers.txt file.

Problem 10: Recommend by influence for Facebook

For every Facebook user with an id that is a multiple of 1000, print a list containing the first 10 friend recommendations, as determined by influence score. If there are fewer than 10 recommendations, print all the recommendations.

Paste the output into your answers.txt file.

Problem 11: Does the recommendation algorithm make a difference for Facebook?

Considering only those 63 Facebook users with an id that is a multiple of 1000, compute and print the number of Facebook users who have the same first 10 friend recommendations under both recommendation systems, and the number of Facebook users who have different first 10 friend recommendations under the two recommendation systems.

Paste the output into your answers.txt file.

Problem 12: How good are your recommendations for Facebook?

Repeat the procedure you followed for Problem 7, but for the Facebook data.

In your answers.txt file, give the average index for each recommendation system. State which recommendation system is better for the facebook graph.

Hint: The average indices should be in the 100-400 range. This is because the social network is very large, you are using very impoverished data (just existing connections) and you are looking for one specific “right” answer (even though in real life many of the recommendations are probably useful).

Note: You will evaluate the recommendation systems on 100 randomly-chosen edges out of the 817,090 edges in the Facebook graph. If you were to choose 100 different randomly-chosen edges, you would get different results. The answers are likely to be informative nonetheless. If you choose 100 edges at random, it is unlikely that you will make an unlucky choice, such that technique 1 is better than technique 2 for those 100 edges, but technique 2 is better than technique 1 overall. It's still a possibility, of course. The more edges you choose at random, the less likely you are to be unlucky in that way. We asked you to do the evaluation on 100 randomly-chosen edges because your code would be very slow if you used (say) 1000 choices.

Submit your work

You are almost done!

At the bottom of your answers.txt file, in the “Collaboration” part, state which students or other people (besides the course staff) helped you with the assignment, or that no one did.

At the bottom of your answers.txt file, in the “Reflection” part, state how many hours you spent on this assignment. Also state what you or the staff could have done to help you with the assignment.

Submit the following files via Catalyst CollectIt (a.k.a. Dropbox):

answers.txt
romeo-and-juliet.pdf
social_network.py