CSE 373, Spring 2019: HW5 Part 2b

Overview

Due Wednesday, May 22 at 11:59pm.

In part 2a of this project, we discovered that while TF-IDF works fine in many cases, it's also exploitable.

Websites that want to maximize search traffic to gain more ad revenue or attract more customers can artificially inflate their score by keyword stuffing: by stuffing each webpage full of as many different words and phrases as possible.

We saw this for ourselves by running the search engine against the "wikipedia-with-spam" dataset, which contains a mixture of legitimate webpages and spam webpages. The search results were still useful, but also contained alarming amount of spam.

To fix this, we need a second metric to penalize that behavior: we'll be using a simplified version of PageRank, an algorithm invented by (and named after) Larry Page, one of the founders of Google. PageRank is not the only algorithm Google uses, but is one of their more widely known ones.

You will use these files from prior assignments:

  • src/main/java/datastructures/concrete/dictionaries/ChainedHashDictionary.java
  • src/main/java/datastructures/concrete/dictionaries/ArrayDictionary.java
  • src/main/java/datastructures/concrete/ArrayHeap.java
  • src/main/java/datastructures/concrete/ChainedHashSet.java
  • src/main/java/datastructures/concrete/DoubleLinkedList.java
  • src/main/java/misc/Sorter.java

You will be modifying the following file:

  • src/main/java/search/analyzers/PageRankAnalyzer.java

Additionally, here are a few more files that you might want to review while completing the assignment (note that this is just a starting point, not necessarily an exhaustive list):

  • src/test/java/search/TestPageRankAnalyzer.java
  • src/main/java/search/models/Webpage.java
  • src/main/java/search/Main.java

High-level description

PageRank ignores the user's query and instead focuses on computing the relative "importance" or "reputability" of a webpage based on what webpages link to it. The intuition is that if many different webpages link to webpage A, then that webpage A is most likely "important" or "reputable," and many people are likely to naturally stumble across webpage A just by following links.

Additionally, if a reputable webpage like A links to just one other webpage, webpage B, we can assume that B is also rather reputable. Since webpage A also has many users, these users are likely to click on that single link to B, which means that webpage A will "share" some of its popularity with webpage B. Extrapolating from this logic, we can estimate that a webpage's reputability is correlated with its "popularity," or number of users.

Because all these webpages end up sharing popularity with each other, calculating an actual value for popularity is more mathematically complex than simply counting the number of incoming links. To calculate this efficiently, PageRank uses a sort of simulation that very loosely mimicks behavior of internet users over time.

To put it another way, we are looking at the graph representing the internet and trying to emulate how people will naturally traverse that graph!

The exact algorithm will come later on this page, but here's a video from last quarter providing a more-detailed version of this high-level overview. The video is rather lengthy and is not required to understand or implement the algorithm, but goes over a related example of simulation using graphs that may help you understand the PageRank algorithm.

Part 2b i: Build the web graph

Task: Implement PageRankAnalyzer.makeGraph(...).

PageRank works by analyzing a directed graph representing the internet: each webpage is a vertex, and each link is an edge. So, if we want to implement PageRank, we need to first build this graph!

Some notes:

  1. You should represent this graph using an adjacency list. Discuss with your partner: why is an adjacency matrix likely not the best choice here?

  2. If a webpage has multiple outgoing links to the same destination, you should add only a single edge to your graph. That is, there should be no duplicate edges.

  3. If a webpage links to itself, you should omit that link. That is, there should be no self-loops in your graph.

  4. Your makeGraph(...) method will be given a set of web pages as input. If any of those web pages links to something that is not contained within that set, ignore that link. We want our graph to be self-contained: every edge should point to an existing vertex.

  5. Please make sure to read the other notes and hints section from the part 2a spec—in particular, the bullet points about working with iterators.

Part 2b ii: Implement PageRank

Task: Implement PageRankAnalyzer.makePageRanks(...).

Next, you will implement the makePageRanks(...) method, which will precompute the page rank for every webpage in your graph. This method should implement the core algorithm as described below:

  1. Initialize step:

    Initially, give each webpage a rank of \(\frac{1}{N}\), where \(N\) is the total number of webpages.

    Intuitively, we're saying that at the very start, \(\frac{1}{N}\) of all web surfers will start on each webpage: each webpage initially has an equal percentage of visitors.

  2. Update step:

    Next, we will compute the new page rank by simulating the expected behavior of our web surfers. Here's some pseudocode for the update step:

    void updatePageRanks(oldRanks, newRanks, numPages, decay):
        newRanks.setAll(0)
        foreach page in oldRanks:
            // split rank between all linked pages
            foreach unique linkDestination in page:
                newRanks[linkDestination] += oldRanks[page] / numUniqueLinks(page) * decay
            if numUniqueLinks(page) == 0:
                // split rank between all pages
                newRanks.incrementAll(oldRanks[page] / numPages * decay)
        // re-distribute "decayed" surfers
        newRanks.incrementAll((1 - decay) / numPages)

    At each iteration, we evenly distribute each page's rank from the previous iteration between all linked pages. We also multiply by decay; this decay factor is a parameter that represents the rate at which your surfers are willing to continue browsing by following links. We usually set decay equal to roughly 0.85: this means that about 85% of your users will be willing to continue visiting new web pages. The other 15% simply stop where they are.

    This decay factor also helps our page ranks converge: they ensure that every time we run the update step, our page ranks change less and less. Note that the closer we set decay to 0, the faster we converge (but the more inaccurate are results are likely to be).

    What about pages that have no out-going links? In this case, we assume those visitors jump to a random webpage, and distribute those pages' ranks evenly across all webpages.

    Finally, we evenly distribute the "decayed" surfers across all pages so that the total page rank across all pages sums to 1 again. Doing this ensures that we don't "lose" web surfers when we apply the decay factor—otherwise, the average page rank would decrease with each iteration, which really isn't very useful behavior. You can think of this added value as representing new surfers who start browsing from a random webpage.

  3. Convergence check step:

    We have just completed one iteration of the page rank algorithm! We now need to decide if we need to keep going or if it's ok to stop.

    Compare the difference between the old page rank and the new page rank for each individual webpage. If the difference between the two for every single webpage is less than some user-specified value \(\epsilon\), we can stop recursing: we say our page ranks have converged and we will return the old page ranks.

    (Note: \(\epsilon\) is the Greek lower-case letter "epsilon", and is traditionally used to represent some sort of very small number.)

    However, if any webpage has a difference greater than \(\epsilon\), we replace the old page ranks with the new ones we just computed and repeat step 2.

Examples

Example 1: Simple Graph

Initialize step

Let's look an example of PageRank in action. Suppose we have four web pages arranged like below. Each square is a webpage, and the arrows represent out-going links. Here, page A is basically acting like the "main page": it links to all the pages, and pages B, C, and D all link to it.

After the initialize step runs, we give each page an initial rank of 0.25:

Update step: iteration 1

We then run the update step once, to obtain the following new ranks. We update the scores of each web pages by the following. (Note: we've picked \(d = 0.85\) as our decay rate here).

  • Page A new rank: \(\underbrace{(d \cdot 0.25) / 1}_{\text{from page B}} + \underbrace{(d \cdot 0.25) / 1}_{\text{from page C}} + \underbrace{(d \cdot 0.25) / 1}_{\text{from page D}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.67500\)

  • Page B new rank: \(\underbrace{(d \cdot 0.25) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.10833\)

  • Page C new rank: \(\underbrace{(d \cdot 0.25) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.10833\)

  • Page D new rank: \(\underbrace{(d \cdot 0.25) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.10833\)

So, here is what the graph looks like after iteration 1 is finished:

Remember, the score represents "the percentage of surfers present on the page". So, after iteration 1, about 67.5% percent of our web surfers are present on page A. Pages B, C, and D all have about 10.8333...% of the surfers each. This seems to roughly be what we want! If more surfers are on page A, then it's likely that it's a "more important page".

Update step: iteration 2

We repeat this process again on the next iteration.

  • Page A new rank: \(\underbrace{(d \cdot 0.10833) / 1}_{\text{from page B}} + \underbrace{(d \cdot 0.10833) / 1}_{\text{from page C}} + \underbrace{(d \cdot 0.10833) / 1}_{\text{from page D}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.31375\)

  • Page B new rank: \(\underbrace{(d \cdot 0.67500) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.22875\)

  • Page C new rank: \(\underbrace{(d \cdot 0.67500) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.22875\)

  • Page D new rank: \(\underbrace{(d \cdot 0.67500) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.22875\)

This produces the following graph. Now, as the web surfers move again, page A now only has about 31.4% of the web surfers; pages B, C, and D all have about 22.88% each.

Update step: iteration 3

After repeating this process, we get the following results (we omit the calculations this time to save space). Notice that these scores are pretty similar to iteration 1, but are starting to settle down somewhere between iteration 1 and iteration 2's scores:

Update step: iteration 4

And if we repeat again, the scores are similar to iteration 2's, but are starting to settle down even more. The intuition here is that the web surfers are "sloshing" back and forth between page A and pages B, C, and D. However, due to our decay factor, the percentage of surfers that moves between each page decreases slowly over time, so the page's scores start stablizing and converging:

Convergence

And finally, we fast-forward to a future iteration where our scores basically stop changing and have settled down to this:

This final result represents our true page rank score.

Example 2: Cycle

Here is another example of a graph after we run one iteration of PageRank. Interestingly, it turns out no matter how many iterations we run, the result will be exactly the same? Why do you suppose that is?

Example 3: Complex

Here is a more complex example of a graph (which includes a page with no incoming links).

Here is the result after iteration 1:

  • Page A new rank: \( \underbrace{(d \cdot 0.2) / 1}_{\text{from page D}} + \underbrace{(d \cdot 0.2) / 5}_{\text{from page C}} + \underbrace{(1 - d) / 5}_{\text{new surfers}} = 0.23400\)

  • Page B new rank: \( \underbrace{(d \cdot 0.2) / 2}_{\text{from page A}} + \underbrace{(d \cdot 0.2) / 5}_{\text{from page C}} + \underbrace{(1 - d) / 5}_{\text{new surfers}} = 0.14900\)

  • Page C new rank: \( \underbrace{(d \cdot 0.2) / 2}_{\text{from page B}} + \underbrace{(d \cdot 0.2) / 5}_{\text{from page C}} + \underbrace{(1 - d) / 5}_{\text{new surfers}} = 0.14900\)

  • Page D new rank: \( \underbrace{(d \cdot 0.2) / 2}_{\text{from page A}} + \underbrace{(d \cdot 0.2) / 2}_{\text{from page B}} + \underbrace{(d \cdot 0.2) / 1}_{\text{from page E}} + \underbrace{(d \cdot 0.2) / 5}_{\text{from page C}} + \underbrace{(1 - d) / 5}_{\text{new surfers}} = 0.40400\)

  • Page E new rank: \( \underbrace{(d \cdot 0.2) / 5}_{\text{from page C}} + \underbrace{(1 - d) / 5}_{\text{new surfers}} = 0.06400\)

After iteration 2:

After iteration 3:

The final, converged ranks:

We recommend reading through all three examples before you begin writing code.

Also, check out these helpful slides that walk through 2 iterations of page rank.

Part 2b iii: Optimize PageRank

Task: Optimize PageRankAnalyzer.makePageRanks(...).

The version of the PageRank algorithm described above is not the most efficient one. You should try to optimize the algorithm: consider combining or rearranging certain steps and calling alternate methods from your data structures to reduce the number of operations required (but make sure you get the same (or very similar) results in the end).

You should be able to optimize the algorithm significantly: in particular, you should be able to reduce its big-\(\Theta\) runtime from the version described in the pseudocode above. (There are many other optimizations you can make to affect the constant factors as well; these may also affect your run-time significantly, but will be less impactful than the changes that reduce the asymptotic complexity.)

Also note that an inefficient implementation may have issues with floating point errors, as described in part 2a.

Part 2b iv: Re-run your search engine again

Task: Finish implementing PageRank.computePageRank(...) and re-run your search engine.

Finally, finish implementing the computePageRank(...) method. This method should be just one line: we don't need to do any work since we already pre-computed everything.

Now, try running your search engine against the "spam" dataset again. (As before, the setup phase may take some time; the engine should still answer queries nearly instantaneously.) The spam pages should now appear appear several places lower on the page—try typing in "seattle" and "computer science" again.

Ideally, the spam results would have vanished completely; part of the reason they don't is that the set of web pages we're looking at is very minimal and doesn't contain many other pages about Seattle or computer science. If we had more legitimate pages about either topics, they would have pushed down the spam results even further.

Of course, there are other limitations to PageRank: we were able to suppress spam pages that relied on keyword stuffing, but we have almost no defense against other kinds of search manipulation results: in particular, we're now vulnerable to link manipulation.

For example, what if we try constructing web pages that artificially create a large network of links? Or what if we purchase an ad on a popular website/bribe the owner to link back to our site?

It's an endless game of cat-and-mouse.

What's (mathematically) happening behind the scenes?

So, what exactly is happening behind the scenes, mathematically speaking?

Well, to put it succiently, it turns out that what PageRank is really doing is computing an approximation of the first principal eigenvector of a (diagonalizable) Markov transition matrix representing the link structure of the internet via the power method. This eigenvector is in fact exactly the vector we return from the computePageRank(...) method.

If you've never taken a linear algebra class before (e.g. Math 308), that paragraph probably sounded like complete gibberish. That's ok—the punchline is that you can basically go around telling your friends and family that you just wrote an algorithm to "compute the eigenvector of the internet" and sound all fancy and mysterious.

If you're curious and want to learn more, see the following links, which go into a little more detail into the underlying math:

Part 2b v: Complete individual feedback

Task: Submit responses to the Canvas survey.

Each group member should answer and submit these questions independently on this Canvas survey. As a general rule of thumb, a good survey should take around 5 minutes to complete, but feel free to include as much detail as you want. Points will only be deducted for very short answers.

Deliverables

The following deliverables are due on Wednesday, May 22 at 11:59pm.

Submit by pushing your code to GitLab. If you intend to submit late, fill out this late submission form when you submit.

Before submitting, be sure to double-check and make sure that...

  • Your PageRankAnalyzer class is fully implemented and tested.
  • Your search engine still runs, is fully functional, and will correctly sort and search results as described in part 2a.
  • You ran the checkstyle plugin within IntelliJ and fixed all style errors.

Both partners should turn in their answers to the individual writeup on Canvas if you haven't already. This is due at the same time as the rest of your project.