Overview
    Due Fri, May 18 at 11:59pm
    In part 2a of this project, we discovered that while TF-IDF
    works fine in many cases, it's also exploitable.
    Websites that want to maximize search traffic to gain more ad revenue
    or attract more customers can artifically inflate their score by
    keyword stuffing: by stuffing each webpage full of as many
    different words and phrases as possible.
    We saw this for ourselves in part 2a by running the search engine against the
    "wikipedia-with-spam" dataset, which contains a mixture of legitimate webpages
    and spam webpages. We get some useful results, but an alarming amount of spam.
    We need a second metric to help penalize that behavior. We will use
    a simplified version of PageRank, an algorithm invented by (and named 
    after) Larry Page, one of the founders of Google. PageRank is not the only
    algorithm Google uses, but is one of their more widely known ones.
    You will use these files from prior assignments:
    
        - main.java.datastructures.concrete.dictionaries.ChainedHashDictionary.java
- main.java.datastructures.concrete.dictionaries.ArrayDictionary.java
- main.java.datastructures.concrete.ArrayHeap.java
- main.java.datastructures.concrete.ChainedHashSet.java
- main.java.datastructures.concrete.DoubleLinkedList.java
- main.java.misc.Searcher.java
You will be modifying the following file:
    
        - main.java.search.analyzers.PageRankAnalyzer.java
Additionally, here are a few more files that you might want to review
    while completing the assignment (note that this is just a starting point,
    not necessarily an exhaustive list):
    
        - main.java.search.models.Webpage.java
- main.java.search.Main.java
 
        
    Part 2b i: Build the web graph
    
    Task: Implement PageRankAnalyzer.makeGraph(...).
    PageRank works by analyzing a directed graph representing the
    internet: each webpage is a vertex, and each link is an edge.
    So, if we want to implement PageRank, we need to first build this graph!
    Some notes:
    
        - You should represent this graph using an adjacency list.
            Discuss with your partner: why is an adjacency matrix likely not the best
            choice here? 
- If a webpage has multiple outgoing links to the same destination, you should
             add only a single edge to your graph. That is, there should be no duplicate
             edges. 
- If a webpage links to itself, you should omit that link. That is, there should
            be no self-loops in your graph. 
- Your - makeGraph(...)method will be given a set of web pages
            as input. If any of those web pages links to something that is not
            contained within that set, ignore that link. We want our graph to be
            self-contained: every edge should point to an existing vertex.
 
- Please make sure to read the other notes
                and hints section from the part 2a spec – in particular, the
            bullet points about working with iterators. 
 
    Part 2b ii: Implement PageRank
    Task: Implement PageRankAnalyzer.makePageRanks(...).
    Next, you will implement the makePageRanks(...) method,
    which will precompute the page rank for every webpage in your graph.
    This method should implement the core algorithm as described below:
    Core algorithm
    PageRank works by ignoring the user's query and instead
    computing the relative "importance" or "reputability" of a webpage
    based on what webpages link to it. The intuition is that if many
    different webpages link to webpage A, then many people are likely to
    naturally stumble across webpage A just by following links. As a result,
    we can conclude that webpage A is most likely "important" or
    "reputable".
    
    Similarly, if webpage A has so many visitors and if
    webpage A links to just one other webpage, webpage B, we can assume
    that many users of webpage A are likely to click on that single link B.
    This means that webpage B will "share" some of the importance of
    webpage B.
    To put it another way, we are looking at the graph representing
    the internet and trying to emulate how people will naturally traverse
    that graph!
    PageRank works by emulating the above process using the below
    algorithm:
    
        - 
            
    
        
        
            
                
                Initially, give each webpage a rank of \(\frac{1}{N}\), where
                \(N\) is the total number of webpages. Intuitively, we're saying that at the very start,
                about \(\frac{1}{N}\) of all web surfers will start on that
                webpage: each webpage initially has an equal percentage of
                visitors. 
 
 
 
- 
            
    
        
        
            
                
                Next, we will compute the new page rank by simulating
                the expected behavior of our web surfers. 
                    - First, give every web page a new page rank of \(0.0\).
                         
- Next, we will take the old page rank for every webpage and 
                        equally share it with every web page it links to.
                         - That is, we loop through each vertex (each webpage)
                        and increase the new rank for every adjacent vertex 
                        by the following amount: - $$
                        d \times \frac{\text{old page rank}}{\text{number of unique links}}
                        $$ - Here, \(d\) is the "decay" factor: it's a parameter that represents
                        the rate at which your surfers are willing to "continue 
                        clicking" a link. We usually set \(d\) equal to roughly
                        \(0.85\): this means that about 85% of your users will be
                        willing to continue visiting new web pages. The other
                        15% simply stop where they are. - This decay factor also helps our page ranks
                        converge: they ensure that every time we
                        run the update step, our page ranks change less and less.
                        Note that the closer we set \(d\) to 0, the faster
                        we converge (but the more inaccurate are results are
                        likely to be). - What happens if a web page has
                        zero out-going links? There are no pages to update, so 
                        what happens to the 85% of our visitors who randomly 
                        click on links? - If the web page has no links, we assume those 
                        visitors jump to a random webpage. That is, if the current
                        page has no outgoing links, increase the 
                        new page rank of every webpage by: - $$d \times \frac{\text{old page rank}}{N}$$ - That is, we pretend those surfers "jump" to some
                        random page. 
- Finally, add \(\frac{1 - d}{N}\) every web page's
                        new page rank. - The technical reason we do this is to make sure
                        our page ranks in total represent a valid probability
                        distribution: we don't want to actually "lose" web
                        surfers when we apply the decay factor. - Alternatively, you can think of this value as
                        representing the probability some random person starts
                        on the current webpage out-of-the-blue. 
 
 
 
 
- 
            
    
        
        
            
                
                We have just completed one iteration of the page rank 
                algorithm! We now need to decide if we need to keep going
                or if it's ok to stop. Compare the difference between the old page rank and the 
                new page rank for each individual webpage. If the difference
                between the two for every single webpage is less then some 
                user-specified value \(\epsilon\), we can stop recursing: we say 
                our page ranks have converged. (Note: \(\epsilon\) is the Greek lower-case letter "epsilon",
                and is traditionally used to represent some sort of very small
                number.) However, if even a single webpage has a difference greater
                then \(\epsilon\), we replace the old page ranks with the
                new ones we just computed and repeat step 2. 
 
 
 
Final note: you are not required to implement the PageRank
    algorithm exactly as described above. You may combine or rearrange
    certain steps to make your algorithm more efficient if you wish, as
    long as you get the same (or close to similar) results in the end.
    
    
        
        
            
                
        Initialize step
        Let's look an example of PageRank in action. Suppose we have four
        web pages arranged like below. Each square is a webpage, and the arrows
        represent out-going links. Here, page A is basically acting like the 
        "main page": it links to all the pages, and pages B, C, and D all link to it.
        After the initialize step runs, we give each page an initial rank of
        0.25:
        
        Update step: iteration 1
        We then run the update step once, to obtain the following new ranks.
        We update the scores of each web pages by the following. (Note: we've
        picked \(d = 0.85\) as our decay rate here).
        
            - Page A new rank: \(\underbrace{(d \cdot 0.25) / 1}_{\text{from page B}} + \underbrace{(d \cdot 0.25) / 1}_{\text{from page C}} + \underbrace{(d \cdot 0.25) / 1}_{\text{from page D}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.67500\) 
- Page B new rank: \(\underbrace{(d \cdot 0.25) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.10833\) 
- Page C new rank: \(\underbrace{(d \cdot 0.25) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.10833\) 
- Page D new rank: \(\underbrace{(d \cdot 0.25) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.10833\) 
So, here is what the graph looks like after iteration 1 is finished:
        
        Remember, the score represents "the percentage of surfers present on the page". So,
        after iteration 1, about 67.5% percent of our web surfers are present on page A. Pages
        B, C, and D all have about 10.8333...% of the surfers each. This seems to roughly be
        what we want! If more surfers are on page A, then it's likely that it's a "more important
        page".
        Update step: iteration 2
        We repeat this process again on the next iteration.
        
            - Page A new rank: \(\underbrace{(d \cdot 0.10833) / 1}_{\text{from page B}} + \underbrace{(d \cdot 0.10833) / 1}_{\text{from page C}} + \underbrace{(d \cdot 0.10833) / 1}_{\text{from page D}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.31375\) 
- Page B new rank: \(\underbrace{(d \cdot 0.67500) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.22875\) 
- Page C new rank: \(\underbrace{(d \cdot 0.67500) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.22875\) 
- Page D new rank: \(\underbrace{(d \cdot 0.67500) / 3}_{\text{from page A}} + \underbrace{(1 - d) / 4}_{\text{new surfers}} = 0.22875\) 
This produces the following graph. Now, as the web surfers move again, page A now only
        has about 31.4% of the web surfers; pages B, C, and D all have about 22.88% each.
        
        Update step: iteration 3
        After repeating this process, we get the following results (we omit the calculations
        this time to save space). Notice that these scores are pretty similar to iteration 1, but
        are starting to settle down somewhere between iteration 1 and iteration 2's scores:
        
        Update step: iteration 4
        And if we repeat again, the scores are similar to iteration 2's, but are starting
        to settle down even more. The intuition here is that the web surfers are "sloshing" back 
        and forth between page A and pages B, C, and D. However, due to our decay factor, the 
        percentage of surfers that moves between each page decreases slowly over time, so the page's 
        scores start stablizing and converging:
        
        Update step: iteration 100
        And finally, if we fast-forward to iteration 100, our scores basically stop changing
        and have settled down to this:
        
        This final result represents our true page rank score.
    
             
         
     
    
    
        
        
            
                
        Here is another example of a graph after we run one iteration of PageRank. Interestingly,
        it turns out no matter how many iterations we run, the result will be exactly the same?
        Why do you suppose that is?
        
        
    
             
         
     
    
    
        
        
            
                
        Here is a more complex example of a graph (which includes a page with no incoming
        links.
        Here is the result after iteration 1:
        
        After iteration 2:
        
        After iteration 3:
        
        After iteration 100 (the final, converged ranks):
        
    
             
         
     
 
    Part 2b iii: Report and analyze results
    Task: Finish implementing PageRank.computePageRank(...) and re-run search engine.
    Finally, finish implementing the computePageRank(...)
    method. This method should be just one line: we don't need to do any
    work since we already pre-computed everything.
    Now, try running your search engine against the "spam" dataset
    again. As before, the setup phase may take some time; the engine should
    still answer queries nearly instantaneously. And if we try loading the
    "wikipedia" and "wikipedia-with-spam" pages, the spam pages should appear
    now appear several places lower on the page – try typing in "seattle"
    and "computer science" again.
    (Ideally, the results should have vanished completely – the reasons why they
    don't are partially because the web pages we're looking at are very minimal and
    don't contain that many pages about either Seattle or computer science.
    If we had more legitimate pages about either topics, they would have pushed down
    the spam results even further.)
    
    Of course, there are other limitations to PageRank: we were able to suppress spam
    pages that relied on keyword stuffing, but we have almost no
    defense against other kinds of search manipulation results: in particular,
    we're now vulnerable to link manipulation.
   
    For example, what if we try constructing web pages that artificially
    create a large network of links? Or what if we purchase an ad on a popular
    website/bribe the owner to link back to our site?
    It's an endless game of cat-and-mouse.
    
    
        
        
            
                
        So, what exactly is happening behind the scenes, mathematically speaking?
        Well, to put it succiently, it turns out that what PageRank is really doing is 
        computing an approximation of the first principal eigenvector of a (diagonalizable) 
        Markov transition matrix  representing the link structure of the internet via the power 
        method. This eigenvector is in fact exactly the vector we return from the 
        computePageRank(...) method.
        If you've never taken a linear algebra class before (e.g. Math 308), that 
        paragraph probably sounded like complete gibberish. That's ok – the punchline
        is that you can basically go around telling your friends and family that
        you just wrote an algorithm to "compute the eigenvector of the internet" and sound 
        all fancy and mysterious.
        If you're curious and want to learn more, see the following links, which
        go into a little more detail into the underlying math:
        
    
             
         
     
 
    
    Part 2b iv: Complete writeup
    Task: submit answers to the following questions.
    Each group member should answer and submit these questions 
    independently at
    this canvas page.
    You must submit your answers in either .txt or
    .pdf form.
    
        - Questions about the project:
            
                - What is your name and your partner's name?
- How was the project? Did you enjoy it?
- Do you have any suggestions for improving the project?
 
- Questions about your partner (skip if you worked solo):
            
                - Which parts of the project did you work on, and which
                    parts did your partner program? Did you try pair
                    programming?
- How was your partnership? Do you feel you and your partners
                    contributed equally to this project?
 
 
    
    Deliverables
    The following deliverables are due on Fri, May 18 at 11:59pm.
    A single person from your partnership should
    submit a zip file conaining the entire contents of your src folder 
    on Canvas.
    Before submitting, be sure to double-check and make sure that...
    
    Both partners should turn in a .txt or
    .pdf file containing their answers to part 2b iv
    on Canvas
    if you haven't already. This is due at the same time as
    the rest of your project.