NeighborSearch

A hypertext search result ranking scheme

Daniel N. Wood

20 March 1998

Motivation

The techniques used by web search engines to determine which pages match a query and how well they match could use improvement. Queries are expressed as a set of keywords and the relevance of a page is specified as a real number. The techniques used by commercial search engines are proprietary, but it appears that the larger search engines use techniques similar to the simpler ones described in the literature [FO95]. Although integrating semantic knowledge, up to full natural language parsing offers the potential for a enormous improvement in search engines, further progress can be made without reliance on hard-core AI.

In particular an important portion of the web is completely ignored by current search engines: the links between pages. I believe that by considering the context of a page (i.e. its neighborhood in the graph made of web pages and hypertext links) we can make a better determination of the relevance of a web page to a search query. Intuition suggests that a page should be considered better if its neighbors in the web are "good" pages (that is, they match the search query by some standard criterion), and vice versa. Furthermore, when making broad queries [K98] there will often be a large number of highly relevant pages. By considering the link structure the search engine can offer pages which provide good coverage over this space instead of possibly returning a set of tightly interlinked pages. Furthermore, we can highly rate pages that have many links to other good pages--a set of links created by hand is typically easier to navigate than the results of a search query.

I have implemented NeighborSearch, an attempt at incorporating these ideas into a search engine. Currently NeighborSearch exists in a non-realtime version suitable for exploring the space of possible ranking schemes. I believe it could be reimplemented in a much faster form by pre-computing and storing more information, much as search engines do now.

Implementation

I begin by generating a graph encompassing a reasonable subspace of the web for a particular query. I send the query keywords to the MetaCrawler [MC] and use the first 10 results as my base set. Then I fill out the graph by retrieving the children and grandchildren of these pages. As each page is retrieved, I keep a list of its child URLs, and generate a simple keyword-presence score for each. This simple score is simply the number of occurrences of keywords in the document divided by the number of words in the document. Although this technique can't be expected to compete with the methods used by commercial search engines, it serves as a reasonable starting point for examining what benefit we can obtain from the page context. If I can generate high-quality rankings using a combination of these simple scores and the page context, then I can replace the initial scoring technique with something stronger. The number of pages generating by expanding an initial set of size 10 for 2 steps is simply astounding. Therefore, I tried to limit the number of pages downloaded. (Of course, I cached the downloaded pages as well, so that my debugging wouldn't place an undue burden on any servers.) As many completely unrelated pages turn up in this set, I truncated my search by leaving out, and not examining the children of, pages with a zero score. Then I create three sets of refined scores in succession, each building on the one that came before.

First, I create a score I call the context score. The context score considers the context of the page to determine whether the page is really relevant. For each page in the graph, considering links undirected, I find the average of the simple scores of all pages distance 1 away and distance 2 away. Then the context score of the page is a weighted average of its score with the depth-1 and depth-2 averages, weighted respectively by 1, 0.1 and 0.01. I use undirected links because at a general level, links to and links from "good" pages (pages that have a high simple score) both indicate that a page may itself be good.

Next I create the utility score. This score is similar to the context score. There are three differences in its construction. First, it is based upon the previously computed context scores and not the simple scores. Second, the weighting falloff at each level is 0.5 instead of 0.1. Most importantly, while finding the pages 1 and 2 links away, I only follow links in their standard direction. This asymmetry represents the fact that links to more good pages are useful to the reader, but links from good pages are inaccessible during normal web browsing.

Finally, I create a final ranking of the pages using the utility score as a starting point. I choose the page with the highest score and give it first place. Then I find all pages within 2 links (only traversing links forward) of it and reduce their scores by a diminishment factor that depends on how far away they were (3 away: 0.3, 2 away: 0.1, 1 away: 0.01). Then I repeat this process until rest of the pages have been placed in order. The diminishment serves two related purposes. First, it makes it more likely that the highly ranked pages will represent different regions of the web. Second, it represents the fact that if a web page is easily accessible from one of the highly ranked pages then the reader may not need or want to find the page in their search results.

Results

To determine the quality of each ranking, I enlisted help from two human test subjects. I asked each of them to provide me with some search queries that I ran through NeighborSearch. The results were sorted by the minimum of the original ranking by MetaCrawler, the rankings by the simple score, and the final rankings. Then I took the best 20 and gave them to the subject to examine and subjectively rate. Although I asked them to rank the search queries, this was judged too time-consuming, and the subjects simply rated them. To rate the ratings (so to speak), I took each pair of URLs and gave the rating a point if it agreed with the ordering given by the user. Here are the results of the three tests I was able to conduct:

Query = software dvd encoding
URL MC Rank Simple Score Simple Rank Final Rank User Score
http://www.mgm.com/dvd/--0.5000010021
http://www.scbbs.com/--0.5000020011
http://www.unik.no/~robert/hifi/dvd/pro.html010.0790210108
http://www.visiblelight.com/02 0.044081328 7
http://www.dvd-onestop.com/--0.3330030041
http://www.zumadigital.com/--0.2500050031
http://www.dvcc.com/030.0361272414
http://www.dvdvideogroup.com/-- 0.333004005 1
http://newmedia.com/NewMedia/97/15/feature/Develop_DVD_ROM.h tml040.031 16329110
http://www.ividea.com/050.0212673841
http://www.zumadigital.com/zumadigital/contact.html--0.2500060201
http://www.nfoweb.com/--0.0057510061
http://www.sanyo-verbatim.com/dvdindx.htm--0.1660070151
http://www.dolby.com/dvd/--0.0700290074
http://www.compaq.com/athome/showroom/tech/dvdrom.html 070.028190 0142
http://www.vsda.org/--0.1660082331
http://www.dvdinsider.com/charts.html --0.088017 0086
http://www.idg.net/idg_frames/english/content.cgi?vc=story_1 259%2ehtml080.0710260121
http://newmedia.com/NewMedia/97/15/feature/dvd4.html --0.106009 0113
http://WWW.D2PROD.COM/DVD.html-- 0.102012009 6

Ranking Scheme MetaCrawler Simple Final
Result 133 124 115
Query = bill gates house
URL MC Rank Simple Score Simple Rank Final Rank User Score
http://www.billg.org/secretdiary/ads/ad1.htm--0.2850010025
http://www.teamgates.com/other/billgame.asp--0.1340060015
http://www.zpub.com/un/bill/ecology.html010.05202904510
http://www.billg.org/secretdiary/ --0.250002 0035
http://www4.usnews.com/usnews/nycu/gates.htm020.08201409210
http://www.roadahead.com/--0.2000030725
http://www.zpub.com/un/bill/03 0.051030088 5
http://www.quuxuum.org/~evan/awards.html--0.1420040225
http://nytsyn.com/live/cndtopics.html --0.111009 0040
http://www4.usnews.com/usnews/nycu/gatehigh.htm040.0500320647
http://www.jas.com/ms-shame.shtml --0.142005 0575
http://www.tiac.net/users/billg40/main/index.htm--0.0480330055
http://www.iservco.com/~dave/humor/joke85.html050.01810110010
http://www.mckinley.com/search.gw?....--0.0560250060
http://www.urz.uni-heidelberg.de/subject/hd/fak7/hist/o1/log s/mt/t49/0021.html060.0042071030
http://nytsyn.com/live/Gates/141_052197_104207_14626.html--0.005189 0075
http://www.teamgates.com/other/links.asp--0.1240070775
http://www.microsoft.com/BillGates/billgates_l/features/past features.htm--0.1170080235
http://www4.usnews.com/usnews/news/michigh.htm--0.0081690085
http://nytsyn.com/live/Gates/182_070197_142208_1682.html--0.005196 0095
Ranking Scheme MetaCrawler Simple Final
Result 106 105 104
Query = northwestern wildcats football
URL MC Rank Simple Score Simple Rank Final Rank User Score
http://www.enteract.com/~cjc/nu-football/010.23000515110
http://www.cris.com/~Jjack/wildcats.shtml050.75000100110
http://www.ou.edu/athweb/version1/sports/football/--0.5000020025
http://users.aol.com/aactchrdeb/cats.htm020.05305601310
http://members.aol.com/jimmyten/index.html--0.3330030055
http://www.enteract.com/~cjc/nu-football/previous/previous.h tml--0.104 0140037
http://www.enteract.com/~cjc/nu-football/noframes/030.0221661277
http://www.dancris.com/~pedro/SunDevils.html--0.3330040060
http://www.enteract.com/~cjc/nu-football/sites.html--0.1630090047
http://www.sportingnews.com/cfootball/teams/northwestern/ ...040.02514304710
http://www.nusports.com/audio/index.html--0.2000061115
http://sportingnews.com/cfootball/teams/northwestern/archive .html060.167 00701210
http://pubweb.nwu.edu/~jle096/NU.htm 070.077023 00710
http://www.buckeyezone.com/--0.1660080105
http://www.nwu.edu/factbook/-- 0.120012008 5
http://www.ruf.rice.edu/~ricesid/rsp/97-98/fb/09-15.html080.007369 5825
http://www.personal.psu.edu/users/c/w/cwb129/bigten/ --0.032105 0095
http://www.enteract.com/~cjc/nu-football/schedule.html --0.133010 11910
http://www.nrc.pair.com/big10.html 100.042079 1555
http://www.nwu.edu/search/--0.1250111775
Ranking Scheme MetaCrawler Simple Final
Result 143 119 157

My new ranking scheme fared unspectacularly in these tests. In the third example, where it was the most successful, the score for the MetaCrawler was probably artificially lowered because all of the pages unranked by the MetaCrawler were considered to be ranked equally.

Conclusion

The potential of even the simple scheme implemented in the current NeighborSearch has not been completely resolved. There are a large number of parameters that need further tuning before a final evaluation can be performed. This tuning could be performed by hand, but an automatic scheme analagous to that described in [BCB94] would be more valuable. Unfortunately, Bayesian learning applied to general graphs is NP-complete.

The technique developed by Kleinberg [K98] has demonstrated some good results, while completely ignoring the content of the individual pages (after the initial pruning required to generate the initial subweb). Kleinberg is able to tease apart pages that may be using different meanings of the search terms using the topology of the graph. My algorithm implicitly assumes that a two pages with a high score receive their high scores for the same reason; this assumption is not true in the presence of synonyms. An extension of NeighborSearch to use a more explicitly topological formulation (a la Hubs and Authorities) of web page properties might be more successful.

Finally, perhaps the most important task is to reformulate (any) one of the context-sensitive ranking algorithms in a fashion that can produce real-time response to search queries. Even if the results aren't astounding, if they are fast enough to be usable further research can proceed with making it useful. After all, the fast drives out the slow, even if the fast is wrong.

References

[BCB94] Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew, Automatic Combination of Multiple Ranked Retrieval Systems,, Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 173--181, 1994.

[FO95] C. Faloutsos and D.W. Oard, A Survey of Information Retrieval and Filtering Methods, University of Maryland, Technical Reports CS-TR-3514

[FC89] Frisse, Mark E. and Cousins, Steve B. Information Retrieval From Hypertext: Update on the Dynamic Medical Handbook Project, Proceedings of Hypertext '89, ACM Press, 1989.

[K98] Jon Kleinberg, Authoritative Sources in a Hyperlinked Environment, Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998.

[MC] http://www.metacrawler.com