NeighborSearch

A hypertext search result ranking scheme

Daniel N. Wood

20 March 1998

Motivation

The techniques used by web search engines to determine which pages match a query and how well they match could use improvement. Queries are expressed as a set of keywords and the relevance of a page is specified as a real number. The techniques used by commercial search engines are proprietary, but it appears that the larger search engines use techniques similar to the simpler ones described in the literature [FO95]. Although integrating semantic knowledge, up to full natural language parsing offers the potential for a enormous improvement in search engines, further progress can be made without reliance on hard-core AI.

In particular an important portion of the web is completely ignored by current search engines: the links between pages. I believe that by considering the context of a page (i.e. its neighborhood in the graph made of web pages and hypertext links) we can make a better determination of the relevance of a web page to a search query. Intuition suggests that a page should be considered better if its neighbors in the web are "good" pages (that is, they match the search query by some standard criterion), and vice versa. Furthermore, when making broad queries [K98] there will often be a large number of highly relevant pages. By considering the link structure the search engine can offer pages which provide good coverage over this space instead of possibly returning a set of tightly interlinked pages. Furthermore, we can highly rate pages that have many links to other good pages--a set of links created by hand is typically easier to navigate than the results of a search query.

I have implemented NeighborSearch, an attempt at incorporating these ideas into a search engine. Currently NeighborSearch exists in a non-realtime version suitable for exploring the space of possible ranking schemes. I believe it could be reimplemented in a much faster form by pre-computing and storing more information, much as search engines do now.

Implementation

I begin by generating a graph encompassing a reasonable subspace of the web for a particular query. I send the query keywords to the MetaCrawler [MC] and use the first 10 results as my base set. Then I fill out the graph by retrieving the children and grandchildren of these pages. As each page is retrieved, I keep a list of its child URLs, and generate a simple keyword-presence score for each. This simple score is simply the number of occurrences of keywords in the document divided by the number of words in the document. Although this technique can't be expected to compete with the methods used by commercial search engines, it serves as a reasonable starting point for examining what benefit we can obtain from the page context. If I can generate high-quality rankings using a combination of these simple scores and the page context, then I can replace the initial scoring technique with something stronger. The number of pages generating by expanding an initial set of size 10 for 2 steps is simply astounding. Therefore, I tried to limit the number of pages downloaded. (Of course, I cached the downloaded pages as well, so that my debugging wouldn't place an undue burden on any servers.) As many completely unrelated pages turn up in this set, I truncated my search by leaving out, and not examining the children of, pages with a zero score. Then I create three sets of refined scores in succession, each building on the one that came before.

First, I create a score I call the context score. The context score considers the context of the page to determine whether the page is really relevant. For each page in the graph, considering links undirected, I find the average of the simple scores of all pages distance 1 away and distance 2 away. Then the context score of the page is a weighted average of its score with the depth-1 and depth-2 averages, weighted respectively by 1, 0.1 and 0.01. I use undirected links because at a general level, links to and links from "good" pages (pages that have a high simple score) both indicate that a page may itself be good.

Next I create the utility score. This score is similar to the context score. There are three differences in its construction. First, it is based upon the previously computed context scores and not the simple scores. Second, the weighting falloff at each level is 0.5 instead of 0.1. Most importantly, while finding the pages 1 and 2 links away, I only follow links in their standard direction. This asymmetry represents the fact that links to more good pages are useful to the reader, but links from good pages are inaccessible during normal web browsing.

Finally, I create a final ranking of the pages using the utility score as a starting point. I choose the page with the highest score and give it first place. Then I find all pages within 2 links (only traversing links forward) of it and reduce their scores by a diminishment factor that depends on how far away they were (3 away: 0.3, 2 away: 0.1, 1 away: 0.01). Then I repeat this process until rest of the pages have been placed in order. The diminishment serves two related purposes. First, it makes it more likely that the highly ranked pages will represent different regions of the web. Second, it represents the fact that if a web page is easily accessible from one of the highly ranked pages then the reader may not need or want to find the page in their search results.

Results

To determine the quality of each ranking, I enlisted help from two human test subjects. I asked each of them to provide me with some search queries that I ran through NeighborSearch. The results were sorted by the minimum of the original ranking by MetaCrawler, the rankings by the simple score, and the final rankings. Then I took the best 20 and gave them to the subject to examine and subjectively rate. Although I asked them to rank the search queries, this was judged too time-consuming, and the subjects simply rated them. To rate the ratings (so to speak), I took each pair of URLs and gave the rating a point if it agreed with the ordering given by the user. Here are the results of the three tests I was able to conduct:

Query = software dvd encoding

URL MC Rank Simple Score Simple Rank Final Rank User Score

http://www.mgm.com/dvd/ -- 0.500 001 002 1

http://www.scbbs.com/ -- 0.500 002 001 1

http://www.unik.no/~robert/hifi/dvd/pro.html 01 0.079 021 010 8

http://www.visiblelight.com/ 02 0.044 081 328 7

http://www.dvd-onestop.com/ -- 0.333 003 004 1

http://www.zumadigital.com/ -- 0.250 005 003 1

http://www.dvcc.com/ 03 0.036 127 241 4

http://www.dvdvideogroup.com/ -- 0.333 004 005 1

http://newmedia.com/NewMedia/97/15/feature/Develop_DVD_ROM.h tml 04 0.031 163 291 10

http://www.ividea.com/ 05 0.021 267 384 1

http://www.zumadigital.com/zumadigital/contact.html -- 0.250 006 020 1

http://www.nfoweb.com/ -- 0.005 751 006 1

http://www.sanyo-verbatim.com/dvdindx.htm -- 0.166 007 015 1

http://www.dolby.com/dvd/ -- 0.070 029 007 4

http://www.compaq.com/athome/showroom/tech/dvdrom.html 07 0.028 190 014 2

http://www.vsda.org/ -- 0.166 008 233 1

http://www.dvdinsider.com/charts.html -- 0.088 017 008 6

http://www.idg.net/idg_frames/english/content.cgi?vc=story_1 259%2ehtml 08 0.071 026 012 1

http://newmedia.com/NewMedia/97/15/feature/dvd4.html -- 0.106 009 011 3

http://WWW.D2PROD.COM/DVD.html -- 0.102 012 009 6

Ranking Scheme MetaCrawler Simple Final

Result 133 124 115

Query = bill gates house

URL MC Rank Simple Score Simple Rank Final Rank User Score

http://www.billg.org/secretdiary/ads/ad1.htm -- 0.285 001 002 5

http://www.teamgates.com/other/billgame.asp -- 0.134 006 001 5

http://www.zpub.com/un/bill/ecology.html 01 0.052 029 045 10

http://www.billg.org/secretdiary/ -- 0.250 002 003 5

http://www4.usnews.com/usnews/nycu/gates.htm 02 0.082 014 092 10

http://www.roadahead.com/ -- 0.200 003 072 5

http://www.zpub.com/un/bill/ 03 0.051 030 088 5

http://www.quuxuum.org/~evan/awards.html -- 0.142 004 022 5

http://nytsyn.com/live/cndtopics.html -- 0.111 009 004 0

http://www4.usnews.com/usnews/nycu/gatehigh.htm 04 0.050 032 064 7

http://www.jas.com/ms-shame.shtml -- 0.142 005 057 5

http://www.tiac.net/users/billg40/main/index.htm -- 0.048 033 005 5

http://www.iservco.com/~dave/humor/joke85.html 05 0.018 101 100 10

http://www.mckinley.com/search.gw?.... -- 0.056 025 006 0

http://www.urz.uni-heidelberg.de/subject/hd/fak7/hist/o1/log s/mt/t49/0021.html 06 0.004 207 103 0

http://nytsyn.com/live/Gates/141_052197_104207_14626.html -- 0.005 189 007 5

http://www.teamgates.com/other/links.asp -- 0.124 007 077 5

http://www.microsoft.com/BillGates/billgates_l/features/past features.htm -- 0.117 008 023 5

http://www4.usnews.com/usnews/news/michigh.htm -- 0.008 169 008 5

http://nytsyn.com/live/Gates/182_070197_142208_1682.html -- 0.005 196 009 5

Ranking Scheme MetaCrawler Simple Final

Result 106 105 104

Query = northwestern wildcats football

URL MC Rank Simple Score Simple Rank Final Rank User Score

http://www.enteract.com/~cjc/nu-football/ 01 0.230 005 151 10

http://www.cris.com/~Jjack/wildcats.shtml 05 0.750 001 001 10

http://www.ou.edu/athweb/version1/sports/football/ -- 0.500 002 002 5

http://users.aol.com/aactchrdeb/cats.htm 02 0.053 056 013 10

http://members.aol.com/jimmyten/index.html -- 0.333 003 005 5

http://www.enteract.com/~cjc/nu-football/previous/previous.h tml -- 0.104 014 003 7

http://www.enteract.com/~cjc/nu-football/noframes/ 03 0.022 166 127 7

http://www.dancris.com/~pedro/SunDevils.html -- 0.333 004 006 0

http://www.enteract.com/~cjc/nu-football/sites.html -- 0.163 009 004 7

http://www.sportingnews.com/cfootball/teams/northwestern/ ... 04 0.025 143 047 10

http://www.nusports.com/audio/index.html -- 0.200 006 111 5

http://sportingnews.com/cfootball/teams/northwestern/archive .html 06 0.167 007 012 10

http://pubweb.nwu.edu/~jle096/NU.htm 07 0.077 023 007 10

http://www.buckeyezone.com/ -- 0.166 008 010 5

http://www.nwu.edu/factbook/ -- 0.120 012 008 5

http://www.ruf.rice.edu/~ricesid/rsp/97-98/fb/09-15.html 08 0.007 369 582 5

http://www.personal.psu.edu/users/c/w/cwb129/bigten/ -- 0.032 105 009 5

http://www.enteract.com/~cjc/nu-football/schedule.html -- 0.133 010 119 10

http://www.nrc.pair.com/big10.html 10 0.042 079 155 5

http://www.nwu.edu/search/ -- 0.125 011 177 5

Ranking Scheme MetaCrawler Simple Final

Result 143 119 157

URL	MC Rank	Simple Score	Simple Rank	Final Rank	User Score
http://www.mgm.com/dvd/	--	0.500	001	002	1
http://www.scbbs.com/	--	0.500	002	001	1
http://www.unik.no/~robert/hifi/dvd/pro.html	01	0.079	021	010	8
http://www.visiblelight.com/	02	0.044	081	328	7
http://www.dvd-onestop.com/	--	0.333	003	004	1
http://www.zumadigital.com/	--	0.250	005	003	1
http://www.dvcc.com/	03	0.036	127	241	4
http://www.dvdvideogroup.com/	--	0.333	004	005	1
http://newmedia.com/NewMedia/97/15/feature/Develop_DVD_ROM.h tml	04	0.031	163	291	10
http://www.ividea.com/	05	0.021	267	384	1
http://www.zumadigital.com/zumadigital/contact.html	--	0.250	006	020	1
http://www.nfoweb.com/	--	0.005	751	006	1
http://www.sanyo-verbatim.com/dvdindx.htm	--	0.166	007	015	1
http://www.dolby.com/dvd/	--	0.070	029	007	4
http://www.compaq.com/athome/showroom/tech/dvdrom.html	07	0.028	190	014	2
http://www.vsda.org/	--	0.166	008	233	1
http://www.dvdinsider.com/charts.html	--	0.088	017	008	6
http://www.idg.net/idg_frames/english/content.cgi?vc=story_1 259%2ehtml	08	0.071	026	012	1
http://newmedia.com/NewMedia/97/15/feature/dvd4.html	--	0.106	009	011	3
http://WWW.D2PROD.COM/DVD.html	--	0.102	012	009	6

Ranking Scheme	MetaCrawler	Simple	Final
Result	133	124	115

URL	MC Rank	Simple Score	Simple Rank	Final Rank	User Score
http://www.billg.org/secretdiary/ads/ad1.htm	--	0.285	001	002	5
http://www.teamgates.com/other/billgame.asp	--	0.134	006	001	5
http://www.zpub.com/un/bill/ecology.html	01	0.052	029	045	10
http://www.billg.org/secretdiary/	--	0.250	002	003	5
http://www4.usnews.com/usnews/nycu/gates.htm	02	0.082	014	092	10
http://www.roadahead.com/	--	0.200	003	072	5
http://www.zpub.com/un/bill/	03	0.051	030	088	5
http://www.quuxuum.org/~evan/awards.html	--	0.142	004	022	5
http://nytsyn.com/live/cndtopics.html	--	0.111	009	004	0
http://www4.usnews.com/usnews/nycu/gatehigh.htm	04	0.050	032	064	7
http://www.jas.com/ms-shame.shtml	--	0.142	005	057	5
http://www.tiac.net/users/billg40/main/index.htm	--	0.048	033	005	5
http://www.iservco.com/~dave/humor/joke85.html	05	0.018	101	100	10
http://www.mckinley.com/search.gw?....	--	0.056	025	006	0
http://www.urz.uni-heidelberg.de/subject/hd/fak7/hist/o1/log s/mt/t49/0021.html	06	0.004	207	103	0
http://nytsyn.com/live/Gates/141_052197_104207_14626.html	--	0.005	189	007	5
http://www.teamgates.com/other/links.asp	--	0.124	007	077	5
http://www.microsoft.com/BillGates/billgates_l/features/past features.htm	--	0.117	008	023	5
http://www4.usnews.com/usnews/news/michigh.htm	--	0.008	169	008	5
http://nytsyn.com/live/Gates/182_070197_142208_1682.html	--	0.005	196	009	5

Ranking Scheme	MetaCrawler	Simple	Final
Result	106	105	104

URL	MC Rank	Simple Score	Simple Rank	Final Rank	User Score
http://www.enteract.com/~cjc/nu-football/	01	0.230	005	151	10
http://www.cris.com/~Jjack/wildcats.shtml	05	0.750	001	001	10
http://www.ou.edu/athweb/version1/sports/football/	--	0.500	002	002	5
http://users.aol.com/aactchrdeb/cats.htm	02	0.053	056	013	10
http://members.aol.com/jimmyten/index.html	--	0.333	003	005	5
http://www.enteract.com/~cjc/nu-football/previous/previous.h tml	--	0.104	014	003	7
http://www.enteract.com/~cjc/nu-football/noframes/	03	0.022	166	127	7
http://www.dancris.com/~pedro/SunDevils.html	--	0.333	004	006	0
http://www.enteract.com/~cjc/nu-football/sites.html	--	0.163	009	004	7
http://www.sportingnews.com/cfootball/teams/northwestern/ ...	04	0.025	143	047	10
http://www.nusports.com/audio/index.html	--	0.200	006	111	5
http://sportingnews.com/cfootball/teams/northwestern/archive .html	06	0.167	007	012	10
http://pubweb.nwu.edu/~jle096/NU.htm	07	0.077	023	007	10
http://www.buckeyezone.com/	--	0.166	008	010	5
http://www.nwu.edu/factbook/	--	0.120	012	008	5
http://www.ruf.rice.edu/~ricesid/rsp/97-98/fb/09-15.html	08	0.007	369	582	5
http://www.personal.psu.edu/users/c/w/cwb129/bigten/	--	0.032	105	009	5
http://www.enteract.com/~cjc/nu-football/schedule.html	--	0.133	010	119	10
http://www.nrc.pair.com/big10.html	10	0.042	079	155	5
http://www.nwu.edu/search/	--	0.125	011	177	5

Ranking Scheme	MetaCrawler	Simple	Final
Result	143	119	157

My new ranking scheme fared unspectacularly in these tests. In the third example, where it was the most successful, the score for the MetaCrawler was probably artificially lowered because all of the pages unranked by the MetaCrawler were considered to be ranked equally.

Conclusion

The potential of even the simple scheme implemented in the current NeighborSearch has not been completely resolved. There are a large number of parameters that need further tuning before a final evaluation can be performed. This tuning could be performed by hand, but an automatic scheme analagous to that described in [BCB94] would be more valuable. Unfortunately, Bayesian learning applied to general graphs is NP-complete.

The technique developed by Kleinberg [K98] has demonstrated some good results, while completely ignoring the content of the individual pages (after the initial pruning required to generate the initial subweb). Kleinberg is able to tease apart pages that may be using different meanings of the search terms using the topology of the graph. My algorithm implicitly assumes that a two pages with a high score receive their high scores for the same reason; this assumption is not true in the presence of synonyms. An extension of NeighborSearch to use a more explicitly topological formulation (a la Hubs and Authorities) of web page properties might be more successful.

Finally, perhaps the most important task is to reformulate (any) one of the context-sensitive ranking algorithms in a fashion that can produce real-time response to search queries. Even if the results aren't astounding, if they are fast enough to be usable further research can proceed with making it useful. After all, the fast drives out the slow, even if the fast is wrong.

References

[BCB94] Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew, Automatic Combination of Multiple Ranked Retrieval Systems,, Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 173--181, 1994.

[FO95] C. Faloutsos and D.W. Oard, A Survey of Information Retrieval and Filtering Methods, University of Maryland, Technical Reports CS-TR-3514

[FC89] Frisse, Mark E. and Cousins, Steve B. Information Retrieval From Hypertext: Update on the Dynamic Medical Handbook Project, Proceedings of Hypertext '89, ACM Press, 1989.

[K98] Jon Kleinberg, Authoritative Sources in a Hyperlinked Environment, Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998.

[MC] http://www.metacrawler.com