||Introduction, overview and selection of presenters.
- William Winkler.
The state of record linkage and current
Technical Report,Statistical Research Division, U.S. Bureau of the Census, 1999.
This is a frequently cited survey on record linkage and outlines
the lay of the land, open problems, etc.
- William Cohen, Pradeep Ravikumar and Stephen Fienberg.
A Comparison of String Distance
Metrics for Name-Matching Tasks.
In Workshop on Information Integration on the Web (IIW), at IJCAI
A better written paper than the one below that lists string
distance metrics and compares them on a common task.
- William Winkler and Edward Porter.
Approximate String Comparison and its
effect on an Advanced Record Linkage System.
Technical report, Statistical Research Division, U.S. Bureau
of the Census, 1997.
This paper is a light read that compares the basic string matching
||Theory of Record Linkage
- Ivan Felligi and Alan Sunter.
A theory for record linkage.
Journal of the American Statistical Society, 64:1183--1210, 1969.
This is the seminal paper that formed the basis for most
subsequent work on record linkage.
- Optional Reading: William Winkler.
Using the EM Algorithm for Weight
Computation in the Felligi-Sunter Model of Record Linkage.
Technical Report RR2000/05, Statistical Research Division, Bureau of Census.
This paper describes the use of the EM algorithm to estimate the
various parameters in the FS model. It is difficult read if you are
new to EM, but will be informative if your curiousity has been
perked by the various references to this procedure in the earlier papers.
||Record Linkage and Citation Matching work at UW.
||Adaptive String Matching
- Mikhail Bilenko and Raymond Mooney.
Adaptive Duplicate Detection Using
Learnable String Similarity Measures. In Proceedings of the 9th
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-2003), pp.39-48, Washington, DC, August 2003.
This paper describes how string similarity functions can be learned
and adapted. There is a technical report here that fills in some of
the missing details about learning an edit distance measure.
||Scaling Record Linkage to Large Databases
Online Record Linkage
||Non-traditional Notions of Record Similarity
- Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh
Eliminating Fuzzy Duplicates in
Data Warehouses. In Proceedings of the 27th International
Conference on Very Large Databases, 2002.
This paper describes an interesting notion of equivalence matches
that are not string matches (e.g. UK = Great Britain though they are
no way similar). However such matches can be identified by observing
their participation in hierarchies, (e.g. states/cities within
||String Joins in Relational Databases
- Luis Gravano, Panagiotis Ipeirotis, H. V. Jagadish, Nick
Koudas, S. Muthukrishnan, and Divesh Srivastava.
Approximate String Joins in a Database
(Almost) for Free. In Proceedings of the 27th International
Conference on Very Large Data Bases (VLDB), 2001.
This paper describes the application of q-grams for efficient
approximate string matching in a relational database.
- William Cohen. Data
Integration using Similarity Joins and a Word-based Information
In ACM Transactions on Information Systems 18(3): 288-321 (2000)
This paper describes the Whirl system that also proposed
approximate string joins and approximate query answerugb using
IR-based string comparison metrics. This conference version of this
paper was discussed in 590db, Winter 2004.
||Multi-relational Reference Reconciliation
||Choosing Labeled Training Data