CSE 590DB: Database Seminar, Autumn 2004

Object Matching in Data management

Dan Suciu
Wednesdays 4:30 - 5:20
CSE 605 Database Lab

Seminar Description

Known variously as Record Linkage, Merge/Purge, Deduplication, or Citation Matching, the problem is to decide whether two data records in fact represent the same object or entity. It is a frequent data management task in practice, needed in data cleaning and in data integration. There is a huge literature on this topic, and in this seminar we have selected papers that are especially relevant to a researcher in data management.

As is the tradition, participants are expected to present one paper during the quarter, and to engage in the discussions. The presentations typically are about 25-30mins (shorter if two papers are being presented) allowing sufficient time for discussion. The slides are typically posted to this web page (email jayant At cs) either before or just after the presentation.

Tentative Schedule and Reading List

Day	Readings	Presenter
09/29	Introduction, overview and selection of presenters.	Dan Jayant
10/06	Introduction William Winkler. The state of record linkage and current research problems. Technical Report,Statistical Research Division, U.S. Bureau of the Census, 1999. This is a frequently cited survey on record linkage and outlines the lay of the land, open problems, etc. William Cohen, Pradeep Ravikumar and Stephen Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Workshop on Information Integration on the Web (IIW), at IJCAI 2003. Ashish's slides. A better written paper than the one below that lists string distance metrics and compares them on a common task. William Winkler and Edward Porter. Approximate String Comparison and its effect on an Advanced Record Linkage System. Technical report, Statistical Research Division, U.S. Bureau of the Census, 1997. This paper is a light read that compares the basic string matching techniques.	Nilesh Ashish
10/13	Theory of Record Linkage Ivan Felligi and Alan Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969. This is the seminal paper that formed the basis for most subsequent work on record linkage. Optional Reading: William Winkler. Using the EM Algorithm for Weight Computation in the Felligi-Sunter Model of Record Linkage. Technical Report RR2000/05, Statistical Research Division, Bureau of Census. This paper describes the use of the EM algorithm to estimate the various parameters in the FS model. It is difficult read if you are new to EM, but will be informative if your curiousity has been perked by the various references to this procedure in the earlier papers.	Chris
10/20	Record Linkage and Citation Matching work at UW. Parag and Pedro Domingos Collective Object Identification.	Pedro
10/27	Adaptive String Matching Mikhail Bilenko and Raymond Mooney. Adaptive Duplicate Detection Using Learnable String Similarity Measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp.39-48, Washington, DC, August 2003. This paper describes how string similarity functions can be learned and adapted. There is a technical report here that fills in some of the missing details about learning an edit distance measure.	Shobhit
11/03	Scaling Record Linkage to Large Databases Mauricio Hernandez and Salvatore Stolfo. The Merge/Purge Problem for Large Databases. In Proceedings of the ACM SIGMOD Conference, 1995. This seems to be the first paper in the database community that talks about scaling record linkage to very large datasets using the sorted neighborhood approach that tries to localize expensive record comparisons. Andrew McCallum, Kamal Nigam and Lyle Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In Proceedings of the ACM SIGKDD, 2000. This paper describes the "canopy" approach: a two stage clustering approach can be used to efficiently perform record linkage. Danny's slides.	Mike Danny
11/10	Online Record Linkage Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and Efficient Fuzzy Match for Online Data cleaning. In Proceedings of the ACM SIGMOD, 2003. This paper describes efficient techniques for the online matching of incoming tuples against a reference set of tuples.	Jihad
11/17	Non-traditional Notions of Record Similarity Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proceedings of the 27th International Conference on Very Large Databases, 2002. This paper describes an interesting notion of equivalence matches that are not string matches (e.g. UK = Great Britain though they are no way similar). However such matches can be identified by observing their participation in hierarchies, (e.g. states/cities within countries).	Michelle
11/24	String Joins in Relational Databases Luis Gravano, Panagiotis Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate String Joins in a Database (Almost) for Free. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 2001. This paper describes the application of q-grams for efficient approximate string matching in a relational database. William Cohen. Data Integration using Similarity Joins and a Word-based Information Representation Language. In ACM Transactions on Information Systems 18(3): 288-321 (2000) This paper describes the Whirl system that also proposed approximate string joins and approximate query answerugb using IR-based string comparison metrics. This conference version of this paper was discussed in 590db, Winter 2004.	Yuhan Jayant
12/01	Multi-relational Reference Reconciliation Bunch of UW people. Multi-relational Reference Reconciliation. (will need authentication outside CSE). This paper is under review.	Luna
12/08	Choosing Labeled Training Data Sunita Sarawagi and Anuradha Bhamidipaty. Interactive Deduplication using Active Learning. In Proceedings of the ACM SIGKDD, 2002. This paper describes the use of active learning to intelligently select match/non-match pairs to train classifiers for record linkage. Doug's slides.	Doug

Other Related Papers

Indrajit Bhattacharya and Lise Getoor. Iterative Record Linkage for Cleaning and Integration. In Proceedings of the ACM SIGMOD Workshop on research issues in Data Mining and Knowledge Discovery (DMKD), 2004.
Unlike most citation matching papers, this paper addresses the problem of identifying duplicate author references (as opposed to duplicate papers). Author references are iteratively merged based on their attribute similarity and their co-author lists.
William Cohen, David McAllester, and Henry Kautz. Hardening Soft Information Sources. In Proceedings of ACM SIGKDD, 2000, 255-259.
This paper formulates the reference matching problem as an interesting search problem: find the most likely underlying actual database given a current noisy one.
Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative Data Cleaning: Language, Model, and Algorithms. In Proceedings of the International Conference on Very Large Databases (VLDB), 2001.
Liang Jin, Chen Li, and Sharad Mehrotra. Efficient Record Linkage in Large Data Sets. In Proceedings of the International Conference on Database Systems for Advanced Application (DASFAA), 2003.
Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser. Identity Uncertainty and Citation Matching. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS) 15, 2003.
Stuart Russell spoke about this work during his talk in cse590AI this quarter.

Other Resources

William Cohen. Probabilistic Record Linkage. A good presentation about the basics of record linkage.
The SecondString Toolkit. A Java package that has implementions for many standard string comparison algorithms.

Please sign up for the course mailing list here. Send mail to that list at cse590db at cs

Previous CSE 590DBs:

UW Database Group Web

Questions? Comments?... email jayant At cs

Last modified: Mon Nov 22 11:04:05 PST 2004