CSE 590DB: Database Seminar, Autumn 2004

Object Matching in Data management

Dan Suciu
Wednesdays 4:30 - 5:20
CSE 605 Database Lab

Seminar Description

Known variously as Record Linkage, Merge/Purge, Deduplication, or Citation Matching, the problem is to decide whether two data records in fact represent the same object or entity. It is a frequent data management task in practice, needed in data cleaning and in data integration. There is a huge literature on this topic, and in this seminar we have selected papers that are especially relevant to a researcher in data management.

As is the tradition, participants are expected to present one paper during the quarter, and to engage in the discussions. The presentations typically are about 25-30mins (shorter if two papers are being presented) allowing sufficient time for discussion. The slides are typically posted to this web page (email jayant At cs) either before or just after the presentation.

Tentative Schedule and Reading List

Day Readings Presenter
09/29 Introduction, overview and selection of presenters.
  • Dan
  • Jayant
  • 10/06 Introduction
  • Nilesh
  • Ashish
  • 10/13 Theory of Record Linkage
    • Ivan Felligi and Alan Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
      This is the seminal paper that formed the basis for most subsequent work on record linkage.
    • Optional Reading: William Winkler. Using the EM Algorithm for Weight Computation in the Felligi-Sunter Model of Record Linkage. Technical Report RR2000/05, Statistical Research Division, Bureau of Census.
      This paper describes the use of the EM algorithm to estimate the various parameters in the FS model. It is difficult read if you are new to EM, but will be informative if your curiousity has been perked by the various references to this procedure in the earlier papers.
  • Chris
  • 10/20 Record Linkage and Citation Matching work at UW.
  • Pedro
  • 10/27 Adaptive String Matching
    • Mikhail Bilenko and Raymond Mooney. Adaptive Duplicate Detection Using Learnable String Similarity Measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp.39-48, Washington, DC, August 2003.
      This paper describes how string similarity functions can be learned and adapted. There is a technical report here that fills in some of the missing details about learning an edit distance measure.
  • Shobhit
  • 11/03 Scaling Record Linkage to Large Databases
  • Mike
  • Danny
  • 11/10 Online Record Linkage
  • Jihad
  • 11/17 Non-traditional Notions of Record Similarity
    • Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proceedings of the 27th International Conference on Very Large Databases, 2002.
      This paper describes an interesting notion of equivalence matches that are not string matches (e.g. UK = Great Britain though they are no way similar). However such matches can be identified by observing their participation in hierarchies, (e.g. states/cities within countries).
  • Michelle
  • 11/24 String Joins in Relational Databases
    • Luis Gravano, Panagiotis Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate String Joins in a Database (Almost) for Free. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 2001.
      This paper describes the application of q-grams for efficient approximate string matching in a relational database.
    • William Cohen. Data Integration using Similarity Joins and a Word-based Information Representation Language. In ACM Transactions on Information Systems 18(3): 288-321 (2000)
      This paper describes the Whirl system that also proposed approximate string joins and approximate query answerugb using IR-based string comparison metrics. This conference version of this paper was discussed in 590db, Winter 2004.
  • Yuhan
  • Jayant
  • 12/01 Multi-relational Reference Reconciliation
  • Luna
  • 12/08 Choosing Labeled Training Data
  • Doug
  • Other Related Papers

    Other Resources

    Please sign up for the course mailing list here. Send mail to that list at cse590db at cs

    Previous CSE 590DBs:

    UW Database Group Web

    Questions? Comments?... email jayant At cs

    Last modified: Mon Nov 22 11:04:05 PST 2004