================================================================
CSE 344 -- Spring 2011
Lecture 25:   Entity Resolution

READING ASSIGNMENT: Chapter 21.7

================================================================

The problem:
   -- we have two datasets about the "same" entities (people, or
   companies, or...):  S = {x1, x2, ...} and T = {y1, y2, ...}
   -- however, the entities are represented (or spelled) differently
   in the two datasets
   -- Problem: "resolve" the entities, i.e. find matching entities


Examples:

x = "Mr. J. Brown"   v.s.   y = "John Brown"
x = "Microsoft Corporation" v.s.  y = "The Microsoft Company"

Other names:
   - For DB people: data matching, merge/purge, duplicate detection,
     data cleansing, ETL (extraction, transfer, and loading),
     de-duping
   - For AI/ML people: reference matching, database hardening

Main applications:
   - fuzzy join, e.g. dirty table against reference table
   - removing duplicates in a single table


================================================================

Step 1: finding similar items

   -- define a similarity function between two entities, "x~y", return
      all pairs whose similarity exceeds some threshold

   -- normalize the representation, using rules: effective only if
      such a normalization is already standardized,
      e.g. U.S. addresses

   -- often we have to match two records: several attributes are
      similar, and we need to combine their similarity scores into one
      global score.


Step 2: merging similar items: if x, y are sufficiently similar, then
        merge them into x*y.

assumption: if x~y then x*y is defined

================================================================

We will discuss similarity functions.

The Edit Distance (a.k.a. Levenstein distance)

Definition.  Given two strings x = x1.x2... and y= y1.y2... their
"edit distance" D(x,y) is the shortest sequence of edit commands that
transforms x into y.  An edit command is one of:

   -- delete a character (cost = 1)
   -- insert a character (cost = 1)
   -- substitute one character for another (cost = 1)

*** IN class: compute the edit distance:

   x = "Bill Gates, Jr"
   y = "William Gates, Chair"

Computing D(x,y) in polynomial time.

D(i,j) = edit distance of x1...xi and y1...yj

Then:

D(i,j) = min of the following three values
   1.   D(i-1,j-1)    if xi=yj   /* copy */
     or D(i-1,j-1)+1  if xi!=yj  /* substitute */
   2. D(i-1,j)+1               /* insert */
   3. D(i,j-1)+1               /* delete */

================================================================

Jaccard Similarity.

First, split a string into "k-grams" or "shingles".

E.g. k=3 then "Bill Gates, Jr" --> 'Bil', 'ill', 'll ', 'l G', ...

Thus, x, y become two sets.

DEFINITION.  Given two sets x, y, their Jaccard similarity is J(x,y) =
|x*y|/|x+y|, where * denotes intersection and + denotes union.

***  Example in class.

================================================================

The similarity join problem: given two sets of strings S= {x1, x2,
...}, T = {y1, y2, ...}, and a threshold t, compute all pairs of
matching entities: (xi,yj) s.t. J(xi,yj) > t.

Next time.