================================================================ CSE 344 -- Spring 2011 Lecture 25: Entity Resolution READING ASSIGNMENT: Chapter 21.7 ================================================================ The problem: -- we have two datasets about the "same" entities (people, or companies, or...): S = {x1, x2, ...} and T = {y1, y2, ...} -- however, the entities are represented (or spelled) differently in the two datasets -- Problem: "resolve" the entities, i.e. find matching entities Examples: x = "Mr. J. Brown" v.s. y = "John Brown" x = "Microsoft Corporation" v.s. y = "The Microsoft Company" Other names: - For DB people: data matching, merge/purge, duplicate detection, data cleansing, ETL (extraction, transfer, and loading), de-duping - For AI/ML people: reference matching, database hardening Main applications: - fuzzy join, e.g. dirty table against reference table - removing duplicates in a single table ================================================================ Step 1: finding similar items -- define a similarity function between two entities, "x~y", return all pairs whose similarity exceeds some threshold -- normalize the representation, using rules: effective only if such a normalization is already standardized, e.g. U.S. addresses -- often we have to match two records: several attributes are similar, and we need to combine their similarity scores into one global score. Step 2: merging similar items: if x, y are sufficiently similar, then merge them into x*y. assumption: if x~y then x*y is defined ================================================================ We will discuss similarity functions. The Edit Distance (a.k.a. Levenstein distance) Definition. Given two strings x = x1.x2... and y= y1.y2... their "edit distance" D(x,y) is the shortest sequence of edit commands that transforms x into y. An edit command is one of: -- delete a character (cost = 1) -- insert a character (cost = 1) -- substitute one character for another (cost = 1) *** IN class: compute the edit distance: x = "Bill Gates, Jr" y = "William Gates, Chair" Computing D(x,y) in polynomial time. D(i,j) = edit distance of x1...xi and y1...yj Then: D(i,j) = min of the following three values 1. D(i-1,j-1) if xi=yj /* copy */ or D(i-1,j-1)+1 if xi!=yj /* substitute */ 2. D(i-1,j)+1 /* insert */ 3. D(i,j-1)+1 /* delete */ ================================================================ Jaccard Similarity. First, split a string into "k-grams" or "shingles". E.g. k=3 then "Bill Gates, Jr" --> 'Bil', 'ill', 'll ', 'l G', ... Thus, x, y become two sets. DEFINITION. Given two sets x, y, their Jaccard similarity is J(x,y) = |x*y|/|x+y|, where * denotes intersection and + denotes union. *** Example in class. ================================================================ The similarity join problem: given two sets of strings S= {x1, x2, ...}, T = {y1, y2, ...}, and a threshold t, compute all pairs of matching entities: (xi,yj) s.t. J(xi,yj) > t. Next time.