> bva`!_V1T'I7 @q"xڕ?HAUo=L!bc!$`icNSPrlNkR'MEXL5!&d3o̰+X7||v?z=r*V$1դs^Ǚ ,jgBBb(j~z|.+<-o
@$d=Z~ojHWÁknn#O{/EF+.ƵJjE7@S+9S ϘfY-爿xAz-=r[1]r[1w'A:oI1'=L;^|s$/P_gf҈k;}+-1esVD?ck|;6k?O+mθzlrW-GH\#K՟JRպӣ9Uϓ`ޢ4s|V^CGn
xsRzЎGc$(^J<##(
Equation Equation.30,Microsoft Equation 3.0/0DArialngs$0$B 0DTimes New Roman$0$B 0 DWingdingsRoman$0$B 0@.
@n?" dd@ @@``\T\ d,2$_V1T'I7$0AA@83ʚ;ʚ;g4kdkd<B 0dppp@<4ddddhpC 080___PPT10
?
%The Canopies Algorithmfrom Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching |)eEAndrew McCallum, Kamal Nigam, Lyle H. Unger
Presented by Danny WyattF-Clustering vs. ClassificationClassification learns a way to assign data to pre-labeled classes
Supervised
Clustering learns the definition of the classes from the data
UnsupervisedLB>
B>
Record LinkagePAs classification [Felligi & Sunter]
Data point is a pair of records
Each pair is classified as match or not match
Post-process with transitive closure
As clustering
Data point is an individual record
All records in a cluster are considered a match
No transitive closure if no cluster overlapV%ww
MotivationEither way, n2 such evaluations must be performed
Evaluations can be expensive
Many features to compare
Costly metrics (e.g. string edit distance)
Non-matches far outnumber matches
Can we quickly eliminate obvious non-matches to focus effort?ROD`AD` CanopiesbA fast comparison groups the data into overlapping canopies
The expensive comparison for full clustering is only performed for pairs in the same canopy
No loss in accuracy if:Creating Canopies.Define two thresholds
Tight: T1
Loose: T2
Put all records into a set S
While S is not empty
Remove any record r from S and create a canopy centered at r
For each other record ri, compute cheap distance d from r to ri
If d < T2, place ri in r s canopy
If d < T1, remove ri from SZZ2ZZ!
Creating CanopiesPoints can be in more than one canopy
Points within the tight threshold will not start a new canopy
Final number of canopies depends on threshold values and distance metric
Experimental validation suggests that T1 and T2 should be equalNCanopies and GACZGreedy Agglomerative Clustering
Make fully connected graph with a node for each data point
Edge weights are computed distances
Run Kruskal s MST algorithm, stopping when you have a forest of k trees
Each tree is a cluster
With Canopies
Only create edges between points in the same canopy
Run as beforef ZZZBZ B
EM ClusteringCreate k cluster prototypes c1& ck
Until convergence
Compute distance from each record to each prototype ( O(kn) )
Use that distance to compute probability of each prototype given the data
Move the prototypes to maximize their probabilities48Canopies and EM ClusteringFMethod 1
Distances from prototype to data points only computed within a canopies containing the prototype
Note that prototypes can cross canopies
Method 2
Same as one, but also use all canopy centers to account for outside data points
Method 3
Same as 1, but dynamically create and destroy prototypes using existing techniques ZZ ZPZ ZSZ P S
Complexityn : number of data points
c : number of canopies
f : average number of canopies covering a data point
Thus, expect fn/c data points per canopy
Total distance comparisons needed becomes
RAC Reference Matching ResultsLabeled subset of Cora data
1916 citations to 121 distinct papers
Cheap metric
Based on shared words in citations
Inverted index makes finding that fast
Expensive metric
Customized string edit distance between extracted author, title, date, and venue fields
GAC for final clustering&
JX&
JXReference Matching Results
Discussion4How do cheap and expensive distance metrics interact?
Ensure the canopies property
Maximize number of canopies
Minimize overlap
Probabilistic extraction, probabilistic clustering
How do the two interact?
Canopies and classification-based linkage
Only calculate pair data points for records in the same canopy6ZJZ3ZZ*Z?Z6J3*?" ` 3ff` @@̙f|` 4-̙w^33f` fx3` N3` 33f̙` ̙̙f3` Ab3 Ab3f3ffH` f4fZX>?" dd@'~?oUd@nF p@n`o n?" dd@ @@``PT @ ` `-p>>
@d
(
6$q " `0
T Click to edit Master title style!
!$
0s "` `
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0hz "`
D*
0, "` `
D*`B
s*D1"00`B
s*D)"T
" P
3"M i6P
3" *P
3" P
3" nP
3"O k-P
3" +P
3" P
3" dP
3" wP
3" P
3" #P
3"G c@P
3"
0< "^ `
B*H
0h ? fx380___PPT10.
cr Cascade/
P/
(
6q "`
T Click to edit Master title style!
!
0,t "
P
W#Click to edit Master subtitle style$
$
0x #"`` `
B*
0 } #"``
D*
0Ё #"`` `
D*`B
s*Ds"ppT 0
"0 P
3"cP
3"LP
3"pP
3"bP
3""}P
3"yP
3"8P
3"
P
3"5fP
3"WP
3"IP
3"J P
3"0 `B
s*D1"00H
0h ? fx380___PPT10.
cr} $(
r
S<p0P0
r
SP
H
0h ? 33___PPT10i.P+D=' =
@B +
`0(
x
c$ `0
x
c$` `
H
0h ? fx3___PPT10i.*N+D=' =
@B +*
p*(
x
c$ `0
r
S `0
H
0h ? fx380___PPT10.-Pz0
0(
x
c$( `0
x
c$`)` `
H
0h ? fx380___PPT10.-
80$(
$x
$ c$. `0
x
$ c$/` `
$
04"`
@
" For every traditional cluster, there exists a canopy such that all elements of the cluster are in the canopy *o8oUZoH
$0h ? fx380___PPT10./pw/A<
@<(
@~
@ s*\1 `0
~
@ s*C` `
H
@0h ? fx380___PPT10.-$
H$(
Hr
H STS `0
r
H S`R` `
H
H0h ? fx380___PPT10.`n$$
D$(
Dr
D S_ `0
r
D S`` `
H
D0h ? fx380___PPT10.X0
00(
0x
0 c$l `0
x
0 c$p`
H
00h ? fx380___PPT10.TDb$$
p\$(
\r
\ S^ `0
r
\ SX_`
H
\0h ? fx380___PPT10. >
<>(
<x
< c$u `0
x
< c$v` 0
<0
NA? ?
0
dH
<0h ? fx380___PPT10.UP0]"0
80(
8x
8 c$ `0
x
8 c$P` `
H
80h ? fx380___PPT10.T`;Gd'
P.cT'(
Tr
T S} `0
u ``
cT#"& `T
"T
<?B `
N0.00 @`
!T
<h?B
O0.000 @`
T
<?B
O1.000 @`
T
<`?
B
O1.99% @`
T
<¨?B
B @`
T
<l˨? B
Nnone @`
T
<Ԩ?`B
N0.03 @`
T
<ݨ?B
O0.926 @`
T
<?B
O0.559 @`
T
<?
B
O1.60% @`
T
< ?
B
O0.697 @`
T
<? B
UAuthor/Year @`
T
<
?`
P134.09 @`
T
<?
O0.965 @`
T
<?
O0.737 @`
T
<?
O0.76% @`
T
<l/?
O0.835 @`
T
<0?
VComplete GAC
@`
T
<@?e`
N7.65 @`
T
<I?e
O0.976 @`
T
<pR?e
O0.735 @`
T
<[?
e
O0.75% @`
T
<pd?e
O0.838 @`
T
<8n? e
RCanopies @`
T
<tv?``e
QMinutes @`
T
<?`e
PRecall @`
T
<?`e
S Precision
@`
T
<?
`e
OError @`
T
<X?`
e
LF1 @`
T
<? `e
PMethod @``B
#T
0o
? ```ZB
$T
s*Ԕ
? e`eZB
%T
s*1
? `ZB
&T
s*1
? `ZB
'T
s*1
? B `B `B
(T
0o
?
`
`B
)T
0o
? `
ZB
*T
s*1
?`
ZB
+T
s*1
?
`
ZB
,T
s*1
?`
ZB
-T
s*1
?`
ZB
.T
s*1
?`
`B
/T
0o
?```
H
T0h ? fx380___PPT10.0-$
0L$(
Lr
L Sx `0
r
L SL]0
H
L0h ? fx380___PPT10.8y4xXMhI~Uݓol
\܃=I$a83If4M%,{Г
Q7 (Dןիj{fY6yC1WUOܫBf`6[ e>MPPf,uYز%ZTK!WBޙmȻl!cx@$>-&Pfl@ ~r@
ml/ؖj >$p:4:hU.'W4/`+C/D<"&ӼO~j+b?"h'J 2BA߬r>%`7dG"hTˎ2|)^,|ǮL*V6Rf])bOGT,YǚǦ:~Zw
%SXm<ؓO&ܶ
@Wm]^JO/ӿ\_Slko a9t"K1c(HB4Ŷj1_y_dyl4|Ȋ=ځΒ5\ݞZtqKOP%_$u+(`p$*d9tlz))*BZDƅ喐[K75eФ,IƐ;m# î^\A1d n?@ABCDEFGHdJLMNOPQRSTUVWXYZ[cefghijklmnopqrstuKxyz{|}~IRoot EntrydO)@ wPicturesCurrent UserJ/SummaryInformation(PowerPoint Document(DocumentSummaryInformation8;Equation Equation.30,Microsoft Equation 3.0/0DArialngs$0$B 0DTimes New Roman$0$B 0 DWingdingsRoman$0$B 0@.
@n?" dd@ @@``\T\ d,2$_V1T'I7$0AA@83ʚ;ʚ;g4kdkd<B 0dppp@<4ddddhpC 080___PPT10
?
%LThe Canopies Algorithmfrom Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching |)eEAndrew McCallum, Kamal Nigam, Lyle H. Unger
Presented by Danny WyattF-Record Linkage MethodsPAs classification [Felligi & Sunter]
Data point is a pair of records
Each pair is classified as match or not match
Post-process with transitive closure
As clustering
Data point is an individual record
All records in a cluster are considered a match
No transitive closure if no cluster overlapV%ww
MotivationEither way, n2 such evaluations must be performed
Evaluations can be expensive
Many features to compare
Costly metrics (e.g. string edit distance)
Non-matches far outnumber matches
Can we quickly eliminate obvious non-matches to focus effort?ROD`AD` CanopiesbA fast comparison groups the data into overlapping canopies
The expensive comparison for full clustering is only performed for pairs in the same canopy
No loss in accuracy if:Creating Canopies.Define two thresholds
Tight: T1
Loose: T2
Put all records into a set S
While S is not empty
Remove any record r from S and create a canopy centered at r
For each other record ri, compute cheap distance d from r to ri
If d < T2, place ri in r s canopy
If d < T1, remove ri from SZZ2ZZ!
Creating CanopiesPoints can be in more than one canopy
Points within the tight threshold will not start a new canopy
Final number of canopies depends on threshold values and distance metric
Experimental validation suggests that T1 and T2 should be equalNCanopies and GACZGreedy Agglomerative Clustering
Make fully connected graph with a node for each data point
Edge weights are computed distances
Run Kruskal s MST algorithm, stopping when you have a forest of k trees
Each tree is a cluster
With Canopies
Only create edges between points in the same canopy
Run as beforef ZZZBZ B
EM ClusteringCreate k cluster prototypes c1& ck
Until convergence
Compute distance from each record to each prototype ( O(kn) )
Use that distance to compute probability of each prototype given the data
Move the prototypes to maximize their probabilities48Canopies and EM ClusteringFMethod 1
Distances from prototype to data points only computed within a canopies containing the prototype
Note that prototypes can cross canopies
Method 2
Same as one, but also use all canopy centers to account for outside data points
Method 3
Same as 1, but dynamically create and destroy prototypes using existing techniques ZZ ZPZ ZSZ P S
Complexityn : number of data points
c : number of canopies
f : average number of canopies covering a data point
Thus, expect fn/c data points per canopy
Total distance comparisons needed becomes
RAC Reference Matching ResultsLabeled subset of Cora data
1916 citations to 121 distinct papers
Cheap metric
Based on shared words in citations
Inverted index makes finding that fast
Expensive metric
Customized string edit distance between extracted author, title, date, and venue fields
GAC for final clustering&
JX&
JXReference Matching Results
Discussion4How do cheap and expensive distance metrics interact?
Ensure the canopies property
Maximize number of canopies
Minimize overlap
Probabilistic extraction, probabilistic clustering
How do the two interact?
Canopies and classification-based linkage
Only calculate pair data points for records in the same canopy6ZJZ3ZZ*Z?Z6J3*?*
p*(
x
c$ `0
r
S `0
H
0h ? fx380___PPT10.-PzrudeQ!(
!"#$%&'()*+,-./0123456789:<=>?@ABCDEFGHIOh+'0php
4@
LX`ZEfficient Clustering of High Dimensional Data Sets with Application to Reference MatchingsDanny WyattCascadedannye127Microsoft PowerPointof @3@I?*@P 7G
g {Hyf--$xx--'--%--'--$44--'--$11--'--$----'--$**--'--$
''
--'--$
##
--'--$ --'--$--'--$77--'--$!!;";"!--'x--$$$>%>%$--'--$''B(B('--'--$**E,E,*--'--%ll--'@Arial-. (2
:The Canopies Algorithm."System-@Arial-. 2
$from g.-@Arial-. (2
+oEfficient Clustering .-@Arial-. 2
1mof High.-@Arial-. 2
1}-#.-@Arial-. 2
1Dimensional .-@Arial-. 2
7
Data Sets .-@Arial-. B2
>@'with Application to Reference Matching.-@Arial-. H2
P+Andrew McCallum, Kamal Nigam, Lyle H. Unger.-@Arial-. +2
_ePresented by Danny Wyatt.-՜.+,0t
On-screen ShowiNAsu
: