Saturday, December 6, 2014

Fuzzy Record Matching in SQL, Part 1

Let's say you have some existing customer records like this:

Haystack
idNamePhone
1Neil6045551212
2Neil(null)
3(null)6045551212
4(null)(null)

And you have some new customer records like this (say from another company you bought):

Needles
idNamePhone
1Neil6045551212

And you want to find possible matches. Now obviously Haystack1 and Needles1 match, but Haystack2 and Haystack3 are possible matches too.

Here's a basic way to find relevant matches, ranked:

select
 n.id as needle_id,
 h.id as haystack_id,
 case when n.name = h.name then 1 else 0 end 
 + case when n.phone = h.phone then 1 else 0 end as relevance
from 
 needles n
join 
 haystack h 
on 
 n.name = h.name 
 or n.phone = h.phone
order by 
 relevance desc;


This gives us:

Results
needle_idhaystack_idrelevance
112
121
131

No comments: