Fuzzy in T-SQL - by Rob Farley

Status : 

 


78
0
Sign in
to vote
ID 338664 Comments
Status Active Workarounds
Type Suggestion Repros 10
Opened 4/17/2008 7:43:41 AM
Access Restriction Public

Description

The big use-case is around addresses. I often see people cleansing their data around addresses, looking to match addresses which are the same, but spelled slightly wrong or using abbreviations in a slightly different way. Currently, SSIS is the only way of doing this. It's seen as the cleansing tool. 

SSIS is great for cleaning the data and reloading it in a cleaner format. But often you don't want to do that - you just want a query which can do fuzzy grouping or fuzzy joining. 

Of course in the SSIS fuzzy transforms there are a lot of options that can be configured. Support in the engine would need to be able to have database-level settings to provide defaults, and any functions would need to be able to alter them.

I'm thinking something like:

GROUP BY City, FUZZY(Address1 + ' ' + Address2, <options>)

And predicates something like:

WHERE a.Address1 FUZZYMATCH(<options>) b.Address1

or...  if you have boolean types:

WHERE FUZZYMATCH(a.Address1, b.Address1, <options>)

With this, cleansing could be done using a T-SQL statement. INSERT statements could become MERGE statements which look to find existing _similar_ records before inserting a new one.
Sign in to post a comment.
Posted by Mark Guinness on 3/21/2012 at 9:32 AM
@Grant Dickinson - what happened to the FL3 functions in SQL 2008 R2 CTP3? They still appear in documentation, but I can't find them anywhere in the MDS database. Were the functions deprecated, or replaced with another name?

mdq.FuzzyLookup_Build    http://msdn.microsoft.com/en-us/library/ee633804(SQL.105).aspx
mdq.FuzzyLookup_Lookup    http://msdn.microsoft.com/en-us/library/ee633825(SQL.105).aspx
mdq.FuzzyLookup_Match    http://msdn.microsoft.com/en-us/library/ee633909(SQL.105).aspx
Posted by Luke Jian on 9/30/2011 at 12:38 PM
Adding this in here since my suggestion https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=690567 was marked as duplicate of this (and rightfully so)
Please see especially second part about optimizer.

I see a good case for adding a string function that works like fuzzy lookup in SSIS allowing you to calculate something like Levenshtein distance between two strings. This can be implemented using a CLR function but I think would be very useful to be able to use this in something like

... WHERE
(a.FName=b.FName AND a.LName=b.LName )
OR
(SIMILARITY(a.FName,b.FName)>75 AND SIMILARITY(a.LName,b.LName)>75 )

Also the optimizer should be able to detect this pattern of Exact+Fuzzy Match and optimize the query tree as a union of an inner join query (using indexes) and a function based join that excludes all results that matched with the inner join.
This would dramatically improve the partial match performance and could be used to shift the processing from SSIS to SQL Server.
Posted by grantdi1 on 11/9/2009 at 11:46 AM
Grant Dickinson here from the SQL MDS team. We've done some interesting things in the (v1) MDS product in this regard. I think you will like them. First of all we've exposed Regex as a set of CLR functions in T-SQL. We've also written some similarity() functions that do smarter comparisons than the built-in soundex() does - for instance Levenstein, LCS and a couple of others.

In addition to that we've hosted the new version of FuzzyLookup from MSR, called FL3, in SQL as a set of CLR modules. Meaning you can write fuzzy matching code directly in your own stored procedures and queries. Go have a look at it, the docs are here: http://msdn.microsoft.com/en-us/library/ee633712(SQL.105).aspx
And the bits are part of SQL08R2 CTP3: http://www.microsoft.com/sqlserver/2008/en/us/R2.aspx
There's not much guidance around the FL3 stuff yet but I am busy writing a whitepaper that will explain its use.
Posted by ceibner on 6/26/2008 at 8:12 AM
Although Regex could be used as a work around, I think it is often complex & hard to debug so unlikely to be a widely implemented solution. The complexity of using regex to have any functionality similar to the confidence & similarity levels would be incredibly difficult. Having the ability to use this at a function level would be really handy - but I am guessing it too would only be available at enterprise level if implemented?
Posted by Esrever on 6/9/2008 at 11:29 AM
Would be very very nice!!
I have to search doublets in 6.000.000 and i can't find a fast and easy to use solution for that Problem.
I found SSIS Fuzzy Grouping but it seems that the amount of output data is limited by the amount of input data.
More details here: http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=3462136&SiteID=1&mode=1
Posted by Jamie Thomson on 5/21/2008 at 2:06 AM
I think this is going to be really important. There are many many reasons why knowing that a group of things are similar, as opposed to knowing whether they are the same or not, is important. I have talked about it a little here: http://blogs.conchango.com/jamiethomson/archive/2008/05/21/fuzzyness.aspx

-Jamie
Posted by Mitch Wheat on 4/26/2008 at 6:13 PM
Good call Rob. Believe it or not, I have seen someone's very poor attempt to do just this with hundreds of switch statements and the VBScript COM regex object (via sp_OACreate).
Posted by JeremyC-Carmac on 4/18/2008 at 12:43 PM
I completely agree with the need to have fuzzy functionality it T-SQL. I have a scenario where I am trying to match artist, albums and track titles. Which sometimes works with Regex, but more often then not Regex is just unseable. here are some examples:
"Roling Stone" vs. "The Rolling Stones" - SSIS Fuzzy lookup with find these a very similar match with a high confidence. However Regex won't match these at all. One reason is because of the misspelling in "Roling" and another is because I'd have to use Inflection to add the s onto "Stone", SoundEx wouldn't match it either because of "The".
or
"Notorious B.I.G." vs "Notorious BIG" Once again SSIS Fuzzy lookup works great but Regex wouldn't find a match.
Posted by Microsoft on 4/18/2008 at 10:54 AM
Hi Rob,

Nice suggestion. I wonder whether RegEx support would provide enough matching & correction? I agree your point that building a filter/corrector as a SQLCLR class is a promising route - it also has the advantage that, once built, it's "good to go" - no need to add special-case TSQL syntax to access it - the beauty of procedural-extension, right?

However, we can't fit this feature into the current Katmai release - but I've added to our list of features to consider for next release.

Thanks,

Jim