Welcome to The Fuzzy-String Project!
Fuzzy-string processing using Damerau-Levenshtein distance, optimized for Microsoft Transact-SQL
How do you find information that was saved misspelled, or when your search is misspelled? Fuzzy-string processing!
And if your information is in a database, the best place to do that processing is in the database.
This site is mostly for technical folk, but others may find it useful or instructive.
For instance, if you have a word in your Scrabble® tray, Search
it to find out what extra letter you could link it with to make a new word!
(Of course, other sites can provide that service more quickly, because they have pre-calculated all the information.
My intent here is to prove that you can get adequate performance without caching any information.)
This site uses an optimized Transact-SQL form of the Damerau-Levenshtein distance (DLD) algorithm for calculating the difference between strings.
Technical details and SQL source code can be found at SQLServerCentral
(and will be posted on this site as well, starting 2012-12-17.)
Essentially, DLD is the smallest number of “Changes” required to “Transform” one string into another.
Changes can be Inserts (ac→abc),
Deletes (abc→ac),
Substitutions (abc→adc),
or Transpositions (abcd→acbd
but not ababab→bababa)
of letters.
(Changes may be referred to in the code, or on other sites, as Edits, Operations, Typos, Differences, or Cost.)
DLD does not take into account:
See the Reference page for other algorithms that may use those kinds of string similarity.
However, note that if you use a “Change Limit” (also known as a Threshold) to stop calculation when strings are too different,
DLD often needs to only look at the first few letters to discard a potential match, so it will usually be much faster than other algorithms.
DLD also has the speed advantage of using only integer addition and subtraction in calculation.
What DLD excels at is finding accidental typos, such as those caused by:
DLD can be especially effective if your database has an alphanumeric key field that users have to type,
such as a Vehicle Identification Number (VIN). You can imagine the potential for error when dealing with strings like “4S3BH675017600337”
- especially when they can be hard to read because a vehicle is damaged, worn or dirty!
Of course, no matter how optimized, DLD is still slower than basic equality or LIKE pattern-matching.
So suggested practice is to use your existing search feature first; then when the user starts entering a new record because they didn't find exactly what they were looking for,
launch a DLD search in the background. That way, the user does not need to wait for the DLD search to complete before starting entry,
but they can be notified of a potential match before they finish entry and save a new record (which you would have to de-duplicate later).