Fuzzy Matching logic
Comments
-
The rules in Find duplicates allow you to build up detailed match classification rules based on individual components. Each of these components, like Forename or Street Name can be configured to allow acceptable differences to be classified in one of the four match levels.
These individual comparison functions can be found in the rules documentation, they include standard comparisons like Levenshtein or Jaro Winkler edit distances (essentially the number of character differences between two strings), to specific comparison functions for elements like postcodes where we want to apply additional logic.
These are combined into overall rules, with a pseudo example shown below
- A name is a candidate for manual review if
- The forenames are different be 3 characters
- OR
- The root name of the forenames (i.e. John -> Johnathan and Jon -> Jonathan) are the same
- AND
- The surname has an edit distance of 90% or higher
- The forenames are different be 3 characters
These can be layered up into the top four match levels to provide as much control as needed.
3 - A name is a candidate for manual review if