Fuzzy Matching logic

Clinton Jones · June 2019

How does the fuzzy matching in the Find Duplicates step work?

Akshay Davis · June 2019

The rules in Find duplicates allow you to build up detailed match classification rules based on individual components. Each of these components, like Forename or Street Name can be configured to allow acceptable differences to be classified in one of the four match levels.

These individual comparison functions can be found in the rules documentation, they include standard comparisons like Levenshtein or Jaro Winkler edit distances (essentially the number of character differences between two strings), to specific comparison functions for elements like postcodes where we want to apply additional logic.

These are combined into overall rules, with a pseudo example shown below

A name is a candidate for manual review if
- The forenames are different be 3 characters
  - OR
  - The root name of the forenames (i.e. John -> Johnathan and Jon -> Jonathan) are the same
- AND
- The surname has an edit distance of 90% or higher

These can be layered up into the top four match levels to provide as much control as needed.

Fuzzy Matching logic

Comments

Categories