Tuning Rules (Review Find Duplicates step results, compare records and visualize results)

Sueann SeeSueann See Administrator
edited September 20 in Resources

We shared some pointers on how you can tune the Find Duplicates blocking keys here. In a similar way, you may also want to tune the Find Duplicates ruleset by reviewing the Find Duplicates step results and using the Find Duplicates workbench.

Review Find Duplicates step results

When you preview the Find Duplicates results, you will see the Cluster ID and Match status. The Cluster ID is influenced by the blocking keys that determine the record comparisons to be made but ultimately determined by the Find Duplicates ruleset along with the Match status.

If you find potentially duplicate records having a different Cluster ID, it may mean that you need to check both the blocking keys and ruleset for further refinements. The example below shows Janet Doe and John Larry Doe being assigned the same cluster ID, and evaluated as a close match but may actually be 2 different people.

You would have noticed that the Match status is the same for all records within a cluster. This is because the Match status will always be assigned based on the lowest confidence match status of records in the same cluster. For example, even though the 2 records for John Larry Doe are exactly the same, the Match status is still (1) being a Close Match for the particular cluster because the other records like John L Doe, J Doe, John Doe or Janet Doe within the same cluster has a lower confidence match.


Using Find Duplicates Workbench

Within the Find Duplicates workbench, you will find the following utilities to help with tuning the Find Duplicates rules:

  • Compare records and Visualize Rules to show how 2 records are evaluated against the Find Duplicates rules.
  • Ruleset editor that allows you to easily view and edit the Find Duplicates rules in a tabular format.
  • Tune rules section that suggests improvements to your current ruleset by training a machine learning algorithm based on your input on the matching outcomes.


Compare records and visualize rules

In order to know how any 2 records end up in the same cluster, we can use the Find Duplicates workbench to compare the records and visualize the rules.

First search for the records with a common value.

Then select 2 records to analyze.


The standardization results show that these records have been blocked successfully as indicated by the green highlight.

Go to rules to visualize how the Find Duplicates rules have been evaluated.

@default.country=GBR


/*
* Aliases
*/
define Exact as L0
define Close as L1
define Probable as L2
define Possible as L3

/* Match Rule */
Match.Exact={Hash.Exact}
Match.Close={Name.Close & DOB.Exact}

/*Theme Rule*/
Hash.Exact={[ExactMatch]}
Name.Close = {Forenames.Close & Surname.Exact}

/* Element Rule */
Forenames.Close = {ForenameCompare[InitialVsFullName] | ForenameCompare[InvertedNameMatch]}
Surname.Exact = {Surname[ExactMatch]}
DOB.Exact = {Date[ExactMatch]}

Based on the visualization, the 2 records have been evaluated as an exact match. A close match is inferred given a higher confidence match has been found. Rules are evaluated in order from L0 (Exact Match) through to the lower confident levels (L1,L2, L3), stopping at the first level that passes. Drilling down further into the sub-conditions for a close match confirms that the lower level expressions of the close match rule is actually not evaluated.


We'll cover more on the Ruleset editor and Tune Rules features in a separate article.

Sign In or Register to comment.