Tune Rules with Find Duplicates Workbench
The Find Duplicates Workbench provides you with the capability to tune your Find Duplicates rules using machine learning. Once you have established your duplicate store with your initial settings, you will be able to start tuning your rules further.
Example
We are comparing some product information as generic strings. Here is an extract of the current Find Duplicates results.
We run the workflow to establish the duplicates store, open it using the Find Duplicates Workbench, then go to Tune Rules.
To begin, press the Start button. We are presented with two records from the duplicate store. Using the Yes, Maybe, and No buttons, we select whether the two records should be considered a duplicate match. We are continuously presented with new pairs of records (until you choose to stop providing your input or there are no more records in the duplicate store that can be used)
The rules tuning will begin only when you have provided enough expected match outcomes. Once this happens, you will be able to see the rules tuning status at the bottom of the page where two tables are displayed.
The first table (left) shows how well the default ruleset performs and the second table (right) showing how well the tuned ruleset performs. To understand how well each ruleset is performing, compare the inferred match result (the result returned by the ruleset) against the actual match result (the result you provided). A higher overlap between the inferred and actual match result means a better performing ruleset.
In the example above, note that the current best result has a higher number of inferred yes matching actual yes. We can click Download Ruleset at this point to take a look at the tuned Ruleset. When we evaluate the differences, we can see that some of the rules criteria has been changed. Depending on your data, it could be a matter of changing the comparators used or tuning the percentages.
Original (default ruleset)
Tuned (current best ruleset)
To test the tuned rules, you probably want to first clone your existing Find Duplicates settings and then replace the rules with the tuned rules. To confirm if this Ruleset works better for the entire dataset, you should re-run your workflow with the new Find Duplicates settings. If you are not satisfied with the tuned rules, you can always revert to the previous settings.