Manual review of Find duplicates clusters

Akshay Davis · September 2021

Using a similar approach to that covered in an earlier post we demonstrate a method for manually reviewing clusters to split one into two or more, or join two or more together.

In this example we will use the standard sample Find duplicates test file with UK addresses.

GBR Find Duplicates Sample.csv

Problem overview

Using the standard GBR Individual rules the sample dataset returns the following clusters

This screenshot has examples of two scenarios we want to handle.

Example 1: Cluster to split

The records on rows 11 and 12 (record id 123466 and 123467) have the same name and address, but the DOB indicates they could be father and son living at the same address. For arguments sake, we assume we are not including DOB in the match rules but we have identified these as two separate individuals through manual review.

Example 2: Cluster to join

The records on rows 4 and 5 (record id 123459 and 123460) have the same name and DOB (not included in default Individual rules) but due to the difference in address have not matched. In this example we assume that a manual review identifies these as two records which should be joined as they are from a change of address.

Add a discriminant to the rules

As discussed in the previous post a discriminant can be used to break or join clusters. We will simply one called "Manual Review". To do this we will add a column to our source data called "Manual Review" and a rule to match.

The first step is to modify our match rules to only match where two records have the same Manual Review ID (ExactMatch), only one has an Manual Review ID (OnePopulated), or both do not have an Manual Review ID (NonePopulated). We will map the Manual Review ID as a Generic_String and use the following new ManualReview rule:

ManualReview.Exact={Generic_String[ExactMatch]}
ManualReview.Close={Generic_String[OnePopulated,NonePopulated]}

This allows us to identify where records have been manually identified to be the same (Exact), where they haven't been manually reviewed or where they've manually been identified to be different which will result in a NoMatch.

We then update the existing top level rules to test for ManualReview.Exact as shown below

Match.Exact={ManualReview.Exact | (ManualReview.Close & (...))}
Match.Close={ManualReview.Close & (...)}
Match.Probable={ManualReview.Close & (...)}
Match.Possible={ManualReview.Close & (...)}

Match.Possible={ManualReview.Close & (...)} Where (...) represents the existing match rules.

If two records have been manually reviewed and given the same ID, ManualReview.Exact will evaluate to true and will result in an overall exact match.

If two records have been manually identified as different, then the result of the Generic_String match will be NoMatch and will not match overall.

A note on blocking keys

To ensure two records which have been manually reviewed to be the same should have matching blocking keys to ensure the rules are evaulated. In this example we are only using Generic_String which is included in the default blocking keys. If multiple Generic_Strings are being used then these would need to be modified to take into account the element group.

Processing the initial file

Adding an empty column will result in the same clusters as before

Performing the manual review and re-processing

When manually reviewing we will use an identifier that can help trace the source of the decision.

For example {user id}-{review date}-{decision id}

To join the two records we want to we will use the Manual Review ID: AKSHAY-2021-09-14-JOIN1

To break the two records we will use different IDs:

AKSHAY-2021-09-14-BREAK1
AKSHAY-2021-09-14-BREAK2

Processing this file, we now see the records we expect to be in the same cluster and the other two split into separate clusters.

The data, workflow and Find duplicates rules used in this example is below. This was generated in Aperture 2.4.6, so may not be compatible with earlier versions.

Data Studio Export - Manual Review Example.dmxd