Quick introduction to Blocking Keys and Rules for Find Duplicates
The concept of Blocking Keys and Rules may be foreign to you if you haven’t already used the Find Duplicates step in Aperture Data Studio.
- Blocking keys identifies records that are similar, creating blocks or potential groups of matches.
- Rules compares every set of records in the resulting blocks, returning the cluster ID and match status.
- Cluster ID is a unique identifier for each cluster. A cluster is a collection of records that have been confirmed as representing the same entity based on the Ruleset.
- Match Status represents the level of confidence that the records in a cluster are indeed representing the same entity.
Why do we need blocking keys?
Blocking keys makes it easier and faster to locate the duplicated records.
Let’s say you have a dataset with only four records.
In order to determine if any of the records matches, the Find Duplicates engine would need to compare each record pair, resulting in the comparison being performed 6 times:
- Row 2 & 3
- Row 3 & 4
- Row 4 & 5
- Row 2 & 4
- Row 2 & 5
- Row 3 & 5
You set up a blocking key based on Date of Birth because only records with the same Date of Birth would potentially match. This would mean that records with the same Date of Birth will be blocked together and each record pair within each block will need to be compared.
Date of Birth: 20-Jun-1976
- Row 2 & 5
Date of Birth: 1-Jan-1970
- Row 3 & 4
So, the number of comparisons has now decreased from 6 to only 2 comparisons required.
The more records you have in a dataset, the more comparisons need to be performed. If you have a large dataset, having the right blocking keys could make a lot of difference in terms of performance.
Why do we need rules in addition to blocking keys?
- Blocking keys determines which records need to be compared. Rules determine how to compare these records.
- Blocking keys identifies potential matches. Rules determines if the records really match and how closely they match.
- Rules allow you to define different levels of matches, representing the match status or level of confidence. For example,
- If First Name, Last Name and Date Of Birth for Row 2 & 5 matches, you may define it as an Exact Match.
- If First Name does not match but Last Name and Date of Birth for Row 2 & 5 matches, you may define it as a Possible Match.
Look out for the next post on how to identify blocking keys.