Tuning Blocking Keys
Review Find Duplicates step results
Reviewing the Find Duplicates results may not be the best way to confirm the effectiveness of the blocking keys. However, it does help to reveal obvious issues that may trigger further investigation.
Once a set of Blocking Keys and Rules has been established at the Find Duplicates settings, we can create a test workflow connecting the source dataset to the Find duplicates step. For the purpose of testing, it is best to start with a small dataset or small sample from the intended source dataset so that we do not have to wait too long for the results. Once we gain some confidence in the settings, we can re-run the test with larger samples.
Preview the Find Duplicates results by clicking on Show step results.
- Sort the results by the Cluster ID so that you get a better view of the clusters or groups that have been formed. Check the records that have the same cluster ID. Do they look like legitimate duplicates? If so, then we are on the right track.
- Note that duplicate records having a different Cluster ID does not necessarily indicate a problem with the blocking keys as the Cluster ID is ultimately determined by the Find Duplicates rules as well. As a reminder, the rules determine both the Cluster ID and match status once the records are being grouped based on the blocking keys.
Using Find Duplicates Workbench
The Find Duplicates workbench is a utility you can use to make improvements to your Find duplicates step configuration.
At the Find Duplicates step, ensure that Retain the duplicate store is checked on, and a name is entered for the duplicate store. This will enable you to inspect the Find duplicates results with the Find Duplicates workbench.
When previewing the step results, the duplicate store would be established and can be located at the duplicate store location. By default, this can be found in the experianmatch folder of the installation folder data directory, for example, C:\ApertureDataStudio\data\experianmatch\DuplicateStoreExample
Make a copy of the duplicate store so that the actual store is not affected while performing the analysis. Open the Find Duplicates workbench, enter the path to the duplicate store copy, then click OK to get started.
Search and View Standardization to determine if records are in the same block
The most conclusive way to determine if 2 records have been blocked or grouped as potential duplicates successfully is to compare the blocking keys being generated.
At the Find Duplicates Workbench, go to Search records. Then enter a portion of your record, for example, just the Last Name to retrieve related records.
Select two records to be compared and you will be presented with the Standardization screen.
The Blocking keys section show the generated keys for the records. Based on the example in the screenshot above:
- The AllColumns blocking key values did not match, so these records are not blocked using this key.
- The LastNameBirthDate blocking key values matched (highlighted in green), so these records have been successfully blocked or grouped as potential duplicates using this key.
Analyze Blocking Keys
To further improve the efficiency and effectiveness of the blocking keys, use the Analyze blocking keys feature available at the Find Duplicates Workbench.
Click on the Analyze Blocking Keys tab and click Analyze.
The first table displayed shows the efficiency and effectiveness of the blocking keys.
The higher the number for cost, the more it will slow down the performance. However, it does not mean that we should eliminate the most costly blocking key right away.
Looking at this example:
- Both the blocking keys have overlaps and the number of overlaps is the same. This indicates that we could potentially eliminate one of the blocking keys to improve the performance without affecting the current results.
- Based on the Matches not found by other keys, there are more matches found with the LastNameBirthDate blocking key. This indicates that we need to keep this blocking key to have a higher chance of matching.
- The Matches not found by other keys for AllColumns is 0, meaning this blocking key has not caught any additional matches that have not been found by other blocking keys. This indicates that we can safely eliminate this blocking key without affecting the current match results.
- If we eliminate the AllColumns blocking key, we would improve performance with cost saving for this blocking key. In this case, the cost is 1 i.e. very little difference in terms of performance. Would we be better off keeping the AllColumns blocking key?
Analyzing blocking keys on different datasets would produce different results. It all depends on the data. We may need to run analyze blocking keys several times with larger samples to confirm before making any changes to the blocking keys. In summary, we should always consider the cost/benefit of keeping the blocking keys. We should consider eliminating a blocking key or making it more specific if
- Cost is high and number of matches not found by other keys is low.
- Cost is high and number of overlaps with other keys is high.
You may see a few more tables below, depending on whether the system is able to provide any suggestions for improvements. These are suggested variations of the existing blocking keys. They are not necessarily better and would need to be evaluated based on the provided statistics. As an example, if the number of Matches Found for the suggested blocking key is much less than what your original blocking key could find, then there is no point in changing your original blocking key.
Making a blocking key more specific
We may have started off creating basic blocking keys based on elements or combination of elements in a dataset. In order to refine the blocking keys to make them more efficient, we can utilize some advanced configuration in terms of the element modifiers, algorithm and include/truncate character settings so that we consider some alternative values when blocking records, instead of using the exact value of the data.
As an example, consider the following First Name variants with the same Last Name and Date of Birth.
- If we were only using the All Columns blocking key, then all of these records would be treated as unique since they are not an exact match.
- If we were using Last Name and Date of Birth as a blocking key, then all of these records would be considered potential duplicates, including the one which could be a totally different name i.e. Janet Doe.
- In reality, we may want only John Larry Doe , John L Doe, John Doe, Jon Doe to be in the same block in order to improve the matching efficiency.
- We could possibly apply the START_SUBSTRING algorithm to use the first 4 characters of the First Name, combined with the Last Name and Date of Birth as the blocking key. Any records with the same first 4 characters would then be blocked and considered potential duplicates i.e. John Larry Doe, John L Doe, John Doe. Then, how can we include Jon Doe?
- We could possibly apply the RootName element modifier to the first name such that a more standardized form can be used to determine if 2 records would be blocked. The root name for Jon can be identified by the system as John. As a result, John Larry Doe, John L Doe, John Doe, Jon Doe can be blocked and considered potential duplicates.
The resulting blocking key would look something like this:
[ { "description": "FirstNameLastNameBirthDate", "countryCode": "GBR", "elementSpecifications": [ { "elementType": "FORENAMES", "elementModifiers":[ "ROOTNAME" ], "algorithm": { "name": "START_SUBSTRING", "properties": { "length": 4 } }, "includeFromNChars":1 }, { "elementType": "SURNAME" }, { "elementType": "DATE" } ] } ]
Notice that I’ve included an additional includeFromNChars setting. This is an optional element you can specify within the element specifications to prevent an error with null blocking keys. When you are using a blocking key that includes First Name and you are not sure if the First Name column would be populated at all, then it is wise to use this setting to only include the First Name in the blocking key if it is N or more characters in length.
Hopefully this article has helped ease your Find Duplicates journey. If not, we are open to hear how we can help you further.