The importance of profiling when using Find duplicates

Davis, Akshay · June 2019

When implementing Find duplicates in Aperture Data Studio we've seen many examples of the importance of profiling prior to configuring and running Find duplicates. This has the potential to benefit both the performance of the Find duplicates step as well as the quality of potential duplicates found.

Find duplicates works by trying to reduce the number of records which are scored against each other, and the default rules which ship with Aperture Data Studio make some assumptions about the distribution of the data in an average database. For example, it assumes that addresses should be relatively unique and a good identifier for a potential duplicate record.

In the UK a good way of selecting addresses which are likely to be the same is by just looking at the postcode, and so the default blocking keys use the full postcode as one of the keys to group potential duplicates for scoring, as shown below.

{
 "description": "FullPostcode",
 "countryCode": "GBR",
 "elementSpecifications": [
  {
   "elementType": "POSTCODE",
   "elementModifiers": [
    "STANDARDSPELLING"
   ],
   "includeFromNChars": 5,
   "truncateToNChars": 7
  }
 ]
}

What this simply says is that, where a postcode is detected and is between 5 and 7 characters (valid lengths for UK postcodes), use that to group potential duplicates for scoring.

In most scenarios this is fine, as you are unlikely to have many records at the same address which do not represent the same entity. However, it is possible that this may not be the case depending on the nature of the data being processed or the manner in which it is captured. Examples include:

Dummy/default addresses used in call centers if one is not supplied
Many businesses or accounts registered at the same legal address
Common delivery addresses for different individuals

When this is the case, the blocking key shown above will result in hundreds or thousands of records being scored against each other, which will slow down the duplicate detection process and may result in false positive matches. Although this may not be a problem if it occurs a handful of times, hundreds of these large groupings can significantly slow down the process.

What works well is to start by profiling the data and determining which columns are well populated and relatively unique.

The profile example above (from the sample Customer V1 file) shows that although City is well populated it is not very unique, so would not make a good blocking key on it's own. We would look to combine it with another field to create one more unique.

Post Code on the other hand looks relatively unique, so should be safe to use, but we should confirm the distribution of values within this by looking at the values within.

We can see that these are US ZIP Codes, and there are some reasonable sized clusters of them, as one may expect for US addresses.

This should then be safe to use as a blocking key but is one that should be revisited if during tuning you want to optimize performance.

Roden, Danny · August 2019

Another thing to consider during profiling is, if the data is incomplete, can it be infilled using any additional data sources?

Some examples might be:

Address data - using the Validate Address step you can infill missing or incomplete postcodes/zipcodes and other address information. And for some regions you can even append independent Address Keys to act as a persistent ID for each property (e.g. the Unique Property Reference Number aka UPRN through the UK AddressBase Premium dataset)
Using additional first party data (e.g. a system that contains loyalty information to retrieve membership ID as another match field)
Through trusted third parties (e.g. Business registration numbers or DOB information from Experian)

Roden, Danny · August 2019

In addition to this, it's also a good idea to take a look at some of the most frequent values in the columns you're using the match on. For example, I've worked on a number of projects before where the client wanted to conduct matching using Email address and DOB as a couple of the fields, and when we dived into the data we found a number of default values (e.g. 01/01/1900 as the DOB) and invalid values (e.g. no@email) present which would've skewed the match results had we not dealt with them in advance.

Luke Westlake · August 2019

Aside from performance, as Danny mentions the impact on the matching result can be massive. Here's an article I wrote with a few more examples:

https://www.linkedin.com/pulse/importance-data-quality-matching-luke-westlake/

Roden, Danny · August 2019

In addition to this, it's also a good idea to take a look at some of the most frequent values in the columns you're using the match on. For example, I've worked on a number of projects before where the client wanted to conduct matching using Email address and DOB as a couple of the fields, and when we dived into the data we found a number of default values (e.g. 01/01/1900 as the DOB) and invalid values (e.g. no@email.com) present which would've skewed the match results had we not dealt with them in advance.

The importance of profiling when using Find duplicates

Comments

Categories