Maximum cluster size

Sueann See · January 2022

Find Duplicates comes with a default setting for maximum cluster size at 500. This setting is there to prevent excessive processing time and memory usage that will affect performance of Aperture Data Studio.

If your cluster is too big i.e exceeding the maximum cluster size, you will notice that your records all have match status as 4 (None - records do not match), so it seems like it is not able to find any matches.

The maximum cluster size is used to determine the block record limit and matching score pair limit:

Block record limit - If any individual blocking key produces a number of records that is more than 2x the maximum cluster size setting, that particular key will be ignored. This is a way to prevent wasting time and resources when processing the blocking keys.
Matching record pair limit - The maximum number of matching record pairs allowed is set to (maximum cluster size squared) / 2. This is a very rough way of limiting the in-memory cache size when processing the rules.

For example, assume you have only 1 column and 4 records in your dataset, and you use Column 1 as the blocking key, and all the record values matches.

Block records = 4.
Matching record pairs = 6. Number of matching record pairs is calculated where each record is match with every other record within the block. So record pairs are comparing records in rows 1&2, 2&3, 3&4, 1&3, 1&4, 2&4 in this case.

If you set maximum cluster size to 4, the Find Duplicates process will proceed without any issues and will return with Match status 0 (Exact match) because

Block record limit of 4*2 = 8 is greater than number of block records.
Matching score pair limit of (4*4)/2 = 8 is greater than number of matching record pairs.

However, if you set the maximum cluster size to 2, the records will not match and will return with Match status 4 (None - records do not match).

Block record limit of 2*2 = 4 is the same as number of block records. Still no issues here as it is within the limit.
Matching score pair limit of (2*2)/2 = 2 is lower than the number of matching record pairs indicating the limit has been exceeded.

What can you do if you suspect that the maximum cluster size has been exceeded?

Tune your blocking keys. Evaluate if you can further refine your blocking keys. Are there modifiers or algorithms you can use to limit the number of records blocked for matching?
Tune your rules. Evaluate if should refine your rules. Are there additional logical expressions, comparators and filters you could be using to improve matching?
Adjust the maximum cluster size settings. Note: Do this with caution in consultation with our support team or with help from our Professional Services. There will be performance implications if you simply adjust the size to a larger number.

Maximum cluster size

Categories