Matching Results - Find Duplicates

M.Lambert · November 2023

Hi there,

Any idea why in the matching results the big clusters are coming with different cluster id? When I look in the workbench they match perfectly but the results show different cluster ids.

TIA

Mariana

henry_ · November 2023

This isn't something I've seen before. When you say "big" clusters, how many records do you mean? Very large clusters may hit a maximum cluster size limit, but that's usually just to guard against matching rules that are too loose and generate a lot of false-positive matches.

M.Lambert · November 2023

I am trying to get a match on the email domain when the domain is from a company and not a generic one like gmail or hotmail. The whole data is 8 million records but the cluster it's about 5000 records.

Henry Simms · November 2023

I agree with Josh that, while you could use Find Duplicates here, it doesn't sound like the right tool for the job if you don't have multiple elements to match on, and do not need fuzzy matching.

To just use the grouping approach I would first get the domain from the email (using an "After" function with "@" as the suffix):

Then I'd group on the email domain values, sort by the count, and investigate the large groups (clusters). The domain value itself acts as the cluster ID.

If you wanted to do a slightly more "fuzzy" match on the domains, I would do some pre-processing in the values before grouping to eg remove noise, trim leading and trailing characters or invalid characters etc

Josh Boxer · November 2023

Hi

Group step to create a View would look like this:

If the View is set to Interactive then users will be able to drilldown into rows in each group, in your case emails per domain.

Josh Boxer · November 2023

You are matching on ~5k customers who share the same email domain and no other information? Since you know this is not the same individual duplicated it seems odd to use the Find duplicates feature in this way. I would probably suggest using the Group step to count domains that appear numerous times.

It is possible to increase the maximum cluster size, but note how this impacts processing time significantly:

https://community.experianaperture.io/discussion/comment/1801#Comment_1801

M.Lambert · November 2023

Hi guys, this is very helpful.

I have some thinking to do now, and try a new approach.

Thanks a lot for your reply

M.Lambert · November 2023

Hey Josh,

That's very helpful.

Thanks a lot.

Matching Results - Find Duplicates

Best Answers

Answers

Categories