Matching Results - Find Duplicates
Best Answers
-
This isn't something I've seen before. When you say "big" clusters, how many records do you mean? Very large clusters may hit a maximum cluster size limit, but that's usually just to guard against matching rules that are too loose and generate a lot of false-positive matches.1
-
I am trying to get a match on the email domain when the domain is from a company and not a generic one like gmail or hotmail. The whole data is 8 million records but the cluster it's about 5000 records.
0 -
I agree with Josh that, while you could use Find Duplicates here, it doesn't sound like the right tool for the job if you don't have multiple elements to match on, and do not need fuzzy matching.
To just use the grouping approach I would first get the domain from the email (using an "After" function with "@" as the suffix):
Then I'd group on the email domain values, sort by the count, and investigate the large groups (clusters). The domain value itself acts as the cluster ID.
If you wanted to do a slightly more "fuzzy" match on the domains, I would do some pre-processing in the values before grouping to eg remove noise, trim leading and trailing characters or invalid characters etc
0 -
Hi
Group step to create a View would look like this:
If the View is set to Interactive then users will be able to drilldown into rows in each group, in your case emails per domain.
0
Answers
-
You are matching on ~5k customers who share the same email domain and no other information? Since you know this is not the same individual duplicated it seems odd to use the Find duplicates feature in this way. I would probably suggest using the Group step to count domains that appear numerous times.
It is possible to increase the maximum cluster size, but note how this impacts processing time significantly:
0 -
Hi guys, this is very helpful.
I have some thinking to do now, and try a new approach.
Thanks a lot for your reply
1 -
Hey Josh,
That's very helpful.
Thanks a lot.
0