Matching Results - Find Duplicates

Options
M.Lambert
M.Lambert Member
edited December 2023 in General

Hi there,

Any idea why in the matching results the big clusters are coming with different cluster id? When I look in the workbench they match perfectly but the results show different cluster ids.

TIA

Mariana

Best Answers

  • henry_
    Answer ✓
    Options
    This isn't something I've seen before. When you say "big" clusters, how many records do you mean? Very large clusters may hit a maximum cluster size limit, but that's usually just to guard against matching rules that are too loose and generate a lot of false-positive matches.
  • M.Lambert
    M.Lambert Member
    Answer ✓
    Options

    I am trying to get a match on the email domain when the domain is from a company and not a generic one like gmail or hotmail. The whole data is 8 million records but the cluster it's about 5000 records.

  • Henry Simms
    Henry Simms Administrator
    Answer ✓
    Options

    I agree with Josh that, while you could use Find Duplicates here, it doesn't sound like the right tool for the job if you don't have multiple elements to match on, and do not need fuzzy matching.

    To just use the grouping approach I would first get the domain from the email (using an "After" function with "@" as the suffix):

    Then I'd group on the email domain values, sort by the count, and investigate the large groups (clusters). The domain value itself acts as the cluster ID.

    If you wanted to do a slightly more "fuzzy" match on the domains, I would do some pre-processing in the values before grouping to eg remove noise, trim leading and trailing characters or invalid characters etc

  • Josh Boxer
    Josh Boxer Administrator
    Answer ✓
    Options

    Hi

    Group step to create a View would look like this:

    If the View is set to Interactive then users will be able to drilldown into rows in each group, in your case emails per domain.

Answers

  • Josh Boxer
    Josh Boxer Administrator
    edited November 2023
    Options

    You are matching on ~5k customers who share the same email domain and no other information? Since you know this is not the same individual duplicated it seems odd to use the Find duplicates feature in this way. I would probably suggest using the Group step to count domains that appear numerous times.

    It is possible to increase the maximum cluster size, but note how this impacts processing time significantly:

  • M.Lambert
    Options

    Hi guys, this is very helpful.

    I have some thinking to do now, and try a new approach.

    Thanks a lot for your reply

  • M.Lambert
    Options

    Hey Josh,

    That's very helpful.

    Thanks a lot.