Find Duplicate - Based on different criteria

Mahulima · September 2021

Hello,

Requirement is find the duplicates based on 3 different criteria -

Name and Phone Match
Name and Email Match
Name , Address Match

Now how to achieve this, should I use 3 Different Find Duplicates in Parallel and then check Match Status any 1 of them 0? Is it possible to check in one Find Duplicate?

I think we can include scoring facilities in Aperture so that even if Name, and Phone Match it gives a Scoring value based on the Algorithm. Scoring will be less for very common names in USA ex. John, Harry whereas scoring will be more if the Name is uncommon ex. Mahulima in USA

Also Robert and Bob will match Mathew and Matt will match this scenarios may be covered.

Also household cases will be covered with low weight age in Address

And in this method for my use case I will be giving weight-age to Name, Phone, Email and Address and will give a combine score for 2 records which are suspected a match. And may be I can define threshold scoring coming 8-10 is suspected, below 8 is not match, over 10 is MATCH

Mahulima · September 2021

@Sueann See @Josh Boxer Any thoughts on this?

Sueann See · September 2021

@Mahulima

What will you be using the de-duplicated list for?

This will help determine if you really need individual, address, location level matches. For example,

If you want to ensure you only send 1 mail to each individual (even if they stay at the same address), then you will need an individual match.
If you want to ensure you only send 1 mail per house, then probably you just need an address match.
If you want to pick samples from each postal code to send 1 mail per location, then probably you just need a location level match.

How many Find Duplicates step to use?

It really depends on your use case.
If you use 3 separate steps (one for Phone, Email, Address), eventually it may be difficult to put them into a single view to have your "single view of customers". However, if different departments have their preferred matching criteria then you may have to separate them so you can give the appropriate results to each department.
If you started with a single step encompassing all the criteria, you may still end up creating multiple views of the results to analyze how the records have match (whether its using phone ,email or address). However, for performance reasons - running once vs running 3 times, this may be a better choice.

Example: Single step encompassing all matching criteria for individual match

Imagine something like this - all of this may end up in the same cluster if you defined the blocking keys and match rules correctly.

You could refer to the GBR individual default blocking keys and rules as an example, then add on additional criteria for email and phone.
At a high level for rules:

Match.Exact={Hash.Exact | (Name.Exact & Address.Exact) | (Name.Exact & Phone.Exact) | (Name.Exact & Email.Exact)}

Match.Close={(Name.Close & Address.Close) | (Name.Close & Phone.Close) | (Name.Close & Email.Close)}

Match.Probable={(Name.Probable & Address.Probable) | ForenamesAndAddress.Probable} -- add similar phone and email criteria here

Match.Possible={(Name.Possible & Address.Possible) | ForenamesAndAddress.Possible} -- add similar phone and email criteria here

You will have to inspect the lower level rules definition to see if it fits what you need. GBR Individual rules uses Gender; you may not need this, so you can remove it.
I suggest you start with the exact match, get a basic understanding on the keys and rules structure and syntax, test with a small sample and adjust as you go. Have you gone through the articles here on Find Duplicates?

Robert and Bob will match Mathew and Matt

This can be achieved using the rootname modifier. You can also see this example within the GBR Individual Default Rules.

Weightage

Not sure if I fully understand what you are after. Do you mean something like this?

Name, Phone, Email, Address Match, then assign score 10
Name and Phone Match only, then assign score 5
Name and Email Match only, then assign score 5
Name and Address Match only, then assign score 5
Name Match only, then assign score 1

We don't provide any score column by default, but you can probably create that through some transformation functions if required.

We only allow you to define up to 4 match levels, and the match level for each cluster would be the lowest confidence within that cluster. Refer to this article for an example.

Luke · February 2022

Hi

I used your idea on the high level rule change to GBR Individual

Match.Exact={Hash.Exact | (Name.Exact & Address.Exact) | (Name.Exact & Email.Exact)}

I also found I needed to add a new blocking key to get it to take effect, does this look correct?

{

"description": "Email",

"countryCode": "GBR",

"elementSpecifications": [

{

"elementType": "EMAIL"

},

{

"elementType": "FORENAMES"

},

{

"elementType": "SURNAME"

}

]

},

Thanks

Luke

Sueann See · March 2022

@Luke Based on the rules you have set up, it looks like you are trying to get exact match based on one of the following combinations.

1) Name, Address, Email exact match

2) Name and Address exact match

3) Name and Email exact match

Your blocking keys may be alright, but this would mean that you do not provide any allowance in terms of spelling errors for email, forenames and surnames. Is this your intention?

If not, you can perhaps try to refer to our default GBR_Individual_Default blocking keys for some idea on how to handle names. You may want to:

Add a rootname modifier, so that you can potentially match based on aliases/nicknames, eg. Abby vs Abigail
The double metaphone algorithm to allow some spelling differences
Apply a limit on the number of characters to use for blocking i.e. treat them as potential duplicates as long as first n characters match. Note that this does not affect the actual value used for matching based on the rules later on. The purpose of blocking is to place the records into candidate groups for further matching.

{
    "elementType": "FORENAMES",
    "elementModifiers": [
     "ROOTNAME"
    ],
    "algorithm": {
     "name": "DOUBLE_METAPHONE_FIRST_WORD"
    },
    "includeFromNChars": 1,
    "truncateToNChars": 10
   },
   {
    "elementType": "SURNAME",
    "algorithm": {
     "name": "DOUBLE_METAPHONE"
    },
    "includeFromNChars": 1,
    "truncateToNChars": 10
   }

You can also think of other possible combinations eg. Perhaps combining name and address

{
  "description": "ForenameSurnameMinorStreetNumber",
  "countryCode": "GBR",
  "elementSpecifications": [
   {
    "elementType": "FORENAMES",
    "elementModifiers": [
     "ROOTNAME"
    ],
    "algorithm": {
     "name": "DOUBLE_METAPHONE_FIRST_WORD"
    },
    "includeFromNChars": 1,
    "truncateToNChars": 10
   },
   {
    "elementType": "SURNAME",
    "algorithm": {
     "name": "DOUBLE_METAPHONE"
    },
    "includeFromNChars": 1,
    "truncateToNChars": 10
   },
   {
    "elementType": "MINORSTREET_NUMBER",
    "includeFromNChars": 1,
    "truncateToNChars": 5
   }
  ]
 }

Similarly for email, you may want to block only based on the domain part of the email, or have a name and email domain combination.

It may take a bit of exploration to get to the results you want. Hope this helps.

Luke · March 2022

@Sueann See thanks for your response.

I should have included the other rules sorry.

Match.Exact={Hash.Exact | (Name.Exact & Address.Exact) | (Name.Exact & Email.Exact)}

Match.Close={(Name.Close & Address.Close) | (Name.Close & Email.Exact)}

Match.Probable={(Name.Probable & Address.Probable) | (Name.Probable & Email.Exact)}

Match.Possible={(Name.Possible & Address.Possible) | (Name.Possible & Email.Exact)}

As I have formatted Forename to just an initial so I want an exact match, and an exact match for email address. I've modified references to forenames to always be exact lower down in the script, in Name Custom Groups for example

ForenamesSurname.Probable={Forenames.Probable & Surname.Close}

becomes

ForenamesSurname.Probable={Forenames.Exact & Surname.Close}

I would like to use the GBR_Individual_Default rules for surname, and I was thinking they would still be applied but now I've read your response I understand I should incorporate the GBR_Individual_Default surname rules within my new blocking key.

Luke · March 2022

hi @Sueann See

When looking at the matches I have noticed that when I map the email column to "Email" so that I can use it in my additional OR conditions (Name.Exact & Company.Exact & Email.Exact) in the top level match rules, that the email address is also considered in the standard address rules, so very similar addresses with different email wouldn't get matched.

My way around this was to have two Find Duplicate steps, the first doesn't map the Email column so that email is not considered in the address match, and the second only maps the columns needed for the extra rule | (Name.Exact & Company.Exact & Email.Exact), similar for probable etc.

I added company name to the new blocking key, company name is not used in GBR_Individual_Default so I wondered if I had a suitable configuration here? I also used "includeFromNChars": 1 to make sure there was an email present.

{

"description": "EmailNameOrg",

"countryCode": "GBR",

"elementSpecifications": [

{

"elementType": "EMAIL",

"includeFromNChars": 1

},

{

"elementType": "COMPANY",

"algorithm": {

"name": "DOUBLE_METAPHONE"

}

},

{

"elementType": "FORENAMES"

},

{

"elementType": "SURNAME",

"algorithm": {

"name": "DOUBLE_METAPHONE"

},

"includeFromNChars": 1,

"truncateToNChars": 10

}

]

},

Sueann See · March 2022

@Luke Email is not included in any of our standard address rules. Which standard rules are you referring to? Perhaps you can share

your full set of blocking keys and rules (you can attach a file here, or send it to me via email)
sample records that did not match as expected (you do not need to post any sensitive data here, just mock-up data will do)

With regards to your blocking keys above:

For email, you are still blocking based on the entire email address, so johndoe@gmail.com will not match with johndoe@gmil.com (missing a in gmail). If you need to allow looser matches, then perhaps you can just block based on the email domain part or include the email local part with a fuzzy algorithm.
For the company name, are there also chances that there could be null values? If so, perhaps you can use includeFromNChars here again. If your company name has a common suffix or prefix like Pte Ltd at the end, then you may want to exclude this from the blocking keys i.e. use the substring algorithm or use truncatetoNchars. Double Metaphone is a common fuzzy match algorithm to use but without looking at your data, its hard to tell whether this is good enough. Perhaps you can try and let me know if there are specific cases that you are not able to handle.

Find Duplicate - Based on different criteria

Answers

Categories