Find Duplicate - Based on different criteria
Hello,
Requirement is find the duplicates based on 3 different criteria -
- Name and Phone Match
- Name and Email Match
- Name , Address Match
Now how to achieve this, should I use 3 Different Find Duplicates in Parallel and then check Match Status any 1 of them 0? Is it possible to check in one Find Duplicate?
I think we can include scoring facilities in Aperture so that even if Name, and Phone Match it gives a Scoring value based on the Algorithm. Scoring will be less for very common names in USA ex. John, Harry whereas scoring will be more if the Name is uncommon ex. Mahulima in USA
Also Robert and Bob will match Mathew and Matt will match this scenarios may be covered.
Also household cases will be covered with low weight age in Address
And in this method for my use case I will be giving weight-age to Name, Phone, Email and Address and will give a combine score for 2 records which are suspected a match. And may be I can define threshold scoring coming 8-10 is suspected, below 8 is not match, over 10 is MATCH
Answers
@Sueann See @Josh Boxer Any thoughts on this?
@Mahulima
What will you be using the de-duplicated list for?
This will help determine if you really need individual, address, location level matches. For example,
How many Find Duplicates step to use?
Example: Single step encompassing all matching criteria for individual match
Imagine something like this - all of this may end up in the same cluster if you defined the blocking keys and match rules correctly.
Match.Exact={Hash.Exact | (Name.Exact & Address.Exact) | (Name.Exact & Phone.Exact) | (Name.Exact & Email.Exact)}
Match.Close={(Name.Close & Address.Close) | (Name.Close & Phone.Close) | (Name.Close & Email.Close)}
Match.Probable={(Name.Probable & Address.Probable) | ForenamesAndAddress.Probable} -- add similar phone and email criteria here
Match.Possible={(Name.Possible & Address.Possible) | ForenamesAndAddress.Possible} -- add similar phone and email criteria here
Robert and Bob will match Mathew and Matt
This can be achieved using the rootname modifier. You can also see this example within the GBR Individual Default Rules.
Weightage
Not sure if I fully understand what you are after. Do you mean something like this?
We don't provide any score column by default, but you can probably create that through some transformation functions if required.
We only allow you to define up to 4 match levels, and the match level for each cluster would be the lowest confidence within that cluster. Refer to this article for an example.
Hi
I used your idea on the high level rule change to GBR Individual
Match.Exact={Hash.Exact | (Name.Exact & Address.Exact) | (Name.Exact & Email.Exact)}
I also found I needed to add a new blocking key to get it to take effect, does this look correct?
{
"description": "Email",
"countryCode": "GBR",
"elementSpecifications": [
{
"elementType": "EMAIL"
},
{
"elementType": "FORENAMES"
},
{
"elementType": "SURNAME"
}
]
},
Thanks
Luke
@Luke Based on the rules you have set up, it looks like you are trying to get exact match based on one of the following combinations.
1) Name, Address, Email exact match
2) Name and Address exact match
3) Name and Email exact match
Your blocking keys may be alright, but this would mean that you do not provide any allowance in terms of spelling errors for email, forenames and surnames. Is this your intention?
If not, you can perhaps try to refer to our default GBR_Individual_Default blocking keys for some idea on how to handle names. You may want to:
You can also think of other possible combinations eg. Perhaps combining name and address
Similarly for email, you may want to block only based on the domain part of the email, or have a name and email domain combination.
It may take a bit of exploration to get to the results you want. Hope this helps.
@Sueann See thanks for your response.
I should have included the other rules sorry.
Match.Exact={Hash.Exact | (Name.Exact & Address.Exact) | (Name.Exact & Email.Exact)}
Match.Close={(Name.Close & Address.Close) | (Name.Close & Email.Exact)}
Match.Probable={(Name.Probable & Address.Probable) | (Name.Probable & Email.Exact)}
Match.Possible={(Name.Possible & Address.Possible) | (Name.Possible & Email.Exact)}
As I have formatted Forename to just an initial so I want an exact match, and an exact match for email address. I've modified references to forenames to always be exact lower down in the script, in Name Custom Groups for example
ForenamesSurname.Probable={Forenames.Probable & Surname.Close}
becomes
ForenamesSurname.Probable={Forenames.Exact & Surname.Close}
I would like to use the GBR_Individual_Default rules for surname, and I was thinking they would still be applied but now I've read your response I understand I should incorporate the GBR_Individual_Default surname rules within my new blocking key.
hi @Sueann See
When looking at the matches I have noticed that when I map the email column to "Email" so that I can use it in my additional OR conditions (Name.Exact & Company.Exact & Email.Exact) in the top level match rules, that the email address is also considered in the standard address rules, so very similar addresses with different email wouldn't get matched.
My way around this was to have two Find Duplicate steps, the first doesn't map the Email column so that email is not considered in the address match, and the second only maps the columns needed for the extra rule | (Name.Exact & Company.Exact & Email.Exact), similar for probable etc.
I added company name to the new blocking key, company name is not used in GBR_Individual_Default so I wondered if I had a suitable configuration here? I also used "includeFromNChars": 1 to make sure there was an email present.
{
"description": "EmailNameOrg",
"countryCode": "GBR",
"elementSpecifications": [
{
"elementType": "EMAIL",
"includeFromNChars": 1
},
{
"elementType": "COMPANY",
"algorithm": {
"name": "DOUBLE_METAPHONE"
}
},
{
"elementType": "FORENAMES"
},
{
"elementType": "SURNAME",
"algorithm": {
"name": "DOUBLE_METAPHONE"
},
"includeFromNChars": 1,
"truncateToNChars": 10
}
]
},
@Luke Email is not included in any of our standard address rules. Which standard rules are you referring to? Perhaps you can share
With regards to your blocking keys above: