Custom date of birth - Matching
Hi,
As part of our matching process we break date of birth into day / month and year and then have a custom rule which matches on 2 out of the 3 variables being correct. When combined with other data aspects such as surname and address we find this gives us better results.
DateCustom.Close={(DateDay.Exact & DateMonth.Exact) | (DateMonth.Exact & DateYear.Exact) | (DateDay.Exact & DateYear.Exact)}
I would however like to eliminate the possibility of father and son have being born on the same day | month and living in the same address by adding in a condition where the system checks to ascertain if the year of birth that has been given is + / - x number of years, possibly 5 years for arguments sake. This new variable would become DateYear.Close and would be included as
DateCustom.Close={(DateDay.Exact & DateMonth.Exact & DateYear.Close) | (DateMonth.Exact & DateYear.Exact) | (DateDay.Exact & DateYear.Exact)}
As such ...
John Smith 01/01/1980 would still match with
John Smith 01/01/1981 but would not match with
John Smith 01/01/1986
This is where I am trying to get to but I don't know if the system can have the + / - x number of years variable and if it can I certainly don't know how to code that in. Its not a Levenstein type 77% match and so I am a bit lost.
Anyone able to help or tell me it cant be done / point me to a better way of doing this. ?
Thanks again.
S
Best Answer
-
Danny Roden Administrator
A couple of other thoughts on this @stevenmckinnon...
All combinations?
In your example you state you'd like the records with DOB in years 1980 and 1981 to match, but not 1986. My question is would you also want to know that 1986 and 1981 are within 5 years? If so, I think the approach you need to go down will be to do an inner join on the dataset to uncover all of the pairs of potential matches, so you can then discover the scores yourself, e.g. see below (note I've drawn a black line through records that match to themselves for clarity, but these could also be filtered out after the join):
In this scenario, I've simply used a hash match (based on the first 6 characters of the DOB and the name). However an alternative approach could also be to use a loose matching rule in the Find Duplicates step (so as to benefit from nickname matching etc).
But then the resulting match key was then used as the join key (to itself) and the result gives me all the combination of records so I can compare the 2x DOBs to calculate the difference:
Valid / Invalid Clusters
Alternatively, you might instead want to simply flag clusters where the dobs are all within a 5 year period so you can easily spot those which are within your criteria. To do this you could use a couple of lookup aggregate functions to extract the minimum and maximum years from the DOB values where a common ClusterID is shared, something like this:
However, the challenge here is that if you have a subset of the cluster which are within the 5 year range, then they will be lost as any outliers will cause this to fail (e.g. 1980, 1981 and 1986 would result in a FALSE but if it were just 1980 and 1981 then it's evaluate TRUE)
I hope this helps and provides an alternative view/approach.
7
Answers
Hi Steven
Unfortunately we do not currently provide any data or number distance comparators in Find Duplicates. You can find lists of all our blocking algos and comparators here:
For date we currently have:
We do plan to add the following to Find Duplicates in the near future:
The only suggestion i have is to build a second match job that matches First name, last name, month of DoB and Day of DOB AND Cluster ID and then manually check the results. I can't imagine there will be many parent/child duplicates where the names and DOB are exactly the same!
Just to add to this, of course there is a native function that supports a bunch of transformation options that would take a customer V1 sample data and extract the day month and year from the first order date and compare the first order date with today's date
By using the rounded years option on compare date you could see if father and son with the same names and address have an age difference
@stevenmckinnon did this answer your question?
Thanks for this Danny ... this is going to take me a bit of time to read through and walk through the steps. Will try both options and see what brings back the highest results.