Fuzzy Matching Logic
Hi,
I have a requirement to match records from two separate systems on the basis of FirstName, LastName and DOB from System1 and FisrtName, MiddleName, LastName and DOB from System2 to return the match Id from System2 to System1.
Below are the challenges
- The names are not consistent in both systems . For eg in system1 for most records FirstName is a mix of FirstName and MiddleName and in rare occurence for few records LastName is mix of MiddleName and LastName
- There could be a possibility of having shortened names/nicknames being used in either systems
Can somebody suggest how to do a fuzzy matching logic to match the records in these two systems and get a result that shows us matches and match score so that we can make an informed decision??
Answers
-
Hi Shreya
You haven't mentioned the volume of data, but if it is small then should be simple to match the bulk then manually review the remainder. If it is a large volume of data then Name+DOB will not be enough to find unique matches, e.g. there are ~30k people named John Smith just in UK, assuming they are distributed evenly (which they arent) that is 300 every year over past century. You only need 75 people with the same name for a 99.9% chance that 2 share a birthday.
===============================================================
Ignoring that, it sounds like you need to start by adding some consistency to the data. Identify/Split the 'rare occurrences' of Middle+Last then create Functions that populate these columns individually, something like this Function:
Once that is done, concatenating DOB+Initials+LastName to make a unique ID that could be used to Join both sources.
Dealing with nicknames is a topic in itself, but depending on the global nature of these names you might be able to find a reference table to lookup against. Quick search returned this for example: https://github.com/carltonnorthern/nicknames Do you have any other columns like Gender or Salutation as this will be helpful dealing with nicknames.
Perhaps someone else will have some better ideas, but if not and if the above does not work then reach out to your Experian contact to discuss if Find duplicates and Standardization would help with your data.
0