Identifying two names in a single string with the DelimitedField filter
Sometimes, the names data you have collected may not be in the best form to be parsed or standardized into the individual names components (Title, Surname, Forename) properly. You may have cases where your name field potentially contains multiple names, for example, Mark and Jane Spencer or Mr & Mrs Smith.
In this case, you may want to map the column as a Generic String instead of trying to standardize these strings as names and getting poor results. Then, when building the Find Duplicates rules, you can leverage on the Generic String Filters and Comparators. Specifically, the DelimitedField filter is useful to identify the tokens from the name string that are important for matching.
A working example below shows a variety of name strings that have been matched. Based on the cluster ID, you can see that we have effectively identified some potential matches despite the fact that the Name string contains multiple names.
For this example, at the Find Duplicates step, we have mapped the Key and Name column as Generic strings.
Key is a dummy column we use as a blocking key to simplify this example. In reality, this could be your other contact data such as an identity number, phone, email or any address components.
Blocking Keys
[ { "description": "Key", "countryCode": "GBR", "elementSpecifications": [ { "elementGroups":["KEY"], "elementType": "GENERIC_STRING" } ] } ]
The rules is set up as follows.
@default.country=GBR /* * Aliases */ define Exact as L0 define Close as L1 /* Match Rule */ Match.Exact={Hash.Exact} Match.Close = {String.Close} /*Theme Rule*/ Hash.Exact={[ExactMatch]} String.Close={FirstLast.Exact | SecondLast.Exact} FirstLast.Exact = {#STRING.Generic_String.DelimitedField[" |\\/",0].[ExactMatch] & #STRING.Generic_String.DelimitedField[" |\\/",-1].[ExactMatch]} SecondLast.Exact = {#STRING.Generic_String.DelimitedField[" |\\/",-2].[ExactMatch] & #STRING.Generic_String.DelimitedField[" |\\/",-1].[ExactMatch]}
In plain language, what we are really doing is to identify exact string matches as well as close matches. Close matches are defined as strings that have:
- First and last token exactly the same e.g. Mark and Jane Spencer would match to Mark Spencer
- Second last and last token exactly the same e.g. Mark and Jane Spencer would match to Jane Spencer
Tokens from the string are identified with the DelimitedField filter where the delimiter is either a space " " or a forward slash "/". If your data contains other delimiters, you can extend the rules to include other delimiters.
DelimitedField[" |\\/",0] means
- select the first token from the string (indicated by the number 0),
- delimited by space or forward slash (indicated by " |\\/", which is a java regex notation with double escape characters for the forward flash).
- For example, "Mark and Jane Spencer" would have 4 tokens "Mark", "and", "Jane", "Spencer". The first token would be "Mark".
- Another example, "M/M Smith" would have 3 tokens "M","M","Smith". The first token would be "M".
DelimitedField[" |\\/",-1] means
- select the last token i.e. first token from the back of the string (indicated by the number -1),
- delimited by space or forward slash (indicated by " |\\/", which is a java regex notation with double escape characters for the forward flash).
- For example, "Mark and Jane Spencer" would have 4 tokens "Mark", "and", "Jane", "Spencer". The last token would be "Spencer".
- Another example, "M/M Smith" would have 3 tokens "M","M","Smith". The last token would be "Smith".
DelimitedField[" |\\/",-2] means
- select the second last token i.e. second token from the back of the string (indicated by the number -2),
- delimited by space or forward slash (indicated by " |\\/", which is a java regex notation with double escaping for the forward flash).
- For example, "Mark and Jane Spencer" would have 4 tokens "Mark", "and", "Jane", "Spencer". The second last token would be "Jane".
- Another example, "M/M Smith" would have 3 tokens "M","M","Smith". The second last token would be "M".
It may take some time to understand, test and refine the rules to cater for your specific use case. Hopefully this gives you an idea of the possibilities with the DelimitedField filter.
Comments
-
We received feedback that matching names as generic strings means we lose out on some benefits of standardization. That could be true but it all depends on the variety of data you have as names.
If we rely on the names standardization, can the common knowledgebase cover all the unique variations? For example, What does M/M represent? Is this an abbreviation for Mr and Mrs or is does this present some name initials?
if we do not use the names standardization, do we have workarounds for cases like being able to match by root names? There are options to create a separate rootname lookup, or create looser match levels that relies on other keys other than forenames.
Can we still use DelimitedField filter if we choose to map the column as a Name (and perform names standardization)? Yes! This could possibly be the best of both worlds. For example:
- Map the name string as Name instead of Generic String
- Use the following Blocking Keys and Rules
Blocking Keys
[ { "description": "Key", "countryCode": "GBR", "elementSpecifications": [ { "elementGroups":["KEY"], "elementType": "GENERIC_STRING" } ] }, { "description": "Forename", "countryCode": "GBR", "elementSpecifications": [ { "elementType": "FORENAMES", "elementModifiers": [ "ROOTNAME" ], "algorithm": { "name": "DOUBLE_METAPHONE_FIRST_WORD" } } ] } ]
Note that we use the Rootname modifier here so we can match forenames based on a rootname e.g. Bob vs Robert.
Rules
@default.country=GBR /* * Aliases */ define Exact as L0 define Close as L1 /* Match Rule */ Match.Exact={Hash.Exact} Match.Close = {(Title.Exact | Forenames.Close) & Surname.Exact} /*Theme Rule*/ Hash.Exact={[ExactMatch]} Surname.Exact={Surname[ExactMatch]} Title.Exact={Title[ExactMatch]} Forenames.CLose={ForenameCompare[FirstNameMatch] | DelimitedField[" |\\/",-1].[ExactMatch] | RootName[ExactMatch]}
Note that we use the DelimitedField filter as one of the possible criteria for Forenames match so that we can extract any forename tokens we are interested to use for matching.
Output
Note:
- Able to identify matches like Bob and Robert
- Could not identify matches like Mdm Smith (could possibly mean we need to update our Standardize knowledgebase to include Mdm as an acceptable title)
0