Extracting titles and honorifics from names (advanced)

Clinton Jones · September 2020

In this earlier post I described how you might want to extract titles or honorifics from a name string, this might be important for a marketing campaign where to you wish to personalize your communication by saying dear sir or madam.

Unfortunately the data isn't always as we want it and sometimes we land up with names of customers for which there are some titles and some without titles in the name field.

We need to cope with this and the only way without using a programmatic standardization approach is with conditional logic.

We can see here that the Substring function disnt quite work as we would like it to for Row 4

However, if we have a list of titles or honorifics we can ensure that we don't pull out the forename or in fact any other invalid honorific. The trick here is to use the MAKE VARIABLE action to assign the substring action to a variable which in this case I called honorific and then use the honorific to evaluate using an EQUALS function

Sami Laine · October 2020

Could you describe the main differences between using directly lookup "extract matches as a list"-function instead of those above position, substring and equals? I have been using extract lookups to do similar tasks.

According to my experience, a lookup is a really useful since you can update external reference data lists to keep your processes uptodate and improve results without modifying individual step functions. The honoric or other keyword can exist in any position rather than just in the beginning of the column. However, there can be performance issues if data sets get massive although I have not yet found such limits.

The position-function starts at the beginning of the colomn until the first space. It probably consumes much less computation resources and time but it finds only honorics in the beginning. It is much more awkward to add keywords to the step-function but you can improve the usability by making it a shared function and reuse it across multiple workflows and steps.

Is there other major considerations between matching- and position-functions that are relevant to consider in these situations? Performance and keyword maintenance issues might be the most obvious ones especially in large data sets but is there any other?

Henry Simms · October 2020

Hi @"Sami Laine" , I think you've pretty much hit on the main considerations already here.

Using a Substring function to pull out the first word from a value, and then an Equals to compare against valid values that are "hard-coded" into the function, is simple and quick to execute.

Using functions like Extract Matches As List (we call these domain functions because they are typically used to look up to a defined domain) is more powerful and more maintainable (because the valid values domain is stored as a separate Dataset or View) but more computationally expensive to run.

Where the domain being looked up to is small (hundreds or a few thousand values), the lookup functions are actually very efficient.

Overall, I'd say if you're building out a solution where you already have a domain list of valid values in a file, and this may need to be updated over time, the domain lookup functions are the way to go.

Extracting titles and honorifics from names (advanced)

Comments

Categories