Challenge 2 Solutions

Danny Roden · September 2020

So did you figure it out? What approach did you take?

Your results should look something like this:

Post your solution below in the comments (remember to include a screenshot 📷 or a video 🎦 walk-through of your solution so others can see as well as the .dmx file containing the function/workflow containing your solution.)

Danny Roden · September 2020

To access the sample solution provided for this challenge, please toggle the below:

The approach I’ve taken in my solution is to break up the requirement into sub-functions and then bring those sub-functions together in a master-function. Also, rather than building these against a dataset in a workflow, I have made the decision to hard code the test values as parameters in my functions so as I can make this available to all the other users on my Data Studio server (and so they can easily see the results of the transformation function).

As you’ll see from the overview screenshot, my function is ultimately an ‘If then else’ query where I’m assessing:

1) IF the input needs to be standardized to have Characters 1 & 3 upper cased (assessed via a sub-function called ‘Condition 1…’)

2) THEN run appropriate standardization (executed using a sub-function called Title case (chars 1 & 3)

3) ELSE IF the input needs to be standardized to have Characters 1 & 4 upper cased (assessed via a sub-function called ‘Condition 2…’)

4) THEN run appropriate standardization (executed using a sub-function called Title case (chars 1 & 4)

5) ELSE IF the input contains a hyphen (e.g. name is double-barreled) then also upper case the first character after the hyphen (using a sub-function called ‘Double-barrel case correction’)

6) ELSE just perform ‘standard’ title casing (using the native function which capitalizes every distinct word, separated by a space)

This ultimately produces these results.

The logic defined in each of the sub-functions is summarized below, but has been built in a way such that it can be tweaked and adjusted overtime (e.g. as new exceptions / rules are identified):

· Condition 1 (chars 1 & 3): identifies 2 char prefixes where the 3rd character should be capitalized (notably: Irish names starting with “O’___” and Scottish names starting with “Mc___”). Further to this it identifies names where the user has made a decision to capitalize characters 1&3 (e.g. the DeFazio example), so that this can be retained.

- The cleansing logic for records meeting this rule, are then addressed with the ‘Title Case (chars 1 & 3) [Condition 1]’ function.

· Condition 2 (chars 1 & 4): identifies 3-char prefixes where the 3rd character should be capitalized (specifically Scottish names starting with “Mac___”, however this can be added to).

- The cleansing logic for records meeting this rule, are then addressed with the ‘Title Case (chars 1 & 4) [Condition 2]’ function.

· Title Case (double-barrel): separates the string before and after the hyphen character, title cases both segments, and then combines these using a concatenate function.

· Title Case Surnames: as shown above, this function brings them all together.

Challenge 2 - Surname Case Correction.dmx

Sueann See · October 2020

I created a 2 step workflow to demonstrate my approach.

All of the formatting work is done in the transform step with a custom function created on a new column "Formatted Name".

First, format Surname into Title Case so that the result is predictable i.e. begins with a capital letter.

Since we know that this is not sufficient to cater for all scenarios, we add a Partition First function for every known deviation.

For the purpose of demonstration, i created a variable "temp" to store the results after the Partition functions so you can see how it looks in the Data preview. Basically, you will get a delimited list of the components of the names that contains the deviations.

With this list, you can then do some clean up by replacing commas with space and replacing quotes. For the purpose of demonstration, I've created a variable "temp2" so you can preview the results. Basically, you will get a cleaner list of the components of the names that contains the deviations.

With this, you can do the final formatting steps where you will apply title case again and then remove the Whitespace since you have managed to identify the special components and apply the necessary formatting.

The results of the "Formatted Name" column is as expected for this dataset:

The dmx file is as per attached:

Data Studio Export - Challenges - 26 Oct 2020 23-53-12-330.dmx

Note: This is not a perfect solution that would cater to just any datasets as we have not built an exhaustive list of deviations here. There are also issues as described in this article. For example, in our case, DeFazio will be treated the same as Dean so the results can either be "DeFazio" and "DeAn" (or "Defazio" and "Dean" if we were to remove the partition function that looks for "De"). So perhaps we will have to do extra work to identify the exact surname that would be classified as exceptions.

Danny Roden · October 2020

Interesting approach @Sueann See, thanks for sharing!

I hadn't thought about using the partition function. On the most part that works pretty well, however I'd advise a word of caution around the 'DeFazio' example, because whilst your function works for the test data. If it were applied to a wider dataset it would convert values like "Dexter" or "Derbyshire" to "DeXter" and "DeRbyshire" respectively.

This is always the challenge with these 'exceptions' I guess. Equally, my solution isn't perfect, because it would convert values that happened to have 1st and 3rd letters cased to a similar final output e.g. "DeXtEr" would become "DeXter".

@Henry Simms what are your thoughts on how best to handle these irregular values (e.g. DeFazio) where a blanket rule cannot be applied (unlike for instance all names that begin with "Mc")? Exception lists perhaps (e.g. if the value exists in this exception list, then take the 'standard form' from the exception list)?

Sean Leoszko · October 2020

For my solution to this problem, I leveraged the Standardize Name and Address step to do some of the work and then built custom transformations for remaining names standardize missed. The aim was to minimize the number of custom transformations and let standardize do some of the work.

Workflow: