I am pleased to confirm that our next UK Data Management User Group will take place on Thursday 15 September 2022, 10.30am until 3pm London UK time, and if you are an Experian client based in the UK you can sign up today.
We’re also excited to confirm that this will be our first hybrid session. The event will be hosted at The National Gallery in London, but you can also attend virtually. Please indicate your preference for attending when signing up.
Following on from our previous successful events, we will continue to provide further updates and showcase key features, as well as spotlight some of the newer product capabilities in Aperture Data Studio.
We're also delighted to confirm we will have presentations from two of our clients- St. James's Place and Speedy Services on their use cases.
You can register for the event by clicking here but if you have any queries regarding the event please do not hesitate to contact me.
Every now and then a scenario crops up where it'd handy to know how often a given word occurs within a given dataset. For example you're profiling a reasonably standardised list of values (e.g. job titles) and you want to identify unusually common terms (like 'manager' 'executive' etc) or infrequent ones (like 'test123'). You may also want to perform this type of processing to generate lookup tables to be used for standardisation/cleansing/validation later on (e.g. company terms from an organisation name field, product groups from a product name column etc).
Alternatively you may just want to do this to perform some analysis to achieve something like a word cloud:
Either way, I've recently had a stab at creating this and wanted to share the results with you, not only as this is something which I believe can be used in a variety of different situations but also because it highlights a bunch of features in the product, some alternative approaches and may be an interesting article for self-learning purposes too.
I've tried to summarise it all below, please do pop me a comment with your thoughts, suggestions and if you found this useful and would like more content like this?
In this example I'm starting with a single column of data containing a list of company names (and I want to find the most common words in this list)
First I tidy up the data a little using a combination of remove noise (to strip special characters) and then upper case (to remove case variation between the remaining words). At this point you can also easily remove numbers too if they're not of interest to you.
3) Separate out words
Next I split out the words into individual fields (up to 20x) and I did this with a Text to Columns step (note this is a non-native workflow step but is available free of charge, to learn more reach out to your Experian contact):
Note that this can also be achieved using a native Transform step and the 'Split' function to explicitly extract each term:
4) Columns to Rows
Next I use the new Columns to Rows step to take each of these new columns and essentially stack them on top of each other to create a giant list (note as I've split the data into 20x columns there's now a lot of empty rows):
5) Finishing touches
Then I use a simple filter step to remove empty rows, I group on the 'word' and append a count before finally doing a sort to get the desired output:
The magic bit...
Once I'd got this working for my initial dataset I wanted to make this reusable to be easily used in other workflows. This involved 3x minor tweaks:
Part 1: permit the source to be provided at run-time (this allows this workflow to be dynamic and simply have a different source configured at run time which is pretty handy on its own)
Part 2: adjusting the workflow details to check the 'can be used in other workflows box'
Part 3: put an 'output' step on the end (so that the results can be passed to further steps and surfaced within another workflow)
A final test
Last I tested it being used as a step within another workflow which worked a treat:
Like a copy?
For those of you with the 'text to columns' step already setup, you can simply import this into your space using the below .dmx file, alternatively I'm hoping the above steps will be suitable to help you build something similar yourself.
If you've not already had a go at using 'reusable workflows' I'd strongly encourage exploring this as it's a really powerful feature which can really help scale a more standardised approach to processing your data.
I hope you found this post interesting/useful, let me know your thoughts below in the comments!
All the best,
This post simply acts as an index for all the functions shared in this library. If you would like to receive a notification whenever a new function is added, please bookmark this post by clicking on the star icon to the right. Current functions available:
- 👪 Parse Full Name - A handy set of functions that can help standardise the sequence full names are presented in (e.g. "surname, forename" vs "title forename surname") and can extract individual name elements (title, forename, middlename and surname) using logic.
- Next ⏭️📅 & Previous ⏮️📅 Working Day - As per the name, these functions detect the next working day after (and previous working day before) a given input date.
- Last day / working date of month 📅 - These functions identify the last day (and last non-weekend day) of the month for a given input.
- Parse Date 🧽📅 - This configurable function allows the user to specify a date format string and then it auto-parses in line with that specified format.
- Standardise Country 🌎 - The function uses reference data (contained within the .dmxd file) to standardise known country aliases to the standard form (e.g. England = United Kingdom).
- Offensive Words 🤬 - The functions contained within this package all relate to flagging and dealing with data containing offensive language. All of these functions uses a domain of offensive words (contained within the .dmxd package) which contains a list of known offensive terms.
- Mask Out Email/Phone 🛡️ - This package contains 2x functions to help with anonymising certain input values, whilst leaving an output that can still be used for non-sensitive analysis.
- Proper Case Surname 📛 - This package contains 2x functions which help with contact data: Proper Case Surname and Validate Surname Casing.
- Get Word ✂️ - Extracts the 'nth' word from a string. Where n = 2 the second word is retrieved, where n = -1 the last word is retrieved.
- Reasonable Date of Birth (DOB) 🔞- Checks the input is a value which seems reasonable as a valid date of birth using user defined min/max age parameters (i.e. a date, not in the future, related to an individual within an age bracket of 16-100 years old)
- Standardise Gmail addresses 📧 - Standardise Gmail addresses for matching purposes (e.g. googlemail/gmail as well as email addresses associated with the same account via the use of '+' and '.' in the account part of the email)
- Job Title Match Key 👨💼👩💼 - Generates a key that can be used to group job titles together (despite presentation differences)
- Invalid Character for Names ☹️ - Finds records where the field contains characters which are invalid for names. Records which contain digits, commas, and other special characters will yield a "true" result. Apostrophes, dashes and periods are not considered "special characters" for this function. This function is not suitable for Validation Rules -- use "Contains Only Valid Characters for Names" instead.
- Compare Dates (Verbose) 📅📅 - Provides a summary of how 2x input dates compare (includes 'convert to date' logic). Output options are: Exact match, Incomplete - Date 1 is missing (or not a date), Incomplete - Date 2 is missing (or not a date), Incomplete - Both dates are missing (or not a date), Close - Day & Month match, Close - Month & Year match, Close - Day & Year match, Other difference
- Convert Boolean ✅❌- Converts binary or Boolean values (i.e. true/false, pass/fail, 1/0) to a graphical emoji icon to aid visual presentation in the UI.
- Contains Non-Latin Characters 🈯 - Identifies the presence of any characters not in the 'basic Latin' unicode block.
- Reverse String ⏪ - Reverses the input (e.g. "Danny R" becomes "R ynnaD")
- Repeating Characters 🔁 - Uses a regular expression to identify records where the entire value of a cell is made up of the same character repeated (e.g. "aaa" or "0000000").
- PCI Detection 💳 (Payment Card Information) - Checks that the input contains 16 char numbers (either as a single string or separated with hyphens/spaces every 4 characters) [AmEx format also accounted for]
- SIC Conversion 🏷️ - Takes a 2007 format SIC code as an input and returns the high level label of the 'category' of businesses which it falls into.
- Future Date Check 📅 (Dynamic) - Checks that the input date is a date in the future (relative to the time of execution).
- Extract First Word 🥇📝 - retreives 1st word in a string
- Extract Last Word 💬 🥉 - retreives last word in a string
- Replace the word ‘NULL’ or any non-null space values with null 🔄👻 - produces a 'null'
- Calculate the distance between two sets of co-ordinates 🌍️ - Returns the approximate distance in kilometres between a pair of geographical co-ordinates