Join us Online for our next Data Management User Group | 15 September 22 !
I am pleased to confirm that our next UK Data Management User Group will take place on Thursday 15 September 2022, 10.30am until 3pm London UK time, and if you are an Experian client based in the UK you can sign up today.
We’re also excited to confirm that this will be our first hybrid session. The event will be hosted at The National Gallery in London, but you can also attend virtually. Please indicate your preference for attending when signing up.
Following on from our previous successful events, we will continue to provide further updates and showcase key features, as well as spotlight some of the newer product capabilities in Aperture Data Studio.
We're also delighted to confirm we will have presentations from two of our clients- St. James's Place and Speedy Services on their use cases.
You can register for the event by clicking here but if you have any queries regarding the event please do not hesitate to contact me.
Every now and then a scenario crops up where it'd handy to know how often a given word occurs within a given dataset. For example you're profiling a reasonably standardised list of values (e.g. job titles) and you want to identify unusually common terms (like 'manager' 'executive' etc) or infrequent ones (like 'test123'). You may also want to perform this type of processing to generate lookup tables to be used for standardisation/cleansing/validation later on (e.g. company terms from an organisation name field, product groups from a product name column etc).
Alternatively you may just want to do this to perform some analysis to achieve something like a word cloud:
Either way, I've recently had a stab at creating this and wanted to share the results with you, not only as this is something which I believe can be used in a variety of different situations but also because it highlights a bunch of features in the product, some alternative approaches and may be an interesting article for self-learning purposes too.
I've tried to summarise it all below, please do pop me a comment with your thoughts, suggestions and if you found this useful and would like more content like this?
In this example I'm starting with a single column of data containing a list of company names (and I want to find the most common words in this list)
First I tidy up the data a little using a combination of remove noise (to strip special characters) and then upper case (to remove case variation between the remaining words). At this point you can also easily remove numbers too if they're not of interest to you.
3) Separate out words
Next I split out the words into individual fields (up to 20x) and I did this with a Text to Columns step (note this is a non-native workflow step but is available free of charge, to learn more reach out to your Experian contact):
Note that this can also be achieved using a native Transform step and the 'Split' function to explicitly extract each term:
4) Columns to Rows
Next I use the new Columns to Rows step to take each of these new columns and essentially stack them on top of each other to create a giant list (note as I've split the data into 20x columns there's now a lot of empty rows):
5) Finishing touches
Then I use a simple filter step to remove empty rows, I group on the 'word' and append a count before finally doing a sort to get the desired output:
The magic bit...
Once I'd got this working for my initial dataset I wanted to make this reusable to be easily used in other workflows. This involved 3x minor tweaks:
Part 1: permit the source to be provided at run-time (this allows this workflow to be dynamic and simply have a different source configured at run time which is pretty handy on its own)
Part 2: adjusting the workflow details to check the 'can be used in other workflows box'
Part 3: put an 'output' step on the end (so that the results can be passed to further steps and surfaced within another workflow)
A final test
Last I tested it being used as a step within another workflow which worked a treat:
Like a copy?
For those of you with the 'text to columns' step already setup, you can simply import this into your space using the below .dmx file, alternatively I'm hoping the above steps will be suitable to help you build something similar yourself.
If you've not already had a go at using 'reusable workflows' I'd strongly encourage exploring this as it's a really powerful feature which can really help scale a more standardised approach to processing your data.
I hope you found this post interesting/useful, let me know your thoughts below in the comments!
All the best,
📋 Index of reusable Functions! 📇
This post simply acts as an index for all the functions shared in this library. If you would like to receive a notification whenever a new function is added, please bookmark this post by clicking on the star icon to the right. Current functions available:
- 👪 Parse Full Name - A handy set of functions that can help standardise the sequence full names are presented in (e.g. "surname, forename" vs "title forename surname") and can extract individual name elements (title, forename, middlename and surname) using logic.
- Next ⏭️📅 & Previous ⏮️📅 Working Day - As per the name, these functions detect the next working day after (and previous working day before) a given input date.
- Last day / working date of month 📅 - These functions identify the last day (and last non-weekend day) of the month for a given input.
- Parse Date 🧽📅 - This configurable function allows the user to specify a date format string and then it auto-parses in line with that specified format.
- Standardise Country 🌎 - The function uses reference data (contained within the .dmxd file) to standardise known country aliases to the standard form (e.g. England = United Kingdom).
- Offensive Words 🤬 - The functions contained within this package all relate to flagging and dealing with data containing offensive language. All of these functions uses a domain of offensive words (contained within the .dmxd package) which contains a list of known offensive terms.
- Mask Out Email/Phone 🛡️ - This package contains 2x functions to help with anonymising certain input values, whilst leaving an output that can still be used for non-sensitive analysis.
- Proper Case Surname 📛 - This package contains 2x functions which help with contact data: Proper Case Surname and Validate Surname Casing.
- Get Word ✂️ - Extracts the 'nth' word from a string. Where n = 2 the second word is retrieved, where n = -1 the last word is retrieved.
- Reasonable Date of Birth (DOB) 🔞- Checks the input is a value which seems reasonable as a valid date of birth using user defined min/max age parameters (i.e. a date, not in the future, related to an individual within an age bracket of 16-100 years old)
- Standardise Gmail addresses 📧 - Standardise Gmail addresses for matching purposes (e.g. googlemail/gmail as well as email addresses associated with the same account via the use of '+' and '.' in the account part of the email)
- Job Title Match Key 👨💼👩💼 - Generates a key that can be used to group job titles together (despite presentation differences)
- Invalid Character for Names ☹️ - Finds records where the field contains characters which are invalid for names. Records which contain digits, commas, and other special characters will yield a "true" result. Apostrophes, dashes and periods are not considered "special characters" for this function. This function is not suitable for Validation Rules -- use "Contains Only Valid Characters for Names" instead.
- Compare Dates (Verbose) 📅📅 - Provides a summary of how 2x input dates compare (includes 'convert to date' logic). Output options are: Exact match, Incomplete - Date 1 is missing (or not a date), Incomplete - Date 2 is missing (or not a date), Incomplete - Both dates are missing (or not a date), Close - Day & Month match, Close - Month & Year match, Close - Day & Year match, Other difference
- Convert Boolean ✅❌- Converts binary or Boolean values (i.e. true/false, pass/fail, 1/0) to a graphical emoji icon to aid visual presentation in the UI.
- Contains Non-Latin Characters 🈯 - Identifies the presence of any characters not in the 'basic Latin' unicode block.
- Reverse String ⏪ - Reverses the input (e.g. "Danny R" becomes "R ynnaD")
- Repeating Characters 🔁 - Uses a regular expression to identify records where the entire value of a cell is made up of the same character repeated (e.g. "aaa" or "0000000").
- PCI Detection 💳 (Payment Card Information) - Checks that the input contains 16 char numbers (either as a single string or separated with hyphens/spaces every 4 characters) [AmEx format also accounted for]
- SIC Conversion 🏷️ - Takes a 2007 format SIC code as an input and returns the high level label of the 'category' of businesses which it falls into.
- Future Date Check 📅 (Dynamic) - Checks that the input date is a date in the future (relative to the time of execution).
- Extract First Word 🥇📝 - retreives 1st word in a string
- Extract Last Word 💬 🥉 - retreives last word in a string
- Replace the word ‘NULL’ or any non-null space values with null 🔄👻 - produces a 'null'
- Calculate the distance between two sets of co-ordinates 🌍️ - Returns the approximate distance in kilometres between a pair of geographical co-ordinates
- ⌚ Convert Epoch Date/Time to Standard Date/Time - converts a timestamp from epoch format to 'standard' format
Re: Date formatting
Hello @Nigel Light , the Date format string must be dd/MM/yyyy where "MM" (all capitals) is the month. Hope that sorts the date format. Regards, Pavan
Re: Setup Secure SSL using self-signed certificates
Hi @Sean Leoszko , can you say a bit more about the type of certificate you want to generate, and how you want to use it.
To explain the differences it's useful to understand the purpose of SSL certificates:
- To encrypt sensitive information sent across the network between the client (e.g. browser) and the server.
- To provide authentication. Specifically, to identify your server and protect your clients from things like man-in-the-middle attacks.
There are three types of certificate you might use:
Data Studio comes with its own self-signed certificate, which is what's used when you select "use SSL" from the server settings. You'll usually change the port to 443 as well. A self-signed certificate is not signed by any certificate authority (CA), and certificates like this will provide encryption but not authentication (or rather it places the burden of verifying trust on the user explicitly). You'll see a "your connection is not private" warning in your browser telling you that the connection is not safe. If you generate your own self-signed certificate it will behave in the same way.
You could also request and apply a certificate signed by a trusted certificate authority (CA) to both enable encryption of client-server communications and identify your server to clients. If you want your Data Studio instance to be available to the general public via the internet, you'd want to use a certificate from a public CA to initiate secure and “trusted” communication.
You may actually want something in between: a certificate issued by a private (internal) CA. A private CA is typically enterprise-specific and used for internal servers on a private network. These certificates are not publicly trusted. In this scenario, in order to achieve trust (authentication) an organization would generate a root certificate and then configure all clients (e.g. Chrome) to trust certificates signed by that root by installing it as a trusted root certificate authority.
The 3rd option, using a private CA, is the most common way Experian's clients secure Data Studio, and also the Find duplicates server. In this case, the client's IT team would typically provide the certificate, either as separate public certificate and private key files, or in a combined format like PKCS12.
Re: Inferring Gender from Title or Honorific and Gender Reversal
From a real-world usage perspective, whilst this is a useful calculation, it's use also needs to be carefully considered. Here's a few consulting recommendations:
- Use honorifics only to derive a possible gender, not to make assumptions.
- Any column should be named as "possible/derived" with metadata/data dictionary entries indicating the derivation method.
- Suitable use cases include:
- In a data quality rule to compare an honorific with a gender flag field to look for "possible" but not "certain" mismatches
- Analytics with large enough sample sizes to minimise errors (assuming no systematic errors)
- Light touch personalisation with marketing, but still without assuming gender (caution still required)
- Unsuitable use cases include:
- Creating a new gender flag (or changing an existing one) from a derived calculation
- Updating source systems with new gender flags
- Indicating gender in marketing or service communications
A good example of a suitable use case might be using this calculation to "blank out" potentially inaccurate gender flags prior to loading a dataset to a marketing platform prior to a personalised campaign.
Re: Where could I find more information on how to use "Query duplicates store"?
Hi @MichaelJlam you raise an interesting question, namely, what is the purpose and intent with the different parts of Find Duplicates functionality?
You'll find v2 documentation located here : https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/create-a-single-customer-view-scv/deduplicate-data/
and v1 documentation here https://docs.experianaperture.io/data-quality/aperture-data-studio-v1/improve-data-quality/find-duplicates/3
Essentially there are three key elements to the Find Duplicates functionality:
- Evaluate potential relationships between records in a panel of data and assign a transient 'cluster ID' and match level confidence value
- Establish a transient store of searchable records
- Support processing of delta loads of records that may have relationships with the transient store
For the first piece you're building a workflow that may one or more sources feeding into the Find Duplicates step in the workflow. When you execute that workflow the additional columns are appended. Using the Harmonise step or a grouping step, or in fact a custom step, you can nominate one record to be the surviving record per cluster and establish your 'golden nominal' depduplicated record set.
You may have to run data through that workflow several times before you arrive at a fully deduplicated set of records. The harmonization step is the important part here to eliminate subordinate matches.
Every time you execute that workflow on the full dataset or a subset of it, the 'match store', effectively the transient database of records creates a searchable store with potentially different cluster IDs so it is important to understand that the cluster ID is probably not a good key to use for persistence. Every run's databse becomes the latest version of that database and unless you are retaining old data stores, what you had before is effectively discarded. This is the simplistic understanding that you need to have of what is happening under the covers.
So, assuming you have executed the workflow with the find duplicates step, at least once, you can down leverage the REST APIs associated with the Data Studio Find Duplicates engine to perform three tasks under Advanced Usage.
- Extract the Find Duplicates store records
- Search the Find Duplicates store (to avoid creating a new record unnecessarily)
- Add or make changes to the store (process newly arrived small batches or individual records to determine additional new cluster IDs or delete records)
Returning to your intent, it sounds like you have a couple of purposes, or are trying a couple of different activities. You're using the Find Duplicates functionality to match business data it seems. This means your matching criteria need to be different from those you might use to match people and families.
Data Studio Find Duplicates is optimized to 'match' and cluster people records based on names and addresses and will support use of additional matching criteria like email address, phone number, data of birth and other data attributes but for business data you may need some specialized Blocking Keys and matching Rules.
What you may want to consider is using the Experian pH Business Search custom step for matching Business Data (in the UK) which has a specialism in matching businesses. You can find details of that special custom step in the Marketplace as a Data Studio extension. Before using that step, i would recommend that you run the business data through the Address Validation engine step too, to ensure that you have the best possible addresses for those businesses.
Re: Where could I find more information on how to use "Query duplicates store"?
Query duplicate store is used to query the results of duplicate matching where the UniqueID and ClusterID is not known but the name and contact details are. This is useful for situations where new records are incoming into the SCV from an external source that has no direct links with the SCV. You can query the SCV to identify records that already exist within the SCV. Results with this query may link all records / none or somewhere in between, it will depend on the data you are comparing. It is possible that 1% of your data can only be found in your match store.
Some things to note on the configuration:
Make sure that the selected columns in query match the number and order of those used in the Find Duplicates step, including the Unique ID column (It is ok to send in nothing for this)
For example my Find duplicates setup contains 29 columns
I would then configure query with all 29 columns even if I don't have any data for them in my input file. You can create new columns you don't have using the Transform step and use the Constant function to apply a null value Or you could make use of the map to target step to ensure you have the correct columns as well.
Map to Target
Make sure you take a snapshot directly after the Query Step, the Show Data button will only run 20 rows by default for performance reasons, using a snapshot and running the workflow directly ensures all rows a run.
Failing that I would recommend trying to find easy records from your external system that exist in the SCV by joining attributes such as Name and Email or Name and Phone and make sure these are returned from the query step and verifying the configuration is correct.
Attached is the setup guide for the step, but it does have some basic usage in there too
Re: Identification of an individual flat in a building using Address Validate
Thanks @Nigel Light . I am going to work with the team to get this working in V1 too. I will keep you posted on the progress!