Best Of
π Index of reusable Functions! π
This post simply acts as an index for all the functions shared in this library. If you would like to receive a notification whenever a new function is added, please bookmark this post by clicking on the star icon to the right. Current functions available:
- Standardise Country π - The function uses reference data (contained within the .dmxd file) to standardise known country aliases to the standard form (e.g. England = United Kingdom).
- Offensive Words π€¬ - The functions contained within this package all relate to flagging and dealing with data containing offensive language. All of these functions uses a domain of offensive words (contained within the .dmxd package) which contains a list of known offensive terms.
- Mask Out Email/Phone π‘οΈ - This package contains 2x functions to help with anonymising certain input values, whilst leaving an output that can still be used for non-sensitive analysis.
- Proper Case Surname π - This package contains 2x functions which help with contact data: Proper Case Surname and Validate Surname Casing.
- Get Word βοΈ - Extracts the 'nth' word from a string. Where n = 2 the second word is retrieved, where n = -1 the last word is retrieved.
- Reasonable Date of Birth (DOB) π- Checks the input is a value which seems reasonable as a valid date of birth using user defined min/max age parameters (i.e. a date, not in the future, related to an individual within an age bracket of 16-100 years old)
- Standardise Gmail addresses π§ - Standardise Gmail addresses for matching purposes (e.g. googlemail/gmail as well as email addresses associated with the same account via the use of '+' and '.' in the account part of the email)
- Job Title Match Key π¨πΌπ©πΌ - Generates a key that can be used to group job titles together (despite presentation differences)
- Invalid Character for Names βΉοΈ - Finds records where the field contains characters which are invalid for names. Records which contain digits, commas, and other special characters will yield a "true" result. Apostrophes, dashes and periods are not considered "special characters" for this function. This function is not suitable for Validation Rules -- use "Contains Only Valid Characters for Names" instead.
- Compare Dates (Verbose) π π - Provides a summary of how 2x input dates compare (includes 'convert to date' logic). Output options are: Exact match, Incomplete - Date 1 is missing (or not a date), Incomplete - Date 2 is missing (or not a date), Incomplete - Both dates are missing (or not a date), Close - Day & Month match, Close - Month & Year match, Close - Day & Year match, Other difference
- Convert Boolean β β- Converts binary or Boolean values (i.e. true/false, pass/fail, 1/0) to a graphical emoji icon to aid visual presentation in the UI.
- Contains Non-Latin Characters π― - Identifies the presence of any characters not in the 'basic Latin' unicode block.
- Reverse String βͺ - Reverses the input (e.g. "Danny R" becomes "R ynnaD")
- Repeating Characters π - Uses a regular expression to identify records where the entire value of a cell is made up of the same character repeated (e.g. "aaa" or "0000000").
- PCI Detection π³ (Payment Card Information) - Checks that the input contains 16 char numbers (either as a single string or separated with hyphens/spaces every 4 characters) [AmEx format also accounted for]
- SIC Conversion π·οΈ - Takes a 2007 format SIC code as an input and returns the high level label of the 'category' of businesses which it falls into.
- Future Date Check π (Dynamic) - Checks that the input date is a date in the future (relative to the time of execution).
Re: Date formatting
Hello @Nigel Light , the Date format string must be dd/MM/yyyy where "MM" (all capitals) is the month. Hope that sorts the date format. Regards, Pavan
Re: Setup Secure SSL using self-signed certificates
Hi @Sean Leoszko , can you say a bit more about the type of certificate you want to generate, and how you want to use it.
To explain the differences it's useful to understand the purpose of SSL certificates:
- To encrypt sensitive information sent across the network between the client (e.g. browser) and the server.
- To provide authentication. Specifically, to identify your server and protect your clients from things like man-in-the-middle attacks.
There are three types of certificate you might use:
Data Studio comes with its own self-signed certificate, which is what's used when you select "use SSL" from the server settings. You'll usually change the port to 443 as well. A self-signed certificate is not signed by any certificate authority (CA), and certificates like this will provide encryption but not authentication (or rather it places the burden of verifying trust on the user explicitly). You'll see a "your connection is not private" warning in your browser telling you that the connection is not safe. If you generate your own self-signed certificate it will behave in the same way.
You could also request and apply a certificate signed by a trusted certificate authority (CA) to both enable encryption of client-server communications and identify your server to clients. If you want your Data Studio instance to be available to the general public via the internet, you'd want to use a certificate from a public CA to initiate secure and “trusted” communication.
You may actually want something in between: a certificate issued by a private (internal) CA. A private CA is typically enterprise-specific and used for internal servers on a private network. These certificates are not publicly trusted. In this scenario, in order to achieve trust (authentication) an organization would generate a root certificate and then configure all clients (e.g. Chrome) to trust certificates signed by that root by installing it as a trusted root certificate authority.
The 3rd option, using a private CA, is the most common way Experian's clients secure Data Studio, and also the Find duplicates server. In this case, the client's IT team would typically provide the certificate, either as separate public certificate and private key files, or in a combined format like PKCS12.
Re: Inferring Gender from Title or Honorific and Gender Reversal
From a real-world usage perspective, whilst this is a useful calculation, it's use also needs to be carefully considered. Here's a few consulting recommendations:
- Use honorifics only to derive a possible gender, not to make assumptions.
- Any column should be named as "possible/derived" with metadata/data dictionary entries indicating the derivation method.
- Suitable use cases include:
- In a data quality rule to compare an honorific with a gender flag field to look for "possible" but not "certain" mismatches
- Analytics with large enough sample sizes to minimise errors (assuming no systematic errors)
- Light touch personalisation with marketing, but still without assuming gender (caution still required)
- Unsuitable use cases include:
- Creating a new gender flag (or changing an existing one) from a derived calculation
- Updating source systems with new gender flags
- Indicating gender in marketing or service communications
A good example of a suitable use case might be using this calculation to "blank out" potentially inaccurate gender flags prior to loading a dataset to a marketing platform prior to a personalised campaign.
Re: Where could I find more information on how to use "Query duplicates store"?
Hi @MichaelJlam you raise an interesting question, namely, what is the purpose and intent with the different parts of Find Duplicates functionality?
You'll find v2 documentation located here : https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/create-a-single-customer-view-scv/deduplicate-data/
and v1 documentation here https://docs.experianaperture.io/data-quality/aperture-data-studio-v1/improve-data-quality/find-duplicates/3
Essentially there are three key elements to the Find Duplicates functionality:
- Evaluate potential relationships between records in a panel of data and assign a transient 'cluster ID' and match level confidence value
- Establish a transient store of searchable records
- Support processing of delta loads of records that may have relationships with the transient store
For the first piece you're building a workflow that may one or more sources feeding into the Find Duplicates step in the workflow. When you execute that workflow the additional columns are appended. Using the Harmonise step or a grouping step, or in fact a custom step, you can nominate one record to be the surviving record per cluster and establish your 'golden nominal' depduplicated record set.
You may have to run data through that workflow several times before you arrive at a fully deduplicated set of records. The harmonization step is the important part here to eliminate subordinate matches.
Every time you execute that workflow on the full dataset or a subset of it, the 'match store', effectively the transient database of records creates a searchable store with potentially different cluster IDs so it is important to understand that the cluster ID is probably not a good key to use for persistence. Every run's databse becomes the latest version of that database and unless you are retaining old data stores, what you had before is effectively discarded. This is the simplistic understanding that you need to have of what is happening under the covers.
So, assuming you have executed the workflow with the find duplicates step, at least once, you can down leverage the REST APIs associated with the Data Studio Find Duplicates engine to perform three tasks under Advanced Usage.
- Extract the Find Duplicates store records
- Search the Find Duplicates store (to avoid creating a new record unnecessarily)
- Add or make changes to the store (process newly arrived small batches or individual records to determine additional new cluster IDs or delete records)
Returning to your intent, it sounds like you have a couple of purposes, or are trying a couple of different activities. You're using the Find Duplicates functionality to match business data it seems. This means your matching criteria need to be different from those you might use to match people and families.
Data Studio Find Duplicates is optimized to 'match' and cluster people records based on names and addresses and will support use of additional matching criteria like email address, phone number, data of birth and other data attributes but for business data you may need some specialized Blocking Keys and matching Rules.
What you may want to consider is using the Experian pH Business Search custom step for matching Business Data (in the UK) which has a specialism in matching businesses. You can find details of that special custom step in the Marketplace as a Data Studio extension. Before using that step, i would recommend that you run the business data through the Address Validation engine step too, to ensure that you have the best possible addresses for those businesses.
Re: Where could I find more information on how to use "Query duplicates store"?
Hi Michael
Query duplicate store is used to query the results of duplicate matching where the UniqueID and ClusterID is not known but the name and contact details are. This is useful for situations where new records are incoming into the SCV from an external source that has no direct links with the SCV. You can query the SCV to identify records that already exist within the SCV. Results with this query may link all records / none or somewhere in between, it will depend on the data you are comparing. It is possible that 1% of your data can only be found in your match store.
Some things to note on the configuration:
Make sure that the selected columns in query match the number and order of those used in the Find Duplicates step, including the Unique ID column (It is ok to send in nothing for this)
For example my Find duplicates setup contains 29 columns
I would then configure query with all 29 columns even if I don't have any data for them in my input file. You can create new columns you don't have using the Transform step and use the Constant function to apply a null value Or you could make use of the map to target step to ensure you have the correct columns as well.
Using Transform
Map to Target
Make sure you take a snapshot directly after the Query Step, the Show Data button will only run 20 rows by default for performance reasons, using a snapshot and running the workflow directly ensures all rows a run.
Failing that I would recommend trying to find easy records from your external system that exist in the SCV by joining attributes such as Name and Email or Name and Phone and make sure these are returned from the query step and verifying the configuration is correct.
Attached is the setup guide for the step, but it does have some basic usage in there too
Re: Identification of an individual flat in a building using Address Validate
Thanks @Nigel Light . I am going to work with the team to get this working in V1 too. I will keep you posted on the progress!
Re: Delta Data Loads - Data Studio version 2.0
@Carolyn congratulations on going live !
Re: Edge
Great question @Nigel Light we test, support and use Chrome, Edge and Chromium so you should be good.
Re: Address Rationalisation
@Keith Alexander I think what you're after is the split function.
This takes an input string, a character to split on (comma in your case) and then the item to return.
The example above shows that for Address Line 1. You would create a workflow with a transform step, which creates 5 columns (Address Line 1 -5) and for each, it's just the split step, with the original column as the input and the relevant line number to return.