Best Of
Re: Aperture REST API calls as Datasets within Aperture
Hi @Dom Gittins , really glad that you ask this question because I think there's a lot of potential here for interesting solutions that use the native Autonomous REST JDBC driver with Data Studio's REST API to create interesting and useful management / admin dashboards.
It can be a bit fiddly to set up the AutoREST connector so that the REST API's JSON responses are correctly parsed to tables - particularly when the response contains a nested array of values. In these cases, the driver normalizes the nested array to a child table which can be joined via a key from the parent table. An example of this is the /spaces endpoint, which returns an array of users and groups that have access to each space:
I'll show later how this is handled by Data Studio - to start with, here's how I've set up the External System to connect to Data Studio's API:
- Create an API key - Both the API key and the REST API calls you make are environment-specific, so if you have multiple environments you'll need to manage them with separate keys and tables. The API key only needs Read access to the endpoint you are interested in.
- Create a new External System using the Autonomous REST connector. We're going to use a .rest file to list the endpoints and the schemas for the resulting data. The settings you need for this are:
- Servername = The hostname of the Data Studio server. So if you API documentation is at https://local.datastudio/api/docs/index.html, the servername setting should be "https://local.datastudio"
- Rest config file: This will be the location on disk (on the Data Studio server) where we'll drop the attached .rest file
- Authentication = None (the auth is handled via the API call header rather than the driver auth
- Then create two user connection properties:
- SecurityToken = "<environment_label><space><API_Key>". This is the same way you would authenticate with the REST API directly
- AuthenticationMethod = HTTPHeader
3 . You'll need to create a Credential for the External System to use, but this will just be a dummy value because the External System itself hlds the auth credentials
4 . Now drop the attached .rest file into the location you defined in the External System's "REST config file" setting. remove the ".txt" extension that I had to add to allow the file to be uploaded.
This .rest file defines the endpoints that allow you to load:
- Event information (/events)
- User session information (/sessions)
- Info on all Spaces (/spaces)
- Job execution information (/jobs)
- Dataset information (/datasets)
- ExportableObjects (/exportableobjects) * added Nov 2023
- Metrics (/metrics) * added Nov 2023
To test the connection, go to Datasets -> Add Dataset and select the External System you have created. You should see the following list of tables generated by the .rest file's schema and available to load:
Here we also see how nested arrays are normalized. The SPACES table has a child SPACES_PERMISSIONS table. The "ID" column in the SPACES table is the join key into SPACES_PERMISSIONS (where the column is called SPACES_ID). By joining these in a workflow, you'll get a view of all spaces and associated user & group permissions:
JOBS also has a child table (key is the executionID), and DATASETS has a more complicated structure because a Dataset can include more than one Table, and a Table can have multiple batches of data, as well as multiple columns and tags.
Join us Online for our next Data Management User Group | 15 September 22 !
I am pleased to confirm that our next UK Data Management User Group will take place on Thursday 15 September 2022, 10.30am until 3pm London UK time, and if you are an Experian client based in the UK you can sign up today.
We’re also excited to confirm that this will be our first hybrid session. The event will be hosted at The National Gallery in London, but you can also attend virtually. Please indicate your preference for attending when signing up.
Following on from our previous successful events, we will continue to provide further updates and showcase key features, as well as spotlight some of the newer product capabilities in Aperture Data Studio.
We're also delighted to confirm we will have presentations from two of our clients- St. James's Place and Speedy Services on their use cases.
You can register for the event by clicking here but if you have any queries regarding the event please do not hesitate to contact me.
Many thanks,
Vania Sigalas
Word Frequency
Every now and then a scenario crops up where it'd handy to know how often a given word occurs within a given dataset. For example you're profiling a reasonably standardised list of values (e.g. job titles) and you want to identify unusually common terms (like 'manager' 'executive' etc) or infrequent ones (like 'test123'). You may also want to perform this type of processing to generate lookup tables to be used for standardisation/cleansing/validation later on (e.g. company terms from an organisation name field, product groups from a product name column etc).
Alternatively you may just want to do this to perform some analysis to achieve something like a word cloud:
Either way, I've recently had a stab at creating this and wanted to share the results with you, not only as this is something which I believe can be used in a variety of different situations but also because it highlights a bunch of features in the product, some alternative approaches and may be an interesting article for self-learning purposes too.
I've tried to summarise it all below, please do pop me a comment with your thoughts, suggestions and if you found this useful and would like more content like this?
1) Start
In this example I'm starting with a single column of data containing a list of company names (and I want to find the most common words in this list)
2) Standardisation
First I tidy up the data a little using a combination of remove noise (to strip special characters) and then upper case (to remove case variation between the remaining words). At this point you can also easily remove numbers too if they're not of interest to you.
3) Separate out words
Next I split out the words into individual fields (up to 20x) and I did this with a Text to Columns step (note this is a non-native workflow step but is available free of charge, to learn more reach out to your Experian contact):
Note that this can also be achieved using a native Transform step and the 'Split' function to explicitly extract each term:
4) Columns to Rows
Next I use the new Columns to Rows step to take each of these new columns and essentially stack them on top of each other to create a giant list (note as I've split the data into 20x columns there's now a lot of empty rows):
5) Finishing touches
Then I use a simple filter step to remove empty rows, I group on the 'word' and append a count before finally doing a sort to get the desired output:
The magic bit...
Once I'd got this working for my initial dataset I wanted to make this reusable to be easily used in other workflows. This involved 3x minor tweaks:
Part 1: permit the source to be provided at run-time (this allows this workflow to be dynamic and simply have a different source configured at run time which is pretty handy on its own)
Part 2: adjusting the workflow details to check the 'can be used in other workflows box'
Part 3: put an 'output' step on the end (so that the results can be passed to further steps and surfaced within another workflow)
A final test
Last I tested it being used as a step within another workflow which worked a treat:
Like a copy?
For those of you with the 'text to columns' step already setup, you can simply import this into your space using the below .dmx file, alternatively I'm hoping the above steps will be suitable to help you build something similar yourself.
If you've not already had a go at using 'reusable workflows' I'd strongly encourage exploring this as it's a really powerful feature which can really help scale a more standardised approach to processing your data.
Feedback
I hope you found this post interesting/useful, let me know your thoughts below in the comments!
All the best,
Danny
📋 Index of reusable Functions! 📇
This post simply acts as an index for all the functions shared in this library. If you would like to receive a notification whenever a new function is added, please bookmark this post by clicking on the star icon to the right. Current functions available:
- 👪 Parse Full Name - A handy set of functions that can help standardise the sequence full names are presented in (e.g. "surname, forename" vs "title forename surname") and can extract individual name elements (title, forename, middlename and surname) using logic.
- Next ⏭️📅 & Previous ⏮️📅 Working Day - As per the name, these functions detect the next working day after (and previous working day before) a given input date.
- Last day / working date of month 📅 - These functions identify the last day (and last non-weekend day) of the month for a given input.
- Parse Date 🧽📅 - This configurable function allows the user to specify a date format string and then it auto-parses in line with that specified format.
- Standardise Country 🌎 - The function uses reference data (contained within the .dmxd file) to standardise known country aliases to the standard form (e.g. England = United Kingdom).
- Offensive Words 🤬 - The functions contained within this package all relate to flagging and dealing with data containing offensive language. All of these functions uses a domain of offensive words (contained within the .dmxd package) which contains a list of known offensive terms.
- Mask Out Email/Phone 🛡️ - This package contains 2x functions to help with anonymising certain input values, whilst leaving an output that can still be used for non-sensitive analysis.
- Proper Case Surname 📛 - This package contains 2x functions which help with contact data: Proper Case Surname and Validate Surname Casing.
- Get Word ✂️ - Extracts the 'nth' word from a string. Where n = 2 the second word is retrieved, where n = -1 the last word is retrieved.
- Reasonable Date of Birth (DOB) 🔞- Checks the input is a value which seems reasonable as a valid date of birth using user defined min/max age parameters (i.e. a date, not in the future, related to an individual within an age bracket of 16-100 years old)
- Standardise Gmail addresses 📧 - Standardise Gmail addresses for matching purposes (e.g. googlemail/gmail as well as email addresses associated with the same account via the use of '+' and '.' in the account part of the email)
- Job Title Match Key 👨💼👩💼 - Generates a key that can be used to group job titles together (despite presentation differences)
- Invalid Character for Names ☹️ - Finds records where the field contains characters which are invalid for names. Records which contain digits, commas, and other special characters will yield a "true" result. Apostrophes, dashes and periods are not considered "special characters" for this function. This function is not suitable for Validation Rules -- use "Contains Only Valid Characters for Names" instead.
- Compare Dates (Verbose) 📅📅 - Provides a summary of how 2x input dates compare (includes 'convert to date' logic). Output options are: Exact match, Incomplete - Date 1 is missing (or not a date), Incomplete - Date 2 is missing (or not a date), Incomplete - Both dates are missing (or not a date), Close - Day & Month match, Close - Month & Year match, Close - Day & Year match, Other difference
- Convert Boolean ✅❌- Converts binary or Boolean values (i.e. true/false, pass/fail, 1/0) to a graphical emoji icon to aid visual presentation in the UI.
- Contains Non-Latin Characters 🈯 - Identifies the presence of any characters not in the 'basic Latin' unicode block.
- Reverse String ⏪ - Reverses the input (e.g. "Danny R" becomes "R ynnaD")
- Repeating Characters 🔁 - Uses a regular expression to identify records where the entire value of a cell is made up of the same character repeated (e.g. "aaa" or "0000000").
- PCI Detection 💳 (Payment Card Information) - Checks that the input contains 16 char numbers (either as a single string or separated with hyphens/spaces every 4 characters) [AmEx format also accounted for]
- SIC Conversion 🏷️ - Takes a 2007 format SIC code as an input and returns the high level label of the 'category' of businesses which it falls into.
- Future Date Check 📅 (Dynamic) - Checks that the input date is a date in the future (relative to the time of execution).
- Extract First Word 🥇📝 - retreives 1st word in a string
- Extract Last Word 💬 🥉 - retreives last word in a string
- Replace the word ‘NULL’ or any non-null space values with null 🔄👻 - produces a 'null'
- Calculate the distance between two sets of co-ordinates 🌍️ - Returns the approximate distance in kilometres between a pair of geographical co-ordinates
- ⌚ Convert Epoch Date/Time to Standard Date/Time - converts a timestamp from epoch format to 'standard' format
Re: Date formatting
Hello @Nigel Light , the Date format string must be dd/MM/yyyy where "MM" (all capitals) is the month. Hope that sorts the date format. Regards, Pavan
Re: Setup Secure SSL using self-signed certificates
Hi @Sean Leoszko , can you say a bit more about the type of certificate you want to generate, and how you want to use it.
To explain the differences it's useful to understand the purpose of SSL certificates:
- To encrypt sensitive information sent across the network between the client (e.g. browser) and the server.
- To provide authentication. Specifically, to identify your server and protect your clients from things like man-in-the-middle attacks.
There are three types of certificate you might use:
Data Studio comes with its own self-signed certificate, which is what's used when you select "use SSL" from the server settings. You'll usually change the port to 443 as well. A self-signed certificate is not signed by any certificate authority (CA), and certificates like this will provide encryption but not authentication (or rather it places the burden of verifying trust on the user explicitly). You'll see a "your connection is not private" warning in your browser telling you that the connection is not safe. If you generate your own self-signed certificate it will behave in the same way.
You could also request and apply a certificate signed by a trusted certificate authority (CA) to both enable encryption of client-server communications and identify your server to clients. If you want your Data Studio instance to be available to the general public via the internet, you'd want to use a certificate from a public CA to initiate secure and “trusted” communication.
You may actually want something in between: a certificate issued by a private (internal) CA. A private CA is typically enterprise-specific and used for internal servers on a private network. These certificates are not publicly trusted. In this scenario, in order to achieve trust (authentication) an organization would generate a root certificate and then configure all clients (e.g. Chrome) to trust certificates signed by that root by installing it as a trusted root certificate authority.
The 3rd option, using a private CA, is the most common way Experian's clients secure Data Studio, and also the Find duplicates server. In this case, the client's IT team would typically provide the certificate, either as separate public certificate and private key files, or in a combined format like PKCS12.
Re: Inferring Gender from Title or Honorific and Gender Reversal
From a real-world usage perspective, whilst this is a useful calculation, it's use also needs to be carefully considered. Here's a few consulting recommendations:
- Use honorifics only to derive a possible gender, not to make assumptions.
- Any column should be named as "possible/derived" with metadata/data dictionary entries indicating the derivation method.
- Suitable use cases include:
- In a data quality rule to compare an honorific with a gender flag field to look for "possible" but not "certain" mismatches
- Analytics with large enough sample sizes to minimise errors (assuming no systematic errors)
- Light touch personalisation with marketing, but still without assuming gender (caution still required)
- Unsuitable use cases include:
- Creating a new gender flag (or changing an existing one) from a derived calculation
- Updating source systems with new gender flags
- Indicating gender in marketing or service communications
A good example of a suitable use case might be using this calculation to "blank out" potentially inaccurate gender flags prior to loading a dataset to a marketing platform prior to a personalised campaign.