-
Checking data like social security numbers (SSN) for correctness
Social Security Numbers, National Identity numbers and the like, on face value are simply strings of data, made up with letters or numbers and segmented with hyphens. There is however inherent intelligence in many of these identifiers and even check digits. You'll find regular expressions to validate these kinds of…
-
Sharing database connections across teams
What's the recommended approach to sharing tables/views and connections to databases across a wider team? Should we create new connections per user, share tables with specific users or create credentials for other users on the same connection? I'm interested in the security aspect but also the performance impact of each of…
-
Getting started with Regular Expressions in Data Studio
A regular expression (from here on we’ll call them “regex” for short) is a well-known and powerful syntax for describing a search pattern in a simple string of text. You can think of regex as a way of defining advanced wildcards: In Windows file explorer you might search for *.txt to return all txt files, or in SQL you…
-
Tips to improve workflow performance
If you ever find that a workflow is taking a long time to execute, you might want to think of ways to improve it. Below are some practical tips and examples that will help you to achieve the best performance out of your workflows. Workflow Structure First of all, it is advisable to break down large workflows into smaller…
-
Using a discriminant for Find Duplicates clustering
Find Duplicates will use fuzzy matching to link records, however, in some cases you may have a discriminant field you wish to use to break clusters. In this post we cover how to make use of these within a workflow. The simple scenario In this scenario, we are processing transaction records and matching on name, mailing…
-
Data Studio v1.6.0 is now available!
The latest version of Data Studio introduces new features, existing functionality enhancements, performance improvements and a number of bug fixes. Try it out. Some of the new features include: A number of enhancements to the Validate addresses step: Ability to select predefined custom layouts for output columns, giving…
-
Creating a new data connection to Salesforce from Data Studio via JDBC
It is likely that quite soon after you start to use Data Studio you will want to connect Data Studio to databases and systems of record. This is quickly and relatively easily done using the JDBC connector component delivered as a part of the application. Details of how to add a custom JDBC driver are explained in the…
-
Changing source file schemae and data definitions
Legacy system files often come without a schema or data definition embedded in the file, and if there is one, it might not be one that you are particularly happy with. One of the capabilities of Data Studio, is the support of using a Data Definition Language (DDL) file or COB file to accompany your file upload. DDL and COB…
-
JDBC Performance issues can simply be about security
In Data Studio you have a number of different ways of accessing data. File Based methods Data connection based methods (JDBC or special connector) API or programmatically Leveraging snapshot data JDBC is a high performance way of getting data into Data Studio fast! Getting JDBC up and running is straightforward and…
-
Refresh your sources
Occasionally you may find that the staged sources fall out of synch in how Data Studio renders them in the UI. If you feel this is the case, you can explicitly refresh the sources from the UI by right mouse clicking on the ☰ under the My Files area and selecting refresh
-
How to resolve a unmapped source on a workflow
In a previous tip I suggested you might want to export and import your workflow between systems. I also mentioned that one of the prerequisites for this to work seamlessly, was for you to have the same or similar configuration and sources defined in both systems. It will happen, that there are source differences. So how do…
-
Data Wrangling - is it so bad?
An interesting perspective from Pete Aven of Marklogic, popped up in my feed this week, written on Medium.com and enitled "Data Wrangling is Bad"; Aven describes how potentially we're all Data Wranglers, that it is not a good thing, should not be embraced or accepted. In reality though, do we have a choice?
-
Why might my Find Duplicates results look different?
If I use the Find Duplicates step on its own, in some instances I get more clusters of records (clusterIDs) than if I use Data Studio with the Address matching step. I attach an example that illustrates this using the test data delivered with the application Why would that be?
-
Do businesses run on premium data? New study assesses variables in data quality tools
Lisa Ehrlinger from Johannes Kepler Universität Linz Linz, Austria, and her team have identified 667 data quality tools on the market, and they have narrowed that number down to 13 for detailed testing and analysis based on their domain independence, non-specificity, and availability free or on a trial basis. While the…
-
Jobs with large lookup files
At St. James's Place, prior to highlighting whether client details need to be quality checked, we need to establish whether the client has a current fund holding (as this is where we realise maximum business benefit in correcting the data) To do this we are faced with loading very large table(s) ie >100Million rows which…
-
Dealing with PII
Talking to a prospect today about PII - identifying it on data loaded into the system and then processing it - eg reporting its presence. Comments appreciated Steve
-
When should I use a Match Lookup rather than a join
I have a large amount of data (hundreds of millions of transactional records) that I need to match up against a list of master data records by name (less than 100,000) - which would be better, a match lookup or a join? Some of these records might constitute exact matches. Some of these records might have case differences…
-
What's better invalid data or missing data?
A discussion that seems to come up from time-to-time is, whether it is better to have gaps in your data or to have it complete but containing poor quality/invalid data. Keen to get everyone's thoughts on this topic. What's more important to you: accuracy or completeness?
-
Extract a number (6-char SIC code) from a string and perform a lookup (2-char SIC to get category)
Using some of the Company's House open data, I recently built a simple workflow that extracts the SIC code from a string and then used a lookup table to retrieve the SIC Category (using a 2-char SIC classification) - thought I'd share it on here in case it's of use to anyone (see video attached). Also I'm keen to…
-
Floating Point numbers with Microsoft Excel Open XML Spreadsheet (.XLSX) files
When using .XLSX files in Aperture, be aware that floating point numbers will be treated differently with this file type compared to other excel files (.xls & .csv). XLSX stores values internally in scientific notation. If you open the attached file in Excel you’ll see: However, if you save the XLSX file as a xml file and…
-
The importance of profiling when using Find duplicates
When implementing Find duplicates in Aperture Data Studio we've seen many examples of the importance of profiling prior to configuring and running Find duplicates. This has the potential to benefit both the performance of the Find duplicates step as well as the quality of potential duplicates found. Find duplicates works…
-
Great article on building loyalty with a Single Customer View!
Just sharing a great article written by @aysha_aktemur on how Aperture Data Studio empowers loyalty data practitioners to cleanse and consolidate their data to take advantage of the best information on their best customers and can help to build more personalised marketing campaigns for their clients.
-
How to request a license key
-
Removing an unwanted intermittent string from the start of an alphanumeric field
At St. James’s Place we needed to remove a string, that occurred intermittently, from the start of an alphanumeric dataset. The “Trim” function didn’t quite meet our needs so we managed to achieve this by creating a new column in a “Transform” step as follows : “Replace” function, replacing the string that you are…
-
What happened to the Python step in Data Studio?
In Data Studio v1.1 there was a Python Step - what has happened to that and what are the options for using Python scripts with Data Studio now?
-
I cannot access any files I previously loaded through Data Explorer
Sometimes you will find that the view of files in the data explorer is out of sync with what is stored in the import folder. This can happen for a variety of reasons. You may have moved the location of the import folder or may have manually added or removed files from the import folder using a file system method rather…
-
Working with Dates
I have a set of data that contains some dates What I would like to do, is split the day, month and year into their own columns, this is easily done with the TRANSFORM step and using the EXTRACT TIMESTAMP ELEMENT date functions - there are many options available here If you want to determine the elapsed time between today…
-
How do you work out where the row comes from ?
When you're combining data from multiple data sources that potentially have duplicated data, you may want to work out which row you want to win. This is partiulcarly relevant when you use harmonization. You can base decisions on the source, or which records to prioritize against in grouping. You could consider this as a…
-
Solving for literal null values
Depending on the source that you are working with, you may find that you actually have literal nulls in your data, this is a common results from SQLServer sqlcmd exports to CSV. You'll be able to easily determine whether you have literal nulls by running a profile on your data, you may not easily detect this in the preview…
-
Dealing with character sets
Sometimes, the file that want to use with Data Studio will contain characters that would come into data studio incorrectly. Consider this file, countries.csv Note that in the Preview and configure the letter Å (å in lower case) an overring A is common in Swedish, Danish, Norwegian, Finnish, North Frisian, Walloon,…
-
Fuzzy Matching logic
How does the fuzzy matching in the Find Duplicates step work?
-
Find Duplicates Step configuration
What is the difference between the Name, Household and Address/Location choices in the default Find Duplicates step?