-
Data Wrangling - is it so bad?
An interesting perspective from Pete Aven of Marklogic, popped up in my feed this week, written on Medium.com and enitled "Data Wrangling is Bad"; Aven describes how potentially we're all Data Wranglers, that it is not a good thing, should not be embraced or accepted. In reality though, do we have a choice?
-
Why might my Find Duplicates results look different?
If I use the Find Duplicates step on its own, in some instances I get more clusters of records (clusterIDs) than if I use Data Studio with the Address matching step. I attach an example that illustrates this using the test data delivered with the application Why would that be?
-
Do businesses run on premium data? New study assesses variables in data quality tools
Lisa Ehrlinger from Johannes Kepler Universität Linz Linz, Austria, and her team have identified 667 data quality tools on the market, and they have narrowed that number down to 13 for detailed testing and analysis based on their domain independence, non-specificity, and availability free or on a trial basis. While the…
-
Jobs with large lookup files
At St. James's Place, prior to highlighting whether client details need to be quality checked, we need to establish whether the client has a current fund holding (as this is where we realise maximum business benefit in correcting the data) To do this we are faced with loading very large table(s) ie >100Million rows which…
-
Dealing with PII
Talking to a prospect today about PII - identifying it on data loaded into the system and then processing it - eg reporting its presence. Comments appreciated Steve
-
When should I use a Match Lookup rather than a join
I have a large amount of data (hundreds of millions of transactional records) that I need to match up against a list of master data records by name (less than 100,000) - which would be better, a match lookup or a join? Some of these records might constitute exact matches. Some of these records might have case differences…
-
What's better invalid data or missing data?
A discussion that seems to come up from time-to-time is, whether it is better to have gaps in your data or to have it complete but containing poor quality/invalid data. Keen to get everyone's thoughts on this topic. What's more important to you: accuracy or completeness? 
-
Extract a number (6-char SIC code) from a string and perform a lookup (2-char SIC to get category)
Using some of the Company's House open data, I recently built a simple workflow that extracts the SIC code from a string and then used a lookup table to retrieve the SIC Category (using a 2-char SIC classification) - thought I'd share it on here in case it's of use to anyone (see video attached). Also I'm keen to…
-
Floating Point numbers with Microsoft Excel Open XML Spreadsheet (.XLSX) files
When using .XLSX files in Aperture, be aware that floating point numbers will be treated differently with this file type compared to other excel files (.xls & .csv). XLSX stores values internally in scientific notation. If you open the attached file in Excel you’ll see: However, if you save the XLSX file as a xml file and…
-
The importance of profiling when using Find duplicates
When implementing Find duplicates in Aperture Data Studio we've seen many examples of the importance of profiling prior to configuring and running Find duplicates. This has the potential to benefit both the performance of the Find duplicates step as well as the quality of potential duplicates found. Find duplicates works…
-
Great article on building loyalty with a Single Customer View!
Just sharing a great article written by @aysha_aktemur on how Aperture Data Studio empowers loyalty data practitioners to cleanse and consolidate their data to take advantage of the best information on their best customers and can help to build more personalised marketing campaigns for their clients.
-
How to request a license key
-
Removing an unwanted intermittent string from the start of an alphanumeric field
At St. James’s Place we needed to remove a string, that occurred intermittently, from the start of an alphanumeric dataset. The “Trim” function didn’t quite meet our needs so we managed to achieve this by creating a new column in a “Transform” step as follows : “Replace” function, replacing the string that you are…
-
What happened to the Python step in Data Studio?
In Data Studio v1.1 there was a Python Step - what has happened to that and what are the options for using Python scripts with Data Studio now?
-
I cannot access any files I previously loaded through Data Explorer
Sometimes you will find that the view of files in the data explorer is out of sync with what is stored in the import folder. This can happen for a variety of reasons. You may have moved the location of the import folder or may have manually added or removed files from the import folder using a file system method rather…
-
Working with Dates
I have a set of data that contains some dates What I would like to do, is split the day, month and year into their own columns, this is easily done with the TRANSFORM step and using the EXTRACT TIMESTAMP ELEMENT date functions - there are many options available here If you want to determine the elapsed time between today…
-
How do you work out where the row comes from ?
When you're combining data from multiple data sources that potentially have duplicated data, you may want to work out which row you want to win. This is partiulcarly relevant when you use harmonization. You can base decisions on the source, or which records to prioritize against in grouping. You could consider this as a…
-
Solving for literal null values
Depending on the source that you are working with, you may find that you actually have literal nulls in your data, this is a common results from SQLServer sqlcmd exports to CSV. You'll be able to easily determine whether you have literal nulls by running a profile on your data, you may not easily detect this in the preview…
-
Dealing with character sets
Sometimes, the file that want to use with Data Studio will contain characters that would come into data studio incorrectly. Consider this file, countries.csv Note that in the Preview and configure the letter Å (å in lower case) an overring A is common in Swedish, Danish, Norwegian, Finnish, North Frisian, Walloon,…
-
Fuzzy Matching logic
How does the fuzzy matching in the Find Duplicates step work?
-
Find Duplicates Step configuration
What is the difference between the Name, Household and Address/Location choices in the default Find Duplicates step?