Best Of
Re: Setting up custom Find Duplicates rules
Hi @Carolyn
Find Duplicates works by standardising data and mapping the fields you supply to standardised fields - so you are correct in surmising that the fields get individually mapped. We have a tool that can help you to see the workings of this process called the Find Duplicates Workbench. You're local support team will be able to provide you with this tool and walk you through a few examples.
The good news is that all your requests are possible in Data Studio.
If you want to match on two fields combined, you could use the "Generic Field" option. Use a concatenate function to combine the First and Middle Name in to a single field in your dataset and then use the Generic Field mapping in Find Duplicates to match on that field.
Once you have mapped the field you'll need to modify your underlying rule set to include the new name rule. How you do that will depend on exactly what business rule you want to implement. Do you want to replace the existing individual field name match or add to it?
For your second question, i would again use the Find Duplicates Workbench tool to evaluate exactly why you are not getting good matches and to tweak the underlying individual rule set. Using this tool, it is quite straight forward to include an AND/OR rule for date of birth. It is possible to set up the rules so it treats an empty date of birth as an exact match, which would then revert the rules to using the name address fields.
It is a bit complex to include detailed steps to do this in this forum. It would be best if one of your local consultants can furnish you with a copy of the Find Duplicates Workbench and show you how to evaluate the rules and change them. I will contact your local team and try and arrange this but copying in @Ian Buckle @Sean Edmunds as an fyi.
To give you an idea of what the Workbench can show you:
You can view the Standardised fields for any two records in the Find Duplicates store:
You can then view how the rules evaluate those two records and why they do/don't match:
In this example you can see that the rule does not evaluate exact because the Forename rules are not evaluated as Exact.
You can then use the rules editor to tweak the rules:
You can also find detailed information on how Find Duplicates uses blocking keys and rules in our online docs:
https://www.edq.com/documentation/aperture-data-studio/find-duplicates-step/advanced-config/
Re: What is the status of data catalog integrations, for example Collibra?
Hi @Sami Laine in our conversations with the folks aat Collibra, we have determined that the co-implementation of Mulesoft together with Collibra and a platform like Data Studio represents a lot of implementation friction.
Accordingly one of the more recent implementations of Data Studio with Collibra, connects Data Studio to the Collibra platform directly and bypasses using Mulesoft altogether. @Ivan Ng can provide you with more details, you'll also find details of that integration on the Collibra marketplace accompanied by some overview slides .
Re: De-Duplication DateCompare
The general structure to a rule is:
{Theme Name}.{Level}={ {Optional Group}.{Element or theme}.{Comparator}[{results}]
So, assuming you've configured your date column to have an element group name of "Dob", what you would be looking for is:
DOB.Exact={ #DOB.Date[ExactMatch] } DOB.Close={ #DOB.Date.DateCompare[MonthYearMatch] }
The exact rule doesn't have a comparator because the ExactString comparator is the default if one isn't supplied.
Re: Updating multiple keys after de-dup
ok, the approach I think you probably want is one where you derive a winning record with the final harmonize, and then join the data back to the data replacing the ID based on the clusterid from the harmonized record, this requires a branch fter the find duplicates but before the harmonize. Does that make sense?
this multiview illustrates the results in the top panel
the winning record in the middle
the original data in the bottom
Re: De_Dup failing
@stevenmckinnon look for Standardize or GDQStandardize though it is possible that it may be on a different server from the one that Data Studio is installed on if you are using a remote matchstore configuration
https://www.edq.com/documentation/aperture-data-studio/help/#gdq-standardize-server-failed-to-start
Re: APIs in Aperture Data Studio v2
Hi @Clinton Jones ,
We do have that for v2. Its not yet as extensive as v1, but the API to trigger workflows is available:
You can access the SwaggerUI at http://<server>/api/docs/index.html (Default local host installation: http://localhost:7701/api/docs/index.html)

Re: Data Studio V2.0 - tagging
The easiest way of doing this currently would be to create a template workflow which contains the map to target and transform, then clone that for each new workflow you want to create.
Go to the workflow list, select options then clone.
Re: Find Duplicates Blocking Keys and Rules in v2
@Clinton Jones These ship with the product. If you go into the installation folder, usually c:\program files\experian\aperture data studio x.x.x
, you will find the detaults in the matchDefaults
folder. The files are named {Country}_{Scenario}_Default.{JSON/EXPR}
Scenarios are one of:
- Location: Only considers physical address information
- Household: Looks for people with the same last name at the same address (uses address and surname information)
- Individual: Looks for individuals, using Name and Address
JSON files are the blocking keys, since they follow the json specification.
EXPR are the custom Find Duplicates match rules. If you want to make it easier to read these expr files, you can use the language addon for Notepad++
Approach for Matching Product data (product description)
Yesterday @MiteshKhatri @Akshay Davis @Katya Jermolina and I were having a discussion about approaches for matching product name information using Data Studio, so I've put together the below summary of the approach in case it helps anyone else.
Before I go through the steps I want to call out that there are a variety of different approaches you can take depending on the dataset you're looking at including: (1) using the rules-based fuzzy-matching engine Find Duplicates step, (2) building your own 'blocking' keys and then doing a self-join on records that 'block' together from which scoring can be done (e.g. edit-distance calculations) on candidate pairs or (3) using a transform step to build a 'match key' which can be used to group together records.
Either way, I'd encourage a focused approach to discovery/profiling of the data before diving into any matching tasks.
Product Matching Example
In the above example we're focusing on a field which has some consistent information held within it, but the order of these values and presentation of them alter from one record to another. Also this same approach is useful if your data appears in different fields in one source (e.g. product, size, quantity) but you want to match it to a dataset which has this information stored within a single string.
For this article, I'm using an illustrative example featuring the below fizzy drink products:
Define match elements
The approach we discussed yesterday was to identify common features from within the data which i would want to match on. In this case it would be product name, pack size & volume per unit (but in your data it might be: colour, dimensions, material etc).
Extract match elements
Once these 'matching' characteristics are defined we then want to build transformation logic which extracts the relevant information from the input string into a standardised form:
In this example a combination of techniques are used for each element, based on what I learnt about the variance of the data from profiling activity. Some of these rules include: standardisation of special characters (e.g. 7-up and 7 up), lookup tables to extract values known to be products (or you might do the same for 'colours') and regular expressions to pull out values that conform to specific patterns (e.g. pack size and volume per unit.)
Note: this may take a bit of iterating, so profiling the output of this is also a good idea to get to a point where you're comfortable with your definitions.
Create a 'matchkey'
Next simply concatenate the elements together to form a match key (in my case I used a 'remove noise' function on this key to help standardise this key a little further):
Review results
Next I then created another field to count the number of matches in each cluster (i.e. where they shared the same matchkey).Then sorted my table on the count and matchkey fields:
As you can see, the above approach has helped me cluster together records which share similar characteristics and I can then take this list through further steps to 'Harmonise' them to a common value (or define a new one using a Transform step), extract unmatched information to perform further profiling or even to output this list (and the workflow report) for review before committing any changes.
Disclaimer
Note, as with all matching, the best approach to take will vary depending on the data you're reviewing and the unique set of challenges presented in the way that the data varies in format, structure or consistency.
Re: Creating a Blocking Key from more than one Generic String
@MiteshKhatri The documentation you're looking for is here: https://www.edq.com/documentation/aperture-data-studio/find-duplicates-step/advanced-config/#groups
What you're looking to do is:
[ { "description": "Width" "elementSpecifications": [ { "elementType": "GENERIC_STRING", "elementGroups": ["Width"], "includeFromNChars": 1 } } ]