From a real-world usage perspective, whilst this is a useful calculation, it's use also needs to be carefully considered. Here's a few consulting recommendations:
A good example of a suitable use case might be using this calculation to "blank out" potentially inaccurate gender flags prior to loading a dataset to a marketing platform prior to a personalised campaign.
Hi @MichaelJlam you raise an interesting question, namely, what is the purpose and intent with the different parts of Find Duplicates functionality?
You'll find v2 documentation located here : https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/create-a-single-customer-view-scv/deduplicate-data/
and v1 documentation here https://docs.experianaperture.io/data-quality/aperture-data-studio-v1/improve-data-quality/find-duplicates/3
Essentially there are three key elements to the Find Duplicates functionality:
For the first piece you're building a workflow that may one or more sources feeding into the Find Duplicates step in the workflow. When you execute that workflow the additional columns are appended. Using the Harmonise step or a grouping step, or in fact a custom step, you can nominate one record to be the surviving record per cluster and establish your 'golden nominal' depduplicated record set.
You may have to run data through that workflow several times before you arrive at a fully deduplicated set of records. The harmonization step is the important part here to eliminate subordinate matches.
Every time you execute that workflow on the full dataset or a subset of it, the 'match store', effectively the transient database of records creates a searchable store with potentially different cluster IDs so it is important to understand that the cluster ID is probably not a good key to use for persistence. Every run's databse becomes the latest version of that database and unless you are retaining old data stores, what you had before is effectively discarded. This is the simplistic understanding that you need to have of what is happening under the covers.
So, assuming you have executed the workflow with the find duplicates step, at least once, you can down leverage the REST APIs associated with the Data Studio Find Duplicates engine to perform three tasks under Advanced Usage.
Returning to your intent, it sounds like you have a couple of purposes, or are trying a couple of different activities. You're using the Find Duplicates functionality to match business data it seems. This means your matching criteria need to be different from those you might use to match people and families.
Data Studio Find Duplicates is optimized to 'match' and cluster people records based on names and addresses and will support use of additional matching criteria like email address, phone number, data of birth and other data attributes but for business data you may need some specialized Blocking Keys and matching Rules.
What you may want to consider is using the Experian pH Business Search custom step for matching Business Data (in the UK) which has a specialism in matching businesses. You can find details of that special custom step in the Marketplace as a Data Studio extension. Before using that step, i would recommend that you run the business data through the Address Validation engine step too, to ensure that you have the best possible addresses for those businesses.
Query duplicate store is used to query the results of duplicate matching where the UniqueID and ClusterID is not known but the name and contact details are. This is useful for situations where new records are incoming into the SCV from an external source that has no direct links with the SCV. You can query the SCV to identify records that already exist within the SCV. Results with this query may link all records / none or somewhere in between, it will depend on the data you are comparing. It is possible that 1% of your data can only be found in your match store.
Some things to note on the configuration:
Make sure that the selected columns in query match the number and order of those used in the Find Duplicates step, including the Unique ID column (It is ok to send in nothing for this)
For example my Find duplicates setup contains 29 columns
I would then configure query with all 29 columns even if I don't have any data for them in my input file. You can create new columns you don't have using the Transform step and use the Constant function to apply a null value Or you could make use of the map to target step to ensure you have the correct columns as well.
Map to Target
Make sure you take a snapshot directly after the Query Step, the Show Data button will only run 20 rows by default for performance reasons, using a snapshot and running the workflow directly ensures all rows a run.
Failing that I would recommend trying to find easy records from your external system that exist in the SCV by joining attributes such as Name and Email or Name and Phone and make sure these are returned from the query step and verifying the configuration is correct.
Attached is the setup guide for the step, but it does have some basic usage in there too
Thanks @Nigel Light . I am going to work with the team to get this working in V1 too. I will keep you posted on the progress!
@Carolyn congratulations on going live !
Great question @Nigel Light we test, support and use Chrome, Edge and Chromium so you should be good.
@Keith Alexander I think what you're after is the split function.
This takes an input string, a character to split on (comma in your case) and then the item to return.
The example above shows that for Address Line 1. You would create a workflow with a transform step, which creates 5 columns (Address Line 1 -5) and for each, it's just the split step, with the original column as the input and the relevant line number to return.
Hi @stevenmckinnon - have you opened a support ticket for this to be investigated?
Just to close this one off, we discovered that there was a Data Studio issue preventing a scheduled workflow from export back to a table in a SQL Server system that used NTLM authentication.
We were able to work around the problem by using a setting in Data Studio to enable some as-yet-unreleased functionality. This functionality will be enabled by default from v2.1
I'm not going to pretend to understand the maths, but the prefix value gives higher ratings to strings that have a common prefix up to a prefix length of 4
The prefix weighting will be applied to the length of prefix that you supply and weight the results in preference of the strings with common prefixes.