Where could I find more information on how to use "Query duplicates store"?
Hi, I am new to Aperture Data studio and one of my recent requirement is to use a supplied contact list to find and extract the corresponding Cluster ID Key value from single customer view table.
And it seems this "Query duplicates store" can achieve that. However, after I build the workflow, I find out the number of records that could return using this "Query duplicates store" is only 1% of my input record list.
So, I am wondering if I using this step incorrectly? But I could not able to find any information online that discuss this "Query duplicates store" step.
Could anyone point me to the right place?
Clinton Jones Experian Elite
Hi @MichaelJlam you raise an interesting question, namely, what is the purpose and intent with the different parts of Find Duplicates functionality?
You'll find v2 documentation located here : https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/create-a-single-customer-view-scv/deduplicate-data/
and v1 documentation here https://docs.experianaperture.io/data-quality/aperture-data-studio-v1/improve-data-quality/find-duplicates/3
Essentially there are three key elements to the Find Duplicates functionality:
- Evaluate potential relationships between records in a panel of data and assign a transient 'cluster ID' and match level confidence value
- Establish a transient store of searchable records
- Support processing of delta loads of records that may have relationships with the transient store
For the first piece you're building a workflow that may one or more sources feeding into the Find Duplicates step in the workflow. When you execute that workflow the additional columns are appended. Using the Harmonise step or a grouping step, or in fact a custom step, you can nominate one record to be the surviving record per cluster and establish your 'golden nominal' depduplicated record set.
You may have to run data through that workflow several times before you arrive at a fully deduplicated set of records. The harmonization step is the important part here to eliminate subordinate matches.
Every time you execute that workflow on the full dataset or a subset of it, the 'match store', effectively the transient database of records creates a searchable store with potentially different cluster IDs so it is important to understand that the cluster ID is probably not a good key to use for persistence. Every run's databse becomes the latest version of that database and unless you are retaining old data stores, what you had before is effectively discarded. This is the simplistic understanding that you need to have of what is happening under the covers.
So, assuming you have executed the workflow with the find duplicates step, at least once, you can down leverage the REST APIs associated with the Data Studio Find Duplicates engine to perform three tasks under Advanced Usage.
- Extract the Find Duplicates store records
- Search the Find Duplicates store (to avoid creating a new record unnecessarily)
- Add or make changes to the store (process newly arrived small batches or individual records to determine additional new cluster IDs or delete records)
Returning to your intent, it sounds like you have a couple of purposes, or are trying a couple of different activities. You're using the Find Duplicates functionality to match business data it seems. This means your matching criteria need to be different from those you might use to match people and families.
Data Studio Find Duplicates is optimized to 'match' and cluster people records based on names and addresses and will support use of additional matching criteria like email address, phone number, data of birth and other data attributes but for business data you may need some specialized Blocking Keys and matching Rules.
What you may want to consider is using the Experian pH Business Search custom step for matching Business Data (in the UK) which has a specialism in matching businesses. You can find details of that special custom step in the Marketplace as a Data Studio extension. Before using that step, i would recommend that you run the business data through the Address Validation engine step too, to ensure that you have the best possible addresses for those businesses.5
Ian Buckle Experian Super Contributor
Query duplicate store is used to query the results of duplicate matching where the UniqueID and ClusterID is not known but the name and contact details are. This is useful for situations where new records are incoming into the SCV from an external source that has no direct links with the SCV. You can query the SCV to identify records that already exist within the SCV. Results with this query may link all records / none or somewhere in between, it will depend on the data you are comparing. It is possible that 1% of your data can only be found in your match store.
Some things to note on the configuration:
Make sure that the selected columns in query match the number and order of those used in the Find Duplicates step, including the Unique ID column (It is ok to send in nothing for this)
For example my Find duplicates setup contains 29 columns
I would then configure query with all 29 columns even if I don't have any data for them in my input file. You can create new columns you don't have using the Transform step and use the Constant function to apply a null value Or you could make use of the map to target step to ensure you have the correct columns as well.
Map to Target
Make sure you take a snapshot directly after the Query Step, the Show Data button will only run 20 rows by default for performance reasons, using a snapshot and running the workflow directly ensures all rows a run.
Failing that I would recommend trying to find easy records from your external system that exist in the SCV by joining attributes such as Name and Email or Name and Phone and make sure these are returned from the query step and verifying the configuration is correct.
Attached is the setup guide for the step, but it does have some basic usage in there too6
Hi @Clinton Jones and @Ian Buckle,
Thank you so much for sharing your knowledge and resources provided.
Sorry @Clinton Jones, unfortunately , I am from New Zealand, so the Experian pH Business search step may not be of used but thanks for the information.
@MichaelJlam there is probably an opportunity to consider a custom step as an extension to data studio that uses the New Zeland Companies office API https://www.companiesoffice.govt.nz/about-us/using-our-data-through-apis/
By using an API like that in Data Studio you could converge on known identifiers that work like persistent ID's
Thanks @Clinton Jones , I will spend some time to investigate on it as a slow-burn side project :)
@Ivan Ng is going to follow up on this and see what we can determine from our side