-
Error attempting to delete match store
Hi, In my project, we are using Experian Aperture Data Studio(2.10.10) to connect to a Remote Find Duplicate match store(3.8.15) over http. The Find Duplicate step is configured with the 'Clear and re-establish store' checked as part of our requirements. Majority of the time ,the process runs end to end successfully,…
-
Upgrade Issue, Aperture data studio , Find Duplicates latest versions
Hi Ian, Looks like after the recent upgrade we have an issue that experianmatch path got changed to the C drive automatically, meaning it did not following the existing configuration we have which were to be this onto D drive. And even after changing the path manually on Find Duplicates ini file, it still does not update…
-
What does this column duplicate:Action mean?
The results on my duplicate step are coming with some empty gaps and when I checked it all the gaps were the Action Affected in this column. Can someone explain to me what does this actually mean? Am I missing any data or this is just an extra field and I can remove all the Affected rows? There are 2 things in this column…
-
How long can/should the "Analyze blocking keys" function take?
Morning all, I'm getting up and running with the find duplicates feature and have created my own blocking keys and matching rules now which are producing good results for the clustering, however when I try to analyse it in "Find Duplicates Workbench" the "Analyze blocking keys" tool is taking an extremely long time to…
-
Error while connecting to Embedded Duplicate Store
Hi, I am getting error as below when I am trying to connect to the Embedded Duplicate store in my Data studio server(2.10.10). Please let me know if you have faced similar issue and help me with the steps to resolve this. PS: We do have a remote duplicate store(3.8.15) which is working fine and we have secured the Data…
-
Aperture Data Studio and Find Duplicates Upgrade
@Ian Hayden , we have issue with our upgrade on DEV, Find duplicates is not working. We need your response, I have sent you an email as well. First we ran into LDAP issue, I resolved it according to steps you mentioned, Then, with FD, we are unable to test.
-
Aperture Data Studio - Find Duplicate Servers
Hello Community, Our organization has recently decided to use ADS for Data Quality tool and we are trying install ADS setup in High Availability setup in AWS Environment. We are trying to install Data Studio in a EC2 instance as per the Documentation and several Find Duplicate instances in 3 availability Zones. Help in any…
-
Find Duplicate is showing empty records along with input records
We are using remote duplicate store. Whenever we run the workflow, the Find Duplicate step's result is produced the cluster id and and Match status as expected for the inputted data set. However, there are many blank records in its output, all of such records are showing their "Duplicates: Action" as "AFFECTED"; and for…
-
Find Duplicate new feature compatibility
We are using 2.9.6 ADS version in our prod serverand we have WFs to replace the existing duplicate store data. Now we have upgraded Latest ADS version 2.10.2 in our DEV environment and we see that Find Duplicate has lot of new functionalities with some new components. For our existing wf we still want to go with the same…
-
Find Duplicate Step's result is not showing up
In the Find Duplicate Step, after configuring input data set, duplicate store as Temporary Store and Step Setting, the link for "Show Step result" is disabled as well as the exclamation mark at the top right corner of the step is showing "Coudn't reach Find Duplicate server"
-
Java pid hprof crash files.
Facing the issue in our dev server for Aperture Data Studio and Find Duplicates. In log I have found this at the same timestamp when I got the crash file in installation dir. 2022-12-14 05:20:12,255 ERROR c.e.m.a.MatchInstance [pool-157-thread-1] An unexpected error occurred when running Find Duplicates step against…
-
Java Update - Vulnerability detected on our Experian Dev/Prod Server
This version has been detected as vulnerable by - OCPU-2022-JUL: Oracle Java Critical Patch Update Advisory - July 2022 when we try to update this, I am receiving this notification, Do we need to switch to OpenJDK, or can we keep using below, and proceed with Updated, I am confused how all of this affects Jar files we have…
-
How to identify blocking keys for Find Duplicates
Blocking keys identifies records that are similar, creating blocks or potential groups of matches. Let’s look at an example where you have a list of names and date of birth that may contain duplicates. The rule of thumb is to be able to identify any possible chances of matches. Which elements would you use to say that any…
-
How can I configure TLS for the Find Duplicates Workbench
What are the steps to configure an SSL certificate for the Find Dupes Workbench site? I already have the Workbench installed as part of Data Studio and running on http://localhost:26312/
-
Harmonize Duplicates - Select best record based on score with multiple criteria
A customer posted this question to me: "I want to pick a survivor using multiple score based criteria in the Harmonize step. So when all records in a cluster have the same survivor code ( a number I create) I need it then to pick the record with the newest date. For example two records in a cluster might have survivor…
-
Low Disk Space Notifications and others.
Hi team, this is to ask you if I create a low disk space notification on aperture data studio system settings. Will it notify me the diskspace of the experian match store drive or the installation drive. Second question that I have is in case of an abrupt stop or restart of my experian data studio and find duplicates…
-
Find Duplicates Training Data
Attached is data to be used with Find Duplicates Training.
-
How to check if an input value contains numbers
In Aperture Data Studio, in order to check if an input value contains numbers, you can use the Matches Expression function. The Matches Expression function uses a regular expression to specify a pattern to be matched against. If you are not familiar with a regular expression, it may take some exploration to get to the…
-
Find Duplicates Service Installer FAQ
Here are some FAQs on the Find Duplicates Service Installer available since Aperture Data Studio v2.4.8. What is the purpose of the new Find Duplicates Service Installer? The Find Duplicates Service Installer is created with the intention to simplify the installation of a separate instance of Find Duplicates on Windows.…
-
Find Duplicate - Based on different criteria
Hello, Requirement is find the duplicates based on 3 different criteria - Name and Phone Match Name and Email Match Name , Address Match Now how to achieve this, should I use 3 Different Find Duplicates in Parallel and then check Match Status any 1 of them 0? Is it possible to check in one Find Duplicate? I think we can…
-
Tips and Tricks to make Find Duplicates Blocking Keys and Ruleset more readable
What is your first impression of the blocking keys and ruleset definitions required for Find Duplicates? We have observed that it may take a bit of learning to understand the syntax and structure to correctly update the keys and rules. Here are a few tips and tricks to help ease your experience with reading and updating…
-
Installation of Find Duplicate Different Instance
Can you please help me with the instruction link for installing Find Duplicate Workbench a separate instance (Aperture Data Studio v2)
-
Find Duplicates with Phonetic Comparators
We now have the new phonetic comparators (Soundex, NYSIIS and Double Metaphone) for Find Duplicates that you can use to supplement the edit distance comparators (Levenshtein, JaroWinkler) for better match results. Why the need to supplement the edit distance comparators? Edit distance algorithms count the number of steps…
-
We need your feedback on Find Duplicates
Hi all, We are evaluating the paths to a better Find Duplicates rule building experience. We understand that there are many different perspectives on this topic, but ultimately, we really want to build something with the customer's journey in mind. We think customers would appreciate the improvements specifically in the…
-
The chaining effect
What is the chaining effect? Find Duplicates in Aperture Data Studio works by identifying blocks or groups of duplicates, then comparing every possible record pair within a block based on rules to determine if they represent a single entity, represented by a cluster ID. Depending on how you have configured the rules, you…
-
Manual review of Find duplicates clusters
Using a similar approach to that covered in an earlier post we demonstrate a method for manually reviewing clusters to split one into two or more, or join two or more together. In this example we will use the standard sample Find duplicates test file with UK addresses. Problem overview Using the standard GBR Individual…
-
Reviewing Find Duplicates Results
When you preview the Find Duplicates results, you will see the Cluster ID and Match status. Records with the same Cluster ID are identified as duplicates. The Cluster ID is influenced by the blocking keys that determine the record comparisons to be made. However, ultimately the duplicate records and match status is…
-
Smart Harmonization FAQ
As of release 2.4.5, you are able to turn on a preview of a new feature that involves smart models utilizing machine learning for harmonization. The smart models can be applied as Column specific rules at the additional options for Harmonize duplicates. What is harmonization? Harmonization is used to merge, blend or reduce…
-
Tuning Blocking Keys
Review Find Duplicates step results Reviewing the Find Duplicates results may not be the best way to confirm the effectiveness of the blocking keys. However, it does help to reveal obvious issues that may trigger further investigation. Once a set of Blocking Keys and Rules has been established at the Find Duplicates…
-
Find duplicates functionality, duplicate matching and duplicate record resolution.
Hi community members We're looking for customers and external evaluators of Aperture Data Studio that would be prepared to participate in some field research on the topic of duplicate record matching, identification and resolution. This is an opportunity for you to influence some of the work that we wish to pick up in the…
-
Building Rules
In order to start testing the Find Duplicates step, the Find Duplicates settings will also need to have a ruleset defined in addition to the blocking keys. When building rules, we will have to think about the following: How the Ruleset relates to Blocking Keys Blocking keys identifies potential matches. Rules determines if…
-
Exact match and fuzzy match with Find Duplicates Step
Aperture Data Studio offers a Find Duplicates step that runs on a powerful standardization and matching engine. When you connect the Find duplicates step to your Source dataset, you will see that some configuration is needed before you can see results. The results of the Find Duplicate step provides the Cluster ID and…
-
Fuzzy Match with Regular Expressions
Imagine you have an address field containing variation of city names, for example: You need to determine the country based on the city name. You have an official list of countries and their cities like this: With any exact match techniques, some of the values will not be matched since they do not appear in the official…
-
Selecting Best Record with Harmonize Duplicates
There are times when you just want a quick way to deduplicate your data without necessarily knowing how many duplicates are found. For example, you have a list of course completion status for a course OL100. There are multiple statuses recorded on different dates for each user. However, you are only interested in the…
-
Exact Match with List functions
There are a number of List functions that may be useful when you are trying to de-duplicate a list of values within a single column. Let's take a look at this dataset. How can we de-duplicate the list of fruits for each day? Connecting the dataset to a Transform step with List Frequency and List De-duplicate Functions will…
-
Exact Match with the Group Step
The Group step can be used as an easy way to identify and de-duplicate data that matches exactly. Let's take a look at an example, where you have a list of vehicles that may be duplicated. When you connect this dataset to a group step in a workflow, the results provide a count of each vehicle along with the unique list of…
-
Why worry about duplicated data?
What is duplicated data? Duplicated data refers to data representing the same entity. An entity can be anything such as a person, a company or a product. In the most obvious form, duplicated data refers to an exact copy of a record. However, this is not often as straightforward due to possible variations in the data that…
-
Exact match, fuzzy match and de-duplication with Find Duplicates
Hi everyone, I'm starting a series of articles all about matching and linking records to find duplicates in Aperture Data Studio with the intention to encourage some learning and interaction. Start here: Why worry about duplicated data? Simple ways to identify and resolve duplicated data in Aperture Data studio: Exact…
-
Quick introduction to Blocking Keys and Rules for Find Duplicates
The concept of Blocking Keys and Rules may be foreign to you if you haven’t already used the Find Duplicates step in Aperture Data Studio. Blocking keys identifies records that are similar, creating blocks or potential groups of matches. Rules compares every set of records in the resulting blocks, returning…
-
🎞 A Short Feature Demo on data unions and find duplicates in Data Studio
-
ℹ️ Removing Exact Duplicates / Exact Matches
A common thing users want to do with Data Studio is remove rows with (exact) duplicate values in selected columns, similar to the functionality in Excel 'Remove Duplicates' that will compare values in the selected columns, keeping the first of any exact matches, but also keep the values from any unselected columns in that…
-
How do we expand on the provided base blocking keys and rule sets on Aperture Data Studio?
Hi team I am asking this question on behalf of one of our Credit Services team members. They are looking to create Blocking keys and Rules that accommodate for a text string (Drivers License) and date of birth, but is running into issues. The rules that they are currently using are part of the attachments. Is there an easy…
-
Find Duplicates Language for Notepad++
If you're using Notepad++ as your text editor, it can be helpful to have an interpreter to highlight key words for matching rules for Find Duplicates and have auto completion suggestions. Attached is are two xml files which allow for this. Adding Language To add the new language option, open Notepad++ and select Language…
-
Job Title Match Key 👨💼👩💼
Job Title Match Key The function in this post has been designed to help illustrate an approach (and act as a template for you to build on) to help handle inconsistencies with Job Titles found in B2B databases. In short, the function generates a key that can be used to group job titles together (despite presentation…
-
Ensuring you always return the same duplicate ID with Find Duplicates
Find Duplicates returns consistent Cluster ID after each run. Consistency here applies to the sequence order of the input file. So if the same file is submitted to Find Duplicates, in the same order, then the Cluster ID will be the same on each run. In some cases you may not be able to guarantee the order of the records…
-
Handling Gmail email addresses for duplicate identification
The problem: Gmail allows multiple e-mail addresses for a single account With over 1 billion active users since 2016, Gmail is likely to be a large part of any consumer dataset. Being able to accurately resolve these individuals becomes more problematic when Gmail allows for variants of an e-mail address to be used. An…
-
Using a discriminant for Find Duplicates clustering
Find Duplicates will use fuzzy matching to link records, however, in some cases you may have a discriminant field you wish to use to break clusters. In this post we cover how to make use of these within a workflow. The simple scenario In this scenario, we are processing transaction records and matching on name, mailing…
-
The importance of profiling when using Find duplicates
When implementing Find duplicates in Aperture Data Studio we've seen many examples of the importance of profiling prior to configuring and running Find duplicates. This has the potential to benefit both the performance of the Find duplicates step as well as the quality of potential duplicates found. Find duplicates works…