ℹ️ Removing Exact Duplicates / Exact Matches

Josh Boxer · June 2021

A common thing users want to do with Data Studio is remove rows with (exact) duplicate values in selected columns, similar to the functionality in Excel 'Remove Duplicates' that will compare values in the selected columns, keeping the first of any exact matches, but also keep the values from any unselected columns in that row. (Excel function):

(Note that if you are including all columns in the Dataset or do not wish to hold onto any 'unselected' columns then you can instead do a Group step in Data Studio, which can optionally include an aggregate like Count/Max/Min etc.)

You can do this in Data Studio as part of your Workflow using the Harmonize duplicates step. https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/harmonize-duplicates-step/overview/

If there is a single column to de-duplicate against then select this column as the Cluster ID ('ID' is flexible, it can be a date, text etc.).

If you want more logic around which matched row is kept, such as the one with the latest Timestamp or latest OrderID then you would set Record priority to 'Score based' and select your Timestamp or OrderID column as the 'Score'.

If there are multiple columns you want to de-duplicate against then you need to do some pre-processing:

1. Add Transform step > Add New Column > Create your own function

2. Use Concatenate function (green + if there are more columns)

3. Name the new column and use this column as the Cluster ID value in your Workflow:

What about values that cannot be matched exactly?

If the values in these columns cannot be matched exactly then you have a few options and different approaches written about in more detail in other discussions but essentially:

pre-process the data with additional Transform steps removing noise, changing casing etc. until the values match. An example of this here (with some further thoughts on the topic of 'matching') https://community.experianaperture.io/discussion/248/approach-for-matching-product-data-product-description
the Find Duplicates step can use rules based logic (columnA&columnB or columnC) and/or fuzzy matching algorithms such as Levenshtein and JaroWinkler, soundex matching, etc. to generate a Cluster ID column for similar values. Note however that writing the rules and blocking keys, especially for non-contact data, does require some research https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/find-duplicates-step/advanced-configuration/

ℹ️

﻿ ℹ️ ﻿ Removing Exact Duplicates / Exact Matches

Categories

ℹ️ Removing Exact Duplicates / Exact Matches