Why worry about duplicated data?

Sueann See · July 2021

What is duplicated data?

Duplicated data refers to data representing the same entity. An entity can be anything such as a person, a company or a product. In the most obvious form, duplicated data refers to an exact copy of a record. However, this is not often as straightforward due to possible variations in the data that could still lead to the same record.

Why is it difficult to resolve duplicated data?

“Data is unique if it appears only once in a data set. A record can be a duplicate even if it has some fields that are different.” UK Government Data Quality Hub

Some questions to get you thinking:

How many names or nicknames are you possibly identified by? The family search organization suggests there are 20 ways to find an elusive name.
How many email addresses do you own? I personally moved from Hotmail to Yahoo to Gmail. In fact, I own multiple Gmail accounts for different purposes. In addition, I have an Outlook account and numerous work email accounts for the companies I’ve worked for.
How many phone numbers do you own and how many times has that changed? Landline – personal/work, cellphone – personal/work
How many times have you moved to a new house i.e. have a change of address? What about the time you decide to use your parents’ address or company address as your postal address?
When you register for an account, which name/email/phone/address did you use? Think banking accounts, online shopping, hotel booking, Netflix, Spotify, any website requiring you to register for more content. Do you always provide your real name or contact data? A research by SkyNews indicates half of web users are faking their data due to security fears.
Have you ever made a typo error when providing your delivery details? Ever had something lost in mail or received an old tenant’s parcel?
How much data does your organization collects about your customers and across how many systems? What if the number is in the range of tens of millions, or hundreds of millions when Excel specifications and limits would not work for you? How would you scour through all that data to look for duplicates?

What are the consequences of duplicated data?

Depending on what kind of services your organization offering, the problem of duplicated data can result in consequences ranging from a waste of marketing budget, a legal and reputational risk or at its worst may turn out to be a matter of life and death. For example:

In a Forrester report on Why Marketers can’t ignore data quality, it was estimated that 21 cents of every media dollar spent in the last year was wasted due to poor data quality. This is equivalent to a $1.2 million and $16.5 million average annual loss for the midsize and enterprise organizations in the study.
A Royal Mail Data Insight report also calls out the risk of duplicate customer record causing breach of data usage in line with the General Data Protection Regulation (GDPR) in force in the UK.
In the United States, they are looking at a Universal Patient Identifier to allow for more accurate identification of patients for safer health care interactions.
In Pune, India, duplicated entries have caused delays in vaccination confirmation. Read more about it here.

How can Aperture Data Studio help?

Data Studio is capable of processing millions or billions of rows (subjected to your license limits and the memory and resources you have on your machine).
There are multiple ways you can de-duplicate data in Data Studio, including a full fledge standardization and matching engine behind the Find Duplicates step that offers a lot of flexibility in setting up the conditions to match or link your records.

Are you trying to overcome challenges with duplicated data? Do share your story with us.

Why worry about duplicated data?

Categories