Handling Gmail email addresses for duplicate identification
The problem: Gmail allows multiple e-mail addresses for a single account
With over 1 billion active users since 2016, Gmail is likely to be a large part of any consumer dataset. Being able to accurately resolve these individuals becomes more problematic when Gmail allows for variants of an e-mail address to be used.
An email address can be split into two parts, domain and local part, with an '@' symbol splitting the two. The local part of the e-mail normally allows for unique identification of a user at that domain. However with Gmail, this is not always the case.
Gmail allows users to employ two methods to create multiple email addresses for a single account. The first is that dots don't matter in a gmail address. This means that all the following addresses are all for the same account
The second method is that anything following a plus "+" sign in the local part of the e-mail address is ignored. This allows users to filter emails in their inbox based on which e-mail address was used. So for example the following e-mail addresses will all be for the same account
To compound this, Google owns two domains (gmail and googlemail) which both direct mail to gmail. So [email protected] and [email protected] are equivalent.
All of these e-mail addresses are valid, but the problem is how do we ensure these are all linked in a Find Duplicates step.
In order to tidy this up for matching purposes only, we will create a new column in a Transform step which will process @gmail or @googlemail addresses.
First we need to isolate the local part of the email address, by splitting the email string on the '@' symbol, and retaining the first part. We then split the local part on a '+' symbol and keep the first part again. This is because anything after this symbol is ignored by Gmail.
We next handle '.' by using a regex to replace these.
And then append '@gmail.com' to re-form the e-mail address.
To ensure this only applies to these domains, we will only use this if the domain is gmail or googlemail.
Test file and workflow are below.
some cool tricks there !
One of the fun things about using Data Studio is that there are usually several different ways to achieve the same result. Akshay asked me to see if I could find a slightly shorter set of steps to tidy up the gmail addresses and I came up with this:
I've used the handy After function to extract the domain part of the email (after the
In the Compare function I've checked the "Ignore Case" box, which saves me from needing to explicitly lower case the emails
I've replaced the two Split functions with a single Parse By Regular Expression, using the regex
^[^@+]*to extract from the email any text that appears before the first instance of either a
For anyone using Data Studio v2, the function above has been updated to be a reusable function. Import the metadata exchange file attached into your space, and you should be able to use this wherever you need to standardise a gmail address.