Handling Gmail email addresses for duplicate identification
The problem: Gmail allows multiple e-mail addresses for a single account
With over 1 billion active users since 2016, Gmail is likely to be a large part of any consumer dataset. Being able to accurately resolve these individuals becomes more problematic when Gmail allows for variants of an e-mail address to be used.
An email address can be split into two parts, domain and local part, with an '@' symbol splitting the two. The local part of the e-mail normally allows for unique identification of a user at that domain. However with Gmail, this is not always the case.
Gmail allows users to employ two methods to create multiple email addresses for a single account. The first is that dots don't matter in a gmail address. This means that all the following addresses are all for the same account
- john.smith@gmail.com
- jo.hn.sm.ith@gmail.com
- johnsmith@gmail.com
- j.o.h.n.s.m.i.t.h@gmail.com
The second method is that anything following a plus "+" sign in the local part of the e-mail address is ignored. This allows users to filter emails in their inbox based on which e-mail address was used. So for example the following e-mail addresses will all be for the same account
- johnsmith@gmail.com
- johnsmith+mybank@gmail.com
- johnsmith+marketing@gmail.com
To compound this, Google owns two domains (gmail and googlemail) which both direct mail to gmail. So johnsmith@gmail.com and johnsmith@googlemail.com are equivalent.
All of these e-mail addresses are valid, but the problem is how do we ensure these are all linked in a Find Duplicates step.
The solution
In order to tidy this up for matching purposes only, we will create a new column in a Transform step which will process @gmail or @googlemail addresses.
First we need to isolate the local part of the email address, by splitting the email string on the '@' symbol, and retaining the first part. We then split the local part on a '+' symbol and keep the first part again. This is because anything after this symbol is ignored by Gmail.
We next handle '.' by using a regex to replace these.
And then append '@gmail.com' to re-form the e-mail address.
To ensure this only applies to these domains, we will only use this if the domain is gmail or googlemail.
Test file and workflow are below.
Comments
-
some cool tricks there !
0 -
One of the fun things about using Data Studio is that there are usually several different ways to achieve the same result. Akshay asked me to see if I could find a slightly shorter set of steps to tidy up the gmail addresses and I came up with this:
I've used the handy After function to extract the domain part of the email (after the
@
sign).In the Compare function I've checked the "Ignore Case" box, which saves me from needing to explicitly lower case the emails
I've replaced the two Split functions with a single Parse By Regular Expression, using the regex
^[^@+]*
to extract from the email any text that appears before the first instance of either a@
or+
symbol.4 -
For anyone using Data Studio v2, the function above has been updated to be a reusable function. Import the metadata exchange file attached into your space, and you should be able to use this wherever you need to standardise a gmail address.
2