Handling Gmail email addresses for duplicate identification
The problem: Gmail allows multiple e-mail addresses for a single account
With over 1 billion active users since 2016, Gmail is likely to be a large part of any consumer dataset. Being able to accurately resolve these individuals becomes more problematic when Gmail allows for variants of an e-mail address to be used.
An email address can be split into two parts, domain and local part, with an '@' symbol splitting the two. The local part of the e-mail normally allows for unique identification of a user at that domain. However with Gmail, this is not always the case.
Gmail allows users to employ two methods to create multiple email addresses for a single account. The first is that dots don't matter in a gmail address. This means that all the following addresses are all for the same account
The second method is that anything following a plus "+" sign in the local part of the e-mail address is ignored. This allows users to filter emails in their inbox based on which e-mail address was used. So for example the following e-mail addresses will all be for the same account
All of these e-mail addresses are valid, but the problem is how do we ensure these are all linked in a Find Duplicates step.
In order to tidy this up for matching purposes only, we will create a new column in a Transform step which will process @gmail or @googlemail addresses.
First we need to isolate the local part of the email address, by splitting the email string on the '@' symbol, and retaining the first part. We then split the local part on a '+' symbol and keep the first part again. This is because anything after this symbol is ignored by Gmail.
We next handle '.' by using a regex to replace these.
And then append '@gmail.com' to re-form the e-mail address.
To ensure this only applies to these domains, we will only use this if the domain is gmail or googlemail.
Test file and workflow are below.