Using 'Matches' functions to identify personal information (and automate the detection of this)
Following on from @Steve's discussion (here), I wanted to share some further thoughts on this subject as it's a topic that crops up quite regularly. If I have not got permissions to hold or process PII (personally identifiable information) then I'd want an automated means of flagging whether it is present in my data source(s) before I load them into the production processes/environments (e.g. data feeds from a supplier before loading into my analytics platform).
Types of PII
Given that PII data comes in lots of different shapes/forms, it's important to build a workflow that's right for your definition. Most PII can be identified in 2x ways:
- PII Values - where we recognise the presence of certain values as constituting that the given cell/row contains PII data. Some examples of the types of values you would want to look out for include: country names, locations, salutations/titles, common forenames/surnames and job titles.
- PII Formats - where the format of a specific set of characters can be used to determine the presence of PII data. Some examples of these include: postcode/zipcode, telephone numbers, passport number, credit card details, NHS no, NI no, cookies, IP address etc.
Building this in Aperture Data Studio
Through a relatively easy to build workflow, you can detect the presence of both of these types of PII using a couple of 'matches' functions and setup automation so that email alerts can be triggered to inform data stewards if this is detected. Alternatively you could also adapt the workflow so that it automatically strips any PII values/formats from the records.
My approach for building this on a sample data set involves the use of 2x lookup tables (one for PII values and another for PII formats/regular expressions):
Within the transform step, 4x new columns are created:
- Conc List (a concatenated list of all the values in each of the columns I want to screen for PII dat)
- PII Values Detected (the values detected using my PII values lookup)
- PII Format Detected (the formats detected using my PII formats lookup)
- Row Doesn't Contain PII (a true/false flag to indicate whether the record is suitable or not)
Finally, the 'Fire Event' step can be used to setup email alerts (along with scheduling or triggering) so that when this data source is reprocessed on an automated basis I can be informed of the presence of any PII automatically, like so: