Has anybody used Aperture to anonomise/create synthetic data?
I have a requirement to create anonymised/synthetic data
My initial thought, for the name, would be to swap forenames with a simple lookup table and create surnames by a simple letter replacement (this would be consistent where a client name may appear several times)
Similarly, create false emails (using the forename/surname created above and existing provider) and simply swap the digits in phone numbers
I was wondering if it would be possible to look up addresses for a postcode and, where valid, pick one at random for the same postcode (allowing locations to remain as-is as this would be needed for location analysis)?
Nb This would be a great feature for a future Aperture product release (or is it planned already?)
Regards Nige
Answers
-
HI @Nigel Light I am not aware of any customer explicitly doing this, though I am aware that development have used data studio to create data for testing. @Henry Simms and perhaps @Matthew Berry can comment on that.
0 -
@Nigel Light if i may ask, what is your main purpose of creating anonymised/synthetic data? Do you need to be able to reverse engineer/re-identify the actual values later on?
Depending on what you are trying to achieve, there are several techniques to data anonymization as i understand from some of the articles on the internet like this one here.
I think what you have suggested i.e. using lookup/replace is one of the ways you can do it.
Transformations with functions like Replace or Format Mask may help too. You can also take a look at functions to mask out email/phone numbers in our functions library.
If you need to preserve the statistical significance, then maybe there is a little more work involved in profiling your sample dataset to understand the statistics, values and formats involved.
If you are composing a sample dataset from scratch, again it will depend on what you are trying to achieve. In-house within Experian, if we are doing so for the purpose of testing, we want to make sure the attributes cover a variety of scenarios, have a good composition of positive + negative scenarios and with ratios that would mimic real life usage.
0 -
Hi @Clinton Jones @Sueann See ,
Thanks for the responses. What I wish to do is to retain out existing client details eg birth, residence location but anonymise these eg make all birthdates 01/01/yy (where yy = actual client year of birth), select a random address for the existing clients postcode, anonymise the forename & surname removing any client identifiers.
I am not really sure what we are looking for at this stage eg are all clients names Nigel substantial fund holders? (I can assure you that this one isn't!) but consistently replacing the name would allow us to find out. Ditto location, age and other client markers.
We could then use this for client segmentation/data mining and share easily with an external 3rd party without any risk of identifying the individual.
Interesting task
Nigel
0