Cross-system and cross-table relationship discovery automation
In our customer cases data quality issues are often cross-system inconsistencies. Technically, customer rows might not have been brought from CRM to MDM or DW for some technical or business reason. Additionally, data might have been removed or changed in one system leading to inconsistencies between systems.
Automated relationship discovery in Pandora is really useful in these cases and helps an analyst to get an understanding about data landscape. It's easy list to check which columns might be foreign keys or just unexpected matches. I can find easily should I join the tables with PIC, CustomerID or maybe external userID. I can find if notes fields contain PICs or emails exist in name fields. I can find fast which customers have not been loaded to target systems and so on.
Is Aperture going to have relationship discovery feature that exists in Pandora?
I have liked the ability to select two or more tables that would be then auto-discovered for relationship qualities in unknown customer data. The results are easily listed with drill-downs to see the actual data. The Venn-Diagram shows the row and value counts to show what is the magnitude of findings.
This feature is quite useful especially in MDM and multi-system integration development and prototyping cases when we have several sources and want to cross-check multiple sources and targets before and after MDM.
If you are not going to do this with Aperture what are the reasons to not implement this one?
For example, do you think that PII should be discovered with more complex pattern and content discoveries rather than just exact matches. Relationships are suggested in join-function so no need to generate all stats like this? I know this is mostly useless computation. At the same time, there is a lot of important insight which you do not necessarily even know before you see the unexpected anomalies.
At the same time, this kind of automatic discoveries should not be automatic for all tables and between all data sets. In Finland it is illegal to combine data sets from two person registries. Also GDPR states that its illegal to combine data from several data controllers without a valid legal reason which often does not exist. This makes it difficult or actually impossible to use Pandora for multiple organizations as Pandora automatically combines data values from data sets to a single value list and links them between registries. It automatically discovers that persons SSN is a patient in hospital 1 EPR as well as hospital 2 EPR potentially leading to violation of individual patient registries and their legally binding policies and laws.
Discoveries and links should be done only between data sets of a single data controllers single person data registry - unless you have explicit permission to link the two registries for some reason. Therefore there is a need to have controls for individual data registries/projects/schemas/workspaces or something that keeps the data locked into that context and does not do anything beyond. The legal requirement is that all the person data can be stored to a single physical data base or data lake but they can never be logically or physically linked to each other across data controllers. You would need some kind of a data controller or data space tag that could be used to control the scope of auto-discoveries and logical operations.
In general, I'm interested in your thoughts and roadmap to relationship discovery features.
