Cross-system and cross-table relationship discovery automation
In our customer cases data quality issues are often cross-system inconsistencies. Technically, customer rows might not have been brought from CRM to MDM or DW for some technical or business reason. Additionally, data might have been removed or changed in one system leading to inconsistencies between systems.
Automated relationship discovery in Pandora is really useful in these cases and helps an analyst to get an understanding about data landscape. It's easy list to check which columns might be foreign keys or just unexpected matches. I can find easily should I join the tables with PIC, CustomerID or maybe external userID. I can find if notes fields contain PICs or emails exist in name fields. I can find fast which customers have not been loaded to target systems and so on.
Is Aperture going to have relationship discovery feature that exists in Pandora?
I have liked the ability to select two or more tables that would be then auto-discovered for relationship qualities in unknown customer data. The results are easily listed with drill-downs to see the actual data. The Venn-Diagram shows the row and value counts to show what is the magnitude of findings.
This feature is quite useful especially in MDM and multi-system integration development and prototyping cases when we have several sources and want to cross-check multiple sources and targets before and after MDM.
If you are not going to do this with Aperture what are the reasons to not implement this one?
For example, do you think that PII should be discovered with more complex pattern and content discoveries rather than just exact matches. Relationships are suggested in join-function so no need to generate all stats like this? I know this is mostly useless computation. At the same time, there is a lot of important insight which you do not necessarily even know before you see the unexpected anomalies.
At the same time, this kind of automatic discoveries should not be automatic for all tables and between all data sets. In Finland it is illegal to combine data sets from two person registries. Also GDPR states that its illegal to combine data from several data controllers without a valid legal reason which often does not exist. This makes it difficult or actually impossible to use Pandora for multiple organizations as Pandora automatically combines data values from data sets to a single value list and links them between registries. It automatically discovers that persons SSN is a patient in hospital 1 EPR as well as hospital 2 EPR potentially leading to violation of individual patient registries and their legally binding policies and laws.
Discoveries and links should be done only between data sets of a single data controllers single person data registry - unless you have explicit permission to link the two registries for some reason. Therefore there is a need to have controls for individual data registries/projects/schemas/workspaces or something that keeps the data locked into that context and does not do anything beyond. The legal requirement is that all the person data can be stored to a single physical data base or data lake but they can never be logically or physically linked to each other across data controllers. You would need some kind of a data controller or data space tag that could be used to control the scope of auto-discoveries and logical operations.
In general, I'm interested in your thoughts and roadmap to relationship discovery features.
Best Answer
-
A great question Sami.
At this point in time relationship discovery is not considered as a feature for Aperture Data Studio.
There are a couple of reasons for this. One of them is the fact that we have actually found this capability to be rarely used. Secondly, once relationship discovery has been done once, how relevant is it for execution a second time if the source system schemas haven't changed?
From a technical standpoint, one of the reasons that Pandora is able to do this, is because of the way the data that is cached is handled as one large repository with cross-object searching. In fact in Data Studio v1.x this is more compartmentalised and in v2.0 even more so. This is to avoid a number of issues including, but not limited to : defragmentation of the data store, privacy, compliance and expectations around data and workspace segregation.
When you're a single user on a single system connecting to many sources, these concerns are less prevalent but when you're one of many users on a server based product this becomes critical.
In order to reintroduce a cross object search would require a fair amount of new development work and this would be resource intensive when processing with value potentially limited to just that point in time.
The more appropriate approach we feel, is leveraging the auto-tagging to determine the contents of columns and then at some point making ML based proposals for reshaping the columns so that they can be combined with similar columns from other data sets so that they can all feed into the fuzzy matching engine.
When you combine disparate sources, run find duplicates and then harmonize, you move to a model where there is actually great potential to create a new best possible nominal record with lineage back to those sources that may ultimately prove more useful.
2