Documenting Aperture Data Products with Metadata
In complex Aperture environments, like in any data environment, you need to understand what data you have, where it comes from, what has been done to, where it has been delivered. These details can be found with with functions like metadata management, data lineage and so on. In Aperture, there exists a lot of features that help you to browse data spaces, data sets, workflows and even some of their dependencies. It's still quite a big problem to keep this all in order.
I were wondering if Aperture could benefit from "Data Product" thinking and soon to be published Data Card-work by Google Research. In AI communities, there has been Model Cards, also developed by Google Research, for some time. Now they have just published an article about Data Cards that can be used to document data.
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
https://arxiv.org/pdf/2204.01075v1.pdf
I were just wondering, that this is quite a nice list stuff that data stewards, data analysts, and everyone who works with Aperture, could benefit. Maybe in the future, you could export/publish Aperture data set to external file/API and generate dynamically Data Card for that. The Google article lists a lot of important issues that could be published from Aperture metadata. Some might be need manual documentation too.
The idea would be just to create a concept that Aperture creates Data Products as a result of all functions and then provides Data Card as a dynamic metadata API along with the actual output API and result data. Some of the current Aperture features might be quite easily added to the first versions of Data Card. More problematic metadata management issues could be delivered later.
Comments
-
Hi Sami
Thanks for sharing this. The broader topic of Data governance, including data documentation, is something we have been discussing with customers for some time. It would be good to hear at high level what challenges you are trying to solve with Data Cards or just in general
1 -
HI @Sami Laine just to also add that we have a couple of initiatives underway related to working with Data Catalogs - drop me a note if you would like us to line up a call to discuss.
1 -
Hey,
Well, what has became an evident problem is tracing back data lineage or impact assessment across workflows and even data spaces. Sometimes it might be just a network of all workflows from end product (export/report/ODATA) backwards to all sources that are used to produce the final data product. Sometimes you want to drill-in and see all steps across these workflows. Sometimes you might want to see individual attributes that flow through or might affect the results.
The needs depend on your use. When trying to figure out what is wrong in workflows since some data seems to be broken and does not correspond to source data - the workflow is wrong somewhere - then you might need to trace backwards all steps, attributes across all workflows. If you just want to see if some data sources or workflows are used anymore, its enough to see network in high level.
The Google Data Cards are just a good example of long-term strategic goal and example how some global players approach this data productization issue. Something to think if it might help you in a long run.
I have a hunch that you could stitch together with reasonable effort some sort of Data Card-dashboard somehow by making it technically easy&automated to collect stuff from data sources (tags, structures), lineage (workflows, steps, functions etc), administration (security, owners, versions etc) together. Now many of these are scattered around UI. Well, the dashboard might start like a collection of quick-links to relevant data sources, workflows, functions etc.
1