Profiling all tables of a source
Hi team,
Partner is using Aperture Data Studio to profile sources on client site - so that he can recommend Object mapping with client and confirming what client is attempted to map. They want to be quickly checking the validity of the fields and its contents vs what client thinks they need The initial profiling piece will help them facilitate their source to target mapping.
• 2 Source databases each having 1,382 tables
• 1 Source database having 1,668 tables
• 1 Source database having 2,184 tables
• 2 Source databases each having 138 tables
So some of the above have a reasonable number of tables, is there any way to easily setup profiling for all tables per connection?
I personally understand this is not a recommended approach and use of profiling piece but just wanted to check before getting back to them.
Thank you!
Shamma
Best Answer
-
Clinton Jones Experian Elite
This is a tough problem to resolve. One of the things that Data Studio does not do, as of v1.5 is auto-profile. This is an application behaviour that has been explicitly suppressed in Data Studio because performing profiling is not always required and is, in fact, an expensive processing activity. The option instead is to profile manually and only explicitly. The alternative is to set up a discrete workflow that performs profiling but does not support discovery. These all describe current behaviour and current handling. There is a request that is under consideration, to have auto profiling as a configurable option. As soon as this is likely to be made available we'll provide details via the release notes.
With respect to the specifics here, we're talking about staging multiple tables and then kicking off multiple profiling jobs. The Data Explorer presently does not support multi-object selection and there is no API method for executing a profiling job.
Post the release of v2.0 we plan to release APIs that potentially could support synchronizing selections of tables from lists or from other systems like Collibra etc but this is not an option today.7
Answers
Thanks Clinton!
Hi @Clinton Jones. Is there any update to your response above that is applicable in the future roadmap of 2.0?
I can certainly see a use case here.
Was wondering whether there is the ability to create data driven workflows that could for example, profile specific tables if sufficient metadata was provided that determines the data source, entities and attributes that we might wish to profile and then potentially pass these as parameters to execute "profiling" workflow via a Rest API call. Does that make sense?
In addition is there any consideration or technical constraint to add the ability to perform multi column value or frequency analysis as there was in Pandora?
Many thanks
Daniel
@DTAconsulting what we're looking to do is to instrument data studio for support of cataloging solutions - this would facilitate adding a lot of source in a single pass.
With the new design of v2 sources you can set up a view which is a profile of a dataset but views are generated dynamically so we would have to experiment with the practicality of using the combination of a "drop zone" and a profiling view would be generated on the latest view of the data but it does beg the question whether you shouldn't in fact have a specific intent with a workflow based on a schedule.
Mass profiling has been observed to be an uncommon requirement and more of an edge case, it is time consuming and resource expensive.
We may need to look at a trigger on certain kinds of views tied to drop zone events.