Questions about Data Tagging and Profiling in Aperture Data Studio
Hi team,
- Is it possible to tag and profile multiple tables at the same time in Aperture Data Studio?
- Could we get a summary about the column and row counts for all tables in Aperture Data Studio? I realise that at the home page there is a short summary about the counts, but only for several tables.
- Could we implement Dependency Analysis in Aperture Data Studio? I found Dependency Analysis in official documentation but failed to implement it.
Thanks for your time.
Tagged:
1
Answers
Hi @Nathaniel
This is not yet possible to multiselect tables for auto-tagging and profiling except by way of a workflow for the latter.
At the moment the teams are working on ways to perform this programmatically using APIs to do these.
We'd be interested in learning more about the specifics of the use case that you have in mind where you need to perform bulk statsitical analysis of objects and whether this is a one-off or recurrent activity on ever-changing data.
Can you PM me ?
Hi @Clinton Jones , the scenario is that I have thousands of tables loaded into Aperture Data Studio Dataset, and I have to tag and profile all the tables. It is time consuming, even impossible, to do it manually table by table, so I am looking a way to process in bulk. Is it possible to tag and profile data in bulk by creating a workflow? Thanks for your time!
@Nathaniel this sounds like something that we would support through API rather than the UI, do you have a list of all those tables that you could feed into an API ?
Are you looking to tag this data for :
There might be slightly different approaches suggested according to the use case
Clinton
Dependency Analysis is a v1 feature (which is why you cannot see it when profiling your data). However, this is fairly soon to be re-introduced as "Analyze Relationships" in v2. Let me know what you are trying to achieve with it and I can let you know if this will be covered. Could even set up a quick demo if that would be of interest.
The main goal we have is to identify the golden records that will be used to source a data migration. It's a fairly big migration with lots of tables and we have limited access to SMEs. We're looking for a way to rapidly tag as much source data as possible to help us target our investigations for further analysis.
We're very interested in API processing too as our team is pretty technical :-). Even so, it would be good to help business analysts self serve through the UI
Hi @Clinton Jones, my lead Phil specified the goal of tagging, and could you please explain the API feeding? We have a list of tables in the source Database, but now I am not clear about how to feed them into an API.
Hi @Josh Boxer, the main goal of Dependency Analysis is to find the relationships between tables in source database. We'd like to know the data flow (if possible attributes of the tables) to support the target model.
@Phil Watt and @Nathaniel the API based methods are in the throes of development, I have already seen some of the work done and we are targeting this for release in v2.4 due out next month, June.
Assuming you have defined a number of new tags for framing your golden records and trained the system with plenty of good training data to correctly identify said records, the idea would be that you use a programmatic method to feed meta data into Data Studio via APIs to establish data sources.
The follow on activities would be, again, using a programmatic method to invoke certain actions on those defined data sources such that you land up with the end result of tagged and profiled data.
This would be largely a lights out activity that all happens behind the scenes for the end user, but it would be something that supports you being able to stage many sources and perform many actions without having to go through the manual process of source curation.
You can learn more about the current implementations of APIs from the product documentation here Data Quality user documentation but you should also be aware that Data Studio provides Swagger documentation. Unfortunately the new APIs and methods are only available to you when you actually get your hands on the installed version of Data Studio.
If you need anything specific then please let us know and we will try to get you preview content.
Clinton
Hi @Clinton Jones, thanks for your suggestion! There is a Profile Step in Workflows in Aperture Data Studio, but I haven't found anything like Tag Step there. What I know is we can tag our data when we load them in Data Studio or after being loaded we can click on "Annotate Column" to edit further. Could you please explain how we could tag automatically in a workflow? Thanks!
@Nathaniel you can use the Transform step to apply pre-defined tags to specific columns, but this is not auto-tagging as it requires you to pre-define the mapping of a specific column and a tag.
Example:
When I auto-tag my Source dataset after loading or using Annotate Column, Customer id is not tagged as Unique id.
In a workflow, I can connect a Transform step to the source.
At the transform step, select Customer ID and apply a tag.
@Nathaniel @Phil Watt back to your scenario of "I have thousands of tables loaded into Aperture Data Studio Dataset, and I have to tag and profile all the tables"
Auto-Tag
There are only 2 ways you can auto-tag datasets.
Once you have auto-tagged your datasets, there are several ways you can determine which tags have been applied.
Profiling
To profile thousands of datasets easily, the only way is probably to create a reusable workflow you will have to use a workaround to cater for the dynamic schema. This is assuming you know what is the most number of columns a dataset would have, and you are able to use generic column headings for all your datasets.
Note: If you did it this way, you may want to insert an additional Transform Step prior to the Export to exclude the rows with Completeness = 0%