Questions about Data Tagging and Profiling in Aperture Data Studio

Hi team,

  • Is it possible to tag and profile multiple tables at the same time in Aperture Data Studio? 
  • Could we get a summary about the column and row counts for all tables in Aperture Data Studio? I realise that at the home page there is a short summary about the counts, but only for several tables. 
  • Could we implement Dependency Analysis in Aperture Data Studio? I found Dependency Analysis in official documentation but failed to implement it.

Thanks for your time.

Answers

  • Clinton JonesClinton Jones Experian Elite

    Hi @Nathaniel

    This is not yet possible to multiselect tables for auto-tagging and profiling except by way of a workflow for the latter.

    At the moment the teams are working on ways to perform this programmatically using APIs to do these.

    We'd be interested in learning more about the specifics of the use case that you have in mind where you need to perform bulk statsitical analysis of objects and whether this is a one-off or recurrent activity on ever-changing data.

    Can you PM me ?

  • Hi @Clinton Jones , the scenario is that I have thousands of tables loaded into Aperture Data Studio Dataset, and I have to tag and profile all the tables. It is time consuming, even impossible, to do it manually table by table, so I am looking a way to process in bulk. Is it possible to tag and profile data in bulk by creating a workflow? Thanks for your time!

  • Clinton JonesClinton Jones Experian Elite

    @Nathaniel this sounds like something that we would support through API rather than the UI, do you have a list of all those tables that you could feed into an API ?

    Are you looking to tag this data for :

    • preprocessing ahead of individual table inspection,
    • automated workflow execution based on data tags
    • Integration with a third party data catalog like Erwin or Collibra etc?

    There might be slightly different approaches suggested according to the use case

    Clinton

  • Josh BoxerJosh Boxer Administrator

    Dependency Analysis is a v1 feature (which is why you cannot see it when profiling your data). However, this is fairly soon to be re-introduced as "Analyze Relationships" in v2. Let me know what you are trying to achieve with it and I can let you know if this will be covered. Could even set up a quick demo if that would be of interest.

  • The main goal we have is to identify the golden records that will be used to source a data migration. It's a fairly big migration with lots of tables and we have limited access to SMEs. We're looking for a way to rapidly tag as much source data as possible to help us target our investigations for further analysis.

    We're very interested in API processing too as our team is pretty technical :-). Even so, it would be good to help business analysts self serve through the UI

  • Hi @Clinton Jones, my lead Phil specified the goal of tagging, and could you please explain the API feeding? We have a list of tables in the source Database, but now I am not clear about how to feed them into an API.

  • Hi @Josh Boxer, the main goal of Dependency Analysis is to find the relationships between tables in source database. We'd like to know the data flow (if possible attributes of the tables) to support the target model.

  • Clinton JonesClinton Jones Experian Elite

    @Phil Watt and @Nathaniel the API based methods are in the throes of development, I have already seen some of the work done and we are targeting this for release in v2.4 due out next month, June.

    Assuming you have defined a number of new tags for framing your golden records and trained the system with plenty of good training data to correctly identify said records, the idea would be that you use a programmatic method to feed meta data into Data Studio via APIs to establish data sources.

    The follow on activities would be, again, using a programmatic method to invoke certain actions on those defined data sources such that you land up with the end result of tagged and profiled data.

    This would be largely a lights out activity that all happens behind the scenes for the end user, but it would be something that supports you being able to stage many sources and perform many actions without having to go through the manual process of source curation.

    You can learn more about the current implementations of APIs from the product documentation here Data Quality user documentation but you should also be aware that Data Studio provides Swagger documentation. Unfortunately the new APIs and methods are only available to you when you actually get your hands on the installed version of Data Studio.

    If you need anything specific then please let us know and we will try to get you preview content.


    Clinton

  • Hi @Clinton Jones, thanks for your suggestion! There is a Profile Step in Workflows in Aperture Data Studio, but I haven't found anything like Tag Step there. What I know is we can tag our data when we load them in Data Studio or after being loaded we can click on "Annotate Column" to edit further. Could you please explain how we could tag automatically in a workflow? Thanks!

  • Sueann SeeSueann See Administrator

    @Nathaniel you can use the Transform step to apply pre-defined tags to specific columns, but this is not auto-tagging as it requires you to pre-define the mapping of a specific column and a tag.

    Example:

    When I auto-tag my Source dataset after loading or using Annotate Column, Customer id is not tagged as Unique id.


    In a workflow, I can connect a Transform step to the source.


    At the transform step, select Customer ID and apply a tag.


  • Sueann SeeSueann See Administrator
    edited July 1

    @Nathaniel @Phil Watt back to your scenario of "I have thousands of tables loaded into Aperture Data Studio Dataset, and I have to tag and profile all the tables"

    Auto-Tag

    There are only 2 ways you can auto-tag datasets.

    • Annotate columns via the Datasets UI either when you Add Dataset or also accessible after the Dataset has been created.
    • Or use the Create Dataset REST API to load your datasets and auto-tag them. This is assuming you are able to load all your data as .csv files into Amazon S3 because this is the only source we support at the moment.


    Once you have auto-tagged your datasets, there are several ways you can determine which tags have been applied.

    • Via the UI, Go to System>Data Tags>Show Tagged Data
    • Use the List Datasets REST API that will actually show you which tags have been applied to which columns within each Dataset.


    Profiling

    To profile thousands of datasets easily, the only way is probably to create a reusable workflow you will have to use a workaround to cater for the dynamic schema. This is assuming you know what is the most number of columns a dataset would have, and you are able to use generic column headings for all your datasets.

    • Assuming your widest dataset has 3 columns, create a GenericSchema dataset that would represent this schema with generic column headings:


    • Create a workflow to do the profiling, assigning the GenericSchema as the source but allowing source to be replaced.
    • Execute the workflow via REST API which will provide you with the option to replace the source with another dataset.
    • Example 1: if you executed the workflow with a dataset that only has col1, the profile export you will get will only contain col1.
    • Example 2: If you executed the workflow with a dataset that has col1 populated, but null values for col2 and col3, the profile export you will get will contain all 3 columns but with Completeness at 0% for col2 and col3.


    Note: If you did it this way, you may want to insert an additional Transform Step prior to the Export to exclude the rows with Completeness = 0%

    • Example 3: If you executed the workflow with a dataset that has col1,col2,col3,col4, then col4 will be excluded since it does not match any columns in the GenericSchema.



Sign In or Register to comment.