Delta Data Loads - Data Studio version 2.0

CarolynCarolyn Super Learner

Hi

New to this forum so hope this makes sense

We have just gone live with Aperture Data Studio Version 2.0.

I have a scenario whereby on building a workflow, we will do full customer data extraction (one off at beginning) from source and load as the dataset and build a workflow.

Then every load of data after this will be delta files. Any new customers and any changes to existing data.

The fields wont change or be supplied unless a change if an existing active customer.

For eg - Full extraction load

ID 1 Sue Smith 23 smith street Camberwell 3030 is part of full file

ID 2 Brian Smith 25 smith street Camberwell 3030 is also part of full file

Next load

Sues data changes due to getting married and changing address

Brian remains as is

John is a new customer

File sends deltas

ID 1 Sue Johnson 23 Betty street Mont Albert 3031 is part of full file

ID 3 John Black 27 smith street Camberwell 3030 (New customer)

I can send full line or just the fields that changed

I don't want Sue added again, l want it to override the original data loaded in full file and add John.

I tried Multi batch but this just keeps adding any load to the last load, we dont want this. We want to keep a full active database as per source each day without sending full files and override


Can anyone offer suggestions please

Thanks

Carolyn

Answers

  • Akshay DavisAkshay Davis Administrator

    @Carolyn Welcome to the forum!

    It looks like what you'd want to do is use a multi batch dataset along with a snapshot to update the latest view of the data.

    Create a multi batch dataset, but when you do, also select "allow automatic batch deletion". This option will allow the workflow to clear out data once it's processed.


    You will then create two workflows, one to do a simple initial movement of the data. This will take the data loaded initially and create a snapshot in a new dataset. This workflow will move the data into a dataset which can be updated in the future. You will also need to select "Delete batches on completion" on the source. This will delete all the rows which have successfully been processed. Also, create a new Dataset for the output, this new dataset will be single batch only.

    At the end of running this initial workflow, the "Maintained Dataset" dataset will contain all the original records, and "Original Input" will be empty.

    Next you will need to create a workflow to process deltas. Assuming a delta record is always an amendment to an existing record, or a new record.

    First you will use two joins on the ID field between the Original Input (now containing deltas) and the Maintained Dataset. The first is an inner join to determine which records are updates, the second is an outer join to determine which are new and unchanged records.


    The output from the Updated Records join would go into a transformation. Each row will now contain the original record and the delta record side by side. You can create a single reusable function which checks to see if the delta column is populated and if it is, update the original.

    You can then apply this to all the columns from the "Maintained Dataset", where Column 1 is the maintained dataset column and Column 2 is the delta column.

    The output from this transform is then unioned with the other, and the extra columns are removed prior to outputting to the same "Maintained Dataset". This will overwrite the original data with the updated information.


  • Clinton JonesClinton Jones Experian Super Contributor

    @Carolyn congratulations on going live !


  • CarolynCarolyn Super Learner

    Hi Akshay

    Appreciate the quick response.See what you are doing with all these steps

    I am struggling with the transform step thou

    Update columns - you have is null on the delta field then take the original field, if delta field not null you replace original with delta field.

    Makes sense, however how do l do this step for all columns. Do l have to do a transform per column - are you able to show me how l do for say 5 columns and what the transform step would or should look like?  I just keep getting invalid if l add to the same transform another is null>if else then step for each column comparison.

    Thank you for making the time. This has been very helpful

    Thanks

    Carolyn

  • Akshay DavisAkshay Davis Administrator

    You are correct, you will need to add a transform per column. Unfortunately, as this function isn't a simple one with only one input, you can't apply it to all columns.

    So, for each column in the original file, you will add a transform with a single function that looks like this:

    givenName is from the dataset to maintain, and givenName 1 is from the delta file as it appears on the right side of the join. This produces an output as shown below, and you can see the result changes where givenName 1 is populated.


    You then need to do the same for all the columns from the Maintained Dataset

    displayName is my unique identifier for this file, so does not have a transform applied.

    I've attached the Data Studio Metadata Exchange (dmx) file, with the function, datasets and workflow for this scenario so you can see how this works.

    Import these into a test space to review how I've set it up.

    Hopefully that works for you, but let me know if not.

  • CarolynCarolyn Super Learner

    Thank you. Sorry l have a simple question. I have downloaded, how do l import this into the space.

  • CarolynCarolyn Super Learner

    Its okay found it in the help. Thank you will give this ago

  • Akshay DavisAkshay Davis Administrator

    Navigate to 'System' in the left hand menu, then select Spaces

    From the list, find the space you want and from the menu option on the right select 'import'

    The wizard will then ask you for the path to the .dmx file, and then which workflows, datasets and functions you'd like to import. Import them all for this example.

Sign In or Register to comment.