Incrementally Loading Datasets

George Brown · September 2022

I've thought through a few different ways that I can use to incrementally load a dataset using batches in the dataset and views on the source side. However, I'm wondering if anyone else has done some work around that and would be willing to share their thoughts / ideas.

Essentially we have some really large tables which we want to use in workflows, potentially as a whole, but maybe only for a subset or sample. Rather than wait the full time for these tables to load, we want to just append the latest data.

For example, we might have a table with 1.2 billion records that covers the span of 12 months. Averages 100 million a month or somewhere over 3 million a day. Reloading this entire dataset each day may not even be possible; might take more than 24 hours. But appending 3 million records each day might just take 15 minutes or so and a much more reasonable time to wait for a dataset refresh before running workflows against it.

Anyone?

Thanks in advance,

George

Josh Boxer · September 2022

Hi George

The way to do this is using SQL https://community.experianaperture.io/discussion/760/aperture-data-studio-2-5-6

https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/get-started/configure-external-systems/#loaddatausingasqlquery

Create a static Dataset that uses a SQL query limited to everything before today or an ID less than XYZ etc. (could even do this as a one time file transfer which will always be faster for that volume of records)
Create another Dataset with a SQL query today onwards or ID greater than XYZ
Use a Workflow to Union the two Datasets together with only the second Dataset Source step set to Auto refresh
At some point in future when the second Dataset is getting too large, either update both queries or add a third Dataset starting from that point and Union it in the same Workflow

If you only want a subset or sample then ignore the above steps and use SQL to select only a subset or sample to load into Data Studio

Josh Boxer · September 2022

Out of interest, what kind of data is this and broadly what are you doing with it once it is in Data Studio?

We have some ideas for additional functionality that could possibly be applicable

Henry Simms · September 2022

@George Brown I actually set something like this up myself recently, using an approach similar to the one Josh describes (fetching rows from your DB table using a custom SQL query).

I've described the basic I used process below, but note:

The SQL statements are illustrative only - I'm not an expert and there will certainly be better ways to create the queries!
The below would work if new records are only being created in your DB table. If records are also being updated in your database, you'll want to load the data with something like "last modified date" = today into a separate "changes" Dataset and then use a workflow to update the "master" Dataset accordingly.
The approach below assumes the database is always available and no refreshes are missed. In reality you may want to create something a little more robust that easily allows missed refreshes to be completed later, while to avoid duplication of records or missing records in your master table.

1) Load all the data up to (but not including) today's date

My dummy database table looks like this:

My actual data has 7 rows, 2 of which were added today (15/9/2022):

So to load all records with a created_date < today's date, I select to load from a Dataset Query with:

I also configure the Dataset to be multi batch, so that new batches are added incrementally.

My Dataset has now loaded a single batch with 5 rows:

2) Modify the query so that only today's data is loaded

I can now right click -> Edit details on the Dataset and modify the SQL query used to fetch rows for the load to be:

Now when I refresh the Dataset in Data Studio, it will load today's 2 rows as a second batch:

3) Automate the daily refresh

To automate the refreshing of data, which I might want to run just before midnight every night, I need to do a couple of things:

Set the Dataset to Allow auto-refresh.
Create and publish a workflow that refreshes the source. This may be as simple as a single source step:

Create a Schedule that runs the workflow:

Incrementally Loading Datasets

Answers

1) Load all the data up to (but not including) today's date

2) Modify the query so that only today's data is loaded

3) Automate the daily refresh

Categories