Incrementally Loading Datasets
I've thought through a few different ways that I can use to incrementally load a dataset using batches in the dataset and views on the source side. However, I'm wondering if anyone else has done some work around that and would be willing to share their thoughts / ideas.
Essentially we have some really large tables which we want to use in workflows, potentially as a whole, but maybe only for a subset or sample. Rather than wait the full time for these tables to load, we want to just append the latest data.
For example, we might have a table with 1.2 billion records that covers the span of 12 months. Averages 100 million a month or somewhere over 3 million a day. Reloading this entire dataset each day may not even be possible; might take more than 24 hours. But appending 3 million records each day might just take 15 minutes or so and a much more reasonable time to wait for a dataset refresh before running workflows against it.
Thanks in advance,
The way to do this is using SQL https://community.experianaperture.io/discussion/760/aperture-data-studio-2-5-6
If you only want a subset or sample then ignore the above steps and use SQL to select only a subset or sample to load into Data Studio
Out of interest, what kind of data is this and broadly what are you doing with it once it is in Data Studio?
We have some ideas for additional functionality that could possibly be applicable
@George Brown I actually set something like this up myself recently, using an approach similar to the one Josh describes (fetching rows from your DB table using a custom SQL query).
I've described the basic I used process below, but note:
"last modified date" = todayinto a separate "changes" Dataset and then use a workflow to update the "master" Dataset accordingly.
1) Load all the data up to (but not including) today's date
My dummy database table looks like this:
My actual data has 7 rows, 2 of which were added today (15/9/2022):
So to load all records with a created_date < today's date, I select to load from a Dataset Query with:
I also configure the Dataset to be multi batch, so that new batches are added incrementally.
My Dataset has now loaded a single batch with 5 rows:
2) Modify the query so that only today's data is loaded
I can now right click -> Edit details on the Dataset and modify the SQL query used to fetch rows for the load to be:
Now when I refresh the Dataset in Data Studio, it will load today's 2 rows as a second batch:
3) Automate the daily refresh
To automate the refreshing of data, which I might want to run just before midnight every night, I need to do a couple of things: