Best practice for checking that Datasets have been refreshed

Henry Simms · October 2024

A user asked this question:

We have hundreds of datasets we need to refresh before a migration, what is the best practice in ADS to make sure they have all been updated, without adding a source in a workflow for every dataset can you iterate through the desired datasets and look at the metadata to determine when it was updated?

In this case, data is being refreshed from an External System.

Henry Simms · October 2024

Before discussing best-practice, let's look at the various options.

Refreshing data

There are three ways to refresh a Dataset from an External System:

Manual: Right-click the Dataset → Refresh. Must be done one Dataset at a time.
Refresh on workflow execution: When running a workflow that includes one or more Datasets, by selecting the "Refresh sources" option. Workflows can be executed manually, from the REST API, or via a Schedule. All Datasets used in a given workflow can be refreshed at once if desired.
Refresh on a Schedule: Configure a Schedule with one or more "Data refresh" tasks. The Schedule can be configured to refresh multiple Datasets at once. This last option is new in v2.15 and there's a demo video in the release notification page:

Detecting refresh failures

A Dataset refresh from an External System could fail for a number of reasons, including:

Source system is temporarily unavailable because of a network outage
The credentials used to authenticate with the source system are no longer valid
The table you're fetching data from has been modified (renamed, or schema changed so that mandatory columns have been removed) or permissions have changed (no access)

Let's look at how such failures could be detected for each of the three refresh options above:

Manual: Here, you'll see the error on-screen if refresh fails.
Refresh on workflow execution: Using Stop execution when refresh failed in the workflow source steps, as described in the answers and comments on this Community answer.
Refresh on a Schedule: There a couple of approaches:
1. Use an Automation to detect a "Dataset automatic load failed" event (with or without a filter for specific datasets), and carry out a defined Action, for example by sending a notification:
2. When using the Schedule option, "Dataset refresh" and "Workflow run published" task types can be combined, and tasks are run in sequence, with subsequent tasks only running if previous tasks have been completed successfully. As a result, if you configure your "Dataset refresh" tasks first in the Schedule, you can be confident that follow on workflow executions won't happen unless all Dataset refreshes are successful.

Note on failures and warnings

So far we've discussed failures, but there is a lower category of "Dataset refresh warnings", where data can still be refreshed, but with a warning to the user.

Error example: Dataset not refreshed:

Warning example: Dataset is refreshed:

Warning events are distinguished separately from Failure events, which means you can handle the in different ways.

Best practice for large numbers of Datasets

Discarding the manual refresh option as not viable for complex, or automated solutions, I consider both the "2: Refresh on workflow execution" and "3: Refresh on a Schedule" approaches to be good options.

However, since the recent release of the "3: Refresh on a Schedule" option I have typically used this for refreshing large numbers of datasets, for these reasons:

Schedules are easy to create and update (if new Datasets are being added), and tasks can be re-ordered through simple drag-and-drop
Schedules are easy to execute manually, on a regular configured cadence, or via an API call from an external orchestration system.
Schedule tasks are sequential, and so later tasks won't run if an earlier one fails.
Combining the refresh schedule with automations to notify of failure or ""try again" logic provides lots of options for report on or handling any errors that do occur.

Henry Simms · October 2024

There's an alternative approach to those described above, but it doesn't use "out-of-the-box" functionality so I'm including it as a separate comment.

It is possible to bring in responses from the Data Studio REST API as datasets by using the Autonomous REST connector. This is somewhat tricky to set up, but the linked article should get you most of the way.

By querying the /datasets (get all datasets) endpoint we can bring in the DATASETS, DATASETS_TABLES and DATASETS_TABLES_BATCHES data, and combine and transform it to pull out the last refresh time for a given group of Datasets:

You could then perform validation in Data Studio on the time loaded (refresh time of the latest batch) to detect any refresh issues.

The workflow I've used to generate this view looks like this:

This approach most closely matches the original question's ask of being able to "look at the metadata to determine when it was updated"

I wouldn't want to describe this approach as best practice, but it does provide a way of efficiently bringing Data Studio's metadata into Data Studio itself, and allowing it to be queried or used for decision-making.

Best practice for checking that Datasets have been refreshed

Answers

Refreshing data

Detecting refresh failures

Note on failures and warnings

Best practice for large numbers of Datasets

Categories