How to achieve dynamic delta dataset refresh?
Hi All,
We would like to setup a daily table refresh from an SAP database into Aperture, and we want to only pull the delta (incremental) records.
To achieve this, we need the extraction to be dynamic. The system should somehow determine and add to the query for example, the date of the last record (for example, by calculating the MAX date from the existing data, or get the last change nr.) and use that date to filter the incoming records during the new dataset refresh.
Could you please advise on how to achieve this? How can we pass a dynamically calculated value into the dataset refresh step? What is the recommended best practice or is there a workaround?
Now we are using Talend, but the best would to have a native solution, so we do not have to use anything external.
Thank you for your guidance!
Kind regards,
Vera
Answers
-
Hello
There are a few ways to achieve this. You could create a View in the Database to include only the data that should be refreshed and select this dbView as the source of the Dataset. Alternatively you could use a SQL query to pull in only the desired records.
- Create a Dataset using a SQL query to Select everything but today's date from the table and set the Dataset to Add (rather than Overwrite). Save the Dataset and let it load all records.
- Update the SQL query to select only records with today's date
SELECT * FROM table_name WHERE date_column = CURRENT_DATE;
[note SQL needs to match the flavour of SAP you are connecting to] - Create a Schedule to refresh the Dataset daily
Data Quality user documentation | Dataset batches1 -
Hi Josh,
Okay, yes we have date in this dataset source, so we can use the date, but what if a table does not contain a date but only a record IDs? It is trickier to query only the new records in this case.
Anyway (after a join) there is a date now, so we will do what you suggested
Thanks!
Kind regards,
Vera
0 -
If you dont have a date to filter on then really would need to know the reason you want to do a delta before suggesting, but assuming the reason is it is a large table and you dont want to refresh the same data then
- create a dataset with a query ID less than 50M (do not refresh this dataset once imported)
- create dataset2 with a query ID greater than 50M (refresh daily)
- Union datasets together
- once dataset2 gets large that refresh is taking longer than desired update 50M filter in both datasets
Another option if you are dealing with large database tables is to use Pushdown processing and not import any data
0
