How to optimise Validation, Filter, and Take Snapshot Steps
Can anyone help me with optimising performance on Validation, Filter, and Take Snapshot Steps, and hopefully also provide some benchmarks for expected performance?
We have several datasets that we are cleansing, and then separating into passing and failing records based on them containing enough valid data points for our matching process (we then Union the passing records to a single dataset for this).
We are running an instillation on a virtual machine using a Linux server with 8 cores and 62 gb of memory, which I have been monitoring during the processing of the validation steps. After an initial spike the core usage significantly trails off and continues falling until there is any usage occuring.
Running a data set of 26.4k records took 36 seconds to pass through the validating function, and 44 seconds in total to save the passing and failing datasets.
Running a data set of 694.2k records took 11 minutes 39 seconds to pass through the validating function, and 12 minutes 08 seconds in total to save the passing and failing datasets.
Running a data set of 1.1m records has reached 10% (as indicated by the "Show Jobs" function) and has currently been running for took 50 minutes on the validating function, it has not reached the snap shot stage yet.
We have been doing a lot of testing and are experiencing these problems whenever data sets start approaching 1m records, both when using the validation function and the filter functions.
The validation is looking at a single True/False column to validate passing and failing rows.
Please can you let me know what the performance benchmarks are for these functions, and if there are any known issues that can cause this level of slow performance.