How to optimise Validation, Filter, and Take Snapshot Steps

Options
BAmos
BAmos Member
edited December 2023 in General

Hey folks,

Can anyone help me with optimising performance on Validation, Filter, and Take Snapshot Steps, and hopefully also provide some benchmarks for expected performance?

We have several datasets that we are cleansing, and then separating into passing and failing records based on them containing enough valid data points for our matching process (we then Union the passing records to a single dataset for this).

We are running an instillation on a virtual machine using a Linux server with 8 cores and 62 gb of memory, which I have been monitoring during the processing of the validation steps. After an initial spike the core usage significantly trails off and continues falling until there is any usage occuring.

Running a data set of 26.4k records took 36 seconds to pass through the validating function, and 44 seconds in total to save the passing and failing datasets.
Running a data set of 694.2k records took 11 minutes 39 seconds to pass through the validating function, and 12 minutes 08 seconds in total to save the passing and failing datasets.
Running a data set of 1.1m records has reached 10% (as indicated by the "Show Jobs" function) and has currently been running for took 50 minutes on the validating function, it has not reached the snap shot stage yet.

We have been doing a lot of testing and are experiencing these problems whenever data sets start approaching 1m records, both when using the validation function and the filter functions.

The validation is looking at a single True/False column to validate passing and failing rows.

Please can you let me know what the performance benchmarks are for these functions, and if there are any known issues that can cause this level of slow performance.

Many thanks,

Ben

Tagged:

Answers

  • Josh Boxer
    Josh Boxer Administrator
    edited June 2023
    Options

    Hi Ben

    Thanks for your post. There are a number of factors that can impact the processing time including the complexity of the Workflow, compression, other processes running simultaneously, anti-virus/scanning software etc., but as a rough benchmark, we have a Workflow used for testing on a medium spec machine that can Validate 200 rules on 100M rows in ~10 mins

    Here are some specs/times for 1M rows complex Workflow mostly Validate: https://docs.experianaperture.io/solutions/uk-credit-processing/getting-started/system-sizing-recommendations/#1-million-accounts-submited-10-million-accounts-on-master-file

    It is definitely something we would like to help you to investigate and optimise. Can you please post a screenshot of the Job details page for the 700k records in 12 mins? Also if you can export the Workflow and share the .dmx file so that we can take a look
    https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/monitor-data-and-troubleshoot/migrate-metadata/#exportfromaspace

  • BAmos
    BAmos Member
    Options

    Hi Josh, thanks for coming back to us on this so quickly.

    As requested I've taken the screenshot of the Job Details for the 700k dataset (workflow "07.1 - Data Cleaning & Evaluation Process_JG_Donations"). I have also now seen that our 1m record dataset has finished processing overnight and I have also taken that screenshot (workflow "01.1 - Data Cleaning & Evaluation Process_EPWS_Donations"), this one took 3hr 56min to run the Validation step.

    I have exported the dmx files for these too, can you please let me know the best way to send all of this across to you? Email or file transfer would work for us.

    Thanks,
    Ben

  • Josh Boxer
    Josh Boxer Administrator
    Options

    We have upgraded you! You should now be able to post images and upload files here

  • BAmos
    BAmos Member
    Options

    Wonderful!

    Here are the screen shots and the dmx is attached

  • Josh Boxer
    Josh Boxer Administrator
    Options

    Thanks for sharing. The Workflow is complicated by calling other Workflows that are cleansing phone numbers, emails and addresses.

    Address validation is one of the lengthiest steps as it is doing quite a few processes behind the scenes including standardization to improve the number of matches.

    Validate is one of the most performant steps (in your case you could even replace the Validate step completely with a Split step to simplify the Workflow).

    If you separate your Workflows by writing the results of Cleanse to a Snapshot then use this Snapshot as the source of this Workflow I predict the 1M rows will complete in < 5 mins (let us know if not)
    Note that an Automation can make one workflow trigger another.

    Once that change is made I think you might have a new question regarding how long it is taking to cleanse your data, which will be an interesting conversation as improving the quality and consistency of the data before validation can have significant improvements on the processing time.