Suggest Validation rules - Validate step

Josh Boxer
Josh Boxer Administrator
edited December 2023 in Tips and tricks

Aperture Data Studio has a Validate workflow step that allows you to define rules of how data in specific columns should be populated. This information can then be used to be alerted to any data quality problems or to track data quality improvements over time.


These rules can be simple:

  • should not be null
  • should be a number
  • should be a string in a specific format
  • should match from a list of value
  • etc.

They can also be more complex using Functions to say transform values before being validated or use either/or options where multiple different values are acceptable. As long at the Function returns a True/False value at the end it can be used for Validation.

The easiest way to get started with Validation is to use Suggest validation rules


Steps

  1. Open a Dataset or View and click Explore
  2. Click Profile, select the columns you want to profile and create validation rules for then Apply
  3. Click the Suggest validation rules button
  1. Specify the Pass at and Fail below percentages (this can also be edited later in the Validate step)
  2. Review the rules that have been suggested for each column
  3. Unselect any rules that are not useful and click Apply


  1. On the Validate step that has been created, click Rules to see what has been created
  2. Note that the rules are grouped by column (making it easy to disable or delete any rules no longer required)
  3. Add rule or edit rule will open the Function editor allowing any changes to be made

Finally, your complex Functions written specifically for your data can be added to improve future Suggest validation rules . Any Function returning a True/False value can be updated on the details page to be suggested when relevant:

Feel free to ask any questions or give any feedback on this functionality

Comments

  • If I only wanted to have a rule applied if there was a value, how was i do this? (In Pandora I had to use a table filter)

  • Josh Boxer
    Josh Boxer Administrator

    Hi James

    You might need to provide some further detail on what you are trying to achieve, but take a look at the Function If then else, which 'Checks the condition field for the value of true. If it's true, returns the condition met field, otherwise else.'

    Feel free to reply if you have further questions

    -Josh

  • Hi. For non-mandatory fields, I only want the validation performed on a field that has a value. Otherwise, it gives false/positive results. Seems like a check-box would be simplest. "Only run validation on values entered" or something.

  • Josh Boxer
    Josh Boxer Administrator
    edited October 2021

    Hi James

    I still do not have much to work with here. You have a column with some values in it, some of these values are null. You could add a Filter step to remove these rows before the data is validated.

    Alternatively, If then else I mentioned earlier can handle this scenario as it allows multiple conditions


    If you are using the validate step specifically then it requires the output to be either True or False. So you need to decide if you want to treat NULLs as either True or False. Lets assume true

    If there are only a few values you could list them in an Equals Function and have NULL as one of these, so that these return true


    Another alternative is to use an Or:

    Hope one of these options helps, if not then please share more information about the data being validated and what you want to acheive when validating a NULL.

  • Hi, I tried the 'if then else' function, but it did not work (I changed from null to blank, as i don't have nulls, I have blanks).

    Either I am missing something obvious, or there is something obvious missing. Given that creating validation rules is now more efficient and easier than in Pandora, there should be some option at the group or rule level that allows "only validate on values".

    For example, I have 100 rows, and one of the columns is "email address", it is non-mandatory, so I have only collected 10 values. If I run a validation on that column (to check email format), I am getting 10 rows passed, 90 rows failed, a 10% row pass rate. However, rows with no values should not be evaluated. It should be 10 rows passed, 0 rows failed for a 100% pass rate.

    hope this makes sense.

  • Sorry, the short and easy example is in Pandora - in the Quality Rule window, there was a checkbox for "Ignore Nulls?", and that ignored nulls (blanks) accordingly, and the rule results were not skewed.

  • Josh Boxer
    Josh Boxer Administrator

    I will go and look at what was in Pandora, but in your example I would use the Or Function, if email address is non mandatory then I would treat nulls as True only flagging invalid values as False.

  • The check-box in Pandora was very quick, I would expect the same in Aperture. I don't want to manually add a bunch of OR functions into what could be hundreds of validation checks.

  • Josh Boxer
    Josh Boxer Administrator

    I found an existing request for this functionality that I have added your 'vote' to, so we will try to prioritize the request further. For now you would need to create multiple rules or a more complex rule as described above.

  • Thank you Josh.

  • Hi Josh - any update from the team on when to expect the above functionality?


    thanks

  • Josh Boxer
    Josh Boxer Administrator

    Hi James, the teams are currently engaged with some higher priorities right now, but it is still on my list and I will be sure to raise it again in a few weeks to see if if can get some traction. So, no ETA at this time. If you do not hear any updates then feel free to comment again.

    Any users reading this, please comment how you would utilise this functionality or how it would improve your results. Thanks

  • Josh Boxer
    Josh Boxer Administrator

    Hi James, the latest release version 2.7.6 includes this functionality, so hopefully it makes your rule building easier from now on.

  • thanks for the update

  • Suggest rules does not seem to suggest rules for min-max thresholds or at least my data does not seem to trigger such suggestions.

    Would it be possible to add min-max suggestions for content, at least to those that are numeric or datetime etc?

    In the long run, if there is a lot of suggestions, people might get overwhelmed by them. Therefore, if you add a lot of suggestion types and logics, there might be a need for disable/enable rule categories. In practice I mean something like this:
    - generate basic rules, such as mandatory values and data types. Yes/No.
    - generate content rules, such as patterns, reference lists.
    - generate thresholds, such as valid characters, min-max thresholds etc.
    - generate cross-column dependency rules, column A (create date) must be smaller than Column B (closed date).
    - Etc.

    The more complex rules you go, the trickier they are to recognize and validate to be relevant. However, I think at least the 3 first should be mostly easy and straight forward to generate and adjust manually. Its often easy to spot them from profiling min-max results. Purchases far in the future or people born centuries ago or vice versa.

    Logical dependencies between columns and their content gets fast very tricky. I'm not sure how efficient it would be generate those but they might help to catch some anomalies or deliberate workarounds in manually inserted customer services, orders, purchases etc.