Jobs with large lookup files

Nigel Light
Nigel Light Contributor
edited December 2023 in General

At St. James's Place, prior to highlighting whether client details need to be quality checked, we need to establish whether the client has a current fund holding (as this is where we realise maximum business benefit in correcting the data)

To do this we are faced with loading very large table(s) ie >100Million rows which contain fund holding details and then discarding the majority of the data as we are only interested in the current position ie those for the maximum snapshot date (<1Million rows).

We then only use these as lookup tables and do not use any of the data held.

We may also face licencing issues re:large dataset sizes without a pre-load filter option.

To i) read large table(s) into Aperture and ii) is quite an overhead and we are considering that it is a better solution to run a SQL view to establish the required set of clients, save this as a file and load it into Aperture (or read the view directly in Aperture). This does introduce an overhead by running it ourside of Aperture but, faced with the above issues, is potentially the best solution. 

Is this need common amongst the Aperture community and could this type of functionality be considered for future releases? 

Comments

  • Clinton Jones
    Clinton Jones Experian Elite
    edited August 2019

    This is a great question Nigel, and one that periodically comes up, particularly when the lookup set is a derived set, say from transactional data.

    I guess there are a couple of additional framing questions I would put to you:

    • How frequently does this lookup get refreshed?
    • Would it make sense to connect this source to a snapshot which has a transform ahead of it to sort and suppress columns/rows?

    My recommendation is that if you are pulling this data from a JDBC connection rather than a file, that you establish a view of that data on the database where the data is to be found, but have that view setup as an indexed view so that it delivers the data to data studio pre-sorted.

    one of the performance challenges with lookups is that the data needs to be sorted on the lookup columns in order to perform well.

  • Nigel Light
    Nigel Light Contributor

    Thanks Clinton

    Re:your questions

    i) The lookup table itself gets refreshed daily (though this might not be the frequency we will run this job)

    ii) Snapshot.. are you suggesting the lookup table is loaded, filtered and saved as an Aperture snapshot which is then used as input to a lookup? We will then hold versions of this which will soon grow in size to an unacceptable level?

    Thanks for the tip re:pre-sorting lookup tables 

  • Clinton Jones
    Clinton Jones Experian Elite

    ok, if the lookup is to be refreshed regularly you are quite correct, you could land up with a lot more data in data studio than you want. Additionally it may open up a PII issue for you which you likely want to avoid. I think the option to consider is the database instantiated view at this stage.

    With v2.0 there will be some important data handling improvements (views) that will make this easier and more flexible but will likely not entirely displace the concerns that you might have.

  • Nigel Light
    Nigel Light Contributor

    Looking forward to V2.x :-)

  • Danny Roden
    Danny Roden Administrator

    @Clinton Jones could another option here be to utilise a 'custom step' that queries the database directly?

  • Clinton Jones
    Clinton Jones Experian Elite

    @Danny Roden for sure, we have seen some instances where custom steps are used to push and pull the data from various sources without using the native JDBC or parser of data studio. In v1.x we require a source to start a workflow but that could just be a place holder or empty schema.

  • Henry Simms
    Henry Simms Administrator

    HI @Nigel Light , just picking up on the problem you raised here around taking a snapshot of the large lookup table, and old versions of that snapshot proliferating. This is something we've addressed in the upcoming v1.6.0 release, due out in mid-September, where we've introduced a 'maximum number of versions to keep' option, configurable per snapshot on the Take Snapshot step:

    In the example above, when the workflow is run for the fourth time, creating v4 of the snapshot, a process will be automatically triggered that deletes v1. This not only helps with disk space management and general snapshot proliferation, but could also be important for compliance reasons, where older versions of data should not be retained.

  • Hi Henry. That certainly sounds a great solution (and certainly one that we will find most useful) but not really the issue that I am describing here. What I am highlighting is the need to read in a whole table of data, accumulated/datestamped over a period of time and then applying a filter a set of data to retain only the most recent case. This could be retained as a snapshot (as described) and the #version solution will be useful. However the main issue that I am highlighting is the lack of a pre-filter, meaning the overhead of loading the whole file must be accommodated which, in the case of large files can take a lot of time (though definitely less than Pandora!), when it is only the most recent date/current row that is needed. Hence the suggestion to run the extract as SQL - reducing the volume to be loaded.  

  • Henry Simms
    Henry Simms Administrator

    Hi @Nigel Light , thanks for the positive feedback. Yes, totally understand that the snapshot version management doesn't solve your main problem, and we're working on a way to push down filtering to the DBMS (AKA pre-filter) to just pull out and load the records you want. Watch this space!