Directory / batch input based on key word

Emilia Fuks · March 2024

Good Morning,

I wonder how you would handle this use case. I have a directory containing thousands of objects / datasets. I'd like to be able to load a selection of these datasets into a workflow in Aperture based on a key word contained in the name of the file. I will then join them up and process together.

Any suggestions how to achieve this?

thank you!

Seamus Kyle · March 2024

Hi Emilia,

For clarity, when you say thousands of objects/ datasets do you mean files?

If so, are these files of the same type and structure? E.g. delimited files with the same columns?

If that is the case, then you could potentially do the following.

Load one of these files and specify that the Dataset should be multi-batch and have a dropzone

This will then create a dropzone.

As far as I am aware you don't have any control where the dropzone is created (it's created in the Aperture repository).

With a dropzone, any file that is copied there will automatically be loaded as long as:

a) it is the same type as the file originally loaded (e.g. csv)

b) it has at least one column name as the original Dataset.

If you were able to copy just the files you need (e.g. using a script with a file name filter) to the dropzone, then (in theory) the files should automatically be loaded into the Dataset as multiple batches. Essentially what this means is that they will be unioned.

What I don't know is whether this will work if you copy hundreds/ thousands of files into the dropzone at the same time. I am not sure if the dropzone logic was designed to handle this sort of situation.

However, this is the only thing I can think of.

henry_ · March 2024

If you can set up the file staging location as an External System, you can also use the dropzone with a "starts with file pattern" to only watch for files meeting that naming criteria: https://docs.experianaperture.io/data-quality/aperture-data-studio-v2/get-started/configure-external-systems/#configureexternalsystemdropzone~cloud-storage-and-sftp

Emilia Fuks · March 2024

thank you both. I'll give that a go.

Emilia Fuks · June 2024

Follow up on above in case someone comes across a similar case. Since we're storing files in S3, I ended up creating an external table in Athena. I connected to the table in Aperture using an JDBC driver, which allowed me to load the data using a SQL query.

Directory / batch input based on key word

Answers

Categories