Aperture REST API calls as Datasets within Aperture

Data is available from Aperture via the REST API that is not available in Aperture. Full details of Spaces, User Groups, Users, Jobs, Datasets etc.

We would like to use Aperture to manage Aperture.

For example: An automated workflow which checks for long running jobs and notifies us of issues; a regular review of user activity which allows us to remove access from inactive users, producing data quality stats which are already enriched with Space metadata and which automatically update when a new Space is created.

We can see that there is already a JDBC driver which ought to be able to do this (Autonomous REST 6.0) but the configuration is highly complex. Ideally we would have a config file that sets up all GET calls from the API and makes them available as datasets which we can then use in our Data Quality Management space (and activities).

Has anyone configured the Autonomous REST driver to access the Aperture API ?

If not, could Experian provide a documented example setup including a config file we can complete with the details of our own server ?

Best Answer

  • Henry SimmsHenry Simms Experian Super Contributor
    edited January 25 Answer ✓

    Hi @Dom Gittins , really glad that you ask this question because I think there's a lot of potential here for interesting solutions that use the native Autonomous REST JDBC driver with Data Studio's REST API to create interesting and useful management / admin dashboards.

    It can be a bit fiddly to set up the AutoREST connector so that the REST API's JSON responses are correctly parsed to tables - particularly when the response contains a nested array of values. In these cases, the driver normalizes the nested array to a child table which can be joined via a key from the parent table. An example of this is the /spaces endpoint, which returns an array of users and groups that have access to each space:


    I'll show later how this is handled by Data Studio - to start with, here's how I've set up the External System to connect to Data Studio's API:

    1. Create an API key - Both the API key and the REST API calls you make are environment-specific, so if you have multiple environments you'll need to manage them with separate keys and tables. The API key only needs Read access to the endpoint you are interested in.
    2. Create a new External System using the Autonomous REST connector. We're going to use a .rest file to list the endpoints and the schemas for the resulting data. The settings you need for this are:
      1. Servername = The hostname of the Data Studio server. So if you API documentation is at https://local.datastudio/api/docs/index.html, the servername setting should be "https://local.datastudio"
      2. Rest config file: This will be the location on disk (on the Data Studio server) where we'll drop the attached .rest file
      3. Authentication = None (the auth is handled via the API call header rather than the driver auth
      4. Then create two user connection properties:
        1. SecurityToken = "<environment_label><space><API_Key>". This is the same way you would authenticate with the REST API directly
        2. AuthenticationMethod = HTTPHeader

    3 . You'll need to create a Credential for the External System to use, but this will just be a dummy value because the External System itself hlds the auth credentials

    4 . Now drop the attached .rest file into the location you defined in the External System's "REST config file" setting. remove the ".txt" extension that I had to add to allow the file to be uploaded.

    This .rest file defines the endpoints that allow you to load:

    To test the connection, go to Datasets -> Add Dataset and select the External System you have created. You should see the following list of tables generated by the .rest file's schema and available to load:

    Here we also see how nested arrays are normalized. The SPACES table has a child SPACES_PERMISSIONS table. The "ID" column in the SPACES table is the join key into SPACES_PERMISSIONS (where the column is called SPACES_ID). By joining these in a workflow, you'll get a view of all spaces and associated user & group permissions:

    JOBS also has a child table (key is the executionID), and DATASETS has a more complicated structure because a Dataset can include more than one Table, and a Table can have multiple batches of data, as well as multiple columns and tags.

Answers

  • Henry SimmsHenry Simms Experian Super Contributor

    Some ideas for things to could do with this Data Studio metadata.

    1. Monitoring and email alerting on unusual login activity via the Events table
    2. Create dashboards showing all users' admin access, as a way of checking that permissions aren't too broad
    3. Analysis of Job execution times - checking and alerting if certain jobs are running in parallel.
    4. Finding where your large Datasets are (based on number of rows & columns) that may be hogging disk space
    5. Add in the Metrics operation endpoint and create a dashboard to monitor things like CPU usage, or page responsiveness
  • Danny RodenDanny Roden Administrator

    Great question @Dom Gittins (and nice response @Henry Simms).

    I've just given this a go myself and it worked a treat.

    Just as an FYI for anyone else giving this a go I found that the file you attached downloaded as a .txt (so it needed renaming to get rid of this extension).

    I'm currently using this approach to help @Sharat Kutty generate a list of all datasets within a 'master data' space and expose this as a list for other users to see what 'Enterprise Data' is available (i.e. the catalogue of data that they can request access to, via a shared View). What's nice about this is that you can pull through all of the metadata e.g. source type, column name, datatype, dataset rowcount, summary/description fields and more. Nice for encouraging consistent use (in line with any defined best practises) of the software

  • Henry SimmsHenry Simms Experian Super Contributor
    edited January 27

    A couple of people have hit a problem where, after setting up the REST API connection, loaded tables from the new AutoREST system have 0 rows. That's accompanied by an error in the log file along the lines of:

    java.sql.SQLException: [ApertureDataStudio][AutoREST JDBC Driver][AutoREST]I/O Error com.experian.aperturedatastudio.autorest.util.aj: SSL handshake failed: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

    This is a site certificate error, and essentially indicates that Data Studio doesn't trust the API calls it's making to itself.

    With this set-up, Data Studio is acting like a client to its own API, but doesn't have its own certificate in its truststore - only it's key store. That's quite likely to happen if you're using a certificate from a private CA.

    The fix is to add the public certificate from Data Studio into the cacerts truststore it uses. Steps:

    1.     Get the public certificate. You can grab this from the browser and download to a .cer file or similar

    2.     Run the following command from the command line to import them into the Aperture Data Studio truststore:

    "<JRE path>\bin\keytool" -import -trustcacerts -alias <alias> -file "<Your own Certificate>.crt" -keystore "<database path>\certificates\cacerts"

    Where:

    • <JRE path> is C:\Program Files\Experian\Aperture Data Studio 2.9.6\java64\jre, where you'll need to modify for your own version number and drive letter, if not C:\
    • <alias> is anything you want
    • <Your own Certificate> the full path to your public key file
    • <database path> = D:\aperturedatastudio, again you may be using a different drive letter


Sign In or Register to comment.