What are the limits of data loading in Data Studio?

Clinton Jones
Clinton Jones Experian Elite
edited December 2023 in General

Back in June, I read an internal post by @Henry Simms that suggested that in test we had loaded 162 billion values into Data Studio setting a new record for the largest single table loaded in to Data Studio, using v1.4.0.

This test was done following a request for performance info for a very large and very wide file, we generated test data with over 81 million rows and 2000 columns, with a value profile of 60% integer, 20% string, 20% dates. The input CSV file took up 1.23 TB on disk, comprising over 162 billion values (where values = rows x columns).

Data Studio loaded this table in 19 hours on a fairly high-performance machine. That equates to 2.2 million values per second, which meets our guideline expectation for load rate: Typically between 1.5 million values\second for slower environments, up to 3 million values\second for a fast box with fast disks.

Henry indicated that this provided more proof that Data Studio can handle very large volumes of data and the load process scales linearly and predictably. The limitation is really the size of your disk.

Has more testing been done on later versions and how does this compare when the Find Duplicates or Address Validation steps are part of the mix?

Best Answer

Answers

  • I get the following error message when connecting to an SAP table (CDPOS) with 5.7 Billion rows.

    [129]: transaction rolled back by an internal error: Search result size limit exceeded: 5729469259

    Does this mean there is a record limit defined by Aperture or is this limit prescribed by SAP?

  • Henry Simms
    Henry Simms Administrator

    HI @George_Stephen , are you able to find the full error message in the server log file (datastudio.log)?

    This should indicate where the error is originating. You'll find the log file in the Data Studio repository's /data/log folder, by default C:\aperturedatastudio\data\log, although this may be located on a different drive letter.

  • Josh Boxer
    Josh Boxer Administrator

    Hi George, I believe the limit is on the SAP side: "There is a limit per table/partition of 2 billion rows by the design of HANA. Determines the number of records in the partitions of column-store tables. A table partition cannot contain more than 2,147,483,648 (2 billion) rows.”, so you would need to import from SAP one partition at a time.

  • Thanks Josh! While we are not able to filter in the current version, we will be upgrading to 2.7 and then we can use SQL to filter the data to get the subset of the table. Thanks for the confirmation!