What are the limits of data loading in Data Studio?
Back in June, I read an internal post by @Henry Simms that suggested that in test we had loaded 162 billion values into Data Studio setting a new record for the largest single table loaded in to Data Studio, using v1.4.0.
This test was done following a request for performance info for a very large and very wide file, we generated test data with over 81 million rows and 2000 columns, with a value profile of 60% integer, 20% string, 20% dates. The input CSV file took up 1.23 TB on disk, comprising over 162 billion values (where values = rows x columns).
Data Studio loaded this table in 19 hours on a fairly high-performance machine. That equates to 2.2 million values per second, which meets our guideline expectation for load rate: Typically between 1.5 million values\second for slower environments, up to 3 million values\second for a fast box with fast disks.
Henry indicated that this provided more proof that Data Studio can handle very large volumes of data and the load process scales linearly and predictably. The limitation is really the size of your disk.
Has more testing been done on later versions and how does this compare when the Find Duplicates or Address Validation steps are part of the mix?
Best Answer
-
Throughput for the Validate Addresses step can vary significantly based on a number of variables, the most important of which is usually the quality of input addresses. When performance testing the Validate Address step in Data Studio, the team typically assemble input test data with around 80% good quality addresses (either exact matches or full matches requiring only minor changes), and 20% lower quality addresses including partial, tentative and unmatched addresses. All searches are unique.
Given this, we typically see Address Validation throughput of between 2 million and 4 million records per hour against GBR and USA reference data with a single Validate Addresses step, given optimised caching settings. When using splitting your input and using multiple steps to parallelize the address cleaning process, GBR address cleans can be processed at well over 10 million records per hour. When cleaning against USA reference data we don't see such gains in parallelization, but are planning to resolve this in the near future.
There are a number of ways to optimise your workflow and set-up to gain performance improvements for Validate Addresses, including filtering out very poor addresses (containing for example blank values in city, state and zip fields) before the clean, or splitting the input and parallelizing the clean through multiple Validate Addresses steps. There are also several server settings that could help to optimise performance in specific cases
Note that the Address Validate step builds a global persistent cache, which means that if you search on the same input address a second time, the result will be almost instant. For example, cleaning 1 million GBR addresses through a Data Studio workflow took 16 mins, but re-running the workflow with almost the same input (2% variation in address input) took 3 minutes.
6
Answers
-
I get the following error message when connecting to an SAP table (CDPOS) with 5.7 Billion rows.
[129]: transaction rolled back by an internal error: Search result size limit exceeded: 5729469259
Does this mean there is a record limit defined by Aperture or is this limit prescribed by SAP?
0 -
HI @George_Stephen , are you able to find the full error message in the server log file (datastudio.log)?
This should indicate where the error is originating. You'll find the log file in the Data Studio repository's /data/log folder, by default C:\aperturedatastudio\data\log, although this may be located on a different drive letter.
0 -
Hi George, I believe the limit is on the SAP side: "There is a limit per table/partition of 2 billion rows by the design of HANA. Determines the number of records in the partitions of column-store tables. A table partition cannot contain more than 2,147,483,648 (2 billion) rows.”, so you would need to import from SAP one partition at a time.
0 -
Thanks Josh! While we are not able to filter in the current version, we will be upgrading to 2.7 and then we can use SQL to filter the data to get the subset of the table. Thanks for the confirmation!
1