What are the limits of data loading in Data Studio?

Back in June, I read an internal post by @Henry Simms that suggested that in test we had loaded 162 billion values into Data Studio setting a new record for the largest single table loaded in to Data Studio, using v1.4.0.

This test was done following a request for performance info for a very large and very wide file, we generated test data with over 81 million rows and 2000 columns, with a value profile of 60% integer, 20% string, 20% dates. The input CSV file took up 1.23 TB on disk, comprising over 162 billion values (where values = rows x columns).

Data Studio loaded this table in 19 hours on a fairly high-performance machine. That equates to 2.2 million values per second, which meets our guideline expectation for load rate: Typically between 1.5 million values\second for slower environments, up to 3 million values\second for a fast box with fast disks.

Henry indicated that this provided more proof that Data Studio can handle very large volumes of data and the load process scales linearly and predictably. The limitation is really the size of your disk.

Has more testing been done on later versions and how does this compare when the Find Duplicates or Address Validation steps are part of the mix?

Best Answers

Sign In or Register to comment.