Slow data load
Morning, Last night I created a sample data set that I am going to use to test our de-duplication matching rules and blocking keys on. It was 100,000 rows of random sample data and 150 rows of data that I was confident should be matched by the system. I loaded the data onto an SQL server and then tried to load it into Aperture but found it was loading really slowly. By the time I left at 1700 hours it had loaded approx. 40,000 rows. This morning on coming back into work it is still running and now at 930,000 rows. It is not a complex data set with only a few columns. Has anyone seen this when loading data before and have any suggestions on how to improve the performance ?
P.S ... There is no other activity on the system at present and I set the priority to maximum after I found it was going slowly.
Poor load performance from a JDBC connection could be attributed to a great many things.
The most significant of these is the location of the data to be loaded: If it’s on a remote database then network latency will have a major impact on load times, as the network will be a bottleneck. We have also found security permissions to sometimes be a root cause for performance issues on loading from JDBC.
Is this new degraded behaviour? In other words, have you connected to this source previously and loaded large amounts of data in a timely manner and only now started to see a performance degradation?
Have you tested a similar pull of data using a different tool, such as DBeaver , MS SQL Server Management Studio or similar ?
The SQL database is where we store all of the data that we load into aperture and as such I have previously loaded several millions rows of data in much less time than this. Using SSMS the data loads almost instantaneously, 4 seconds for the full table.
Just for checking purposes I loaded a separate new table from the same source : 359,493 rows in 9 seconds. The other table is still loading, now at 998,800 rows. I will profile it when it finishes.
Actually ..... now that I have had coffee I realise that I am more confused than ever. The table only has 100,150 rows. How is it at row 998,800 !!???
@stevenmckinnon Based on what you're saying it seems most probable that your're seeing network latency as the problem. You could try a ping with a payload from the Data Studio server to the sql server to give you some indication on this.
If you check the activity monitor for that extraction session in the sql server - are you seeing any predominant wait types, what's the workload like on the SQl server like? You suggested that the Aperture Data Studio server is not seeing any high loads...
Since you say your sample data is 100,000 rows and it is saying it is close to downloading 1m rows, It could be that the JDBC connection broke and the state is confused and potentially just looping.
I'd stop the download and start it again. with a small number of records (100k or so) you're better off...
@stevenmckinnon I understand from @Karan Hunjan that once it hit over a million rows being loaded you stopped the process, deleted the file off the server and re-loaded it. Then reloaded it and it loaded in a couple of seconds.
Thinking more on this, it would probably be useful to call out that in the past we have seen issues with loading data from sources where there are orphaned quoted data, this has pretty much only been with loading files and not using JDBC connections. A number of improvements were effected in various releases to automatically resolve this including some major mods to the file parser.
It is always useful to remember that these settings in the preview and configure can be helpful
glad it is resolved now!
We have also experienced similar problems loading large files. Sometimes restarting Aperture before starting a large load helps (clears any cache?). Nige
@Ian Hayden I wonder if NIgel's experience ties into the GC discoveries you have recently made?
@Clinton Jones @Nigel Light Yes, it is possible that Data Studio (under the right conditions, due to memory leaks and/or other processing) could start to run out of memory which will cause it to run much more slowly for all operations until it is restarted.
It definitely sounds like this might be the case. Version 1.6.1 should help but there are further fixes in the pipeline that should fully fix the problem.