-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of qml.data.load() when partially loading a dataset #4674
Improve performance of qml.data.load() when partially loading a dataset #4674
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4674 +/- ##
==========================================
- Coverage 99.64% 99.63% -0.01%
==========================================
Files 377 377
Lines 33999 33735 -264
==========================================
- Hits 33878 33613 -265
- Misses 121 122 +1
☔ View full report in Codecov by Sentry. |
nice speed-up! to be clear, we used to only use (disk) caches when users requested a specific path, but now we always use one (in-memory), right? is there any loss of functionality by removing the option of specifying a cache path? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brownj85, looks good overall! Just wondering how was the speedup verified?
Just by my own testing. I'm assuming @timmysilv and @obliviateandsurrender tested it as well |
Pretty much - I used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. I didn't do any testing personally so I'm a bit curious as to what your own testing entailed. that said, I trust your judgement here and I'm happy with this change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with H2
and H2O
. The performance improved significantly on downloading individual attributes 💪
Context:
Link to shortcut story
Loading individual attributes of datasets took much longer than loading a whole dataset. This is because the
fsspec
library was mapping the HDF5 reads directly to HTTP requests, which only loaded a few KB each.Description of the Change:
open_hdf5_s3()
now opens the remote dataset in read-buffered mode, which reads data in 8MB chunks into a memory-mapped cache. This results in much fewer requests and faster loading.Benefits:
Acceptable performance for partial loading of large datasets. The download throughput for partial loading is now comparable to downloading the whole dataset (about 15-20% less mb/s).
Possible Drawbacks:
None
Related GitHub Issues: