Large JLSO files #75

rofinn · 2020-10-24T19:03:44Z

Closing #21

Pros:

We can still use an existing file format like BSON for metadata
We should be able to support petabyte scale files
New format isn't really any less readable (objects were already just opaque serialized bytes)
Changes are limited the on-disk format and the read/write functions. (e.g., v3 and v4 input/output is the same)

Cons:

Logic for reading and writing is slightly uglier
Might be some safety concerns with just having a continuous block of bytes that folks need to read after reading the BSON header?

- Replaces "objects" dict with "object-names" and "object-nbytes" vectors which tell us how to read raw object bytes after the BSON doc. - Write all object bytes as one contiguous block after writing the BSON doc - Updated tests that previously through an InexactError when writing large objects/docs.

codecov · 2020-10-24T20:03:56Z

Codecov Report

Merging #75 into master will increase coverage by 2.66%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   94.90%   97.57%   +2.66%     
==========================================
  Files           6        6              
  Lines         157      206      +49     
==========================================
+ Hits          149      201      +52     
+ Misses          8        5       -3

Impacted Files	Coverage Δ
src/JLSO.jl	`100.00% <ø> (ø)`
src/JLSOFile.jl	`90.47% <ø> (-3.28%)`	⬇️
src/file_io.jl	`94.33% <100.00%> (+5.45%)`	⬆️
src/upgrade.jl	`100.00% <100.00%> (+1.85%)`	⬆️
src/metadata.jl	`100.00% <0.00%> (ø)`
src/serialization.jl	`100.00% <0.00%> (+9.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03c30e5...be58e4c. Read the comment docs.

oxinabox · 2020-10-24T21:58:08Z

Not reviewed properly, but this seems like a sensible way to store the data.

oxinabox

Don't for get to add files to the specimens.
https://github.com/invenia/JLSO.jl/tree/master/test/specimens

My main concern would be how does a generic BSON reader know when to stop reading the BSON header?
Do we have a generic BSON reader to test what it does?
Maybe a python one?
The sensible thing to do is to stop when every opening { has been closed.
But i worry that some might just read the whole thing and then freakout about extra struff.

src/file_io.jl

oxinabox · 2020-10-26T14:55:48Z

src/JLSO.jl

-const READABLE_VERSIONS = semver_spec("1, 2, 3")
-const WRITEABLE_VERSIONS = semver_spec("3")
+const READABLE_VERSIONS = semver_spec("1, 2, 3, 4")
+const WRITEABLE_VERSIONS = semver_spec("3, 4")


👍 having thought about this for a bit, I think it makes sense to support wring v3 for a bit since it could be useful to allow porting data from v4 back to v3 for if for some reason you are constrained to reading it on a system that doesn't yet support readign v4 (i.e .that can't use this version of JLSO)

src/file_io.jl

oxinabox · 2020-10-26T15:07:52Z

src/upgrade.jl

@@ -124,6 +124,7 @@ function _upgrade_env(pkgs::Dict{String, VersionNumber})
    try
        mktempdir() do tmp
            Pkg.activate(tmp)
+            isempty(pkgs) && return _env()


why is this change here?

Necessary to get tests to pass on julia 1.5 / nightly. I can move it to a separate pull request though.

i have no idea what this is doing.

The require_not_empty was added to Pkg.jl for 1.5. This creates an early exit condition to work around that.

JuliaLang/Pkg.jl@87aab4a#diff-a43f827629f8ae12bbcbb218a293d325afcdda3a7f3c7016e2292840bf29ed1cR58

This otherwise errors on the Pkg.add call in the catch below.

test/file_io.jl

src/file_io.jl

oxinabox

This idea is all sound.
I can't think of any additional feedback I might need to give later.
Approved

rofinn · 2020-10-26T15:29:07Z

Cool, thanks for reviewing @oxinabox I'll make those changes and merge.

rofinn · 2020-10-26T15:58:21Z

My main concern would be how does a generic BSON reader know when to stop reading the BSON header?

The BSON header has it's own EOF byte that should tell any reader to stop reading the doc at the end of the header. That assumes the reader is following the BSON spec though. I'll add some tests using the python bson package since that seems like the easiest options for julia interop.

oxinabox · 2020-10-26T16:54:47Z

I'll add some tests using the python bson package since that seems like the easiest options for julia interop.

They don't need to be CI tests.
Just doing a manual check once seems fine to me.
Adding CI tests about the interoperability of the BSON metadata seems beyond the scope of this PR, but also reasonable to do one day.
(Low priority though)

…likely hit OOM errors.

… to extract object bytes in python-bson

rofinn · 2020-10-26T21:26:24Z

Just doing a manual check once seems fine to me.

Okay, yeah. I even have a little python snippet for extract raw object data from the file in python using the bson library. I've made a separate issue for that in #76

rofinn added 2 commits October 22, 2020 21:28

Add tests for large BSON objects and docs.

1706dcb

rofinn mentioned this pull request Oct 24, 2020

Crashes for very large dataframes #21

Open

rofinn changed the title ~~Add tests for large BSON objects and docs~~ WIP: Larger JLSO files Oct 24, 2020

rofinn changed the title ~~WIP: Larger JLSO files~~ WIP: Large JLSO files Oct 24, 2020

oxinabox reviewed Oct 26, 2020

View reviewed changes

oxinabox approved these changes Oct 26, 2020

View reviewed changes

rofinn added 2 commits October 26, 2020 14:34

Only run large object/files tests on 64-bit systems otherwise you'll …

bb344da

…likely hit OOM errors.

Include object positions in file to improve safety and make it easier…

b102438

… to extract object bytes in python-bson

rofinn added 2 commits October 27, 2020 11:48

Address review comments.

9f538ec

Add v4 specimens and add case for "upgrading" v4 files.

be58e4c

rofinn force-pushed the rf/raw-objects branch from 47cd4f7 to be58e4c Compare October 27, 2020 19:58

rofinn changed the title ~~WIP: Large JLSO files~~ Large JLSO files Oct 27, 2020

rofinn merged commit 191475c into master Oct 28, 2020

rofinn deleted the rf/raw-objects branch October 29, 2020 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large JLSO files #75

Large JLSO files #75

rofinn commented Oct 24, 2020 •

edited

Loading

codecov bot commented Oct 24, 2020 •

edited

Loading

oxinabox commented Oct 24, 2020

oxinabox left a comment

oxinabox Oct 26, 2020

oxinabox Oct 26, 2020

rofinn Oct 26, 2020

oxinabox Oct 26, 2020

rofinn Oct 27, 2020

rofinn Oct 27, 2020

rofinn Oct 27, 2020

oxinabox left a comment

rofinn commented Oct 26, 2020

rofinn commented Oct 26, 2020

oxinabox commented Oct 26, 2020

rofinn commented Oct 26, 2020

Large JLSO files #75

Large JLSO files #75

Conversation

rofinn commented Oct 24, 2020 • edited Loading

codecov bot commented Oct 24, 2020 • edited Loading

Codecov Report

oxinabox commented Oct 24, 2020

oxinabox left a comment

Choose a reason for hiding this comment

oxinabox Oct 26, 2020

Choose a reason for hiding this comment

oxinabox Oct 26, 2020

Choose a reason for hiding this comment

rofinn Oct 26, 2020

Choose a reason for hiding this comment

oxinabox Oct 26, 2020

Choose a reason for hiding this comment

rofinn Oct 27, 2020

Choose a reason for hiding this comment

rofinn Oct 27, 2020

Choose a reason for hiding this comment

rofinn Oct 27, 2020

Choose a reason for hiding this comment

oxinabox left a comment

Choose a reason for hiding this comment

rofinn commented Oct 26, 2020

rofinn commented Oct 26, 2020

oxinabox commented Oct 26, 2020

rofinn commented Oct 26, 2020

rofinn commented Oct 24, 2020 •

edited

Loading

codecov bot commented Oct 24, 2020 •

edited

Loading