Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large JLSO files #75

Merged
merged 6 commits into from
Oct 28, 2020
Merged

Large JLSO files #75

merged 6 commits into from
Oct 28, 2020

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented Oct 24, 2020

Closing #21

Pros:

  • We can still use an existing file format like BSON for metadata
  • We should be able to support petabyte scale files
  • New format isn't really any less readable (objects were already just opaque serialized bytes)
  • Changes are limited the on-disk format and the read/write functions. (e.g., v3 and v4 input/output is the same)

Cons:

  • Logic for reading and writing is slightly uglier
  • Might be some safety concerns with just having a continuous block of bytes that folks need to read after reading the BSON header?

- Replaces "objects" dict with "object-names" and "object-nbytes" vectors which tell us how to read raw object bytes after the BSON doc.
- Write all object bytes as one contiguous block after writing the BSON doc
- Updated tests that previously through an InexactError when writing large objects/docs.
@rofinn rofinn changed the title Add tests for large BSON objects and docs WIP: Larger JLSO files Oct 24, 2020
@rofinn rofinn changed the title WIP: Larger JLSO files WIP: Large JLSO files Oct 24, 2020
@codecov
Copy link

codecov bot commented Oct 24, 2020

Codecov Report

Merging #75 into master will increase coverage by 2.66%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   94.90%   97.57%   +2.66%     
==========================================
  Files           6        6              
  Lines         157      206      +49     
==========================================
+ Hits          149      201      +52     
+ Misses          8        5       -3     
Impacted Files Coverage Δ
src/JLSO.jl 100.00% <ø> (ø)
src/JLSOFile.jl 90.47% <ø> (-3.28%) ⬇️
src/file_io.jl 94.33% <100.00%> (+5.45%) ⬆️
src/upgrade.jl 100.00% <100.00%> (+1.85%) ⬆️
src/metadata.jl 100.00% <0.00%> (ø)
src/serialization.jl 100.00% <0.00%> (+9.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03c30e5...be58e4c. Read the comment docs.

@oxinabox
Copy link
Member

Not reviewed properly, but this seems like a sensible way to store the data.

Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't for get to add files to the specimens.
https://github.com/invenia/JLSO.jl/tree/master/test/specimens

My main concern would be how does a generic BSON reader know when to stop reading the BSON header?
Do we have a generic BSON reader to test what it does?
Maybe a python one?
The sensible thing to do is to stop when every opening { has been closed.
But i worry that some might just read the whole thing and then freakout about extra struff.

src/file_io.jl Outdated Show resolved Hide resolved
src/file_io.jl Outdated Show resolved Hide resolved
src/file_io.jl Outdated Show resolved Hide resolved
const READABLE_VERSIONS = semver_spec("1, 2, 3")
const WRITEABLE_VERSIONS = semver_spec("3")
const READABLE_VERSIONS = semver_spec("1, 2, 3, 4")
const WRITEABLE_VERSIONS = semver_spec("3, 4")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 having thought about this for a bit, I think it makes sense to support wring v3 for a bit since it could be useful to allow porting data from v4 back to v3 for if for some reason you are constrained to reading it on a system that doesn't yet support readign v4 (i.e .that can't use this version of JLSO)

src/file_io.jl Outdated Show resolved Hide resolved
src/file_io.jl Show resolved Hide resolved
@@ -124,6 +124,7 @@ function _upgrade_env(pkgs::Dict{String, VersionNumber})
try
mktempdir() do tmp
Pkg.activate(tmp)
isempty(pkgs) && return _env()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary to get tests to pass on julia 1.5 / nightly. I can move it to a separate pull request though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have no idea what this is doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The require_not_empty was added to Pkg.jl for 1.5. This creates an early exit condition to work around that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This otherwise errors on the Pkg.add call in the catch below.

test/file_io.jl Outdated Show resolved Hide resolved
test/file_io.jl Outdated Show resolved Hide resolved
src/file_io.jl Outdated Show resolved Hide resolved
Copy link
Member

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea is all sound.
I can't think of any additional feedback I might need to give later.
Approved

@rofinn
Copy link
Member Author

rofinn commented Oct 26, 2020

Cool, thanks for reviewing @oxinabox I'll make those changes and merge.

@rofinn
Copy link
Member Author

rofinn commented Oct 26, 2020

My main concern would be how does a generic BSON reader know when to stop reading the BSON header?

The BSON header has it's own EOF byte that should tell any reader to stop reading the doc at the end of the header. That assumes the reader is following the BSON spec though. I'll add some tests using the python bson package since that seems like the easiest options for julia interop.

@oxinabox
Copy link
Member

I'll add some tests using the python bson package since that seems like the easiest options for julia interop.

They don't need to be CI tests.
Just doing a manual check once seems fine to me.
Adding CI tests about the interoperability of the BSON metadata seems beyond the scope of this PR, but also reasonable to do one day.
(Low priority though)

@rofinn
Copy link
Member Author

rofinn commented Oct 26, 2020

Just doing a manual check once seems fine to me.

Okay, yeah. I even have a little python snippet for extract raw object data from the file in python using the bson library. I've made a separate issue for that in #76

@rofinn rofinn changed the title WIP: Large JLSO files Large JLSO files Oct 27, 2020
@rofinn rofinn merged commit 191475c into master Oct 28, 2020
@rofinn rofinn deleted the rf/raw-objects branch October 29, 2020 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants