Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce compression #11

Merged
merged 11 commits into from
May 31, 2019
Merged

Introduce compression #11

merged 11 commits into from
May 31, 2019

Conversation

oxinabox
Copy link
Member

This PR will close #7

Right now all it does is rename :serialize to :julia_native,
and make sure we are all setup to handle version changes.

We should probably have a sample file added to the repo (it i can be very small)
to test we can load formats serialized by old versions.

@codecov
Copy link

codecov bot commented May 24, 2019

Codecov Report

Merging #11 into master will decrease coverage by 7.31%.
The diff coverage is 86.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #11      +/-   ##
==========================================
- Coverage     100%   92.68%   -7.32%     
==========================================
  Files           5        5              
  Lines          54       82      +28     
==========================================
+ Hits           54       76      +22     
- Misses          0        6       +6
Impacted Files Coverage Δ
src/JLSO.jl 100% <ø> (ø) ⬆️
src/file_io.jl 100% <100%> (ø) ⬆️
src/metadata.jl 100% <100%> (ø) ⬆️
src/JLSOFile.jl 93.75% <75%> (-6.25%) ⬇️
src/serialization.jl 83.87% <82.14%> (-16.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6cc609...4386519. Read the comment docs.

@iamed2
Copy link
Member

iamed2 commented May 24, 2019

Why was it renamed? It's not clear that :julia_native means "uses serialize".

@rofinn
Copy link
Member

rofinn commented May 24, 2019

I'd suggest :julia_serialize if we're going to rename it.

@oxinabox
Copy link
Member Author

oxinabox commented May 24, 2019

The problem with serialize is that it doesn't actually specify what serialisation algorithm.
It is ambigious with the general concept of serialization.
All of these are serializers.

So I figured since changing the file version this is a chance to corect it.
Another option could be julia_base_serialize.
base_serialize, etc.
It needs to convey that it is a specific form for serialisation from the julia base library

@oxinabox
Copy link
Member Author

Anyway, as of now this has all the stuff to work with Transcoding stream compressors.
Which do we want to support and with which compression leavels?
I also think we should make compressed the default.
Because disk time (esp if pushing to S3) much more expensive than the time taken for lightweight compression. (also later we may end up doing this in another thread. THis kind thing is ideal for Fork based parallelism but that is not an option sadly)

to :julia_serialize
@oxinabox
Copy link
Member Author

oxinabox commented May 24, 2019

Right, here are some stats,
this is running on existing financial_data.jlso file.
Timing is for second of 2 calls, includes time to write to disk on my laptop.
Which should be the default compression?
I am leaning towards the default gzip which is still faster than :none, due to the smaller filesize meaning less slow writes to disk

┌ Info: Original
└   size_kb = 626
┌ Info: Whole Compressed
|      compression = :gzip
|      time=  0.025815
└   size_kb = 45

┌ Info: 
│   format = :julia_serialize
│   compression = :none
│   time = 0.055351557
└   size_kb = 606
┌ Info: 
│   format = :julia_serialize
│   compression = :gzip
│   time = 0.046644187
└   size_kb = 51
┌ Info: 
│   format = :julia_serialize
│   compression = :gzip_fastest
│   time = 0.035092568
└   size_kb = 54
┌ Info: 
│   format = :julia_serialize
│   compression = :gzip_smallest
│   time = 0.073955492
└   size_kb = 50
┌ Info: 
│   format = :bson
│   compression = :none
│   time = 0.702540057
└   size_kb = 4840
┌ Info: 
│   format = :bson
│   compression = :gzip
│   time = 0.726010517
└   size_kb = 182
┌ Info: 
│   format = :bson
│   compression = :gzip_fastest
│   time = 0.667815146
└   size_kb = 218
┌ Info: 
│   format = :bson
│   compression = :gzip_smallest
│   time = 0.871860139
└   size_kb = 183

@oxinabox oxinabox changed the title [WIP] Introduce compression Introduce compression May 25, 2019
jlso.objects[name] = take!(io)
# need to close buffer so any compression can write end of body stuffs.
close(compressing_buffer)
jlso.objects[name] = buffer.data # can't use take! as stream is now closed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oxinabox
Copy link
Member Author

@oxinabox
Copy link
Member Author

Default is now :gzip
JLSO file is now v2.0, we can still load v1.0 but can not write them any more.

src/file_io.jl Outdated Show resolved Hide resolved
src/metadata.jl Outdated Show resolved Hide resolved
src/serialization.jl Outdated Show resolved Hide resolved
src/serialization.jl Outdated Show resolved Hide resolved
#)
bson = (
deserialize! = first ∘ values ∘ BSON.load,
serialize! = (io, value) -> bson(io, Dict("object" => value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "object"?

Copy link
Member Author

@oxinabox oxinabox May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is an element of JLSOFile.objects
and we only put it into aDict here because that is how the BSON API likes it.
This dict is never visible to the User, except when they access the BSON directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine, but what changed to make this only necessary now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made bson and julia_serialize work with the same interface
BSON used to do Dict(name => value)
which was redundant and the name is already stored as the key to the parent of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, does bson only accept a Dict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. There is probably some internal function we could call instead.
But it would also make it much harder to keep loading v1 JLSO files so...

src/JLSO.jl Outdated Show resolved Hide resolved
src/serialization.jl Outdated Show resolved Hide resolved
src/serialization.jl Outdated Show resolved Hide resolved
Co-Authored-By: Eric Davies <[email protected]>
Copy link
Member

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem reasonable, but I do have a suggestion for code readability.

src/serialization.jl Outdated Show resolved Hide resolved
@rofinn
Copy link
Member

rofinn commented May 28, 2019

Looks like the tests are failing because of the sample legacy files being loaded. I'd recommend just hard coding the legacy metadata in the test julia code rather than saving files that'll be julia version and architecture dependent.

@oxinabox
Copy link
Member Author

I have tests that directly check the legacy metadata but that would not catch changest in how those are interpreted (e.g. changing :bson to not pre-encode objects as 1 element as Dicts).

Not testing that can still load files feels bad.
I would rather make a list of allowed_failures that depends on architecture and Julia version.

@rofinn
Copy link
Member

rofinn commented May 29, 2019

You can still test all of those things without needing to save binary files in the git repo. Just manually serialize the old structure to an IOBuffer and try loading it. I feel like the only thing that the binary files might catch are changes to the backend serialization format (e.g., saved using an old version of the bson library).

@oxinabox
Copy link
Member Author

oxinabox commented May 29, 2019

Generating the old structure would be super easy to screw up though.
It would share a lot of the code with the package itself.

A end to end integration test gives me much more confidence that it is correct.

@oxinabox
Copy link
Member Author

oxinabox commented May 29, 2019

Also I would really like to know which things when serialized on one platform can't be loaded on another.
So I am strongly inclined to keep those real tests. So I know what breaks.
I am going to push a branch that turns some into warnings and then we can then reassess

@oxinabox
Copy link
Member Author

Ok, I am pretty sure this is actually a bug in the JLSO format.
#12

So for now I have allowed failures on x86.
(I could do this in the script rather than in the config, if prefered)

But in anycase this convinces me that it is completely worth having these kinds of tests.

@oxinabox
Copy link
Member Author

O n testing of BSON on data from 32bit and 64 bit systems it handles it fine.
So this is not #12

@rofinn
Copy link
Member

rofinn commented May 29, 2019

A end to end integration test gives me much more confidence that it is correct.

In that case, I'd recommend saving all the different permutations in a datadeps just for testing rather than storing binary files in the repo.

@oxinabox
Copy link
Member Author

In that case, I'd recommend saving all the different permutations in a datadeps just for testing rather than storing binary files in the repo.

I would agree, if there were a bit larger. But they are pretty small really.
200kb ish.
Git's limit is 100MB, Github's limit is 50MB.
for simplicities sake it might be better just to have them in the repo.

I am still chasing down what is breaking on 32bit.

@rofinn
Copy link
Member

rofinn commented May 29, 2019

I would agree, if there were a bit larger. But they are pretty small really.

I'm inclined to do it out of principle :) It's easy enough for someone to accidentally slip in a large binary file (especially if we want to have automated benchmarks), so using datadeps would make it more explicit about what the best practice is regardless of file size.

@oxinabox
Copy link
Member Author

Ok, I am now satisfied that this is a weird BSON.jl error.

@oxinabox
Copy link
Member Author

BSON error now has a PR to fix it open.

@rofinn how keen are you on having that test data in a DataDep?
I'ld rather not; because it is one more thing to do.
Also, if you insist, where should i store those?

@rofinn
Copy link
Member

rofinn commented May 30, 2019

If you're willing to make an issue to add DataDeps then I think I'm fine to approve. We should probably host these test files in a public S3 bucket.

@oxinabox
Copy link
Member Author

oxinabox commented May 30, 2019

Done #16

Copy link
Member

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me for now

@oxinabox oxinabox merged commit 9b42fca into master May 31, 2019
@ararslan ararslan deleted the ox/compress2 branch June 6, 2019 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce compressed format: .JLSO.gz
5 participants