Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add h5read_compound to read compound structures efficiently #559

Closed
wants to merge 3 commits into from

Conversation

tamasgal
Copy link
Contributor

@tamasgal tamasgal commented Jul 19, 2019

The h5read_compound function with its three methods is already in use in all of my projects where I have to read HDF5Compound and is roughly two orders of magnitudes faster than using read or h5read and then parsing the data manually.

I carry over these functions all over the place and I would really like them to be included in HDF5.jl.

Credits go to @damiendr who provided the base code snippet back in 2017: #408

For demonstration purposes I created a file compound.h5 using Python+PyTables with a compound structure using this Python script:

import tables as tb


class Foo(tb.IsDescription):
    a = tb.Int32Col()
    b = tb.Float64Col()


h5file = tb.open_file("compound.h5", mode="w")
table = h5file.create_table("/", 'foo', Foo)

foo = table.row

for i in range(100_000):
    foo['a'] = i
    foo['b'] = i * 1.1
    foo.append()

Here are the results of a simple benchmark:

using HDF5
using BenchmarkTools

struct Foo
    a::Int32
    b::Float64
end

h5f = h5open("compound.h5", "r")

@btime read(h5f, "/foo");
@btime h5read_compound(h5f["/foo"], Foo);
@btime h5read("compound.h5", "/foo");
@btime h5read_compound("compound.h5", "/foo", Foo);
 97.445 ms (799562 allocations: 31.37 MiB)
  1.052 ms (25 allocations: 1.53 MiB)
  99.236 ms (799568 allocations: 31.37 MiB)
  3.061 ms (31 allocations: 1.53 MiB)

@tamasgal
Copy link
Contributor Author

tamasgal commented Jul 19, 2019

I'd like to add a test but I still did not manage to use HDF5.jl to write a compound structure.

The appveyor complains about BinaryProvider.

Edit: I just realised there is a compound.h5 in test/test_files

@tamasgal
Copy link
Contributor Author

I am a bit puzzled by this error on Travis:

ERROR: LoadError: LoadError: error compiling top-level scope: type definition not allowed inside a local scope

why is it running in a local scope? The test suite works fine on my desktop...

@tamasgal
Copy link
Contributor Author

OK got it, it has to be outside of the test suite block.

@tknopp
Copy link
Contributor

tknopp commented Aug 21, 2019

Can this parse arrays of compounds?

In particular, the challenge is parsing the following file format:
https://netix.dl.sourceforge.net/project/ismrmrd/data/simple_gre.h5

I made a custom parser but of course it would be great is HDF5.jl would support this directly.

@tamasgal
Copy link
Contributor Author

Can this parse arrays of compounds?

In particular, the challenge is parsing the following file format:
https://netix.dl.sourceforge.net/project/ismrmrd/data/simple_gre.h5

I made a custom parser but of course it would be great is HDF5.jl would support this directly.

Not sure if I understand what you mean. In the example above it does read an array of compounds. I was not able to read the HDF5 file you provided. Can you provide a struct with the matching compound structure? If so, you should be able to simply read it using

h5read_compound("simple_gre.h5", "/dataset/data", TheStruct);

@tamasgal
Copy link
Contributor Author

tamasgal commented Aug 21, 2019

Hmm, I think I got it. I see that each entry in your file is an array of compounds... Not sure how to deal with that though.

At least I successfully used StaticArrays as fields of the compound structures, so it might work.

@tknopp
Copy link
Contributor

tknopp commented Aug 21, 2019

I don't want to hold up this PR but just mention that in the end it would also be cool to read such files. h5py is able to do that. It basically constructs the the structs in a dynamic fashion. Your approach has the advantage that all is type stable.

@kleinhenz
Copy link
Contributor

@tknopp with #592 it looks like I am able to automatically read your file

@tknopp
Copy link
Contributor

tknopp commented Nov 1, 2019

That sounds very interesting. Note that I wrote a custom parser for the data:

https://github.com/MagneticResonanceImaging/MRIReco.jl/blob/master/src/IO/ISMRMRD.jl#L14

Of course I am relying on the current implementation.
One issue that I was facing is, that the struct offsets were in part pretty strange in files that I found in the wild. I therefore needed to do something like this:

https://github.com/MagneticResonanceImaging/MRIReco.jl/blob/master/src/IO/ISMRMRD.jl#L70

Is you implementation handling that well?

Other question: In my library I have already defined the struct the is comparable to the compound stored, see

https://github.com/MagneticResonanceImaging/MRIReco.jl/blob/master/src/Datatypes/RawAcqData.jl#L6

Ideally I am able to convert the named tuple into that.

@tknopp
Copy link
Contributor

tknopp commented Nov 1, 2019

initial testing looks very good! Nice job

@tamasgal
Copy link
Contributor Author

I am not against deriving the types dynamically, it's actually a nice feature to have, but there are also other use cases where you would like to map an existing struct (like in my/our case). We have a defined structure which we also use in other languages.

I made some preliminary checks and it also seems that the static approach is much faster than the dynamic one.

On the other hand, both approaches are orthogonal to each other, so I think they could be both merged.

What's the status of the PR? Any chance of accepting it or should I close it?

@musm
Copy link
Member

musm commented Nov 27, 2019

Any chance we can reach some convergence on this PR and #592 ?

Some sort of unanimous decision would be great, so we can make progress.

@kleinhenz
Copy link
Contributor

I agree that the two approaches are mostly orthogonal so it wouldn't hurt to merge both. I think my approach is more in line with how things work in HDF5.jl currently where all the type translations are handled automatically but that doesn't cover the use case where you have a custom type that you want to read into like this does.

@tamasgal I'm curious where my approach was significantly slower? The actual hdf5 io routines called in both cases should be identical. On this benchmark I get basically identical performance. In the case where there are strings or other vlen members then I have some overhead from copying the data over into the native julia type but this is not really avoidable as far as I can tell.

@tamasgal
Copy link
Contributor Author

@kleinhenz I have to check carefully but I had no time yet. I'll try out your example and compare that!

filetype = datatype(dset) # packed layout on disk
memtype_id = h5t_get_native_type(filetype.id) # padded layout in memory
@assert sizeof(T) == h5t_get_size(memtype_id) "Type sizes mismatch!"
out = Vector{T}(undef, length(dset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not out = Array{T}(undef, size(dset)) to preserve all the size information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@kleinhenz
Copy link
Contributor

I think it would be nice to integrate this a little more with the existing api infrastructure before it is exported. For example currently this doesn't allow reading hyperslabs.

It's interesting to note that the current read function

# Read array of scalars
function read(obj::DatasetOrAttribute, ::Type{Array{T}}) where {T<:Union{HDF5Scalar, Complex{<:HDF5Scalar}}}
    if isnull(obj)
        return T[]
    end
    dims = size(obj)
    data = Array{T}(undef,dims)
    dtype = datatype(data)
    readarray(obj, dtype.id, data)
    close(dtype)
    data
end

would already almost work for this if it weren't for the HDF5Scalar type restriction. I think it is probably possible to do something very similar to this that integrates with the rest of the existing api by replacing the places where HDF5Scalar is used for dispatch with some trait based system that the user can opt into. I'd be willing to spend some time trying to work this out if people are interested/think this is a good approach.

@tamasgal
Copy link
Contributor Author

Yes good catch and idea! I’d appreciate if you could take over, I am quite overcommitted these days (well, who is not in this community 🙈 ).

@tknopp
Copy link
Contributor

tknopp commented Nov 28, 2019

@kleinhenz: I second my request that this API is able of handeling the ISMRMRD data format. #592 will break my code in MRIReco.jl and I hope that the present PR will rescue me.

@musm
Copy link
Member

musm commented Aug 25, 2020

What's the status here? Should we proceed or do something else?

@kleinhenz
Copy link
Contributor

#652 is my idea of how to do this.

@musm musm closed this Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants