Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance/ergonomics of reading compound datatypes #592

Merged
merged 9 commits into from
Mar 2, 2020

Conversation

kleinhenz
Copy link
Contributor

When reading compound datatypes this replaces HDF5Compound with a named tuple with the correct data layout. The data is directly read into the output buffer instead of going through an IOBuffer. This results in very large performance gains.

I wrote a small benchmark to test the performance of reading a one million element dataset of a small compound datatype with three float fields.

Before

BenchmarkTools.Trial:
  memory estimate:  329.44 MiB
  allocs estimate:  9000084
  --------------
  minimum time:     1.689 s (19.72% GC)
  median time:      1.878 s (24.72% GC)
  mean time:        1.860 s (25.70% GC)
  maximum time:     2.013 s (31.63% GC)
  --------------
  samples:          3
  evals/sample:     1⏎

After:

BenchmarkTools.Trial:
  memory estimate:  22.89 MiB
  allocs estimate:  64
  --------------
  minimum time:     4.305 ms (20.43% GC)
  median time:      4.584 ms (20.69% GC)
  mean time:        4.717 ms (23.21% GC)
  maximum time:     53.451 ms (91.03% GC)
  --------------
  samples:          1059
  evals/sample:     1⏎

Additionally I fixed support for variable length strings as fields of compound datatypes which seemed to be broken before.

In the case where the compound type contains strings/arrays/vlen types a copy is made to convert from the hdf5 compatible type (e.g. Cstring, FixedArray, etc.) to native julia types. This has some performance cost but I think is almost always what you want to do.

Besides being much much faster I think that getting an array of NamedTuples is much easier to work with than the HDF5Compound struct there was before.

This should close #408. I prefer this solution to #559 since it doesn't change the interface and doesn't require manual specification of the struct.

@tknopp
Copy link
Contributor

tknopp commented Nov 2, 2019

So, from my perspective this looks very good. What I request is:

  • a major version bumb, when this lands so that I can pin my package to the previous version.

@kleinhenz: While I think you approach is better than #559, it would help a lot when we there would be possibility to convert the named tuple into the equivalent struct representation. One issue I am currently seeing is that you represent FixedSize Vectors as Vectors, while the "in-memory" representation is a NTuple.

@musm
Copy link
Member

musm commented Nov 3, 2019

We can certainly make a major version bump.

One issue I am currently seeing is that you represent FixedSize Vectors as Vectors, while the "in-memory" representation is a NTuple.

@tknopp Can you double check here your statement. The code here is using a fixed size NTuple not Vectors for FixedArray?

@kleinhenz
Copy link
Contributor Author

In the normalize_types function I translate from HDF5.FixedArray which is just a wrapper around an NTuple with the shape information saved in the type into an actual Array. This is definitely a design decision that I could see going either way. My thought was that in most cases you will want to work with the data as an array so it makes sense to just translate. If we depended on StaticArrays we could just use that, but that is probably overkill for this package.

The reason I decided to do the type translation here is because it is basically necessary for string types. You don't want to return named tuples with Cstring members which point to memory allocated by hdf5. So if in order to get sensible julia types we have to do some type translations for strings I thought we might as well also do it for arrays. This could easily be changed if having compatibility with existing structs is more important.

In the case where there are no string or array members then I think using reinterpret in order to get to an equivalent struct representation should just work.

@tknopp
Copy link
Contributor

tknopp commented Nov 4, 2019

Can we make the type normalization optional? Maybe with the current behavior being the default.

@kleinhenz
Copy link
Contributor Author

So thinking about this a little more I think that in the case where you have a memory compatible struct already defined it makes a lot more sense to use it directly like in #559 than to use this and then reinterpret. The nice thing about this approach is that it gives you an ergonomic representation without you having to specify a struct manually. But if the struct is already defined then I don't think there's really any advantage to reading as named tuples first. The main function in #559 is <10 lines of code so I don't think using an approach like that is too onerous. I think that makes more sense than complicating the default interface by making the type normalization configurable.

@tknopp
Copy link
Contributor

tknopp commented Nov 4, 2019

Is it really complicating the interface? I mainly asking for a backdoor, where the raw things are returned. I am actually wondering how you plan to realize the write support. In that case you will need to denormalize the types.

@kleinhenz
Copy link
Contributor Author

In order to make it optional it seems like you need to either add a new global flag which is a bit unappealing or else add a new mechanism for passing keyword arguments through all of the machinery which seems like sort of a big change just for this. I can't just add it to the relevant method because usually this will be invoked through h5read or read(dset).

What would be the advantage of using this and then reinterpreting the data into your custom struct over just reading directly into the custom struct?

IMHO write support is separate and maybe out of scope. I guess JLD already solves the problem of how to map julia types to compound hdf5 types. It's probably best to leave general serialization logic there.

@tknopp
Copy link
Contributor

tknopp commented Nov 4, 2019

IMHO write support is separate and maybe out of scope. I guess JLD already solves the problem of how to map julia types to compound hdf5 types. It's probably best to leave general serialization logic there.

I disagree with that. You seem to think about compounds as some serialization format where the concrete way things are stored don't matter too much. But there are file formats (e.g. https://ismrmrd.github.io) that use a fixed compound type definition. I therefore consider this to be a contract where the user needs full control to fulfill the contract.

But anyway, I can live with the normalization and will work around it in my code.

@kleinhenz
Copy link
Contributor Author

That's a good point. Still I think that is an issue that should to be addressed separately. As noted in #341 making the set of natively supported types dynamically modifiable is currently pretty non-trivial because HDF5Scalar is used in dispatch everywhere. It would be nice to be able to define datatype(::Type{mytype}) and have all of the high level machinery just work. It would also be nice to provide an extension point to allow user specification of the type to be read into to override the default namedtuple scheme proposed here. However I think both of these things would require fairly big changes to the internal library code.

The change proposed here is much more minimal since it doesn't do anything about user defined types but I think it's worth making since it greatly improves performance and named tuples are much more convenient than the HDF5Compound type.

end
iscomplex = (membernames == COMPLEX_FIELD_NAMES[]) && (membertypes[1] == membertypes[2]) && (membertypes[1] <: HDF5.HDF5Scalar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part was refactored because now membernames/membertypes are used outside of of the COMPLEX_SUPPORT path.

@kleinhenz
Copy link
Contributor Author

@musm what do you think about this approach?

@tknopp
Copy link
Contributor

tknopp commented Nov 28, 2019

@musm, @kleinhenz so I support merging this although it will break my package hardly. Maybe #559 can be then worked on in a separate step. As @kleinhenz outlined they are orthogonal. In this PR one has no knowledge about the compound. In #559 one has the compound at hand. So it may even dispatch nicely.

@musm
Copy link
Member

musm commented Dec 4, 2019

Yeah I think the approach here is a good one. I need to play around with it a bit more.

What about wrapping the NamedTuple in an HDF5Compound container? This might help users more easily upgrade or transition.

@kleinhenz
Copy link
Contributor Author

How would you want to wrap it? To maintain the old interface or to just provide an outer type?

My instinct would be to just return the named tuple directly since

  1. the old interface was inconvenient enough to justify a breaking change
  2. people already know how named tuples work so adapting code to use the named tuples will have less friction than trying to figure out the semantics of a new HDF5Compound container.

@musm
Copy link
Member

musm commented Dec 12, 2019

Fair enough. I think we will need an upcoming breaking release containing this PR, changes to the array interface to align them with Base julia, and the update to the JLL artifacts.

@codecov
Copy link

codecov bot commented Feb 29, 2020

Codecov Report

Merging #592 into master will decrease coverage by 0.46%.
The diff coverage is 96%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #592      +/-   ##
=========================================
- Coverage   78.57%   78.1%   -0.47%     
=========================================
  Files           3       3              
  Lines        1134    1128       -6     
=========================================
- Hits          891     881      -10     
- Misses        243     247       +4
Impacted Files Coverage Δ
src/HDF5.jl 77.84% <96%> (-0.5%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9163402...6c8bc40. Read the comment docs.

@kleinhenz
Copy link
Contributor Author

@musm does anything else have to happen with this before it can be merged?

test/compound.jl Outdated Show resolved Hide resolved
test/compound.jl Show resolved Hide resolved
remove unnecessary convert calls

Co-Authored-By: Mustafa M. <[email protected]>
src/HDF5.jl Outdated Show resolved Hide resolved
whitespace

Co-Authored-By: Mustafa M. <[email protected]>
src/HDF5.jl Outdated
return HDF5Compound(data[1], membername, membertype)
# get a vector of all the leaf types in a (possibly nested) named tuple
function get_all_types(::Type{NamedTuple{T, U}}) where T where U
types = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be

Suggested change
types = []
types = DataType[]

src/HDF5.jl Outdated
if T === HDF5Compound{N}
return HDF5Compound(data[1], membername, membertype)
# get a vector of all the leaf types in a (possibly nested) named tuple
function get_all_types(::Type{NamedTuple{T, U}}) where T where U
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing there's a better way to do this that I can't currently think of, but seems fine for now.

Copy link
Member

@musm musm Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems faster / better

function get_all_types1(::Type{NamedTuple})
    types = DataType[]
    for Ui in U.types
        if Ui <: NamedTuple
          append!(types, get_all_types(Ui))
        else
          push!(types, Ui)
        end
    end
    return types
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this would be better/more julian?

do_normalize(::Type{T}) where T = false
do_normalize(::Type{NamedTuple{T, U}}) where T where U = any(i -> do_normalize(fieldtype(U,i)), 1:fieldcount(U))
do_normalize(::Type{T}) where T <: Union{Cstring, FixedString, FixedArray, VariableArray} = true

do_reclaim(::Type{T}) where T = false
do_reclaim(::Type{NamedTuple{T, U}}) where T where U =  any(i -> do_reclaim(fieldtype(U,i)), 1:fieldcount(U))
do_reclaim(::Type{T}) where T <: Union{Cstring, VariableArray} = true

the only reason I collect all the types is to see if I need to call reclaim or do an extra normalization step to turn the data into a native julia type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that might be simpler. It's certainly in our best favor to make the code run as fast as possible and be as simple as possible to read for maintenance. You can also benchmark performance.

size(::Type{FixedArray{T,D}}) where {T,D} = D
eltype(::Type{FixedArray{T,D}}) where {T,D} = T
struct FixedArray{T,D,L}
data::NTuple{L, T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how / where's this field being used ?

Copy link
Contributor Author

@kleinhenz kleinhenz Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used so that the julia type you construct for an H5T_ARRAY is memory compatible and can be directly written into which is necessary for compound datatypes with H5T_ARRAY members.

src/HDF5.jl Outdated
push!(row, read!(io, Vector{UInt8}(undef,dsize)))
end
# convert Cstring/FixedString to String
function normalize_types(x::NamedTuple{T}) where T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where was this normalization obtained from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was taken from the current handling of strings/vlen types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except that for vlen arrays I use a non-owning unsafe wrap and then make a copy. I think this is the right thing to do since hdf5 wants to manage its own memory.

src/HDF5.jl Outdated Show resolved Hide resolved
return Cstring
else
n = h5t_get_size(dtype.id)
return FixedString{Int(n)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm guessing it's fine not do Int here as well. Unless you foresee negative consequences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is actually necessary since NTuple requires the first type parameter to be Int64.

@musm
Copy link
Member

musm commented Mar 2, 2020

Thanks, I think we are almost there, just a few more tweaks and we can merge and tag a new major.

@musm
Copy link
Member

musm commented Mar 2, 2020

Any chance we could also get this to work with a variant of #341 in the future?

* use dispatch instead of manually collecting all leaf types
  for type normalization

* remove unnecessary Int conversion
@kleinhenz
Copy link
Contributor Author

yeah I think that should work fine. I already checked that this integrates with the complex support so float16 should work the same way. One thing that doesn't work currently is hyperslab support since we have an annoying check for T <: HDF5Scalar in the getindex code. I think a lot of that code should probably be reworked to be more generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance issue reading HDF5Compound data
3 participants