Re-haul and implement a more generic read #652

kleinhenz · 2020-08-26T17:42:04Z

This adds a new generic (not type constrained) read method

read(obj::DatasetOrAttribute, ::Type{T}, I...) where T

which reads a hyperslab of a dataset or an attribute into an array of type T.

It also adds convenience methods

read(parent::Union{HDF5File, HDF5Group}, name_type_pair::Pair{String, DataType}; pv...)
h5read(filename, name_type_pair::Pair{String, DataType}; pv...)

which can be used like v_read = read(h5f, "data"=>Simple) or v_read = h5read(fn, "data"=>Simple) to read
data into an array with element type Simple (see test/custom.jl). This provides the functionality of #559 in a way that is more integrated with the rest of the library than adding a new separate h5read_compound method and closes #169.

Although the function does not have a type constraint I do check that the type is the correct size so this should still prevent you from shooting yourself in the foot too easily. The behavior of the generic read can be customized by defining do_reclaim, do_normalize and normalize_types on your custom type. If we decide to do this, I will write up some documentation for customization.

Internally, I was able to replace a bunch of separate read methods for different types with this one generic method. This works by separating reading from hdf5 into a memory compatible datatype from mapping that data into a native julia type. The first part can be handled generically (HDF5Scalar types are not special) and the second part can be handled by the normalize_types methods which we have been using with compound types and can be extended. I think this is a win in terms of simplification and reducing the number of lines of code. The only type that requires special handling is now HDF5Opaque.

As a side-effect of this hyperslab indexing works generically for everything besides HDF5Opaque. Closes #625.
As another side-effect, vlen strings are now handled uniformly via unsafe_string so this should hopefully close #627.

The rules for mapping from hdf5 to julia types now uniformly follow the conventions we have been using for reading compound types which shouldn't be too breaking but does cause a few changes in behavior. In particular the data

HDF5 "output.h5" {
DATASET "/salut_vlen" {
   DATATYPE  H5T_VLEN { H5T_STRING {
      STRSIZE 1;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }}
   DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
   DATA {
   (0): ("H", "i"), ("t", "h", "e", "r", "e")
   }
}
}

is now read as salut_vlenr = [["H", "i"], ["t", "h", "e", "r", "e"]] instead of ["Hi", "There"]. However, I think the new behavior is actually a more correct interpretation of the datatype since STRSIZE=1. Also now hdf5 strings are read into julia strings even if they are length 1 whereas previously these would have been read in character types. This could be changed if people think that it is important. Other than the salut_vlenr example all other tests pass without modification.

* I actually think the current behavior is more correct that what was done previously

kleinhenz · 2020-08-26T17:44:48Z

@tknopp, @tamasgal would this work for your usecases?

tamasgal · 2020-08-26T18:15:40Z

Looks good to me!

musm · 2020-08-28T15:00:26Z

As a side-effect of this hyperslab indexing works generically for everything besides HDF5Opaque. Closes #625.
As another side-effect, vlen strings are now handled uniformly via unsafe_string so this should hopefully close #627.

👍

is now read as salut_vlenr = [["H", "i"], ["t", "h", "e", "r", "e"]] instead of ["Hi", "There"]. However, I think the new behavior is actually a more correct interpretation of the datatype since STRSIZE=1. Also now hdf5 strings are read into julia strings even if they are length 1 whereas previously these would have been read in character types. This could be changed if people think that it is important.

I don't have a strong opinion whether it should a string or character, but I tend to agree with you. Perhaps there is some efficiency lost by it not being a character, when STRSIZE=1. But wouldn't that mean we would need to special case STRSIZE=1, which seems cumbersome/inelegant.

musm · 2020-08-28T15:29:52Z

src/HDF5.jl

@@ -1952,6 +1820,15 @@ function hdf5_to_julia_eltype(objtype)
    return T
 end

+function get_jl_type(objtype)


Not sure I understand the point of this method.

The hdf5_to_julia_eltype method
already has defined class_id == H5T_OPAQUE

HDF5.jl/src/HDF5.jl

Line 1808 in ece0dcd

elseif class_id == H5T_OPAQUE

So it seems that without this method things should just work already.

Edit: NVM that's for hdf5_to_julia_eltype

Yeah this mostly duplicates hdf5_to_julia_eltype but changes the string handling part so there are no special cases for the length=1 case. I wrote a new function instead of changing the original because I wanted to only touch the read part of the library. There is probably some consolidation/cleanup possible.

musm · 2020-08-28T15:32:22Z

src/HDF5.jl

@@ -1952,6 +1820,15 @@ function hdf5_to_julia_eltype(objtype)
    return T
 end

+function get_jl_type(objtype)


maybe add a comment that this is temporarily special cased for H5T_OPAQUE

src/HDF5.jl

musm · 2020-08-28T16:08:49Z

massive improvement in simplicity and readability.

musm · 2020-08-28T16:53:12Z

feel free to merge whenever

src/HDF5.jl

musm · 2020-08-28T21:02:19Z

What should we do with the UTF8Char and ASCIIChar types

kleinhenz · 2020-08-29T00:52:46Z

Cool, if you are overall happy with this approach I can clean things up a bit and write some documentation. I left some stuff that can probably be removed like the hdf5_to_julia_eltype function which mostly but not quite duplicates get_jl_type because I was trying to only touch the read part of the library and some of that is used in the write part. However there is probably at least some cleanup/consolidation possible.

I was actually wrong about the string reading behavior. Fixed length strings with length 1 are already read as strings not characters. The only place where the character types are actually used to change behavior is for reading H5T_VLEN datatypes with fixed string element types. In that case, if STRSIZE=1 then the data is read as an array of strings instead of an array of arrays of characters. This is what is going on in the salut_vlenr example. We could replicate the current behavior by adding a method for normalize_types(x::VariableArray{FixedString}). This is a fairly weird example though since the natural thing to do would be to just use a variable length string type. I need to check if the character types are important for any of the write methods but otherwise I think they can be removed.

musm · 2020-08-29T02:55:39Z

Cool, if you are overall happy with this approach I can clean things up a bit and write some documentation. I left some stuff that can probably be removed like the hdf5_to_julia_eltype function which mostly but not quite duplicates get_jl_type because I was trying to only touch the read part of the library and some of that is used in the write part. However there is probably at least some cleanup/consolidation possible.

Right now I think is the appropriate time to make such changes in the upcoming breaking release. It would certainly be good to clean up as much of the old cruft and any ugly parts of the code that we can, so I'm in support.

tknopp · 2020-08-30T14:52:29Z

@kleinhenz: I have switched to the regular compound reading with named tuples: MagneticResonanceImaging/MRIReco.jl@9b1a219

works fine although there was some drop in performance.

* requires Compat for julia 1.3

musm · 2020-09-02T14:50:33Z

@kleinhenz LMK if this is g2g, I'm happy with the changes here.

kleinhenz · 2020-09-02T16:42:06Z

yeah I think this is good to go. There is some more cleanup of internals possible but it is only sort of related to the changes here and I don't want to bloat this pull request too much.

* Add deprecated bindings for backwards compatibility with ecosystem packages Namely, get the tests of JLD.jl and MAT.jl to pass again. * Deprecations for #632 - keywords instead of property lists * Deprecations for #652 - generic read * Deprecate {d,a}_{create,write} methods with property lists The calls should use keyword properties instead.

HDF5 has the concept of NULL arrays which are distinct from 0-size arrays. As shown in JuliaIO#705, writing any 0-size array was being translated to the NULL object during writing, but having 0-length axes is also perfectly valid. This PR preserves 0-sizes when writing arrays. To represent the HDF5 NULL object, there was already the `HDF5.EmptyArray` type defined, but prior to JuliaIO#652 it appears to have been used only for internal dispatch on the generic read (which returned a 0-length vector for NULL objects). Since the generic read overhaul, it has been unused. This PR reuses `HDF5.EmptyArray` and elevates it to a full subtype of `AbstractArray`, which then becomes the returned object for any NULL datasets/attributes which are read in. This array type carries an element type but has no contents. This is distinct from Julia's `Array{T,0}` 0-dimensional arrays because the latter _does_ have a single element. (I tried just doing duck typing, but you end up playing whack-a-mole with having to define enough new method definitions that you might as well just implement the `AbstractArray` interface and take advantage of its fallbacks once the few methods are defined.) (One oddity is that Julia's 0-dimensional `Arrays` are turned into HDF5 scalars and will not be read back in as a 0-dimensional array, but (a) that is the existing behavior, and (b) I haven't thought of any other way to handle the case.)

musm · 2020-11-17T18:48:07Z

src/HDF5.jl

@@ -1991,19 +1874,22 @@ function get_mem_compatible_jl_type(objtype)
        finally
            h5t_close(super_type)
        end
+    elseif class_id == H5T_REFERENCE
+        # TODO update to use version 1.12 reference functions/types


kleinhenz added 13 commits August 25, 2020 15:26

add separate tests for HDF5ReferenceObj

e9fb1c2

support H5T_REFERENCE in get_mem_compatible_jl_type

2a9a664

wip on generic read

1be00e4

use generic read

3253219

comment out failing test

cbf489b

* I actually think the current behavior is more correct that what was done previously

remove unused

a96f4f2

add test for reading custom struct

96a3691

check that type we are reading into is concrete

a6a1678

add convenience methods for typed reads

976dc45

wrap h5s_get_select_npoints

79d71d2

allow selecting hyperslab of dataspace

08e1f54

support hyperslabs in generic read function

6ba3e83

replace filter of tuple with _dropint for compatibility

15f4e96

kleinhenz mentioned this pull request Aug 26, 2020

Add h5read_compound to read compound structures efficiently #559

Closed

add tests for hyperslab indexing of non-scalar types

ece0dcd

musm reviewed Aug 28, 2020

View reviewed changes

src/HDF5.jl Outdated Show resolved Hide resolved

musm reviewed Aug 28, 2020

View reviewed changes

src/HDF5.jl Show resolved Hide resolved

musm reviewed Aug 28, 2020

View reviewed changes

src/HDF5.jl Show resolved Hide resolved

kleinhenz added 2 commits August 28, 2020 22:24

whitespace

5f0f3af

use datatype show method when printing error

2b0b02b

type constrain signature of get_(mem_compatible)_jl_type

85a2e59

replace _dropint by filter(f, tuple)

94e33a6

* requires Compat for julia 1.3

musm changed the title ~~RFC: generic read~~ Re-haul and implement a more generic read Sep 2, 2020

musm merged commit 15a521d into JuliaIO:master Sep 2, 2020

jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 9, 2020

Add deprecations needed from JuliaIO#652

728eefd

jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 9, 2020

Deprecations for JuliaIO#652 - generic read

e3994fd

jmert mentioned this pull request Sep 9, 2020

Add deprecations to aid in next major revision transition #657

Merged

jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 9, 2020

Deprecations for JuliaIO#652 - generic read

7bde4b9

jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 11, 2020

Deprecations for JuliaIO#652 - generic read

8e08e44

kleinhenz mentioned this pull request Sep 17, 2020

Regression in typed read #669

Closed

jmert mentioned this pull request Sep 23, 2020

Adapt to changes in HDF5 v0.14 release JuliaIO/MAT.jl#145

Merged

jmert mentioned this pull request Sep 30, 2020

Re-add read(dset, String) form #686

Merged

jmert mentioned this pull request Oct 13, 2020

Fix #705, distinguishing 0-size and null arrays #706

Merged

kleinhenz mentioned this pull request Nov 4, 2020

remove hdf5_to_julia #732

Merged

kleinhenz mentioned this pull request Nov 14, 2020

Reading a part of an HDF5Compound dataset #427

Closed

musm reviewed Nov 17, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-haul and implement a more generic read #652

Re-haul and implement a more generic read #652

kleinhenz commented Aug 26, 2020

kleinhenz commented Aug 26, 2020

tamasgal commented Aug 26, 2020

musm commented Aug 28, 2020 •

edited

Loading

musm Aug 28, 2020 •

edited

Loading

kleinhenz Aug 28, 2020

musm Aug 28, 2020

musm commented Aug 28, 2020

musm commented Aug 28, 2020

musm commented Aug 28, 2020

kleinhenz commented Aug 29, 2020

musm commented Aug 29, 2020

tknopp commented Aug 30, 2020

musm commented Sep 2, 2020

kleinhenz commented Sep 2, 2020

musm Nov 17, 2020

Re-haul and implement a more generic read #652

Re-haul and implement a more generic read #652

Conversation

kleinhenz commented Aug 26, 2020

kleinhenz commented Aug 26, 2020

tamasgal commented Aug 26, 2020

musm commented Aug 28, 2020 • edited Loading

musm Aug 28, 2020 • edited Loading

Choose a reason for hiding this comment

kleinhenz Aug 28, 2020

Choose a reason for hiding this comment

musm Aug 28, 2020

Choose a reason for hiding this comment

musm commented Aug 28, 2020

musm commented Aug 28, 2020

musm commented Aug 28, 2020

kleinhenz commented Aug 29, 2020

musm commented Aug 29, 2020

tknopp commented Aug 30, 2020

musm commented Sep 2, 2020

kleinhenz commented Sep 2, 2020

musm Nov 17, 2020

Choose a reason for hiding this comment

musm commented Aug 28, 2020 •

edited

Loading

musm Aug 28, 2020 •

edited

Loading