Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.14 release tracker #666

Closed
5 of 6 tasks
musm opened this issue Sep 16, 2020 · 13 comments · Fixed by #735
Closed
5 of 6 tasks

v0.14 release tracker #666

musm opened this issue Sep 16, 2020 · 13 comments · Fixed by #735

Comments

@musm
Copy link
Member

musm commented Sep 16, 2020

I propose tagging a major release by the end of the week sans objections?

Checklist before release:

Save for next major release (note this can come soon after if we are ready)?

@musm
Copy link
Member Author

musm commented Sep 17, 2020

After tagging, we can then fix JLD and MAT?

@kleinhenz
Copy link
Contributor

Do JLD and MAT have compat bounds so that our release won't break them?

@musm
Copy link
Member Author

musm commented Sep 17, 2020

Yes they do

@jmert
Copy link
Contributor

jmert commented Sep 17, 2020

As the new issue #669 I just added hints to, I think there are still a few corner cases in the new read interface that need to be sorted out. In addition to that issue which I think is a true regression, the behavior surrounding variable-length string arrays is maybe now a bit confused —

writing on v0.13.6 — while not truly round-trippable — were at least similar in form for read and write:

julia> h5open("test.h5", "w") do hfile
           attrs(hfile)["attr"] = HDF5.HDF5Vlen(["string", "another"])
       end
HDF5Vlen{HDF5.ASCIIChar}(["string", "another"])

julia> h5open("test.h5", "r") do hfile
           read(attrs(hfile)["attr"])
       end
2-element Vector{String}:
 "string"
 "another"

but now on master, the form being read back is very different from that being written:

julia> h5open("test.h5", "w") do hfile
           attrs(hfile)["attr"] = HDF5.HDF5Vlen(["string", "another"])
       end
HDF5Vlen{HDF5.ASCIIChar}(["string", "another"])

julia> h5open("test.h5", "r") do hfile
           read(attrs(hfile)["attr"])
       end
2-element Vector{Vector{String}}:
 ["s", "t", "r", "i", "n", "g"]
 ["a", "n", "o", "t", "h", "e", "r"]

I haven't dug far enough yet to know the answer for sure, but I think this is due to special cases for HDF5Vlen strings that construct the "strange" length-1 string arrays that are plaguing MAT.jl right now on HDF5 master.

Maybe all that is just to say, with the large overhaul of the read interface, there's probably need to look at the write interface as well.

Edit: changed a wrong "fixed" to "variable" in the intro which I'd gotten backwards

@kleinhenz
Copy link
Contributor

kleinhenz commented Sep 17, 2020

Yeah I agree the write side could be cleaned up somewhat also. Just to note, your example roundtrips fine if you remove the HDF5Vlen from the write. As far as I can tell the only purpose of HDF5Vlen is to write these strange string types. It is mostly possible to tell from the type itself whether it should be written as a vlen type so HDF5Vlen doesn't seem to serve a clear purpose and I think it should probably be deprecated/removed.

@jmert
Copy link
Contributor

jmert commented Sep 17, 2020

Just to note, your example roundtrips fine if you remove the HDF5Vlen from the write.

Yeah, I purpose-found this case in trying to track down how MAT.jl currently writes the DATATYPE H5T_VLEN { H5T_STRING { STRSIZE 1; ...}} HDF5 type attributes.

...so HDF5Vlen doesn't seem to serve a clear purpose and I think it should probably be deprecated/removed.

If this type were removed, what would be the call that a user like MAT.jl would have to use to force writing out a variable-length string rather than a fixed-length string? Would it be to create the corresponding HDF5 data type manually instead of just using a pre-existing datatype() return?

@kleinhenz
Copy link
Contributor

HDF5Vlen currently doesn't write variable length strings but rather variable length arrays of length 1 fixed strings. This seems to be a particular requirement of the MAT format so I would move the code for writing this format into MAT.jl which would involve defining the datatype and using some pieces of the lower level interface. I think this is probably fine since this shouldn't be visible to MAT users.

@kleinhenz
Copy link
Contributor

I can try to make a pull request to fix MAT to see how it would work.

@jmert
Copy link
Contributor

jmert commented Sep 17, 2020

HDF5Vlen currently doesn't write variable length strings but rather variable length arrays of length 1 fixed strings. This seems to be a particular requirement of the MAT format ...

OK, I think I'm starting to follow this more clearly. Yes, empirically, it's what MATLAB does with its data format, but it's also (maybe unfortunately) the "natural" result of naively constructing a variable-length array of HDF5 strings:

julia> dtype1 = HDF5.HDF5Datatype(HDF5.h5t_vlen_create(HDF5.h5t_copy(HDF5.H5T_C_S1)))
HDF5 datatype: H5T_VLEN {
      H5T_STRING {
         STRSIZE 1;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
   }

Instead, it looks like what looks more "sane" from an outside perspective is to first manipulate the string type to be variable length and then construct the variable array around them:

julia> dvstr = HDF5.h5t_copy(HDF5.H5T_C_S1);
julia> HDF5.h5t_set_size(dvstr, HDF5.H5T_VARIABLE);
julia> dtypeN = HDF5.HDF5Datatype(HDF5.h5t_vlen_create(dvstr))
HDF5 datatype: H5T_VLEN {
      H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
   }

It seems like there might be some motivation to keep around the HDF5Vlen data type to encapsulate the H5T_VLEN type, but I also agree that the current behavior where HDF5Vlen of strings is translated into the former HDF5 datatype is probably something that should be done manually by MAT.jl so that the default behavior here is something more like the latter.

(For reference, I eventually found some hints to his at https://support.hdfgroup.org/HDF5/doc/RM/RM_H5T.html#CreateVLString)

@musm
Copy link
Member Author

musm commented Oct 21, 2020

Should we plan for a new release soon? Anything in particular we should get in beforehand? We have made quite a few changes since 0.13, so it feels like a good time. I know there's still lots left to be done, but I wonder if it would be better to make incremental major releases or one large major release that would require many user changes to update to. I'm kind of leaning towards the former, but don't have strong opinions on it.

@jmert
Copy link
Contributor

jmert commented Oct 28, 2020

I'd like to see the type-specific overloads of h5a_* and h5d_* turned into specializations of a_* and d_* before the actual release, since those are arguably part of the middle-level layer and shouldn't be mixed in with the true low-level interface functions.

I'll try to work on that over the next day or two. I think doing just the mechanical translation should be relatively fast and straightfoward. There's probably more reorganizing and cleanup that could be done once the renames happen, but that can happen later in point releases.

@musm musm changed the title Proposal new major release Checklist for new major release Nov 1, 2020
@musm musm changed the title Checklist for new major release v0.14 release tracker Nov 1, 2020
@musm
Copy link
Member Author

musm commented Nov 12, 2020

I think we are at the stage where it makes sense to release 0.14. There was some follow up work in upgrading the underlying hdf5 library to 0.12 that I am working on, but it's quite tricky since HDF5 doesn't allow for cross compilation.

@musm
Copy link
Member Author

musm commented Nov 12, 2020

Interesting issue number

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants