API less restrictive, more friendly to hierarchical data #119

sem-geologist · 2023-05-17T13:10:38Z

sem-geologist
May 17, 2023

I should rant a bit at first (That's me, how to start a day without ranting? 😆). I am developing HussariX and I struggle with imposed restrictions of HyperSpy Structure, and thought things will get a bit better after splitting io. For me it looks it is not much and I am still maintaining dublicate code with different data access interfaces.

Currently it is expected that given format api provides .file_reader or/and .file_writer. This limits usage of library only for flat structured physical files, where signals are present as single or a group originating within single measurement/ROI. i.e. contrary to such superstition, the old asw Jeol folder based projects organizes data within by defined project and sample (two level hierarchy). Bruker Esprit projects allows deep custom structured hierarchical structures with no artificial imposed limits of many levels (something like file systems with folders and files within). As #118 came up, and I started to think again about adding Cameca EPMA formats to this library, I again see some "wall" standing in front me, as data is hierarchical inside these files and translates poorly to flat lists with dictionaries when reading whole files.

Thus I need to maintain some shared codebase which have some different interfaces to access data. Life would be much easier if this project 1) would not impose some artificial restrictions, and even more bright 2) if we could come up with unified extension of more advanced data access interfaces.

Who got the idea to hide everything from user (I often find myself in doubt, if this is philosophically Java project written in python syntax...) and expose only these (file_reader, file_writer) functions? If I do import rsciio.bruker as api I get only .file_reader function exposed. If I want to get full functionality I need to do ugly import rsciio.bruker._api as api.

Idea 1. Lets leave the Java'esque restricted thinking and move to openness of python. I suggest we drop using underscore in filenames of every format _api.py. Why can't it be simply api and expose all those internal functions to the user? Is it not enough that functions and methods which are not meant to be directly used by user are already pre-fixed with underscore inside code of particular format parser? If we would drop underscore in _api.py then in example by doing import rsciio.bruker as api we would get also .api accessible alongside .file_read. file_write and file_read are guaranteed unified interfaces, where api would provide more specific functions for dealing with the specific files and workflows feasible for that particular vendor formats. Early adopters (potentially me) could use some functionalities directly until (or if) unified functionality will be developed (see further ideas).

Idea 2. lets expand unified interface functions. Currently it is clear that file name is expected as argument for file_reader and file_writer and thus physical file. But what if we want to get single piece of data and metadata from hierarchical (tree like structured) storage entities directly? i.e. node in huge XML, or some files inside zip. BTW, currently testing of some tiff functionalities use zipped tifs (the cumbersomeness and many disadvantages of compressing testing data at git repository is a bit off-topic here, but it is worth to come back in the another discussion) and needs to extract files to temporary location to load it using hyperspy. Ironically the rsciio without hyperspy (on its own by invoking rsciio.tiff.file_reader is able to read those directly from within zip! Why it would not? tifffile's TiffFile accepts str | os.pathLike | BytesIO | or anything similar! But Hyperspy needs physical files, because it uses file extension to recognize which reader needs to be used, and thus test files needs to be physically extracted. rsciio still has no this unified functionality (I mean something like hyperspy, is there plans to add unified rsciio.read or rsciio.load, rsciio.write or rsciio.save?). Thus my proposition would be to extend read/write capabilities matrices and add corresponding functions such as bytesio_reader - which could read from objects which support common interfaces with bytesIO: – that is .read, .seek, .tell, .peek... methods. And another would be etree_reader and etree_writer, for readers and writers (some planned to be made) which deal with XML with ElementTree interface.

Idea 3. It would be nice if hierarchically saved data systems could expose another unified method get_tree (naming is for discussion) for these formats which has dataset/sample/projects/etc... hierarchy. i.e. Bruker Projects and Jeol asw projects (probably there is more to me currently not known). Such tree (ordered dictionary) could be then used to retrieve data and metadata on demand from particular nodes.

ericpre · 2023-05-17T17:35:25Z

ericpre
May 17, 2023
Maintainer

Thanks @sem-geologist for starting this discussion: what you say makes a lot of sense to me and I will try to give some context to the discussion.

Who got the idea to hide everything from user (I often find myself in doubt, if this is philosophically Java project written in python syntax...) and expose only these (file_reader, file_writer) functions? If I do import rsciio.bruker as api I get only .file_reader function exposed. If I want to get full functionality I need to do ugly import rsciio.bruker._api as api.

In short: start from a clean very minimal public API and growth a consistent API over time.

One issue that I encounter as a user and developer is to known or to make clear what is expected to be public stable API. Some will say, everything that is explicitly documented as public API, other will say everything which can imported without underscore... By using a ugly (I would say dirty) underscore, this is clear that this is something not very clean and potentially asking for trouble, i.e. break in the future. The approach being used is pragmatic to be things moving, splitting the io code from hyperspy without spending too much on defining a complete stable API, so that ideally, people can start to use and contribute to the design of the API. We should also pay attention that arguments are part of the API, something which is easy to forget... off the top of my head, I actually don't know how good/bad the situation is in rosettasciio.

Designing a good interface will not be trivial as it would need to be done for a variety to format, but this is not particularly new neither and needs to be done correctly and carefully, which will takes time. Maybe it would be worth having an experimental API, which will not be stable for a period of time, explicitly presented as experimental or being a prototype

0 replies

jlaehne · 2023-05-17T21:14:29Z

jlaehne
May 17, 2023
Collaborator

Also, we wanted to expose file_reader and file_writer more directly through e.g. rsciio.bruker instead of rsciio.bruker.api.

I would expect that the standard user will not expect or need direct access to all the underlying functions, which is more something for some power users that can take the detour through ._api. But I don't object to making the whole thing public either. Alternatively, we can simply add more functions to the exposed api where applicable, like in the cases you outline - there surely is no need to restrict us to the reader/writer functions, but these should be available consistently.

is there plans to add unified rsciio.read or rsciio.load, rsciio.write or rsciio.save?
I currently don't see the need to duplicate efforts here for something available through HyperSpy. But that is my view through the perspective of reading files through hyperspy anyway. The idea was to make the plugins available to other packages and programs through a unified api, with less load in terms of dependencies - if it aids, such unified functions can be added later, but for some packages it might be more practical to implement that on their end like hyperspy.

Concerning hierachical data formats (like also hdf5, where h5py provides such functionality) your ideas make perfect sense. But like @ericpre said, we should get a first version out soon and then can further develop the api.

1 reply

sem-geologist May 18, 2023
Author

Also, we wanted to expose file_reader and file_writer more directly through e.g. rsciio.bruker instead of rsciio.bruker.api.

but it already does like that, You get file_reader and file_writer directly under vendor name, not under vendor.api. Taking underscores away from filenames _api.py all public internal functions would appear under .api of selected form alongside directly exposed file_reader and file_writer (e.g. rsciio.bruker, would have exposed file_reader, file_writer (planned writer to .spx format) and api, where under api would be internal public functions.

I would expect that the standard user will not expect or need direct access to all the underlying functions, which is more something for some power users that can take the detour through ._api. But I don't object to making the whole thing public either. Alternatively, we can simply add more functions to the exposed api where applicable, like in the cases you outline - there surely is no need to restrict us to the reader/writer functions, but these should be available consistently.

And that is what private functions and methods have prefixes with underscore inside the particular parser code for - to discourage using it directly as 1) it is useful only in a particular chain of functions, and using it outside can give unexpected results, 2) is poor documented 3) api is not stable. In most of formats it is not the case, as it is the most stable code from everything there, well documented and clearly design also for direct use.

Furthermore, it is not only standard users, but other developers which build software on this. To be honest, I think aiming to standard users misses the point badly, I doubt that any standard users are even interested in this kind of library, as it just returns data and metadata, HyperSpy and its extensions is more of standard user target. Hiding functions which were designed to be public (well documented, independent, stable api) by making file prefixed with underscore is like telling: "don't dare to use this code as You like, only use it as we tell to use it" - that is like all this proprietary software which locks every possible bit of unforeseen potential usability from everyone. I find it really sad, that one of "anti"-pattern pillars (which is so much practiced within proprietary packages and software, which was one of the main annoyances motivating me to invest so much time liberating data formats for freedom to use it as anyone likes or have idea what to do with it) is creeping into libre open source project. I kind of understand the reason behind such restrictions in proprietary software, they try to monetize every additional features (there it is even considered as good pattern), but I can't see any benefits for raising such artificial restrictions here. By restricting the discover-ability of internal useful api's of formats (that what underscore prefix in filename does) we just push away potential users. Who can know that to access the more powerful functions You need to import ._api of vendor? Only those who develop this package or had contributed to it and had seen how the source code is organised, as most of code completer or introspection tools will ignore it (e.g. Jupyter). However, for people who install libraries through pip/mamba/conda - without source code inspection that will be absolutely hardly to discover.

Rapid development based on code reuse to get job done without waiting for requested functionality to be implemented - that is one of selling points of open source.

sem-geologist · 2023-05-18T12:05:26Z

sem-geologist
May 18, 2023
Author

I was thinking and thinking and thinking... maybe it would be most correct idea to expose the public functions by defining __all__ inside _api.py and exposing it under vendor.api alongside the file_reader and file_writer. That way it is laying on the developer shoulders (of particular format which he is familiar with) that only good, stable and well thought with no plans of braking, functions and classes should be exposed inside vendor.api? I will open new PR to show proof of concept with bruker.

0 replies

ericpre · 2023-05-18T12:49:49Z

ericpre
May 18, 2023
Maintainer

Yes, an alternative is to define__all__ can in vendor/__init__, so it can be imported with from rsciio.vendor import ReaderClass, function. However, if we want to design a consistent API, any new API needs to consider with respect to other format.

The current _api convention is a generic name and come from the fact that in hyperspy, each vendor IO plugin were contained in a single file, now in rosettasciio, they have their own folder, in that we folder anything is possible!

It would be good to start writing down or implementation some prototype of would be the structure of the API and use it for a while as prototype until we are satisfied with the API design. Then the structure can be implemented to all format.

0 replies

sem-geologist · 2023-05-18T13:22:01Z

sem-geologist
May 18, 2023
Author

Let me clarify again what I have in my mind with __all__. API has two meaning: unified API (that is same named function which do the same thing with similar expected outcome). And then there is API useful when dealing with specific vendor format. Unified functions should be exposed directly under vendor (__all__ defined in __init__.py in vendor folder), and naming conventions discussed (which at least for file_reader and file_writer looks well established). It is exposed at vendor.function (e.g rsciio.bruker.file_reader). However vendor specific API should be exposed under vendor.api.function (e.g. rsciio.bruker.api.SFSReader). These would be functions which allows interested parties to implement its own functionality "already now" not in next few years, else it would take a lot of time and discussion and polishing for unified interfaces (if there are similarities in other vendor formats - e.g. deep hierarchy trees), or are unique to the particular vendor format and unified functions will not see the light ever. In particularly I want to expose SFSReader, EDXSpectrum, which would allow to deal with project files, .pan files (particle analysis) and all kind of other unique stuff which reuse same data structures and technologies, which are utilised currently "fully" in supported bcf and spx bruker formats. It is reusable code which I think can be rapidly applied for most of new developed formats of bruker. So by mentioning __all__ I mean functions and classes listed inside the vendor._api.py.__all__.

The only thing to discuss and agree here is if naming vendor.api.function (the "api" part) is ok for such unique public functions exposed under it?
What could be alternatives to .api? .helper,.vendor_specific, .internal, .public, .etc, .utils, .misc?

0 replies

sem-geologist · 2023-05-18T14:13:37Z

sem-geologist
May 18, 2023
Author

I just realized where this confusion originates. Looks that README.md of project is outdated! (Now I understand where @jlaehne comment comes from).
The following example under NOTE in README won't work:

from rsciio.msa import api  # this won't work

as there is no api under msa (or other vendor).
to use file_reader one needs to import it like that:

import rsciio.msa as api
api.file_reader(...)

or

from rsciio import msa
msa.file_reader(...)

1 reply

jlaehne May 19, 2023
Collaborator

Thanks for pointing that out, I quickly opened a PR to fix that, also on the first page of the user guide: #120

ericpre · 2023-05-18T16:56:03Z

ericpre
May 18, 2023
Maintainer

Thanks for clarifying, what you are after is a stable (with deprecation cycle) and well documented API specific to each format that ignore consistency with other format. As said, this would enable user who doesn't care about the "unified api" to engage with rosettasciio by using it and contributing to the format there are interested and at the same time, we could piggy back on that to make it available in a unified manner (more than just file_reader and file_writer) to user who want to be able to access a number of different format in a consistent manner. This sounds very good to me!
I can't think of anything better than format.api to access the format specific API and define the unified API in format.__init__, as it is done currently.

0 replies

jlaehne · 2023-05-19T07:41:33Z

jlaehne
May 19, 2023
Collaborator

OK, so easiest would be if we rename _api.py back to api.py for all formats? file_reader/writer are then accessible directly and through .api, but that should not hurt.

1 reply

sem-geologist May 19, 2023
Author

No. I think in the end I got convinced that it is rather more correct approach to hide some of details for most of formats (lets leave _api.py intact). Some formats use public notation for its methods and functions, which are not well documented or actually are just a "middle-cog-wheel" in parsing process and should be private. Exposing all functions of all formats calls for trouble. I think creating api.py alongside _api.py, and exposing wisely cherry-picked functions and classes which could be useful would be better approach. I will upload updated bruker format with such treatment to show how to achieve that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API less restrictive, more friendly to hierarchical data #119

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

API less restrictive, more friendly to hierarchical data #119

sem-geologist May 17, 2023

Replies: 8 comments · 3 replies

ericpre May 17, 2023 Maintainer

jlaehne May 17, 2023 Collaborator

sem-geologist May 18, 2023 Author

sem-geologist May 18, 2023 Author

ericpre May 18, 2023 Maintainer

sem-geologist May 18, 2023 Author

sem-geologist May 18, 2023 Author

jlaehne May 19, 2023 Collaborator

ericpre May 18, 2023 Maintainer

jlaehne May 19, 2023 Collaborator

sem-geologist May 19, 2023 Author

sem-geologist
May 17, 2023

Replies: 8 comments 3 replies

ericpre
May 17, 2023
Maintainer

jlaehne
May 17, 2023
Collaborator

sem-geologist May 18, 2023
Author

sem-geologist
May 18, 2023
Author

ericpre
May 18, 2023
Maintainer

sem-geologist
May 18, 2023
Author

sem-geologist
May 18, 2023
Author

jlaehne May 19, 2023
Collaborator

ericpre
May 18, 2023
Maintainer

jlaehne
May 19, 2023
Collaborator

sem-geologist May 19, 2023
Author