API less restrictive, more friendly to hierarchical data #119
Replies: 8 comments 3 replies
-
Thanks @sem-geologist for starting this discussion: what you say makes a lot of sense to me and I will try to give some context to the discussion.
In short: start from a clean very minimal public API and growth a consistent API over time. One issue that I encounter as a user and developer is to known or to make clear what is expected to be public stable API. Some will say, everything that is explicitly documented as public API, other will say everything which can imported without underscore... By using a ugly (I would say dirty) underscore, this is clear that this is something not very clean and potentially asking for trouble, i.e. break in the future. The approach being used is pragmatic to be things moving, splitting the io code from hyperspy without spending too much on defining a complete stable API, so that ideally, people can start to use and contribute to the design of the API. We should also pay attention that arguments are part of the API, something which is easy to forget... off the top of my head, I actually don't know how good/bad the situation is in rosettasciio. Designing a good interface will not be trivial as it would need to be done for a variety to format, but this is not particularly new neither and needs to be done correctly and carefully, which will takes time. Maybe it would be worth having an experimental API, which will not be stable for a period of time, explicitly presented as experimental or being a prototype |
Beta Was this translation helpful? Give feedback.
-
Also, we wanted to expose I would expect that the standard user will not expect or need direct access to all the underlying functions, which is more something for some power users that can take the detour through
Concerning hierachical data formats (like also hdf5, where h5py provides such functionality) your ideas make perfect sense. But like @ericpre said, we should get a first version out soon and then can further develop the api. |
Beta Was this translation helpful? Give feedback.
-
I was thinking and thinking and thinking... maybe it would be most correct idea to expose the public functions by defining |
Beta Was this translation helpful? Give feedback.
-
Yes, an alternative is to define The current It would be good to start writing down or implementation some prototype of would be the structure of the API and use it for a while as prototype until we are satisfied with the API design. Then the structure can be implemented to all format. |
Beta Was this translation helpful? Give feedback.
-
Let me clarify again what I have in my mind with The only thing to discuss and agree here is if naming |
Beta Was this translation helpful? Give feedback.
-
I just realized where this confusion originates. Looks that README.md of project is outdated! (Now I understand where @jlaehne comment comes from). from rsciio.msa import api # this won't work as there is no import rsciio.msa as api
api.file_reader(...) or from rsciio import msa
msa.file_reader(...) |
Beta Was this translation helpful? Give feedback.
-
Thanks for clarifying, what you are after is a stable (with deprecation cycle) and well documented API specific to each format that ignore consistency with other format. As said, this would enable user who doesn't care about the "unified api" to engage with rosettasciio by using it and contributing to the format there are interested and at the same time, we could piggy back on that to make it available in a unified manner (more than just |
Beta Was this translation helpful? Give feedback.
-
OK, so easiest would be if we rename |
Beta Was this translation helpful? Give feedback.
-
I should rant a bit at first (That's me, how to start a day without ranting? 😆). I am developing HussariX and I struggle with imposed restrictions of HyperSpy Structure, and thought things will get a bit better after splitting io. For me it looks it is not much and I am still maintaining dublicate code with different data access interfaces.
Currently it is expected that given format api provides
.file_reader
or/and.file_writer
. This limits usage of library only for flat structured physical files, where signals are present as single or a group originating within single measurement/ROI. i.e. contrary to such superstition, the old asw Jeol folder based projects organizes data within by defined project and sample (two level hierarchy). Bruker Esprit projects allows deep custom structured hierarchical structures with no artificial imposed limits of many levels (something like file systems with folders and files within). As #118 came up, and I started to think again about adding Cameca EPMA formats to this library, I again see some "wall" standing in front me, as data is hierarchical inside these files and translates poorly to flat lists with dictionaries when reading whole files.Thus I need to maintain some shared codebase which have some different interfaces to access data. Life would be much easier if this project 1) would not impose some artificial restrictions, and even more bright 2) if we could come up with unified extension of more advanced data access interfaces.
Who got the idea to hide everything from user (I often find myself in doubt, if this is philosophically Java project written in python syntax...) and expose only these (file_reader, file_writer) functions? If I do
import rsciio.bruker as api
I get only.file_reader
function exposed. If I want to get full functionality I need to do uglyimport rsciio.bruker._api as api
.Idea 1. Lets leave the Java'esque restricted thinking and move to openness of python. I suggest we drop using underscore in filenames of every format
_api.py
. Why can't it be simplyapi
and expose all those internal functions to the user? Is it not enough that functions and methods which are not meant to be directly used by user are already pre-fixed with underscore inside code of particular format parser? If we would drop underscore in_api.py
then in example by doingimport rsciio.bruker as api
we would get also.api
accessible alongside.file_read
.file_write
andfile_read
are guaranteed unified interfaces, where api would provide more specific functions for dealing with the specific files and workflows feasible for that particular vendor formats. Early adopters (potentially me) could use some functionalities directly until (or if) unified functionality will be developed (see further ideas).Idea 2. lets expand unified interface functions. Currently it is clear that file name is expected as argument for
file_reader
andfile_writer
and thus physical file. But what if we want to get single piece of data and metadata from hierarchical (tree like structured) storage entities directly? i.e. node in huge XML, or some files inside zip. BTW, currently testing of some tiff functionalities use zipped tifs (the cumbersomeness and many disadvantages of compressing testing data at git repository is a bit off-topic here, but it is worth to come back in the another discussion) and needs to extract files to temporary location to load it using hyperspy. Ironically the rsciio without hyperspy (on its own by invokingrsciio.tiff.file_reader
is able to read those directly from within zip! Why it would not?tifffile
'sTiffFile
accepts str | os.pathLike | BytesIO | or anything similar! But Hyperspy needs physical files, because it uses file extension to recognize which reader needs to be used, and thus test files needs to be physically extracted.rsciio
still has no this unified functionality (I mean something like hyperspy, is there plans to add unifiedrsciio.read
orrsciio.load
,rsciio.write
orrsciio.save
?). Thus my proposition would be to extend read/write capabilities matrices and add corresponding functions such asbytesio_reader
- which could read from objects which support common interfaces with bytesIO: – that is.read
,.seek
,.tell
,.peek
... methods. And another would beetree_reader
andetree_writer
, for readers and writers (some planned to be made) which deal with XML with ElementTree interface.Idea 3. It would be nice if hierarchically saved data systems could expose another unified method
get_tree
(naming is for discussion) for these formats which has dataset/sample/projects/etc... hierarchy. i.e. Bruker Projects and Jeol asw projects (probably there is more to me currently not known). Such tree (ordered dictionary) could be then used to retrieve data and metadata on demand from particular nodes.Beta Was this translation helpful? Give feedback.
All reactions