Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing BytesIO objects to xls2xform_convert #698

Closed

Conversation

spwoodcock
Copy link

@spwoodcock spwoodcock commented Mar 4, 2024

Included

  • Added BytesIO support for XLSForm input to xls2xform_convert, with output of the converted XForm as a BytesIO object also.
  • There was already code in most places to handle in memory objects, e.g. xls2json and xls2json_backends, but it was not included in higher level xls2xform_convert.
  • This is not a perfect solution, as it requires passing a dummy path for xlsform_path when `xlsform_object=BytesIO(...). However, this is required to not break existing functionality.
    • Setting xlsform_path=/tmp/form.xls or xlsform_path=/tmp/form.xlsx allows us to determine the type of form input (XLS, XLSX, or CSV formats).
    • If we did not include this, then there is no easy way to tell which type of data is in the BytesIO object.
      We do need to know the file type of the BytesIO object though, so this works
  • Note that validation of the input XLSForm is possible, however validation of the output XForm is not possible in this mode, as no file exists for the Java file-based validation.
  • Includes tests for all functionality.

Why is this the best possible solution? Were any other approaches considered?

  • I considered a larger refactor to better handle the filespec/str and BytesIO distinction, but decided to write things in line with existing code.

What are the regression risks?

  • Hopefully not as the test suite is quite extensive.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyxform tests and ruff check pyxform tests to lint code (I ran this and it updated all the files in the repo - I ran it only on the files I edited instead)
  • verified that any code or assets from external sources are properly credited in comments

@spwoodcock
Copy link
Author

I just added two updates:

  • Better error handling and informative messages if the XLS/XLSX files are empty or corrupt.
  • Tests to verify this.

@lindsay-stevens
Copy link
Contributor

Hi @spwoodcock, thanks for time preparing this PR. I know we don't have a contributing guidelines in this project repo, so apologies for that, though the ODK docs guidelines has the gist of it. Ideally before opening a PR, a feature request or bug report should be raised on the ODK forum first, or at least in a GitHub issue. That way, the pyxform team can work with contributors to understand the problem or use case, consider design issues or how an implementation might fit into the project functionality without causing regressions, and decide whether it's a change that can reasonably be supported in future with the project's limited resources. The forum has the benefit of higher exposure, so you can potentially get feedback from other users as well.

Currently it's unclear as to why, when, or what types of users would need to pass in a BytesIO object to xls2xform_convert. This function's main purpose is to handle the incoming XLSForm conversion request from the command line interface by coordinating the typical conversion pipeline. The xls2xform is not part of the implied library API - there is a somewhat random collection of objects available in the top level __init__.py and it probably should use __all__, but anyway the xlsx2xform.py module is not exposed there to indicate that it should be directly imported by other projects. Ironically pyxform-http imports xls2xform, but that project is an example of how to incorporate pyxform into another app with the current release.

Please help us out by opening an ODK forum thread or GitHub issue to address and discuss the above. Thanks!

@spwoodcock
Copy link
Author

spwoodcock commented Mar 8, 2024

Hi @lindsay-stevens, thanks for the detailed feedback!

I agree, it would be a good idea creating a forum post, my apologies.
I had already written this code for our project FMTM, so thought why not make a PR instead of using our own fork.

While the documentation is fantastic for users creating XLSForms, there doesn't seem to be much in terms of developer documentation (correct me if I am wrong).
One of the devs in our team decided to use xls2xform_convert for converting XLS/XLSX to XForm XML.
I think this was based on the usage in pyxform-http.
When I reviewed it looked like the appropriate function to use, and worked as intended.

There is currently support for passing memory objects in parse_file_to_json.
Would your recommended approach be something like:

from pyxform.xls2json import parse_file_to_json
from pyxform.builder import create_survey_element_from_dict
xlsform_data = ... # code to get xlsform bytes
json_data = parse_file_to_json(path="/dummy/path/with/file/ext.xls", file_object=xlsform_data)
xform = create_survey_element_from_dict(json_data)
xform_xml = xform.to_xml()

If so, could I contribute an example like this to the docs?

Funnily enough our use case is exactly the same as pyxform-http, where we have a web API that users can upload XLSForms to.
We need to parse the forms, validate, and output the XForm XML for usage in ODK Central/Collect.

Originally the devs implemented by writing the XLSForm to a temp file, converting and writing the XForm to a temp file, then inserting the XForm bytes into a database, then deleting the temp files.
I thought this was very inefficient and could all be handled in memory.
pyxform-http could probably benefit from this too, processing entirely in memory instead of requiring disk read/writes (also correct me if I am wrong)

I will open a forum post with all of this info 😄

@lognaturel
Copy link
Contributor

lognaturel commented Mar 11, 2024

This gets at some questions @lindsay-stevens and I have asked ourselves over the past few years about what pyxform's interface is/should be. It's true that pyxform-http would benefit from not having to write files, I'm not sure why we haven't discussed that before.

Your 4-liner looks ok to me but it would also need to reimplement some logic to support pretty printing, different validators, etc? And itemsets.csv generation would have to be handled separately?

I do think the question of what the interface should be for a client that wants to convert an in-memory XLSForm into an in-memory XForm would make for a good forum discussion topic if you're up for moving the discussion there so it gets more visibility.

@spwoodcock
Copy link
Author

Closing in favour of #712

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants