-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support handling of parquet tables #434
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hagenw
changed the title
Support handling of tables in parquet
Support handling of parquet tables
Jun 26, 2024
ChristianGeng
approved these changes
Jul 9, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this has been reviewd in two iterations,and this is the end of the second, I will proceed without further ado.
I merely have rerun the tests - they are all passing.
So approval without much fuss.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #433
Summary
In audeering/audformat#419 we introduced storing tables as parquet files (parquet tables). This pull request extends
audb
to support parquet tables. Parquet tables are slightly differently handled than csv tables:b"hash"
metadata entry in its headerBesides the obvious changes, there are a few edge cases/non obvious consequences:
archive
is set to""
in the dependency table as parquet files are directly stored on the backend.audb.publish()
will convert pickle only tables to parquet instead of csv before publication.audb.publish()
will ignore the csv file in this case.pandas>=2.1.0
as these are minimum requirements ofaudformat>=1.2.0
.Docstring
If a user provides a parquet table without storing it with
audformat
, or use parquet files as attachments, both will not have theb"hash"
entry in their metadata. In this case, the code calculates the md5 sum on the fly. I extended the docstring ofaudb.publish()
accordingly with a paragraph discussing this:I also added a section discussing what happens if a table is stored only as a pickle file, or if it is stored as a csv and parquet file:
Getting Started Documentation
I updated the "Overview" page in the documentation, by stating that now only media files and csv files are packed into ZIP files when uploading to a backend:
And marked the unpacking stage as optional when downloading a database:
Real-world test
I created an internal merge request, that re-publishes all tables from
librispeech
as parquet files: https://gitlab.audeering.com/data/librispeech/-/merge_requests/6This works as expected for me.
Implementation details
tests/test_publish_table.py
to store all the tests related to the handling of different table file formats.audb.core.utils.md5(file)
. Iffile
is a parquet file, it will return the string found underb"hash"
in its metadata. If this is not available, orfile
is not a parquet file, it usesaudeer.md5(file)
.audb/core/load.py
andaudb/core/load_to.py
as those do not really share the same code base. As this is not easy to fix, I would not handle it inside this pull request. I created Share more code between audb.load() and audb.load_to() #435 instead.