Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support handling of parquet tables #434

Merged
merged 29 commits into from
Jul 9, 2024
Merged

Support handling of parquet tables #434

merged 29 commits into from
Jul 9, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Jun 20, 2024

Closes #433

Summary

In audeering/audformat#419 we introduced storing tables as parquet files (parquet tables). This pull request extends audb to support parquet tables. Parquet tables are slightly differently handled than csv tables:

  • their checksum is not calculated on the fly, but read from the b"hash" metadata entry in its header
  • they are stored directly on the backend, and not within a zip file

Besides the obvious changes, there are a few edge cases/non obvious consequences:

  1. archive is set to "" in the dependency table as parquet files are directly stored on the backend.
  2. audb.publish() will convert pickle only tables to parquet instead of csv before publication.
  3. A table can exist as csv and parquet file at the same time. audb.publish() will ignore the csv file in this case.
  4. I removed support for Python 3.8 and require pandas>=2.1.0 as these are minimum requirements of audformat>=1.2.0.

Docstring

If a user provides a parquet table without storing it with audformat, or use parquet files as attachments, both will not have the b"hash" entry in their metadata. In this case, the code calculates the md5 sum on the fly. I extended the docstring of audb.publish() accordingly with a paragraph discussing this:

image

I also added a section discussing what happens if a table is stored only as a pickle file, or if it is stored as a csv and parquet file:

image

Getting Started Documentation

I updated the "Overview" page in the documentation, by stating that now only media files and csv files are packed into ZIP files when uploading to a backend:

image

And marked the unpacking stage as optional when downloading a database:

image

Real-world test

I created an internal merge request, that re-publishes all tables from librispeech as parquet files: https://gitlab.audeering.com/data/librispeech/-/merge_requests/6

This works as expected for me.

Implementation details

  • I added the new file tests/test_publish_table.py to store all the tests related to the handling of different table file formats.
  • I added audb.core.utils.md5(file). If file is a parquet file, it will return the string found under b"hash" in its metadata. If this is not available, or file is not a parquet file, it uses audeer.md5(file).
  • I had to add the same code for loading from the backend to audb/core/load.py and audb/core/load_to.py as those do not really share the same code base. As this is not easy to fix, I would not handle it inside this pull request. I created Share more code between audb.load() and audb.load_to() #435 instead.

@hagenw hagenw marked this pull request as draft June 20, 2024 09:19
@hagenw hagenw marked this pull request as ready for review June 26, 2024 11:09
@hagenw hagenw requested a review from ChristianGeng June 26, 2024 11:09
@hagenw hagenw changed the title Support handling of tables in parquet Support handling of parquet tables Jun 26, 2024
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this has been reviewd in two iterations,and this is the end of the second, I will proceed without further ado.

I merely have rerun the tests - they are all passing.

So approval without much fuss.

@hagenw hagenw merged commit da28499 into main Jul 9, 2024
8 checks passed
@hagenw hagenw deleted the parquet-tables branch July 9, 2024 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for PARQUET file tables
2 participants