Support handling of parquet tables #434

hagenw · 2024-06-20T09:19:32Z

Closes #433

Summary

In audeering/audformat#419 we introduced storing tables as parquet files (parquet tables). This pull request extends audb to support parquet tables. Parquet tables are slightly differently handled than csv tables:

their checksum is not calculated on the fly, but read from the b"hash" metadata entry in its header
they are stored directly on the backend, and not within a zip file

Besides the obvious changes, there are a few edge cases/non obvious consequences:

archive is set to "" in the dependency table as parquet files are directly stored on the backend.
audb.publish() will convert pickle only tables to parquet instead of csv before publication.
A table can exist as csv and parquet file at the same time. audb.publish() will ignore the csv file in this case.
I removed support for Python 3.8 and require pandas>=2.1.0 as these are minimum requirements of audformat>=1.2.0.

Docstring

If a user provides a parquet table without storing it with audformat, or use parquet files as attachments, both will not have the b"hash" entry in their metadata. In this case, the code calculates the md5 sum on the fly. I extended the docstring of audb.publish() accordingly with a paragraph discussing this:

I also added a section discussing what happens if a table is stored only as a pickle file, or if it is stored as a csv and parquet file:

Getting Started Documentation

I updated the "Overview" page in the documentation, by stating that now only media files and csv files are packed into ZIP files when uploading to a backend:

And marked the unpacking stage as optional when downloading a database:

Real-world test

I created an internal merge request, that re-publishes all tables from librispeech as parquet files: https://gitlab.audeering.com/data/librispeech/-/merge_requests/6

This works as expected for me.

Implementation details

I added the new file tests/test_publish_table.py to store all the tests related to the handling of different table file formats.
I added audb.core.utils.md5(file). If file is a parquet file, it will return the string found under b"hash" in its metadata. If this is not available, or file is not a parquet file, it uses audeer.md5(file).
I had to add the same code for loading from the backend to audb/core/load.py and audb/core/load_to.py as those do not really share the same code base. As this is not easy to fix, I would not handle it inside this pull request. I created Share more code between audb.load() and audb.load_to() #435 instead.

docs/pics/publish.dot

docs/overview.rst

tests/test_load.py

tests/test_publish_table.py

ChristianGeng

As this has been reviewd in two iterations,and this is the end of the second, I will proceed without further ado.

I merely have rerun the tests - they are all passing.

So approval without much fuss.

hagenw marked this pull request as draft June 20, 2024 09:19

hagenw mentioned this pull request Jun 21, 2024

Share more code between audb.load() and audb.load_to() #435

Open

hagenw force-pushed the parquet-tables branch from ed3497c to e06570b Compare June 25, 2024 10:03

hagenw marked this pull request as ready for review June 26, 2024 11:09

hagenw requested a review from ChristianGeng June 26, 2024 11:09

hagenw changed the title ~~Support handling of tables in parquet~~ Support handling of parquet tables Jun 26, 2024

ChristianGeng reviewed Jun 27, 2024

View reviewed changes

docs/pics/publish.dot Show resolved Hide resolved

ChristianGeng reviewed Jun 27, 2024

View reviewed changes

docs/overview.rst Show resolved Hide resolved

hagenw mentioned this pull request Jun 27, 2024

Include cache handling in documentation on audb load process #439

Open

ChristianGeng reviewed Jun 28, 2024

View reviewed changes

tests/test_load.py Show resolved Hide resolved

ChristianGeng reviewed Jun 28, 2024

View reviewed changes

tests/test_publish_table.py Outdated Show resolved Hide resolved

ChristianGeng reviewed Jun 28, 2024

View reviewed changes

tests/test_publish_table.py Outdated Show resolved Hide resolved

ChristianGeng reviewed Jun 28, 2024

View reviewed changes

tests/test_publish_table.py Outdated Show resolved Hide resolved

hagenw added 17 commits July 8, 2024 12:37

Add support for storing tables as PARQUET

843674a

Depend on audformat repo

e6f246e

Debug and fixes

8bfbbf5

Fix publish test

c91e7b3

Improve storage format tests

1a6b0dc

Extend table storage format tests

0543fd4

Fix code coverage

e636e74

Remove support for Python 3.8

092be3d

Remove test for pandas==2.0.3

f977e2f

Empty archive deps entry for PARQUET tables

0c68ed6

Use PARQUET as default when only PKL table

8c1c735

Add extra file for tests

8fe2bce

Extend docstring

95c11d7

Improve wording

7b73379

Improve wording

be8f202

Require audformat>=1.2.0

b3d92f6

DOC: update overview page

84f44ad

hagenw added 11 commits July 8, 2024 12:38

Improve docstring on hash

eee0f21

Discuss handling of tables in docstring

c7b95fb

Refer to audb module in docstring

be6b7ce

Improve docstring

ddfaaeb

Simplify assignment of other storage format

eab857e

Add build_dir fixture

c1fa729

Add idea to use class for tests

1a2db8a

Add more ideas

5c48367

Further improve tests

a6c4a59

Fix tests

31f5eeb

Add another helper function

cd941f9

hagenw force-pushed the parquet-tables branch from e701de0 to cd941f9 Compare July 8, 2024 10:38

Fix testing for absolute path of media file

3289091

ChristianGeng approved these changes Jul 9, 2024

View reviewed changes

hagenw merged commit da28499 into main Jul 9, 2024
8 checks passed

hagenw deleted the parquet-tables branch July 9, 2024 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support handling of parquet tables #434

Support handling of parquet tables #434

hagenw commented Jun 20, 2024 •

edited

Loading

ChristianGeng left a comment

Support handling of parquet tables #434

Support handling of parquet tables #434

Conversation

hagenw commented Jun 20, 2024 • edited Loading

Summary

Docstring

Getting Started Documentation

Real-world test

Implementation details

ChristianGeng left a comment

Choose a reason for hiding this comment

hagenw commented Jun 20, 2024 •

edited

Loading