Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify how large floating-point numbers are exported #187

Open
d33bs opened this issue Apr 12, 2024 · 2 comments
Open

Verify how large floating-point numbers are exported #187

d33bs opened this issue Apr 12, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@d33bs
Copy link
Member

d33bs commented Apr 12, 2024

While working on cytominer-database data comparisons for #30 I noticed variations in how floating point numbers are treated. As an example, the number 3.5215257120407011 may be found within .tests/data/cytominer-database/data_b/A01-2/Cytoplasm.csv. This number is extracted as 3.521525712040701 (one decimal place less, through automatic CSV settings in DuckDB interpreted as a DOUBLE). Similarly, cyotminer-database appears to change this number to 3.5215257120407006 (three additional decimal places, I believe this is done through Pandas CSV and/or pd.Series.as_type() casts).

I found that DuckDB can correctly interpret this number automatically as a DECIMAL but that this option doesn't appear to be available through the CSV reader yet (planning to open an issue here). As it stands, this number is interpreted as a DOUBLE, which truncates the data.

I did some work to try and research how this number is interpreted in Python, NumPy, Pandas, DuckDB, and Parquet. Both Python (through Decimal) and NumPy (through longdouble) were able to interpret an extended version of the number but not the precise number itself (similar to how cytominer-database operated above, adding decimal places). Pandas is able to accurately interpret the number through PyArrow types (specifically decimal128). DuckDB seems to follow the same decimal-style formatting as PyArrow and is able to inference the width + scale when reading the number alone. From an Arrow decimal128 type, Parquet is able to write and read the number accurately through I believe the decimal logical type.

See here for a Google Colab notebook with findings (and a gist backup).

@d33bs d33bs added the question Further information is requested label Apr 12, 2024
@d33bs
Copy link
Member Author

d33bs commented Apr 12, 2024

Opened duckdb/duckdb#11639 in reference to this issue.

@d33bs
Copy link
Member Author

d33bs commented Apr 13, 2024

Just a quick update to mention I thought to try PyArrow arrays and the built-in CSV reader. The CSV reader turned out to have similar challenges (truncating the number one decimal place). When attempting to create an array of pa.decimal128(17, 16) type with the number from a Python List I saw an error (though I may be using it incorrectly). I updated the notebook and gist as a reference point.

@d33bs d33bs self-assigned this Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant