Verify how large floating-point numbers are exported #187

d33bs · 2024-04-12T21:42:57Z

While working on cytominer-database data comparisons for #30 I noticed variations in how floating point numbers are treated. As an example, the number 3.5215257120407011 may be found within .tests/data/cytominer-database/data_b/A01-2/Cytoplasm.csv. This number is extracted as 3.521525712040701 (one decimal place less, through automatic CSV settings in DuckDB interpreted as a DOUBLE). Similarly, cyotminer-database appears to change this number to 3.5215257120407006 (three additional decimal places, I believe this is done through Pandas CSV and/or pd.Series.as_type() casts).

I found that DuckDB can correctly interpret this number automatically as a DECIMAL but that this option doesn't appear to be available through the CSV reader yet (planning to open an issue here). As it stands, this number is interpreted as a DOUBLE, which truncates the data.

I did some work to try and research how this number is interpreted in Python, NumPy, Pandas, DuckDB, and Parquet. Both Python (through Decimal) and NumPy (through longdouble) were able to interpret an extended version of the number but not the precise number itself (similar to how cytominer-database operated above, adding decimal places). Pandas is able to accurately interpret the number through PyArrow types (specifically decimal128). DuckDB seems to follow the same decimal-style formatting as PyArrow and is able to inference the width + scale when reading the number alone. From an Arrow decimal128 type, Parquet is able to write and read the number accurately through I believe the decimal logical type.

See here for a Google Colab notebook with findings (and a gist backup).

The text was updated successfully, but these errors were encountered:

d33bs · 2024-04-12T22:02:04Z

Opened duckdb/duckdb#11639 in reference to this issue.

d33bs · 2024-04-13T19:44:38Z

Just a quick update to mention I thought to try PyArrow arrays and the built-in CSV reader. The CSV reader turned out to have similar challenges (truncating the number one decimal place). When attempting to create an array of pa.decimal128(17, 16) type with the number from a Python List I saw an error (though I may be using it incorrectly). I updated the notebook and gist as a reference point.

d33bs added the question Further information is requested label Apr 12, 2024

d33bs mentioned this issue Apr 12, 2024

Add tablenumber capabilities to distinguish repeated image numbers from multiple datasets #188

Merged

13 tasks

d33bs self-assigned this Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify how large floating-point numbers are exported #187

Verify how large floating-point numbers are exported #187

d33bs commented Apr 12, 2024

d33bs commented Apr 12, 2024

d33bs commented Apr 13, 2024 •

edited

Loading

Verify how large floating-point numbers are exported #187

Verify how large floating-point numbers are exported #187

Comments

d33bs commented Apr 12, 2024

d33bs commented Apr 12, 2024

d33bs commented Apr 13, 2024 • edited Loading

d33bs commented Apr 13, 2024 •

edited

Loading