You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on cytominer-database data comparisons for #30 I noticed variations in how floating point numbers are treated. As an example, the number 3.5215257120407011 may be found within .tests/data/cytominer-database/data_b/A01-2/Cytoplasm.csv. This number is extracted as 3.521525712040701 (one decimal place less, through automatic CSV settings in DuckDB interpreted as a DOUBLE). Similarly, cyotminer-database appears to change this number to 3.5215257120407006 (three additional decimal places, I believe this is done through Pandas CSV and/or pd.Series.as_type() casts).
I found that DuckDB can correctly interpret this number automatically as a DECIMAL but that this option doesn't appear to be available through the CSV reader yet (planning to open an issue here). As it stands, this number is interpreted as a DOUBLE, which truncates the data.
I did some work to try and research how this number is interpreted in Python, NumPy, Pandas, DuckDB, and Parquet. Both Python (through Decimal) and NumPy (through longdouble) were able to interpret an extended version of the number but not the precise number itself (similar to how cytominer-database operated above, adding decimal places). Pandas is able to accurately interpret the number through PyArrow types (specifically decimal128). DuckDB seems to follow the same decimal-style formatting as PyArrow and is able to inference the width + scale when reading the number alone. From an Arrow decimal128 type, Parquet is able to write and read the number accurately through I believe the decimal logical type.
Just a quick update to mention I thought to try PyArrow arrays and the built-in CSV reader. The CSV reader turned out to have similar challenges (truncating the number one decimal place). When attempting to create an array of pa.decimal128(17, 16) type with the number from a Python List I saw an error (though I may be using it incorrectly). I updated the notebook and gist as a reference point.
While working on
cytominer-database
data comparisons for #30 I noticed variations in how floating point numbers are treated. As an example, the number3.5215257120407011
may be found within.tests/data/cytominer-database/data_b/A01-2/Cytoplasm.csv
. This number is extracted as3.521525712040701
(one decimal place less, through automatic CSV settings in DuckDB interpreted as a DOUBLE). Similarly,cyotminer-database
appears to change this number to3.5215257120407006
(three additional decimal places, I believe this is done through Pandas CSV and/orpd.Series.as_type()
casts).I found that DuckDB can correctly interpret this number automatically as a DECIMAL but that this option doesn't appear to be available through the CSV reader yet (planning to open an issue here). As it stands, this number is interpreted as a DOUBLE, which truncates the data.
I did some work to try and research how this number is interpreted in Python, NumPy, Pandas, DuckDB, and Parquet. Both Python (through Decimal) and NumPy (through longdouble) were able to interpret an extended version of the number but not the precise number itself (similar to how
cytominer-database
operated above, adding decimal places). Pandas is able to accurately interpret the number through PyArrow types (specifically decimal128). DuckDB seems to follow the same decimal-style formatting as PyArrow and is able to inference the width + scale when reading the number alone. From an Arrow decimal128 type, Parquet is able to write and read the number accurately through I believe the decimal logical type.See here for a Google Colab notebook with findings (and a gist backup).
The text was updated successfully, but these errors were encountered: