Switch from using 32 bit to 64 bit numbers for tsid and value in the data table #54

intarga · 2025-01-07T14:55:39Z

32 bit numbers were suggested by me to reduce row size in the data table and speed up queries and transmission. We need to make sure we're OK committing to this before launch though as it will be annoying to change if we change our minds.

Particulars to consider:

Can we be assured we won't lose data on conversion to 32 bit floats? maybe it's worth roundtripping data from a kvalobs dump to sanity-check this.
Integer TSIDs limited us to 2.1 million TSIDs, have a real good think about whether we have any risk of approaching that.

Lun4m · 2025-01-08T12:08:30Z

I looked a bit into this after our discussion on how to store the QC provenance, and I found this.
Apparently, due to alignment padding, we are already wasting 4 bytes per row in the data table, and moving to all 8 bytes data types would only add 4 more bytes.
For the provenance, I had suggested using bitstrings, but they have an overhead of 5-8 bytes. Bool columns might be more space-efficient in this case?

Regarding the number of time series, we are already at 500k, so it's probably worth moving to 64-bit IDs.

intarga · 2025-01-08T12:41:20Z

Apparently, due to alignment padding, we are already wasting 4 bytes per row in the data table, and moving to all 8 bytes data types would only add 4 more bytes.

Nice detective work! This makes a very compelling case for 64 bit on both.

For the provenance, I had suggested using bitstrings, but they have an overhead of 5-8 bytes. Bool columns might be more space-efficient in this case?

I’m not sure what you mean here, we planned to put provenance in a separate table? Unless you mean the end user flags, in which case yes I think bool columns are the way to go

Regarding the number of time series, we are already at 500k

Just noticed I made a typo in my original comment, should be billion not million. Regardless though, it’s a moot point in light of your above findings

Lun4m · 2025-01-08T13:47:41Z

I’m not sure what you mean here, we planned to put provenance in a separate table? Unless you mean the end user flags, in which case yes I think bool columns are the way to go

Yeah, sorry, I am mixing stuff up. I was already thinking about the provenance table, but we should probably discuss it in another issue.

intarga added the question Further information is requested label Jan 7, 2025

intarga added this to the Beta Release milestone Jan 7, 2025

intarga assigned Lun4m Jan 7, 2025

intarga changed the title ~~Revisit decision of 32 bit numbers vs 64 bit for TSIDs and values~~ Switch from using 32 bit to 64 bit numbers for tsid and value in the data table Jan 9, 2025

intarga added enhancement New feature or request and removed question Further information is requested labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from using 32 bit to 64 bit numbers for tsid and value in the data table #54

Switch from using 32 bit to 64 bit numbers for tsid and value in the data table #54

intarga commented Jan 7, 2025

Lun4m commented Jan 8, 2025 •

edited

Loading

intarga commented Jan 8, 2025

Lun4m commented Jan 8, 2025

Switch from using 32 bit to 64 bit numbers for tsid and value in the data table #54

Switch from using 32 bit to 64 bit numbers for tsid and value in the data table #54

Comments

intarga commented Jan 7, 2025

Lun4m commented Jan 8, 2025 • edited Loading

intarga commented Jan 8, 2025

Lun4m commented Jan 8, 2025

Lun4m commented Jan 8, 2025 •

edited

Loading