Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from using 32 bit to 64 bit numbers for tsid and value in the data table #54

Open
intarga opened this issue Jan 7, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@intarga
Copy link
Member

intarga commented Jan 7, 2025

32 bit numbers were suggested by me to reduce row size in the data table and speed up queries and transmission. We need to make sure we're OK committing to this before launch though as it will be annoying to change if we change our minds.

Particulars to consider:

  • Can we be assured we won't lose data on conversion to 32 bit floats? maybe it's worth roundtripping data from a kvalobs dump to sanity-check this.
  • Integer TSIDs limited us to 2.1 million TSIDs, have a real good think about whether we have any risk of approaching that.
@intarga intarga added the question Further information is requested label Jan 7, 2025
@intarga intarga added this to the Beta Release milestone Jan 7, 2025
@Lun4m
Copy link
Collaborator

Lun4m commented Jan 8, 2025

I looked a bit into this after our discussion on how to store the QC provenance, and I found this.
Apparently, due to alignment padding, we are already wasting 4 bytes per row in the data table, and moving to all 8 bytes data types would only add 4 more bytes.
For the provenance, I had suggested using bitstrings, but they have an overhead of 5-8 bytes. Bool columns might be more space-efficient in this case?

Regarding the number of time series, we are already at 500k, so it's probably worth moving to 64-bit IDs.

@intarga
Copy link
Member Author

intarga commented Jan 8, 2025

Apparently, due to alignment padding, we are already wasting 4 bytes per row in the data table, and moving to all 8 bytes data types would only add 4 more bytes.

Nice detective work! This makes a very compelling case for 64 bit on both.

For the provenance, I had suggested using bitstrings, but they have an overhead of 5-8 bytes. Bool columns might be more space-efficient in this case?

I’m not sure what you mean here, we planned to put provenance in a separate table? Unless you mean the end user flags, in which case yes I think bool columns are the way to go

Regarding the number of time series, we are already at 500k

Just noticed I made a typo in my original comment, should be billion not million. Regardless though, it’s a moot point in light of your above findings

@Lun4m
Copy link
Collaborator

Lun4m commented Jan 8, 2025

I’m not sure what you mean here, we planned to put provenance in a separate table? Unless you mean the end user flags, in which case yes I think bool columns are the way to go

Yeah, sorry, I am mixing stuff up. I was already thinking about the provenance table, but we should probably discuss it in another issue.

@intarga intarga changed the title Revisit decision of 32 bit numbers vs 64 bit for TSIDs and values Switch from using 32 bit to 64 bit numbers for tsid and value in the data table Jan 9, 2025
@intarga intarga added enhancement New feature or request and removed question Further information is requested labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants