Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change data feed's log replay should check deletion vectors is enabled before reading them #574

Open
OussamaSaoudi-db opened this issue Dec 6, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@OussamaSaoudi-db
Copy link
Collaborator

Please describe why this is necessary.

Presently, CDF collects deletion vectors from add and remove actions unconditionally. However, the protocol states that deletion vectors may only be read if the deletion vector reader feature is enabled.

Add a check that the deletion vector reader feature is enabled before generating selection vectors for TableChangesScanData

Describe the functionality you are proposing.

These are the desired semantics:

  • if the reader feature is supported but not enabled through the table property, the snapshot could still have DV that were written while the feature was enabled, and it would be incorrect to ignore them.
  • if the reader feature is supported but not enabled through the table property, the CDF changes for that commit must not include any DV for add actions. But they could include DV for remove actions (as in, the commit purged some existing DV). The log replay phase only operates on remove deletion vectors, so this should not be a problem
  • if not even supported, the snapshot (and the CDF changes for the commit) should not contain any traces of DV.

If deletion vectors are found, but not supported, determine if it would be better to ignore them, or to throw an error.

Additional context

No response

@OussamaSaoudi-db OussamaSaoudi-db added the enhancement New feature or request label Dec 6, 2024
@OussamaSaoudi-db
Copy link
Collaborator Author

@scovich I was talking to @zachschuermann, and I'm not convinced we need to check the protocol at each commit in the CDF range before using deletion vectors. One minor reason is that delta-spark doesn't check the protocol when iterating over actions. The main reason I don't think we need to check the protocol at each step of CDF is that Snapshots also fail to check Protocols for the entire snapshot range.

Consider a Snapshot at version n. We perform a P&M query to get the protocol. Let's say the protocol was set at version m where 0 < m <= n. Let's say you are processing an Add action with deletion vectors that was committed at version i where i < m . The snapshot has no way of knowing that deletion vector reads are enabled at version i. After all, the only protocol we know of is at m, m > i. A checkpoint may have compacted the old protocol away.

Despite this, our behaviour when doing snapshot scans is to use deletion vectors if they are present. Hence, the delta protocol has an implicit trust that the writer is doing the right thing: If there are deletion vectors, you may read them. Given that, I think that CDF can also happily read deletion vectors.

I think this also extends to the use of ICT for timestamps.

I may be missing something though, so let me know what you think :)

@scovich
Copy link
Collaborator

scovich commented Dec 10, 2024

I think the missing detail is: If DV are supported in the snapshot's latest protocol, then any commit involved in log replay could have DV even if not currently enabled by the table property. If DV are not supported by the snapshot's latest protocol, then the log replay can safely assume it will not encounter any physical traces of past DV. In particular, DV support cannot be removed as long as the table history contains any DV.

If I remember correctly Delta-spark has two different log replay paths, one that uses DV and one that does not, depending on the protocol check. Even if it doesn't, that would be a bug to fix -- a garbage DV would corrupt the table and we must not tolerate it, neither in regular scans nor in CDF: The end table version decides whether DV is supported, and if not supported then log replay must either ignore DV column or error out if it sees a value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants