You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Presently, CDF collects deletion vectors from add and remove actions unconditionally. However, the protocol states that deletion vectors may only be read if the deletion vector reader feature is enabled.
Add a check that the deletion vector reader feature is enabled before generating selection vectors for TableChangesScanData
Describe the functionality you are proposing.
These are the desired semantics:
if the reader feature is supported but not enabled through the table property, the snapshot could still have DV that were written while the feature was enabled, and it would be incorrect to ignore them.
if the reader feature is supported but not enabled through the table property, the CDF changes for that commit must not include any DV for add actions. But they could include DV for remove actions (as in, the commit purged some existing DV). The log replay phase only operates on remove deletion vectors, so this should not be a problem
if not even supported, the snapshot (and the CDF changes for the commit) should not contain any traces of DV.
If deletion vectors are found, but not supported, determine if it would be better to ignore them, or to throw an error.
Additional context
No response
The text was updated successfully, but these errors were encountered:
@scovich I was talking to @zachschuermann, and I'm not convinced we need to check the protocol at each commit in the CDF range before using deletion vectors. One minor reason is that delta-spark doesn't check the protocol when iterating over actions. The main reason I don't think we need to check the protocol at each step of CDF is that Snapshots also fail to check Protocols for the entire snapshot range.
Consider a Snapshot at version n. We perform a P&M query to get the protocol. Let's say the protocol was set at version m where 0 < m <= n. Let's say you are processing an Add action with deletion vectors that was committed at version i where i < m . The snapshot has no way of knowing that deletion vector reads are enabled at version i. After all, the only protocol we know of is at m, m > i. A checkpoint may have compacted the old protocol away.
Despite this, our behaviour when doing snapshot scans is to use deletion vectors if they are present. Hence, the delta protocol has an implicit trust that the writer is doing the right thing: If there are deletion vectors, you may read them. Given that, I think that CDF can also happily read deletion vectors.
I think this also extends to the use of ICT for timestamps.
I may be missing something though, so let me know what you think :)
I think the missing detail is: If DV are supported in the snapshot's latest protocol, then any commit involved in log replay could have DV even if not currently enabled by the table property. If DV are not supported by the snapshot's latest protocol, then the log replay can safely assume it will not encounter any physical traces of past DV. In particular, DV support cannot be removed as long as the table history contains any DV.
If I remember correctly Delta-spark has two different log replay paths, one that uses DV and one that does not, depending on the protocol check. Even if it doesn't, that would be a bug to fix -- a garbage DV would corrupt the table and we must not tolerate it, neither in regular scans nor in CDF: The end table version decides whether DV is supported, and if not supported then log replay must either ignore DV column or error out if it sees a value.
Please describe why this is necessary.
Presently, CDF collects deletion vectors from add and remove actions unconditionally. However, the protocol states that deletion vectors may only be read if the deletion vector reader feature is enabled.
Add a check that the deletion vector reader feature is enabled before generating selection vectors for
TableChangesScanData
Describe the functionality you are proposing.
These are the desired semantics:
If deletion vectors are found, but not supported, determine if it would be better to ignore them, or to throw an error.
Additional context
No response
The text was updated successfully, but these errors were encountered: