test: Port cdf tests from delta-spark to kernel #611

OussamaSaoudi-db · 2024-12-20T00:18:06Z

What changes are proposed in this pull request?

This PR adds several CDF tests from delta-spark. We check the following:

CDF over various version ranges
Update operations are read correctly from cdc files
data_change=false means the action is skipped
A range with start > end is an error.
Start version greater than latest table version is an error
CDF works on partition tables
CDF works on tables with backticks in the column names
CDF is correct in deletion cases with unconditional deletes, conditional deletes that remove all rows, and selective conditional deletes.

Table-changes construction is also changed so that CDF version error is checked before snapshots are created. This makes the error message clearer in the case that the start version is beyond the end of the table.

codecov · 2024-12-20T00:21:38Z

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.51%. Comparing base (cbb52a4) to head (c8a01ac).

Files with missing lines	Patch %	Lines
kernel/src/table_changes/mod.rs	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #611   +/-   ##
=======================================
  Coverage   83.50%   83.51%           
=======================================
  Files          74       74           
  Lines       16919    16920    +1     
  Branches    16919    16920    +1     
=======================================
+ Hits        14128    14130    +2     
  Misses       2133     2133           
+ Partials      658      657    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nicklan

looks great thanks! One small thing, could you recreate the archives with --no-xattrs passed to tar? Otherwise if you look at these zstd files on a non-macos machine you get lots of:
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.provenance'

tar -c --no-xattrs [etc] should do the trick

nicklan · 2025-01-08T00:53:23Z

kernel/tests/cdf.rs

+#[test]
+fn simple_cdf_version_ranges() -> DeltaResult<()> {
+    let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?;
+    let mut expected = vec![


I tend to prefer putting the expected output along with the data (like we do with dat) rather than spreading it out like this. I think this is okay for now though, and maybe we can make an issue to port each test by recreating the archive with an expected data parquet.

this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata

cc @nicklan

OussamaSaoudi-db · 2025-01-08T21:44:23Z

@nicklan rebuilt with --no-xattrs. Seems to have caused a diff on github, but I'm still seeing com.apple.provenance in when calling xattr. Does it still cause the error?

zachschuermann

did all the expected output come from delta-spark? if yes, then LGTM

zachschuermann · 2025-01-09T16:49:12Z

kernel/tests/cdf.rs

+#[test]
+fn simple_cdf_version_ranges() -> DeltaResult<()> {
+    let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?;
+    let mut expected = vec![


this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata

cc @nicklan

zachschuermann · 2025-01-09T16:50:54Z

kernel/tests/cdf.rs

+    // Note: `update_pre` and `update_post` are technically not part of the delta spec, and instead
+    // should be `update_preimage` and `update_postimage` respectively. However, the tests in
+    // delta-spark use the post and pre.


how are we observing update_pre and update_post here then? aren't we reading the CDF and then filling in our own update_preimage etc.?

All update _change_types come directly from a cdc file. We don't insert or modify them. In this test, delta-spark wrote update_pre and update_post directly into the cdc file.

OussamaSaoudi-db · 2025-01-09T17:38:04Z

@zachschuermann regarding creating an issue, I already put one up for CDF here #626, but I can expand its scope to remove all string matching from our testing.

OussamaSaoudi-db · 2025-01-09T17:57:19Z

@zachschuermann My methodology is as follows: I used their expected results that looked like this (sample below), constructed the tables by hand, and only then did I verify it against our implementation.

        checkCDCAnswer(
          log,
          CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'insert'"),
          Range(0, 6).map { i => Row(i, "old", i % 2, "insert", 0) })
        checkCDCAnswer(
          log,
          CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'delete'"),
          Seq(0, 2, 3, 4).map { i => Row(i, "old", i % 2, "delete", if (i % 2 == 0) 2 else 1) })
        // ...

OussamaSaoudi-db added 3 commits December 17, 2024 16:11

Add simple tests

bd11f8e

Add delta-spark cdf tests

38f7b26

Add deletion tests

e8cc395

github-actions bot assigned OussamaSaoudi-db Dec 20, 2024

OussamaSaoudi-db changed the title ~~Cdf delta spark tests~~ test: Cdf delta spark tests Dec 20, 2024

OussamaSaoudi-db changed the title ~~test: Cdf delta spark tests~~ test: Port cdf tests from delta-spark to kernel Dec 20, 2024

OussamaSaoudi-db added 4 commits December 19, 2024 16:36

address issues

67a8596

Remove unnecessary file

249f457

rename to consistent naming

6caa32e

Merge branch 'main' into cdf_delta_spark_tests

8be1910

OussamaSaoudi-db marked this pull request as ready for review December 20, 2024 00:47

roeap approved these changes Dec 20, 2024

View reviewed changes

Remove unnecessary Ok(())

47ea246

OussamaSaoudi-db requested a review from zachschuermann December 30, 2024 22:01

OussamaSaoudi-db requested a review from nicklan January 7, 2025 01:59

nicklan approved these changes Jan 8, 2025

View reviewed changes

remove xattrs

18e97a0

Merge branch 'main' into cdf_delta_spark_tests

c8a01ac

zachschuermann approved these changes Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Port cdf tests from delta-spark to kernel #611

test: Port cdf tests from delta-spark to kernel #611

OussamaSaoudi-db commented Dec 20, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading

nicklan left a comment

nicklan Jan 8, 2025

OussamaSaoudi-db Jan 8, 2025

zachschuermann Jan 9, 2025

OussamaSaoudi-db commented Jan 8, 2025

zachschuermann left a comment

zachschuermann Jan 9, 2025

zachschuermann Jan 9, 2025

OussamaSaoudi-db Jan 9, 2025

OussamaSaoudi-db commented Jan 9, 2025

OussamaSaoudi-db commented Jan 9, 2025

test: Port cdf tests from delta-spark to kernel #611

Are you sure you want to change the base?

test: Port cdf tests from delta-spark to kernel #611

Conversation

OussamaSaoudi-db commented Dec 20, 2024 • edited Loading

What changes are proposed in this pull request?

codecov bot commented Dec 20, 2024 • edited Loading

Codecov Report

nicklan left a comment

Choose a reason for hiding this comment

nicklan Jan 8, 2025

Choose a reason for hiding this comment

OussamaSaoudi-db Jan 8, 2025

Choose a reason for hiding this comment

zachschuermann Jan 9, 2025

Choose a reason for hiding this comment

OussamaSaoudi-db commented Jan 8, 2025

zachschuermann left a comment

Choose a reason for hiding this comment

zachschuermann Jan 9, 2025

Choose a reason for hiding this comment

zachschuermann Jan 9, 2025

Choose a reason for hiding this comment

OussamaSaoudi-db Jan 9, 2025

Choose a reason for hiding this comment

OussamaSaoudi-db commented Jan 9, 2025

OussamaSaoudi-db commented Jan 9, 2025

OussamaSaoudi-db commented Dec 20, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading