-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Port cdf tests from delta-spark to kernel #611
base: main
Are you sure you want to change the base?
test: Port cdf tests from delta-spark to kernel #611
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #611 +/- ##
=======================================
Coverage 83.50% 83.51%
=======================================
Files 74 74
Lines 16919 16920 +1
Branches 16919 16920 +1
=======================================
+ Hits 14128 14130 +2
Misses 2133 2133
+ Partials 658 657 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great thanks! One small thing, could you recreate the archives with --no-xattrs
passed to tar? Otherwise if you look at these zstd files on a non-macos machine you get lots of:
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.provenance'
tar -c --no-xattrs [etc]
should do the trick
#[test] | ||
fn simple_cdf_version_ranges() -> DeltaResult<()> { | ||
let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?; | ||
let mut expected = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer putting the expected output along with the data (like we do with dat
) rather than spreading it out like this. I think this is okay for now though, and maybe we can make an issue to port each test by recreating the archive with an expected data parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: #626
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata
cc @nicklan
@nicklan rebuilt with --no-xattrs. Seems to have caused a diff on github, but I'm still seeing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did all the expected output come from delta-spark? if yes, then LGTM
#[test] | ||
fn simple_cdf_version_ranges() -> DeltaResult<()> { | ||
let batches = read_cdf_for_table("cdf-table-simple", 0, 0, None)?; | ||
let mut expected = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reminds me: we should make an issue (or perhaps we already have one?) to fix all the tests and move away from string matching and instead compare underlying (sorted) data/metadata
cc @nicklan
// Note: `update_pre` and `update_post` are technically not part of the delta spec, and instead | ||
// should be `update_preimage` and `update_postimage` respectively. However, the tests in | ||
// delta-spark use the post and pre. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are we observing update_pre
and update_post
here then? aren't we reading the CDF and then filling in our own update_preimage
etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All update _change_type
s come directly from a cdc file. We don't insert or modify them. In this test, delta-spark wrote update_pre and update_post directly into the cdc file.
@zachschuermann regarding creating an issue, I already put one up for CDF here #626, but I can expand its scope to remove all string matching from our testing. |
@zachschuermann My methodology is as follows: I used their expected results that looked like this (sample below), constructed the tables by hand, and only then did I verify it against our implementation. checkCDCAnswer(
log,
CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'insert'"),
Range(0, 6).map { i => Row(i, "old", i % 2, "insert", 0) })
checkCDCAnswer(
log,
CDCReader.changesToBatchDF(log, 0, 2, spark).filter("_change_type = 'delete'"),
Seq(0, 2, 3, 4).map { i => Row(i, "old", i % 2, "delete", if (i % 2 == 0) 2 else 1) })
// ... |
What changes are proposed in this pull request?
This PR adds several CDF tests from delta-spark. We check the following:
Table-changes construction is also changed so that CDF version error is checked before snapshots are created. This makes the error message clearer in the case that the start version is beyond the end of the table.