feat: Add check for schema read compatibility #554

OussamaSaoudi-db · 2024-11-29T00:37:42Z

What changes are proposed in this pull request?

This PR introduces a function schema.can_read_as(read_schema) for Schema. This checks that for data written with schema schema, whether it can be read with the read_schema. This check is useful for implementing schema evolution checks in CDF.

Closes #523

How was this change tested?

Schema compatibility tests are added that check the following:

can_read_as is reflexive
adding a nullable column to each the key and value of a map succeeds.
changing a map value from nullable to non-nullable fails
same schema with different field name case fails
changing column type from long to integer fails.
Setting nullability from false to true succeeds
Setting nullability from true to false fails
Adding a nullable column succeeds
Adding a non-nullable column fails

codecov · 2024-11-29T00:42:30Z

Codecov Report

Attention: Patch coverage is 96.80000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 83.66%. Comparing base (cbb52a4) to head (3d852b0).

Files with missing lines	Patch %	Lines
kernel/src/schema_compat.rs	96.80%	4 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #554      +/-   ##
==========================================
+ Coverage   83.50%   83.66%   +0.16%     
==========================================
  Files          74       75       +1     
  Lines       16919    17169     +250     
  Branches    16919    17169     +250     
==========================================
+ Hits        14128    14365     +237     
- Misses       2133     2141       +8     
- Partials      658      663       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

OussamaSaoudi-db · 2024-11-29T01:10:33Z

kernel/src/table_changes/schema_compat.rs

+                );
+                name_equal && nullability_equal && data_type_equal
+            }
+            None => read_field.is_nullable(),


The None case is a point at which I differ from the delta implementation. I'm not convinced by the code there. If we don't find the read field in the existing schema, then we just ignore it. I think this should only pass if the new field in the read schema is nullable.

I may be missing something tho 🤔

Nullability is... complicated. But I think what you say makes sense -- technically it could be ok for the read field to not be nullable, if the parent is nullable and the parent is null for all rows where the child is null. But if the parent is hard-wired null then we shouldn't be recursing to its children in the first place.

OussamaSaoudi-db · 2024-11-29T01:11:48Z

kernel/src/table_changes/schema_compat.rs

+    // == read_nullable || !existing_nullable
+    read_nullable || !existing_nullable
+}
+fn is_struct_read_compatible(existing: &StructType, read_type: &StructType) -> bool {


I was considering using a DeltaResult instead of a bool so we can return better errors about how a schema differs. Thoughts?

I think that makes sense. Something similar to ValidateColumnMappings in #543, which returns an Err with the offending column name path?

kernel/src/table_changes/schema_compat.rs

scovich · 2024-12-02T17:28:52Z

kernel/src/table_changes/schema_compat.rs

+                );
+                name_equal && nullability_equal && data_type_equal
+            }
+            None => read_field.is_nullable(),


Nullability is... complicated. But I think what you say makes sense -- technically it could be ok for the read field to not be nullable, if the parent is nullable and the parent is null for all rows where the child is null. But if the parent is hard-wired null then we shouldn't be recursing to its children in the first place.

scovich · 2024-12-02T17:29:56Z

kernel/src/table_changes/schema_compat.rs

+use crate::schema::{DataType, Schema, StructField, StructType};
+
+fn is_nullability_compatible(existing_nullable: bool, read_nullable: bool) -> bool {
+    // The case to avoid is when the read_schema is non-nullable and the existing one is nullable.


"avoid" as in "it's illegal to attempt reading a nullable underlying as non-nullable"? (maybe just say that?)

Also, this method takes two args of the same type, but it is not commutative. Subtly error-prone, and I don't know the best way to make it safe? The arg names are a good start, but rust doesn't allow named args at call sites. And the name of the function does not give any indication of the correct arg order.

Is it worth using a struct just to force named args? Seems clunky. Or maybe we can choose an asymmetric function name of some kind, that indicates which arg comes first?

(whatever solution we choose, we should probably apply it to the is_struct_read_compatible as well)

Another possibility: Add these as methods on StructType itself? Then callers would be encouraged to do things like:

table_schema.can_read_as(read_schema)

... but I don't know a good way to do that for the nullability compat check since it's a plain boolean and doesn't always apply to a struct field (can also be array element or map value).

We could define a helper trait for struct/map/array, but that just pushes the problem to the trait impl (and there is only one call site for each type right now).

Yeah I think the nullability check could benefit from using a struct. Pushing the impl to a trait would just have the same check in all of them anyway.

scovich · 2024-12-02T18:24:14Z

kernel/src/table_changes/schema_compat.rs

+    let existing_names: HashSet<String> = existing
+        .fields()
+        .map(|field| field.name().clone())
+        .collect();
+    let read_names: HashSet<String> = read_type
+        .fields()
+        .map(|field| field.name().clone())
+        .collect();
+    if !existing_names.is_subset(&read_names) {
+        return false;
+    }
+    read_type
+        .fields()
+        .all(|read_field| match existing_fields.get(read_field.name()) {


It seems like we should only need to materialize a hash set for one side (build), and just stream the other side's fields past it (probe)?

Also: kernel's StructType::fields member is already an IndexMap so you should have O(1) name lookups without building any additional hash sets.

One thing that was missing previously in this code was that they check the fields modulo case-sensitivity. The reasons delta spark does this AFAICT are:

they want to ensure that there are no duplicate fields in a schema that only differ in case sensitivity.

delta-spark typically ignores fields that are in the read schema, but not the current one as I pointed out above. However, they don't want fields that only differ in case to be treated as a new struct field, so they use a case-insensitive map.

The two things I'm most unsure about currently are

How the case sensitive cases are handled

The case where the field is in the read schema but not the existing schema

Would appreciate a second pair of eyes from @zachschuermann and @nicklan as well.

OussamaSaoudi-db · 2025-01-07T19:29:18Z

TODO: Add doc comments. I think I want those uncertainties cleared before spending time on docs

…duplicates

nicklan

flushing first pass comments. will review more soon

nicklan · 2025-01-07T22:55:03Z

kernel/src/schema_compat.rs

+use crate::utils::require;
+use crate::{DeltaResult, Error};
+
+struct NullabilityCheck {


nit: doc comment

nicklan · 2025-01-07T22:56:59Z

kernel/src/schema_compat.rs

+    }
+}
+
+impl StructField {


i need to think about about doing this as multiple impl blocks. it keeps the code nice and separate, but does make it more complex to find where various bits of code are. at the very least can you not import StructField, StructType, or DataType and do impl crate::schema::StructField so it's clear that it's on that struct

I can do that. In terms of the motivation for separate impl blocks, it goes back to Ryan's suggestion to make it clear what order you're passing in arguments to the compatibility check.

See this

Also I removed the import and switched to using Self. lmk if that looks good.

nicklan · 2025-01-07T22:59:21Z

kernel/src/schema_compat.rs

+    nullable: bool,
+    read_nullable: bool,
+}
+


Is there a benefit to making this a struct and then having an impl? Seems like it'd be easier/cleaner as just a two arg method from the usage i've seen so far.

This is related to what @scovich was saying here. A nullability check function would not be commutative, and this could easily cause subtle errors with a swap. We can't do impls for primitive type like bool, so bool.can_read_as(other) is off the table (and perhaps isn't a good idea anyway).

I could alternatively do something like this:

// Option A struct NullabilityFlag(bool); struct ReadNullabilityFlag(bool); fn is_compatible(a: NullabilityFlag, b: ReadNullabilityFlag) { let NullabilityFlag(nullable) = a; let ReadNullabilityFlag(read_nullable) = b; ... } is_compatible(NullabilityFlag(self.nullable), ReadNullabilityFlag(read_field.nullable)); // Option B: NullabilityFlag(self.nullable).can_read_as(NullabilityFlag(read_field.nullable)) // no need for read version

Looking at these layed out, I think Option B is the winner . What do you think?

Update: switched to option B for now

nicklan

few more comments.

it's a shame we can't use EnsureDataTypes for this. That expects to compare between kernel and arrow types which doesn't work here. We might want to consider if it's worth converting the arrow types into kernel types before doing that check and then re-writing that check in terms of only kernel types. It'd be a little less efficient, but we wouldn't have two somewhat tricky ways of checking compatibility.

nicklan · 2025-01-08T00:57:25Z

kernel/src/schema_compat.rs

+//! compatibility is [`can_read_as`].
+//!
+//! # Examples
+//!  ```rust, ignore


why ignore?

This is pub(crate) and doc tests only work with pub functions.

Doc tests compile with the doctest feature enabled, so we could potentially make the function pub in that case.

We could also invoke doctests with developer-visibility feature enabled, to pick up that set of functions as well.

Looks like dev-visibility worked!

nicklan · 2025-01-08T01:01:17Z

kernel/src/schema_compat.rs

+            .collect();
+        require!(
+            field_map.len() == self.fields.len(),
+            Error::generic("Delta tables don't allow field names that only differ by case")


This feels like it would be a bug in the schema of the table. Should we not catch this case higher up when trying to construct the schema?

Agreed that this is a weird place to put it. I was keeping parity with delta-spark. Perhaps instead we should bake this into a StructType::try_from_string. What do you think?

I like clearer and earlier checks, when possible. The fewer places we have to worry about unvalidated objects the better. Spark is definitely not a model to emulate in that regard.

Made an issue to track this here.

nicklan · 2025-01-08T01:05:48Z

kernel/src/schema_compat.rs

+        );
+
+        // Check that the field names are a subset of the read fields.
+        if !field_map.keys().all(|name| read_field_names.contains(name)) {


In the scala code they have allowMissingColumns, which does allow dropping of columns. I'm not quite clear when you'd want to set that or not though. In the case of CDF though, why isn't it okay for us to have dropped a column? Assuming we're reading as the final schema (which I think is the case), then if there were extra columns when we started, we just... don't return those?

In the scala code they have allowMissingColumns, which does allow dropping of columns.

Regarding the flags, I chatted with @zachschuermann and we decided to do the simple implementation in this round without all the configurations that the scala code provides. For CDF specifically, it seems that it only ever uses the forbidTightenNullability flag and none of the others. I also think we may not need the nullability flag because we only use the final schema.

Assuming we're reading as the final schema (which I think is the case),

That's correct

the case of CDF though, why isn't it okay for us to have dropped a column? [...] if there were extra columns when we started, we just... don't return those?

I think the reason is the output CDF of such a table may not make sense. Consider a table with columns a and b. The final schema only has column a. You could imagine all of the changes in the change data feed are made to b, but if you read with a as the only column you get:

{(_change_type: update_preimage, a: 1), (_change_type: update_postimage, a: 1)}

Seems that nothing's changed at all.

More generally: I think that if data is written in schema A, and we read it with schema B, there should not be any data loss. Hence A's columns should be a subset of B's columns. Dropping a column is essentially a projection, and they should be explicitly requested by the user/query.

AFAIK, clients must anyway be prepared to deal with spurious row changes? It can also happen if we don't distinguish copied rows from updated ones. It can also happen if the reader only cares about a subset of columns that didn't happen to change.

@scovich so are you saying that it actually might be acceptable to drop the columns that aren't in the end schema?

Seems like it? In my experience most users would prefer spurious changes if the alternative is pipeline failure.

It would be nice to get confirmation from some CDF workload experts tho. If delta-spark drops columns in this case that's probably an indirect confirmation?

So to my understanding, they don't allow columns to be dropped in cdf. Their schema utils function let's you specify that you tolerate dropped columns, but CDF never uses it.

Here are the only call sites for schema compat in CDF 1 & 2. They change forbidTightenNullability, but all other flags are default.

SchemaUtils.isReadCompatible( existingSchema = metadata.schema, readSchema = readSchemaSnapshot.schema, forbidTightenNullability = true)

Note that allowMissingColumns is false by default.

def isReadCompatible( existingSchema: StructType, readSchema: StructType, forbidTightenNullability: Boolean = false, allowMissingColumns: Boolean = false, allowTypeWidening: Boolean = false, newPartitionColumns: Seq[String] = Seq.empty, oldPartitionColumns: Seq[String] = Seq.empty): Boolean = {

Here we see that if we do not allow missing columns, then the schema fields should be a subset of the read schema (ie no dropped columns).

if (!allowMissingColumns && !(existingFieldNames.subsetOf(newFields) && isPartitionCompatible(newPartitionColumns, oldPartitionColumns))) { // Dropped a column that was present in the DataFrame schema return false }

Note also that CDF doesn't use the partitionColumns parts of the schema comparison.

I'm planing on talking to some of the folks who worked on this in the past, but I believe what I have currently matches the CDF behaviour for delta-spark.

nicklan · 2025-01-08T01:12:46Z

kernel/src/schema_compat.rs

+                self_map.value_type().can_read_as(read_map.value_type())?;
+            }
+            (a, b) => {
+                // TODO: In the future, we will change this to support type widening.


check_cast_compat basically does this but for two arrow types. just fyi

Mentioned in this issue: #623

OussamaSaoudi-db · 2025-01-08T01:48:14Z

@nicklan Ooh good callout that EnsureDataTypes exists. I'll take a closer look at it and see if it matches our needs.

OussamaSaoudi-db · 2025-01-09T01:20:31Z

@nicklan I made a new issue to handle the duplication with EnsureDataTypes. #629

github-actions bot assigned OussamaSaoudi-db Nov 29, 2024

OussamaSaoudi-db commented Nov 29, 2024

View reviewed changes

scovich reviewed Dec 2, 2024

View reviewed changes

OussamaSaoudi-db mentioned this pull request Dec 14, 2024

[tracking issue] Change Data Feed read path #440

Open

21 tasks

OussamaSaoudi-db force-pushed the schema-compat branch from ba3d0ad to b518e25 Compare January 2, 2025 22:45

OussamaSaoudi-db changed the title ~~Implement schema compatibility check~~ feat: Add check for schema read compatibility Jan 7, 2025

OussamaSaoudi-db requested review from nicklan, zachschuermann and scovich January 7, 2025 02:27

OussamaSaoudi-db marked this pull request as ready for review January 7, 2025 02:27

OussamaSaoudi-db added 7 commits January 7, 2025 13:24

Initial schema compat

c0fb44a

Add schema compatibility tests

21f614f

Add error-based schema compatibility check

f450304

Remove old schema compat

8a4600a

change naming, remove redundant test

da3a45f

Improve comments, add check that field names aren't case-insensitive …

0f62fa7

…duplicates

Change ordering, remove unnecessary comment

596aa04

OussamaSaoudi-db force-pushed the schema-compat branch from c6deb44 to 596aa04 Compare January 7, 2025 21:24

nicklan reviewed Jan 7, 2025

View reviewed changes

Add comments to can_read_as methods

abc01e2

nicklan reviewed Jan 8, 2025

View reviewed changes

OussamaSaoudi-db added 3 commits January 8, 2025 15:21

Move to struct/impl based nullability check

d9caa71

Make StructType::can_read_as developer-visibility pub

007f472

Switch macro order

3d852b0

This was referenced Jan 9, 2025

feat: Check that schemas are well-formed upon construction #628

Open

refactor: Look into unifying EnsureDataTypes and schema_compat.rs functionality #629

Open

feat: Add check for schema read compatibility #554

Are you sure you want to change the base?

feat: Add check for schema read compatibility #554

Conversation

OussamaSaoudi-db commented Nov 29, 2024 • edited Loading

What changes are proposed in this pull request?

How was this change tested?

codecov bot commented Nov 29, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Jan 7, 2025 • edited Loading

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Jan 8, 2025

OussamaSaoudi-db commented Jan 9, 2025

OussamaSaoudi-db commented Nov 29, 2024 •

edited

Loading

codecov bot commented Nov 29, 2024 •

edited

Loading

OussamaSaoudi-db commented Jan 7, 2025 •

edited

Loading

OussamaSaoudi-db Jan 8, 2025 •

edited

Loading

OussamaSaoudi-db Jan 8, 2025 •

edited

Loading

scovich Jan 8, 2025 •

edited

Loading