Basic Transaction Support for DeltaCAT Durable Storage Model w/o Isolation #442

pdames · 2025-01-09T21:46:58Z

Summary

Adds support for multi-namespace/multi-table atomic transactions that create, update, rename, and/or replace catalog artifacts. Strong transaction isolation (e.g., via serializable and/or snapshot-level isolation) is not yet supported, and will come in a subsequent PR. The catalog transaction log currently provides a minimal listing of successfully committed transaction IDs, with no O(1) read access to corresponding transaction operation details (also to be added in a subsequent PR).

Rationale

This is part of the core change set required to support https://github.com/ray-project/deltacat/milestone/4 via (1) compaction on an open, durable DeltaCAT catalog metadata format, (2) synchronization of DeltaCAT datasets to Iceberg/Hudi/Delta formats via lightweight metadata translation.

Changes

Changes are focused on DeltaCAT's internal storage model (i.e. deltacat/storage/model), esp. the DeltaCAT 2.0 durable storage bindings introduced via metafile.py.

Impact

These changes attempt to preserve backwards compatibility with the existing DeltaCAT 1.X compactor.

Testing

Unit tests (make test).

Regression Risk

If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.

Checklist

Unit tests covering the changes have been added
- If this is a bugfix, regression tests have been added
E2E testing has been performed

…unittest.

…update vs. replace transaction.

…stamp to transaction ID.

mcember · 2025-01-13T16:20:49Z

deltacat/storage/model/delta.py

@@ -244,6 +254,13 @@ def stream_id(self) -> Optional[str]:
            return delta_locator.stream_id
        return None

+    @property
+    def stream_format(self) -> Optional[str]:


Why not have this return Optional[StreamFormat] given you declare the enum in types.py?

I need to revisit this again, but I think this was just a concession on backwards compatibility with DeltaCAT 1.X. However, we may be able to update this in a way that still preserves backwards compatibility - it may just expand the corresponding scope of changes a bit.

mcember · 2025-01-13T16:27:57Z

deltacat/storage/model/delta.py

@@ -265,6 +282,65 @@ def stream_position(self) -> Optional[int]:
            return delta_locator.stream_position
        return None

+    def to_serializable(self) -> Delta:


This feels non-standard to me. A pattern I've seen more is that the serialize method will ignore mutable/internal fields, e.g. jackson JsonIgnore annotation. Just looking at this class - to_serializable returns a serializable object, but I'm not sure how I'm expected to perform the serialization (pickle? json). It might help to add pydoc here.

To_serializable and from_serializable are also asymmetric in an unexpected way in that you can't round trip using them. Maybe from_serializable could be renamed to deserialize, or you could explicitly add a serialized function

Yeah, I had been thinking about potential new names for these methods, as to_serializable() has expanded to be a more generic prepare_for_write()/before_write() method, which may include validating the object before writing it, removing in-memory references that shouldn't be persisted to disk, changing object formats to ensure they're serializable via msgpack, etc.), etc. Likewise, from_serializable() has also expanded to be more of a generic after_read() method which can restore in-memory object formats (e.g., deserialize native Arrow schemas from bytes) but also perform any post-read validations.

Regarding symmetry, the test_metafile_io() tests include asserts that all write() and read() invokes against serialized/deserialized metafiles remain lossless. The two known exceptions to this using msgpack are seen at the bottom of test_python_type_serde(), which shows that (1) tuples becomes lists, and (2) bytearray objects become bytes.

So, if your class requires a tuple to remain a tuple or bytearray to remain bytearray post-read, then this is something you'd need to put in your from_serializable() method today.

Regardless, the above seems like useful behavior to document outside of just the test cases as the expected behavior of these methods matures.

+1, naming/functionality was odd/non-standard here and consider changing it (doesn't have to be this CR). That said, I asked GPT about it and it seemed to think it was fine even when prodded, so if it's good enough for our AI overlords it's probably good enough to ship ;)

mcember · 2025-01-13T16:31:48Z

deltacat/storage/model/locator.py

+DEFAULT_PATH_SEPARATOR = "/"
+
+
+class LocatorName:


Having not read the full CR - did you consider some system in which all objects have an immutable guid and an optional user defined name? This might help to deal with conflicts like moving a table to a namespace which contains a conflicting name, or act as a primitive to do renames.

If I could go back in time and re-design Andes, I would give all tables guids so that renaming is less invasive

Yep - that's exactly what we have here. You can see some of the test_metafile_io.py rename tests to see how this works in practice but, if an object has a mutable name (e.g., Table and Namespace), then it will have a separate file written to map it's mutable locator name to the underlying metafile's immutable id (which is a UUID). If you rename the object, then this UUID will stay the same, the mapping from the old name to that UUID will be marked as deleted, and a mapping from the new name back to that UUID will be created.

One other thing I'll add is that, especially with the recent introduction of Locator Aliases and corresponding alias names in my last commit, the relationship between all of these different mutable/immutable locators, IDs, and names is becoming even more confusing and deserves some more formal documentation both in code and in a corresponding specification.

I think the order of operations here will roughly be (1) implement the concepts in code and write a bunch of tests to ensure they all work as intended, (2) document the code thoroughly and discuss potential renames/code doc updates to assist existing DeltaCAT internal developers, (3) write an external specification doc to solidify these concepts for external users and developers.

mcember · 2025-01-13T16:36:16Z

deltacat/storage/model/locator.py


 class Locator:
+    """
+    Creates a globally unique reference to any named catalog object. Locators


So - this is basically a URN, but maybe without elements like scheme?

You could just call these URNs and make them compatible with the URN spec by making strings like: `urn:deltacat:delta-{deltaId}

Even if you aren't using the URN spec, it would be helpful to describe the specification of the locator strings. Or, if that specification is truly specific to each object type and you impose no constraints, that is relevant to document too

I suppose it could be thought of like a scheme-less URN. A Locator basically just says that, if I have some filesystem that knows how to read a given catalog root path, then you should be able to create a reference to any object in the catalog using its unique hex digest directly off of that root. It's kind of like tiny URL for the catalog, such that every cataloged object (from Namespace to Delta) remains discoverable in O(1) directly from a reference stored at {catalog_root}/{object_digest}.

mcember · 2025-01-13T16:38:38Z

deltacat/storage/model/locator.py

+    @property
+    def name(self) -> LocatorName:
+        """
+        Returns the name of this locator.


The name of the locator is the type of object, like delta or stream? IMO name is a bit confusing since I might think that a locator name is the canonical string

The name for a Delta would be a single-part name just containing its stream position, while the name of a Partition is a tuple of its partition values and partition ID (since identical partition values may exist in multiple different partition schemes, just matching partition values alone may result in incorrect partition pruning).

But your initial thought is pretty close, since the canonical string representation is just {parent_object_digest}/{this_object_name} (i.e., name provides a relative identifier that can be used to locate the object within all siblings of the same parent, while canonical_string provides a catalog-global identifier that can be used to find the object among all objects registered with the catalog).

Open for naming suggestions to help clarify these relationships though.

mcember · 2025-01-13T16:48:50Z