refine: refine interface of ManifestWriter #738

ZENOTME · 2024-11-28T14:51:19Z

This PR refine the write interface of ManifestWriter according to ManifestWriter from pyiceberg. It add 3 interface add, delete, existing which will rewrite some metadata of manifest entry, e.g. snapshot id, sequence number, file sequence number.

These refined interfaces are benefit for MergeAppend.( I' m working on it now.

ZENOTME · 2024-11-28T16:09:22Z

cc @liurenjie1024 @Xuanwo @Fokko @c-thiel

liurenjie1024

Thanks @ZENOTME for this pr!

liurenjie1024 · 2024-12-04T03:38:59Z

crates/iceberg/src/spec/manifest.rs

-    /// Write a manifest.
-    pub async fn write(mut self, manifest: Manifest) -> Result<ManifestFile> {
+    /// Add a new manifest entry.
+    pub fn add(&mut self, mut entry: ManifestEntry) -> Result<()> {


It's kind of weird of manipulating arguments, how about make the arguments DataFile?

Applies to other apis.

I think the reason here use ManifesEntry is that in some case we will add entry from other Manifest. In this case, there are some info we need from original ManifsetEntry. E.g. when we add the delete manifest entry, we change the snapshot id and keep the original sequence number.

/// Add a delete manifest entry. pub fn delete(&mut self, mut entry: ManifestEntry) -> Result<()> { entry.status = ManifestStatus::Deleted; entry.snapshot_id = Some(self.snapshot_id); self.add_entry(entry)?; Ok(()) }

I'm not convinced. If we ask user to provide ManifestEntry, it would be confusing to user which part will be used and which part not. I think the style in java would be more clear from a user's view. If we to use ManifestEntry approach, we must have clear documentation about the behavior of each part, e.g. which is ignored, which is reserved.

If we to use ManifestEntry approach, we must have clear documentation about the behavior of each part, e.g. which is ignored, which is reserved.

I agree with this. Then I think these functions can be pub(crate) to ensure public users will not use it. I think for now there is no demand that user need to use this API.🤔

liurenjie1024 · 2024-12-19T06:51:01Z

crates/iceberg/src/spec/manifest.rs

+    }
+
+    /// Write manifest file and return it.
+    pub async fn to_manifest_file(mut self, metadata: ManifestMetadata) -> Result<ManifestFile> {


I have concerns with this api, since it's error prone. According to iceberg's spec, each manifest file should contains one type of data file: data or deletes. It's quite possible that the user pass different kinds entries in previouse method, then the metadata is different. My suggestion is to follow java/python's approach:

A factory method like

pub fn new_v1_writer(...) {} pub fn new_v2_writer(...) {} pub fn new_v2_delete_writer(...) {}

We could use things like trait or enum to abstract out common parts of different writers.

We could use things like trait or enum to abstract out common parts of different writers.

Difference between v1, v2, delete is:

the metadata of avro file

avro schema

content type

check in add_entry to make sure entry.content_type == writer.content_type

serialize the ManifestEntry

I think both differences except for serializing the ManifestEntry can be implemented by storing different data in the writer when we create the writer using the factory method. So do we really need to abstract out common parts of different writers now?🤔

I'm fine without trait/enum, the focus is factory methods to ensure api safety.

liurenjie1024

Thanks @ZENOTME for this pr, generally LGTM! I still have concerns with the add/eixsting/delete api, and prefer the approach used in java api: org.apache.iceberg.ManifestWriter#add(F), which provides better api safety. For what I mentioned in comments, it's possible to add some check, but it's not a good api for user which throws error at runtime.

liurenjie1024 · 2025-01-15T09:59:06Z

crates/iceberg/src/spec/manifest.rs

+    /// Create a new builder.
+    pub fn new(
+        output: OutputFile,
+        snapshot_id: i64,


This should be optional.

Which value we should assign in https://github.com/apache/iceberg-rust/blob/b39d7db8e30400e9bd77a82ecc85a497327f47b8/crates/iceberg/src/spec/manifest.rs#L484C13-L484C26 if it's none. It's a required field. https://iceberg.apache.org/spec/#manifest-lists:~:text=503%20added_snapshot_id

I think we should use some special value like -1 here. The reason it's optional is that when we append a data file here, the snapshot id is unknown. Actual snapshot id is determined when do commit, which may fail and retry.

liurenjie1024 · 2025-01-16T08:44:19Z

crates/iceberg/src/spec/manifest.rs

+
+    /// Add an existing manifest entry. This method will update following status of the entry:
+    /// - Update the entry status to `Existing`
+    pub fn existing(&mut self, mut entry: ManifestEntry) -> Result<()> {


This incorrect, an existing entry requires user to provide snapshot id, data sequence number, which are all optional in ManifestEntry..

Seems iceberg-java also has ian nterface like here: https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L157. And looks like it also support the case the existing entry without snapshot id.🤔

They are different cases, one is public api, another is package private. They have different callers.

Oh I got you. I think the interface here is package private. I forget to mark them as pub(crate) to avoid confusion.

liurenjie1024 · 2025-01-16T08:45:04Z

crates/iceberg/src/spec/manifest.rs

+    /// Add a delete manifest entry. This method will update following status of the entry:
+    /// - Update the entry status to `Deleted`
+    /// - Set the snapshot id to the current snapshot id
+    pub fn delete(&mut self, mut entry: ManifestEntry) -> Result<()> {


This is also incorrect. The sequence number must be provided.

ZENOTME · 2025-01-16T10:36:18Z

Thanks @ZENOTME for this pr, generally LGTM! I still have concerns with the add/eixsting/delete api, and prefer the approach used in java api: org.apache.iceberg.ManifestWriter#add(F), which provides better api safety. For what I mentioned in comments, it's possible to add some check, but it's not a good api for user which throws error at runtime.

Hi @liurenjie1024, seems iceberg-java also provide the interface for entry in https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L157, and use them in ManifsetMergeManager: https://github.com/apache/iceberg/blob/d96901b843395fe669f6bd4f618f8e5e46c0eed4/core/src/main/java/org/apache/iceberg/ManifestMergeManager.java#L188. And it also don't check some case like existing entry with null data sequence number. Is the case that iceberg-java missed or it's acceptable.

1. adopt factory method to build different type manifest writer 2. provide add, exist, delete method

2. mark add entry method as crate private

liurenjie1024

Thanks @ZENOTME for this great pr! Just left some minor suggestions, and we are close !

liurenjie1024 · 2025-01-17T12:53:18Z

crates/iceberg/src/spec/manifest.rs

+        data_file: DataFile,
+        snapshot_id: i64,
+        sequence_number: i64,
+        file_sequence_number: i64,


This should be optional, file_sequence_number could be inherited from snapshot.

liurenjie1024 · 2025-01-17T12:53:27Z

crates/iceberg/src/spec/manifest.rs

+        &mut self,
+        data_file: DataFile,
+        sequence_number: i64,
+        file_sequence_number: i64,


This should be optional, file_sequence_number could be inherited from snapshot.

liurenjie1024 · 2025-01-17T12:58:37Z

crates/iceberg/src/spec/manifest.rs

@@ -41,6 +41,9 @@ use crate::io::OutputFile;
 use crate::spec::PartitionField;
 use crate::{Error, ErrorKind};

+/// Placeholder for snapshot ID. The field with this value must be replaced with the actual snapshot ID before it is committed.
+pub const UNASSIGNED_SNAPSHOT_ID: i64 = -1;


We should move this to snapshot module.

Xuanwo

Thank you @ZENOTME for working on this. The design mostly look good to me. Only have some question about the API naming.

Xuanwo · 2025-01-17T14:45:19Z

crates/iceberg/src/spec/manifest.rs

+    /// # TODO
+    /// Remove this allow later
+    #[allow(dead_code)]
+    pub(crate) fn existing(&mut self, mut entry: ManifestEntry) -> Result<()> {


Hi, the API naming seems a bit unclear. If we are "adding an existing manifest entry," how about naming this API add_existing_entry?

Xuanwo · 2025-01-17T14:45:32Z

crates/iceberg/src/spec/manifest.rs

+    /// # TODO
+    /// Remove this allow later
+    #[allow(dead_code)]
+    pub(crate) fn delete(&mut self, mut entry: ManifestEntry) -> Result<()> {


Maybe delete_entry?

Xuanwo · 2025-01-17T14:45:39Z

crates/iceberg/src/spec/manifest.rs

+    /// - Set the snapshot id to the current snapshot id
+    /// - Set the sequence number to `None` if it is invalid(smaller than 0)
+    /// - Set the file sequence number to `None`
+    pub(crate) fn add(&mut self, mut entry: ManifestEntry) -> Result<()> {


Maybe add_entry?

Xuanwo · 2025-01-17T14:46:17Z

crates/iceberg/src/spec/manifest.rs

+
+    /// Add an existing manifest entry. The original data and file sequence numbers, snapshot ID,
+    /// which were assigned at commit, must be preserved when adding an existing entry.
+    pub fn existing_file(


How about add_existing_file?

Xuanwo · 2025-01-17T14:49:24Z

crates/iceberg/src/spec/manifest.rs

+    }
+
+    /// Write manifest file and return it.
+    pub async fn to_manifest_file(mut self) -> Result<ManifestFile> {


to_manifest_file sounds more like a conversion function, but it actually involves heavy I/O operations. How about using the name suggested in the comments: write_manifest_file?

liurenjie1024

Thanks @ZENOTME for this pr, LGTM!

liurenjie1024 · 2025-01-18T06:21:46Z

Let's wait for a moment to see if @Xuanwo has other comments.

Xuanwo

Thank you @ZENOTME for working on this, let move!

ZENOTME mentioned this pull request Nov 28, 2024

Support for MergeAppend #736

Open

liurenjie1024 reviewed Dec 4, 2024

View reviewed changes

ZENOTME requested a review from liurenjie1024 December 13, 2024 08:40

liurenjie1024 reviewed Dec 19, 2024

View reviewed changes

ZENOTME force-pushed the refine_manifest branch from 76ae53d to 491d60f Compare December 19, 2024 15:23

ZENOTME force-pushed the refine_manifest branch from 491d60f to b39d7db Compare January 10, 2025 14:31

ZENOTME requested a review from liurenjie1024 January 10, 2025 14:32

liurenjie1024 reviewed Jan 16, 2025

View reviewed changes

ZENOTME force-pushed the refine_manifest branch from d4a569c to b701bd9 Compare January 17, 2025 06:03

ZENOTME added 3 commits January 17, 2025 14:53

refine interface of manifest writer:

1a4adfc

1. adopt factory method to build different type manifest writer 2. provide add, exist, delete method

1. add method for data file

19ac4dc

2. mark add entry method as crate private

make snapshot id optional

97c0369

ZENOTME force-pushed the refine_manifest branch from b701bd9 to 97c0369 Compare January 17, 2025 06:54

ZENOTME requested a review from liurenjie1024 January 17, 2025 07:17

liurenjie1024 reviewed Jan 17, 2025

View reviewed changes

Xuanwo reviewed Jan 17, 2025

View reviewed changes

refine interface

bd71120

liurenjie1024 approved these changes Jan 18, 2025

View reviewed changes

Xuanwo approved these changes Jan 18, 2025

View reviewed changes

Xuanwo merged commit 55cca03 into apache:main Jan 18, 2025
17 checks passed

ZENOTME deleted the refine_manifest branch January 20, 2025 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refine: refine interface of ManifestWriter #738

refine: refine interface of ManifestWriter #738

ZENOTME commented Nov 28, 2024

ZENOTME commented Nov 28, 2024 •

edited

Loading

liurenjie1024 left a comment

liurenjie1024 Dec 4, 2024

liurenjie1024 Dec 4, 2024

ZENOTME Dec 4, 2024

liurenjie1024 Dec 6, 2024

ZENOTME Dec 6, 2024

liurenjie1024 Dec 19, 2024

ZENOTME Dec 19, 2024

liurenjie1024 Dec 19, 2024

liurenjie1024 left a comment

liurenjie1024 Jan 15, 2025

ZENOTME Jan 16, 2025

liurenjie1024 Jan 17, 2025

liurenjie1024 Jan 16, 2025

ZENOTME Jan 16, 2025

liurenjie1024 Jan 16, 2025

ZENOTME Jan 16, 2025

liurenjie1024 Jan 16, 2025

ZENOTME commented Jan 16, 2025

liurenjie1024 left a comment

liurenjie1024 Jan 17, 2025

liurenjie1024 Jan 17, 2025

liurenjie1024 Jan 17, 2025

Xuanwo left a comment

Xuanwo Jan 17, 2025

Xuanwo Jan 17, 2025

Xuanwo Jan 17, 2025

Xuanwo Jan 17, 2025

Xuanwo Jan 17, 2025

liurenjie1024 left a comment

liurenjie1024 commented Jan 18, 2025

Xuanwo left a comment

refine: refine interface of ManifestWriter #738

refine: refine interface of ManifestWriter #738

Conversation

ZENOTME commented Nov 28, 2024

ZENOTME commented Nov 28, 2024 • edited Loading

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZENOTME commented Jan 16, 2025

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

liurenjie1024 commented Jan 18, 2025

Xuanwo left a comment

Choose a reason for hiding this comment

ZENOTME commented Nov 28, 2024 •

edited

Loading