Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc: Add compression format suffix to the written Parquet file #11563

Closed

Conversation

liujiayi771
Copy link
Contributor

Adding a compression format suffix to the written Parquet file, and make it easier to determine the compression format of the Parquet file.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 17, 2024
Copy link

netlify bot commented Nov 17, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 734c71d
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6739789b75ef9700088bf8ae

@liujiayi771 liujiayi771 changed the title Add compression format suffix to the written Parquet file misc: Add compression format suffix to the written Parquet file Nov 17, 2024
@liujiayi771
Copy link
Contributor Author

@majetideepak Could you help to review?

@majetideepak
Copy link
Collaborator

@liujiayi771 I have mixed opinions on this. I do see the benefit but then the question becomes why not other file details. I also do not see this for files generated by other tools. Ideally, we could make this pluggable.
@Yuhta, @pedroerp Do you have any thoughts on this?

@liujiayi771
Copy link
Contributor Author

@majetideepak I added this because the Parquet files written by Spark have compressed format suffixes. However, I also tested the Parquet files written by Presto, which do not even have the .parquet suffix.

@majetideepak
Copy link
Collaborator

@liujiayi771 It makes sense that Spark users of Gluten+Velox would like to see Spark filenames and similarly, Presto users would prefer Presto filenames.
Maybe we can add a simple HiveConfig for this? Something like: hive.parquet.write-filename-suffix="presto". "spark" can be another option. Velox already has Spark functions. So using the application name here should be okay imo.

@Yuhta
Copy link
Contributor

Yuhta commented Nov 18, 2024

@liujiayi771 @majetideepak I think we need to make it more general than just Presto or Spark, there are many more engines inside Meta using Velox than these two. We can make this a global HiveDataSinkFileNameGenerator for now that takes bucket ID, connector query config, user specified file name (targetFileName), commit strategy, etc.

@FelixYBW
Copy link

@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?

@liujiayi771
Copy link
Contributor Author

liujiayi771 commented Nov 19, 2024

@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?

@FelixYBW Currently, the parquet file name written by Gluten is
Gluten_Stage_3_TID_2124_VTID_257_0_3_0946dfb5-f773-42c9-ac8e-d4e70bede02b.parquet
which is generated by

targetFileName = fmt::format(
        "{}_{}_{}_{}",
        connectorQueryCtx_->taskId(),
        connectorQueryCtx_->driverId(),
        connectorQueryCtx_->planNodeId(),
        makeUuid());

I found that Jimmy add a new targetFileName in LocationHandle, so we can specify the targetFileName that contains compression kind suffix from Gluten side. However, we will not be able to retain information such as task ID, driver ID, etc., because this information cannot be obtained in the planning phase. We can only generate file names similar to Spark that only contain UUID, for example, Gluten-0946dfb5-f773-42c9-ac8e-d4e70bede02b.zstd.parquet. Do you think it is necessary to retain task ID, stage ID, and driver ID in the file name?

cc @zhztheplayer, @rui-mo, @JkSelf.

@JkSelf
Copy link
Collaborator

JkSelf commented Nov 19, 2024

@liujiayi771
In versions prior to 3.4 of Gluten, the fileName could be guaranteed to contain task ID information. This is because this information is created when the executor actually retrieves the data to write to the file, and then it is offloaded to the Parquet writer.
However, in versions after 3.4, it is difficult to obtain this information when defining the file name during the planning phase. Therefore, I believe it is not feasible to maintain consistency with Spark in this regard. Unless we modify the file name within the actual Parquet writer, which I think is not very meaningful.

@liujiayi771
Copy link
Contributor Author

@JkSelf The Parquet files written by Spark only contain UUID in the format part-uuid.parquet. What I mean is, if the files produced by Gluten also only contain UUID, do you think there is a problem with that? Is it necessary to include task ID and driver ID information like we currently do?

@JkSelf
Copy link
Collaborator

JkSelf commented Nov 19, 2024

@liujiayi771
I understand, thank you for your explanation. It's fine with me to remove the task ID and driver ID.

@FelixYBW
Copy link

Agree, let's keep the exact the same parquet name as Spark. Customer may use it in their tools. who know.

@liujiayi771
Copy link
Contributor Author

Close this PR, as we have handled this on the Gluten side. Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants