misc: Add compression format suffix to the written Parquet file #11563

liujiayi771 · 2024-11-17T04:57:33Z

Adding a compression format suffix to the written Parquet file, and make it easier to determine the compression format of the Parquet file.

netlify · 2024-11-17T04:57:50Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`734c71d`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6739789b75ef9700088bf8ae

liujiayi771 · 2024-11-17T05:07:17Z

@majetideepak Could you help to review?

majetideepak · 2024-11-17T12:53:26Z

@liujiayi771 I have mixed opinions on this. I do see the benefit but then the question becomes why not other file details. I also do not see this for files generated by other tools. Ideally, we could make this pluggable.
@Yuhta, @pedroerp Do you have any thoughts on this?

liujiayi771 · 2024-11-17T13:00:54Z

@majetideepak I added this because the Parquet files written by Spark have compressed format suffixes. However, I also tested the Parquet files written by Presto, which do not even have the .parquet suffix.

majetideepak · 2024-11-18T10:07:35Z

@liujiayi771 It makes sense that Spark users of Gluten+Velox would like to see Spark filenames and similarly, Presto users would prefer Presto filenames.
Maybe we can add a simple HiveConfig for this? Something like: hive.parquet.write-filename-suffix="presto". "spark" can be another option. Velox already has Spark functions. So using the application name here should be okay imo.

Yuhta · 2024-11-18T15:54:52Z

@liujiayi771 @majetideepak I think we need to make it more general than just Presto or Spark, there are many more engines inside Meta using Velox than these two. We can make this a global HiveDataSinkFileNameGenerator for now that takes bucket ID, connector query config, user specified file name (targetFileName), commit strategy, etc.

FelixYBW · 2024-11-18T23:30:46Z

@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?

liujiayi771 · 2024-11-19T01:24:52Z

@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general?

@FelixYBW Currently, the parquet file name written by Gluten is
Gluten_Stage_3_TID_2124_VTID_257_0_3_0946dfb5-f773-42c9-ac8e-d4e70bede02b.parquet
which is generated by

targetFileName = fmt::format(
        "{}_{}_{}_{}",
        connectorQueryCtx_->taskId(),
        connectorQueryCtx_->driverId(),
        connectorQueryCtx_->planNodeId(),
        makeUuid());

I found that Jimmy add a new targetFileName in LocationHandle, so we can specify the targetFileName that contains compression kind suffix from Gluten side. However, we will not be able to retain information such as task ID, driver ID, etc., because this information cannot be obtained in the planning phase. We can only generate file names similar to Spark that only contain UUID, for example, Gluten-0946dfb5-f773-42c9-ac8e-d4e70bede02b.zstd.parquet. Do you think it is necessary to retain task ID, stage ID, and driver ID in the file name?

cc @zhztheplayer, @rui-mo, @JkSelf.

JkSelf · 2024-11-19T02:21:51Z

@liujiayi771
In versions prior to 3.4 of Gluten, the fileName could be guaranteed to contain task ID information. This is because this information is created when the executor actually retrieves the data to write to the file, and then it is offloaded to the Parquet writer.
However, in versions after 3.4, it is difficult to obtain this information when defining the file name during the planning phase. Therefore, I believe it is not feasible to maintain consistency with Spark in this regard. Unless we modify the file name within the actual Parquet writer, which I think is not very meaningful.

liujiayi771 · 2024-11-19T02:31:35Z

@JkSelf The Parquet files written by Spark only contain UUID in the format part-uuid.parquet. What I mean is, if the files produced by Gluten also only contain UUID, do you think there is a problem with that? Is it necessary to include task ID and driver ID information like we currently do?

JkSelf · 2024-11-19T02:49:28Z

@liujiayi771
I understand, thank you for your explanation. It's fine with me to remove the task ID and driver ID.

FelixYBW · 2024-11-19T21:37:02Z

Agree, let's keep the exact the same parquet name as Spark. Customer may use it in their tools. who know.

liujiayi771 · 2024-11-21T06:50:01Z

Close this PR, as we have handled this on the Gluten side. Thanks all.

liujiayi771 requested a review from majetideepak as a code owner November 17, 2024 04:57

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 17, 2024

misc: Add compression format suffix to the written Parquet file

734c71d

liujiayi771 force-pushed the compression-suffix branch from 3b13b92 to 734c71d Compare November 17, 2024 05:01

liujiayi771 changed the title ~~Add compression format suffix to the written Parquet file~~ misc: Add compression format suffix to the written Parquet file Nov 17, 2024

liujiayi771 closed this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc: Add compression format suffix to the written Parquet file #11563

misc: Add compression format suffix to the written Parquet file #11563

liujiayi771 commented Nov 17, 2024

netlify bot commented Nov 17, 2024 •

edited

Loading

liujiayi771 commented Nov 17, 2024

majetideepak commented Nov 17, 2024

liujiayi771 commented Nov 17, 2024

majetideepak commented Nov 18, 2024

Yuhta commented Nov 18, 2024

FelixYBW commented Nov 18, 2024

liujiayi771 commented Nov 19, 2024 •

edited

Loading

JkSelf commented Nov 19, 2024

liujiayi771 commented Nov 19, 2024

JkSelf commented Nov 19, 2024

FelixYBW commented Nov 19, 2024

liujiayi771 commented Nov 21, 2024

misc: Add compression format suffix to the written Parquet file #11563

misc: Add compression format suffix to the written Parquet file #11563

Conversation

liujiayi771 commented Nov 17, 2024

netlify bot commented Nov 17, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

liujiayi771 commented Nov 17, 2024

majetideepak commented Nov 17, 2024

liujiayi771 commented Nov 17, 2024

majetideepak commented Nov 18, 2024

Yuhta commented Nov 18, 2024

FelixYBW commented Nov 18, 2024

liujiayi771 commented Nov 19, 2024 • edited Loading

JkSelf commented Nov 19, 2024

liujiayi771 commented Nov 19, 2024

JkSelf commented Nov 19, 2024

FelixYBW commented Nov 19, 2024

liujiayi771 commented Nov 21, 2024

netlify bot commented Nov 17, 2024 •

edited

Loading

liujiayi771 commented Nov 19, 2024 •

edited

Loading