-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc: Add compression format suffix to the written Parquet file #11563
Conversation
✅ Deploy Preview for meta-velox canceled.
|
3b13b92
to
734c71d
Compare
@majetideepak Could you help to review? |
@liujiayi771 I have mixed opinions on this. I do see the benefit but then the question becomes why not other file details. I also do not see this for files generated by other tools. Ideally, we could make this pluggable. |
@majetideepak I added this because the Parquet files written by Spark have compressed format suffixes. However, I also tested the Parquet files written by Presto, which do not even have the .parquet suffix. |
@liujiayi771 It makes sense that Spark users of Gluten+Velox would like to see Spark filenames and similarly, Presto users would prefer Presto filenames. |
@liujiayi771 @majetideepak I think we need to make it more general than just Presto or Spark, there are many more engines inside Meta using Velox than these two. We can make this a global |
@liujiayi771 Can we add it from Gluten instead before pass to velox to make Velox more general? |
@FelixYBW Currently, the parquet file name written by Gluten is targetFileName = fmt::format(
"{}_{}_{}_{}",
connectorQueryCtx_->taskId(),
connectorQueryCtx_->driverId(),
connectorQueryCtx_->planNodeId(),
makeUuid()); I found that Jimmy add a new targetFileName in LocationHandle, so we can specify the targetFileName that contains compression kind suffix from Gluten side. However, we will not be able to retain information such as task ID, driver ID, etc., because this information cannot be obtained in the planning phase. We can only generate file names similar to Spark that only contain UUID, for example, Gluten-0946dfb5-f773-42c9-ac8e-d4e70bede02b.zstd.parquet. Do you think it is necessary to retain task ID, stage ID, and driver ID in the file name? cc @zhztheplayer, @rui-mo, @JkSelf. |
@liujiayi771 |
@JkSelf The Parquet files written by Spark only contain UUID in the format part-uuid.parquet. What I mean is, if the files produced by Gluten also only contain UUID, do you think there is a problem with that? Is it necessary to include task ID and driver ID information like we currently do? |
@liujiayi771 |
Agree, let's keep the exact the same parquet name as Spark. Customer may use it in their tools. who know. |
Close this PR, as we have handled this on the Gluten side. Thanks all. |
Adding a compression format suffix to the written Parquet file, and make it easier to determine the compression format of the Parquet file.