URL-encode partition field names in file locations #1457

smaheshwar-pltr · 2024-12-20T21:57:57Z

Closes #1458
Closes #175

smaheshwar-pltr · 2024-12-21T00:32:19Z

tests/integration/test_partitioning_key.py

+        # Test that special characters are URL-encoded
+        (
+            [PartitionField(source_id=15, field_id=1001, transform=IdentityTransform(), name="special#string#field")],
+            ["special string"],
+            Record(**{"special#string#field": "special string"}),  # type: ignore
+            "special%23string%23field=special%20string",
+            f"""CREATE TABLE {identifier} (
+                `special#string#field` string
+            )
+            USING iceberg
+            PARTITIONED BY (
+                identity(`special#string#field`)
+            )
+            """,
+            f"""INSERT INTO {identifier}
+            VALUES
+            ('special string')
+            """,
+        ),


Oops. I thought this test would work but it fails, and I'm not sure why yet.

It fails

assert expected_hive_partition_path_slice in spark_path_for_justification

with

E AssertionError: assert 'special%23string%23field=special%20string' in 's3://warehouse/default/test_table/data/special#string#field=special+string/00000-57-0756f620-2b2e-4ffa-97f6-625343525c9b-00001.parquet'

and it fails

assert spark_partition_for_justification == expected_partition_record

with

E AssertionError: assert Record[specia...ecial string'] == Record[specia...ecial string'] E - Record[special#string#field='special string'] E + Record[special_x23string_x23field='special string']

special_x23string_x23field is related to #590

Thanks @kevinjqliu. And the first failure is because the Iceberg version on spark was before apache/iceberg#10329, so it's not URL-encoded (I think).

Given this, I've disabled justification with a message similar to other tests here where behaviour differs.

Do you think this is sufficient (given we're testing PartitionKey's to_path, it felt natural but I'm unsure)?

If not, happy to be pointed to somewhere where I can add an integration test similar to the one shown in the issue. Thanks!

apache/iceberg@795fea9
should be available starting in 1.6.0.

It looks like pyspark is using an older library version, can you add this change as see if the test pass?
#1462

Do you think this is sufficient (given we're testing PartitionKey's to_path, it felt natural but I'm unsure)?

yea we're testing partition_to_path, maybe add a test in tests/table/test_partitioning.py, which is not part of the integration test suite.

The integration test is a nice to have though. lets see if upgrading the iceberg library helps

Thanks for this suggestion - bumping made the path test fail because quote was being used instead of quote_plus for encoding (the Java implementation encodes spaces to + which quote doesn't do).

For consistency, I've made the change to match Java behaviour, but can revert that if consistency isn't so important - what do you think?.

A unit test sounds good (and integration for justification checks would be great).

nice! thats why we like integration tests :)
let me merge #1462 first, and then we can rebase

But the Record comparison still fails because of the non-optional sanitisation-transformation described in apache/iceberg#10120. And, as it stands, the provided Record param is used to check key.partition so can't be changed because that should be unsanitised, IIUC.

Think some test rewiring might be required - maybe providing a separate record param that's by default the other one, just for this justification check, but I wonder if we're really just testing spark behaviour then.

kevinjqliu · 2024-12-22T20:55:39Z

BTW #1462 is merged, could you rebase this PR?

smaheshwar-pltr · 2024-12-23T13:55:36Z

tests/table/test_partitioning.py

+        spec_id=3,
+    )
+
+    record = Record(**{"my#str%bucket": "my+str", "other str+bucket": "( )", "my!int:bucket": 10})  # type: ignore


mypy complains here and elsewhere but I think it's fine

smaheshwar-pltr · 2024-12-23T13:56:50Z

tests/table/test_partitioning.py

+
+    # Both partition names fields and values should be URL encoded, with spaces mapping to plus signs, to match the Java
+    # behaviour: https://github.com/apache/iceberg/blob/ca3db931b0f024f0412084751ac85dd4ef2da7e7/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L198-L204
+    assert spec.partition_to_path(record, schema) == "my%23str%25bucket=my%2Bstr/other+str%2Bbucket=%28+%29/my%21int%3Abucket=10"


Cross-checked with Java implementation (integration tests will do this eventually), in particular WRT to ' ' and '+' encoding. It is consistent.

smaheshwar-pltr · 2024-12-23T13:59:06Z

Done, @kevinjqliu. Fails due to #1457 (comment) but will think over it.

FYI, am away for a little bit now so will pick this back up shortly in the new year! (Feel free to take over if v urgent 😄)

kevinjqliu · 2024-12-23T19:01:49Z

Thanks for the PR! I've dug into the test failure a bit. Heres what I found.

There's a subtle difference between PartitionKey.partition and DataFile.partition. In most cases, these are the same value. For strings with special characters, DataFile.partition is sanitized but PartitionKey.partition is not.

DataFile.partition is sanitized according to apache/iceberg/#10120 this is to match the column value stored in the underlying parquet file.
PartitionKey.partition uses the value from the PartitionSpec which stores the un-sanitized value.

You can verify this by looking up the table partition spec.

iceberg_table.metadata.spec()
iceberg_table.metadata.specs()

The integration test assumes that the value for PartitionKey.partition and DataFile.partition is the same.
One possible solution is to sanitize the given Record before comparison

After spark_path_for_justification,

        # Special characters in partition value are sanitized when written to the data file's partition field
        # Use `make_compatible_name` to match the sanitize behavior
        sanitized_record = Record(**{make_compatible_name(k): v for k, v in vars(expected_partition_record).items()})
        assert spark_partition_for_justification == sanitized_record

smaheshwar-pltr · 2024-12-23T23:30:57Z

Thanks a lot for this explanation and suggestion @kevinjqliu! It sounds good.

Had some time so I've made this change so tests pass - using make_compatible_name as a param that can be specified on a per-case basis (instead of porting over all this logic into tests). I also made it Optional so each subtest doesn't have to specify the identity transform but None.

Fokko · 2024-12-29T05:55:39Z

pyiceberg/partitioning.py

@@ -234,9 +234,11 @@ def partition_to_path(self, data: Record, schema: Schema) -> str:
            partition_field = self.fields[pos]
            value_str = partition_field.transform.to_human_string(field_types[pos].field_type, value=data[pos])

-            value_str = quote(value_str, safe="")
+            value_str = quote_plus(value_str, safe="")


It defaults to utf-8, so that's good 👍

Fokko · 2024-12-29T05:57:04Z

pyiceberg/partitioning.py

+            field_str = quote_plus(partition_field.name, safe="")
+            field_strs.append(field_str)


Nit, I would just collapse these:

Suggested change

field_str = quote_plus(partition_field.name, safe="")

field_strs.append(field_str)

field_strs.append(quote_plus(partition_field.name, safe=""))

Fokko

Thanks @smaheshwar-pltr for picking this up, and thanks for @kevinjqliu the review. I'll leave this one open in case Kevin has any further comments.

This was referenced Dec 20, 2024

(Potential Bug) Partition field names are not URL-encoded in file locations #1458

Open

Support Location Providers #1452

Open

smaheshwar-pltr marked this pull request as ready for review December 20, 2024 23:36

smaheshwar-pltr commented Dec 21, 2024

View reviewed changes

smaheshwar-pltr requested a review from kevinjqliu December 21, 2024 19:41

Sreesh Maheshwar added 8 commits December 22, 2024 22:30

URL-encode partition field names in file locations

4b139ee

Separate into variable

638a43f

Add test

3126952

Revert to main

65e2c39

Failing test

64e7748

Disable justication from test

18c7674

Use quote_plus instead of quote to match Java behaviour

3756e4e

Temporarily update test to pass

f5a35de

smaheshwar-pltr force-pushed the encode-partition-names branch from 999d753 to f5a35de Compare December 23, 2024 03:31

Sreesh Maheshwar added 2 commits December 22, 2024 22:32

Uncomment test

f1f5f4c

Add unit test

8a106e6

smaheshwar-pltr commented Dec 23, 2024

View reviewed changes

Fix typo in comment

1bb379b

Add make_name_compatible suggestion so test passes

61cdd08

Fix typo in schema field name

f32b3aa

Fokko reviewed Dec 29, 2024

View reviewed changes

Fokko approved these changes Dec 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL-encode partition field names in file locations #1457

URL-encode partition field names in file locations #1457

smaheshwar-pltr commented Dec 20, 2024 •

edited by Fokko

Loading

smaheshwar-pltr Dec 21, 2024

kevinjqliu Dec 21, 2024

smaheshwar-pltr Dec 21, 2024

smaheshwar-pltr Dec 21, 2024

kevinjqliu Dec 21, 2024

kevinjqliu Dec 21, 2024

smaheshwar-pltr Dec 21, 2024 •

edited

Loading

kevinjqliu Dec 21, 2024

smaheshwar-pltr Dec 21, 2024

kevinjqliu commented Dec 22, 2024

smaheshwar-pltr Dec 23, 2024 •

edited

Loading

smaheshwar-pltr Dec 23, 2024 •

edited

Loading

smaheshwar-pltr commented Dec 23, 2024

kevinjqliu commented Dec 23, 2024

smaheshwar-pltr commented Dec 23, 2024

Fokko Dec 29, 2024

Fokko Dec 29, 2024 •

edited

Loading

Fokko left a comment

		field_str = quote_plus(partition_field.name, safe="")
		field_strs.append(field_str)

	field_str = quote_plus(partition_field.name, safe="")
	field_strs.append(field_str)
	field_strs.append(quote_plus(partition_field.name, safe=""))

URL-encode partition field names in file locations #1457

Are you sure you want to change the base?

URL-encode partition field names in file locations #1457

Conversation

smaheshwar-pltr commented Dec 20, 2024 • edited by Fokko Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Dec 22, 2024

smaheshwar-pltr Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

smaheshwar-pltr Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

smaheshwar-pltr commented Dec 23, 2024

kevinjqliu commented Dec 23, 2024

smaheshwar-pltr commented Dec 23, 2024

Choose a reason for hiding this comment

Fokko Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

smaheshwar-pltr commented Dec 20, 2024 •

edited by Fokko

Loading

smaheshwar-pltr Dec 21, 2024 •

edited

Loading

smaheshwar-pltr Dec 23, 2024 •

edited

Loading

smaheshwar-pltr Dec 23, 2024 •

edited

Loading

Fokko Dec 29, 2024 •

edited

Loading