-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix read from multiple s3 regions #1453
base: main
Are you sure you want to change the base?
Fix read from multiple s3 regions #1453
Conversation
Hey @kevinjqliu , I hope you are ready for the Christmas time :-) After some investigation, I noticed the Would be keen to get some quick feedback, and will add more unit test if this sounds a fix on the correct track? Thanks! |
@jiakai-li Thanks for working on this! And happy holidays :)
looking through the usage for
I think that's one of the problems we need to tackle. The current S3 configuration requires a specific "region" to be set. This assumes that all data and metadata files are from the same region as the one specified. But what if i have some files in one region and some in another? I think a potential solution might be to omit the "region" property and allow the S3FileSystem to determine the proper region using Another potential issue is the way we cache fs, it assumes that there's only one fs per scheme. With the region approach above, we break this assumption. |
BTW theres a similar issue in #1041 |
Thank you @kevinjqliu , just try to clear my head a little bit
Is the change I made in accordance with this option? What I've done essentially is using the
Please correct me if I miss something for how the fs cache works. But here is my understanding: I see we use I think solving the |
Can I tackle on this issue as well if there is no one working on it? |
Im dont think iceberg-python/pyiceberg/io/pyarrow.py Lines 434 to 436 in dbcf65b
and running an example S3 URI:
In order to support multiple regions, we might need to call BTW a good test scenario can be a table where my metadata files are stored in one bucket while my data files are stored in another. We might be able to construct this test case by modifying the |
I don't think anyone's working on it right now, feel free to pick it up. |
Thank you @kevinjqliu , can I have some more guidance on this please?
I did some search and seems in terms of s3 scheme, the format is In the below example, I would expect 'a' to be
Yep, I tested the change using a similar scenario locally with my own handcrafted s3 files. But will add more proper test cases as I make more progress. Thanks again! |
ah yes, you're right. sorry for the confusion. I was thinking of something else. |
BTW there are 2 FileIO implementations, one for pyarrow, another for fsspec. We might want to do the same for fsspec iceberg-python/pyiceberg/io/fsspec.py Lines 133 to 141 in dbcf65b
|
Sweet, I'll go ahead with this approach then. Thanks very much @kevinjqliu ! |
Hi @kevinjqliu , for the above concern, I tested it locally and also did some investigation. According to what I found here seems fsspec doesn't have the same issue as pyarrow. So I guess we can leave it? |
wow thats interesting, i didn't know about that. I like that solution :) Hopefully pyarrow fs will have this feature one day
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more comments, thanks for working on this!
Co-authored-by: Kevin Liu <[email protected]>
This PR is ready for review now. Thanks very much and merry christmas! Please let me know if any further change is required. |
@@ -362,6 +362,12 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste | |||
"region": get_first_property_value(self.properties, S3_REGION, AWS_REGION), | |||
} | |||
|
|||
# Override the default s3.region if netloc(bucket) resolves to a different region | |||
try: | |||
client_kwargs["region"] = resolve_s3_region(netloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about doing this lookup only when the region is not provided explicitly? I think this will do another call to S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Fokko, my understanding is that the problem occurs when the provided region
doesn't match the data file bucket region, and that will fail the file read for pyarrow. And by overwriting the bucket region (fall back to provided region), we make sure the real bucket region that a data file is stored takes precedence. (this function is cached when using fs_by_scheme
, so it will be called only for new bucket that's not resolved previously to save calls to S3)
pyiceberg/io/pyarrow.py
Outdated
try: | ||
client_kwargs["region"] = resolve_s3_region(netloc) | ||
except (OSError, TypeError): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emit a warning here?
This PR closes: