-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Cannot use PyIceberg with multiple FS #1041
Comments
This is a good point, I've heard that folks store their metadata on HDFS, and the data itself on S3. I don't think the example with the add-files is the best, it would be better to support the |
Oh interesting, thanks! I also found an old issue referencing this #161 Another example usage is to use both local FS and S3(minio), which might be easier to set up and test against |
I will have a look this issue. |
A generic question: why we are implementing a custom iceberg-python/pyiceberg/io/__init__.py Lines 320 to 329 in dc6d242
iceberg-python/pyiceberg/io/__init__.py Lines 290 to 300 in dc6d242
instead of using (On a side note: this looks a bit confusing to me, as why for (Also I am not sure why pyarrow is not using fsspec as io layer but implement things on their own.) EDIT: |
Read my comment here for the cause of the issue. I dont think fixing Let me know what do you think, then we can come up with a way to properly address this. 😄 |
Thanks for taking a look at this @TiansuYu
I think custom scheme parsing avoids picking one library over another (
Good catch, iceberg-python/pyiceberg/io/fsspec.py Lines 161 to 176 in dc6d242
iceberg-python/pyiceberg/io/pyarrow.py Lines 348 to 403 in dc6d242
|
yea, the main issue is the assumption that the same iceberg-python/pyiceberg/table/__init__.py Lines 655 to 657 in dc6d242
Instead, we would want to recreate Here's another example of passing in the iceberg-python/pyiceberg/table/__init__.py Lines 530 to 532 in dc6d242
|
Generally, this problem should go away if we re-evaluate |
@kevinjqliu I think resolving fs at file level should make the API cleaner. We can e.g. if no I would say one benefit one might want to set fs on table level is to reuse that fs instance for performance boost. If we want to keep this, I would say we need to make two io configs, one for metadata, one for data, on the |
My preference is resolving |
I will make a PR according to this:
|
Also reading on here: There might be some opportunity that we can simplify the split between arrow and fsspec file_system. |
yep! There are definitely opportunities to consolidate the two. I opened #310 with some details. |
Reading on table spec, I just realised that there is a field |
Its configurable via the write properties. See this comment #1041 (comment) |
Any updates on this issue , I face a similar issue when creating a table on S3 as well |
Hey guys, I can pick this up together with #1279 if no one is currently working on this. |
assigned to you @jiakai-li |
@kevinjqliu I guess we can close this issue and #1279 now? At the meantime, I'm keen to work on the |
@jiakai-li we can close this issue! Fixed by #1453
Let's open a new issue for those #1492 |
Apache Iceberg version
main (development)
Please describe the bug 🐞
PyIceberg assumes the same FS implementation is used for reading both metadata and data.
However, I want to use a catalog with local FS as the warehouse while referencing S3 files as data.
See this example Jupyter notebook to reproduce
Problem
The
fs
implementation is determined by metadata location, which is then passed down to the function which reads the data file.iceberg-python/pyiceberg/io/pyarrow.py
Lines 1428 to 1430 in d8b5c17
Possible solution
Determine
fs
implementation based on the file path of the current fileThe text was updated successfully, but these errors were encountered: