-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38364: [Python] Initialize S3 on first use #38375
Conversation
|
[ RUN ] TestS3FS.GetFileInfoGeneratorStress
API: DeleteMultipleObjects(bucket=stress, multiObject=true, numberOfObjects=1000)
Time: 12:20:20 UTC 10/20/2023
DeploymentID: 8b5b1ae6-66cf-4ead-a4fd-3ef2b430d16c
RequestID: 178FD01F921FEEC8
RemoteHost: 127.0.0.1
Host: 127.0.0.8:51944
UserAgent: aws-sdk-cpp/1.11.68 Windows/10.0.17763.4252 AMD64 MSVC/1929
Error: remove C:\Users\appveyor\AppData\Local\Temp\1\s3fs-test-30c6atjk/.minio.sys/buckets/stress/1/230/fs.json: The process cannot access the file because it is being used by another process. (*fs.PathError)
6: internal\logger\logger.go:278:logger.LogIf()
5: cmd\fs-v1-helpers.go:430:cmd.fsDeleteFile()
4: cmd\fs-v1.go:1210:cmd.(*FSObjects).DeleteObject()
3: cmd\fs-v1.go:1142:cmd.(*FSObjects).DeleteObjects()
2: cmd\bucket-handlers.go:592:cmd.objectAPIHandlers.DeleteMultipleObjectsHandler()
1: net\http\server.go:2047:http.HandlerFunc.ServeHTTP()
API: DeleteMultipleObjects(bucket=stress, multiObject=true, numberOfObjects=1000)
Time: 12:20:21 UTC 10/20/2023
DeploymentID: 8b5b1ae6-66cf-4ead-a4fd-3ef2b430d16c
RequestID: 178FD01F921FEEC8
RemoteHost: 127.0.0.1
Host: 127.0.0.8:51944
UserAgent: aws-sdk-cpp/1.11.68 Windows/10.0.17763.4252 AMD64 MSVC/1929
Error: remove C:\Users\appveyor\AppData\Local\Temp\1\s3fs-test-30c6atjk/.minio.sys/buckets/stress/1/230/fs.json: The process cannot access the file because it is being used by another process. (*fs.PathError)
5: internal\logger\logger.go:278:logger.LogIf()
4: cmd\api-errors.go:2123:cmd.toAPIErrorCode()
3: cmd\api-errors.go:2148:cmd.toAPIError()
2: cmd\bucket-handlers.go:620:cmd.objectAPIHandlers.DeleteMultipleObjectsHandler()
1: net\http\server.go:2047:http.HandlerFunc.ServeHTTP()
[FATAL] 2023-10-20 12:19:08.379 WinHttpSyncHttpClient [2536] Failed setting secure crypto protocols with error code: 87
[FATAL] 2023-10-20 12:19:08.950 WinHttpSyncHttpClient [2536] Failed setting secure crypto protocols with error code: 87
C:/projects/arrow/cpp/src/arrow/filesystem/s3fs_test.cc(865): error: Failed
'fs_->DeleteDirContents("stress")' failed with IOError: Got the following 1 errors when deleting objects in S3 bucket 'stress':
- key '1/230': We encountered an internal error, please try again.: cause(remove C:\Users\appveyor\AppData\Local\Temp\1\s3fs-test-30c6atjk/.minio.sys/buckets/stress/1/230/fs.json: The process cannot access the file because it is being used by another process.)
[ FAILED ] TestS3FS.GetFileInfoGeneratorStress (72879 ms)
EDIT: This seems to have been a one time fluke. |
5c5f30a
to
4307861
Compare
Initializing S3 on import may cause undesired consequences for users who do not use S3: - Longer import times; - Consumption of unnecessary resources, e.g., AWS event loop thread(s); - Potential exposure to bugs in S3 package dependencies. Therefore, a more appropriate way to handle S3 initialization seems to be to move it to its first use.
4307861
to
220745f
Compare
(cc @jorisvandenbossche) We are very interested in feedback on this one. The aws-sdk-cpp bug is causing RAPIDS a lot of pain, and so we are very motivated to avoid the import whenever possible. |
python/pyarrow/_s3fs.pyx
Outdated
ensure_s3_initialized() | ||
|
||
self._initialize_s3(access_key=access_key, secret_key=secret_key, session_token=session_token, | ||
anonymous=anonymous, region=region, request_timeout=request_timeout, | ||
connect_timeout=connect_timeout, scheme=scheme, endpoint_override=endpoint_override, | ||
background_writes=background_writes, default_metadata=default_metadata, | ||
role_arn=role_arn, session_name=session_name, external_id=external_id, | ||
load_frequency=load_frequency, proxy_options=proxy_options, | ||
allow_bucket_creation=allow_bucket_creation, allow_bucket_deletion=allow_bucket_deletion, | ||
retry_strategy=retry_strategy) | ||
|
||
def _initialize_s3(self, *, access_key=None, secret_key=None, session_token=None, | ||
bint anonymous=False, region=None, request_timeout=None, | ||
connect_timeout=None, scheme=None, endpoint_override=None, | ||
bint background_writes=True, default_metadata=None, | ||
role_arn=None, session_name=None, external_id=None, | ||
load_frequency=900, proxy_options=None, | ||
allow_bucket_creation=False, allow_bucket_deletion=False, | ||
retry_strategy: S3RetryStrategy = AwsStandardS3RetryStrategy(max_attempts=3)): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
It seems that __init__()
already calls ensure_s3_initialized()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I had to move the implementation to the _initialize_s3
method are the cdef
s in https://github.com/apache/arrow/pull/38375/files#diff-afa3ea99a387be221ef1f7230aa309b42001aed318cdc6969e700d5eb04d07b2R294-R296, they will instantiate objects that expect S3 to be already initialized (which is only guaranteed after ensure_s3_initialized()
is called) and Cython will instantiate them before any code in the function runs, including ensure_s3_initialized()
.
Another alternative would be to make them unique_ptr
s or something that won't instantiate the objects immediately at __init__()
's entry, but I think this would make the code more complex than the solution currently proposed. If a pointer is preferred here without deferring to another method I can work on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The given cdef's are std::optional
and std::shared_ptr
, so I don't understand the concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed they are, thanks for calling my attention to that. Now I see what happened, I initially tested this in the 12.0.1
tag where CS3Options was not a std::optional
and that was causing the CS3Options
default constructor to be called before ensure_s3_initialized()
. I can confirm it works as expected in main
without that, and fixed it now via 58097e9 .
@github-actions crossbow submit -g python -g wheel |
Revision: 58097e9 Submitted crossbow builds: ursacomputing/crossbow @ actions-06c6a57213 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @pentschev
@pentschev You're welcome! |
Thanks all! This is a big help for RAPIDS. |
Thanks all! 🙏 |
@pentschev No problem! Could you remove the newly added arrow/python/pyarrow/_s3fs.pyx Lines 272 to 280 in 9e2d4aa
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7751444. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change #38375 introduced duplicate calls to `ensure_s3_initialized()`. ### What changes are included in this PR? Deduplicates call to `ensure_s3_initialized()`. ### Are these changes tested? Yes, covered by existing S3 tests. ### Are there any user-facing changes? No. Authored-by: Peter Andreas Entschev <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
FYI we are in the process of releasing pyarrow 14.0. This PR was just too late for being included, but I did label it as "backport-candidate" in case we need to make a new release candidate or would do a bug-fix release. |
Thanks @jorisvandenbossche for considering that, we indeed saw that pyarrow 14.0 release is in-process, so we opted for patching conda-forge packages which is probably sufficient to overcome the issues we've been experiencing, we'll probably need a couple more days to confirm that but it seems like a promising solution with minimal trouble for Arrow maintainers. 🙂 @bdice @vyasr @jakirkham please feel free to correct me if needed here. |
Yes, thank you @jorisvandenbossche. I agree with @pentschev that our use case for pyarrow will be satisfied without a backport to 14.x, as long as we have the patch in the conda-forge packages for Arrow 13 which was just merged. We'll want the same patch to carry forward to Arrow 14 on conda-forge, until Arrow 15 is released with the fix upstreamed.👍 However, I see this fixed an issue reported by someone else, so a backport could be very welcome by other users: rstudio/reticulate#1470 (comment) |
I agree with the above. RAPIDS should be fine with the conda-forge patch and the Arrow 15 bugfix, but it could help other users to get it out sooner. |
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
) ### Rationale for this change apache#38375 introduced duplicate calls to `ensure_s3_initialized()`. ### What changes are included in this PR? Deduplicates call to `ensure_s3_initialized()`. ### Are these changes tested? Yes, covered by existing S3 tests. ### Are there any user-facing changes? No. Authored-by: Peter Andreas Entschev <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: #38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change In accordance to apache#38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use. ### What changes are included in this PR? - Remove calls to `ensure_s3_initialized()` that were up until now executed during `import pyarrow.fs`; - Move `ensure_s3_intialized()` calls to `python/pyarrow/_s3fs.pyx` module; - Add global flag to mark whether S3 has been previously initialized and `atexit` handlers registered. ### Are these changes tested? Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception. ### Are there any user-facing changes? No, the behavior is now slightly different with S3 initialization not happening immediately after `pyarrow.fs` is imported, but no changes are expected from a user perspective relying on the public API alone. **This PR contains a "Critical Fix".** A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown. * Closes: apache#38364 Lead-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
) ### Rationale for this change apache#38375 introduced duplicate calls to `ensure_s3_initialized()`. ### What changes are included in this PR? Deduplicates call to `ensure_s3_initialized()`. ### Are these changes tested? Yes, covered by existing S3 tests. ### Are there any user-facing changes? No. Authored-by: Peter Andreas Entschev <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
In accordance to #38364, we believe that for various reasons (shortening import time, preventing unnecessary resource consumption and potential bugs with S3 library) it is appropriate to avoid initialization of S3 resources at import time and move that step to occur at first-use.
What changes are included in this PR?
ensure_s3_initialized()
that were up until now executed duringimport pyarrow.fs
;ensure_s3_intialized()
calls topython/pyarrow/_s3fs.pyx
module;atexit
handlers registered.Are these changes tested?
Yes, existing S3 tests check whether it has been initialized, otherwise failing with a C++ exception.
Are there any user-facing changes?
No, the behavior is now slightly different with S3 initialization not happening immediately after
pyarrow.fs
is imported, but no changes are expected from a user perspective relying on the public API alone.This PR contains a "Critical Fix".
A bug in aws-sdk-cpp reported in aws/aws-sdk-cpp#2681 causes segmentation faults under specific circumstances when Python processes shutdown, specifically observed with Dask+GPUs (so far we were unable to pinpoint the exact correlation of Dask+GPUs+S3). While this definitely doesn't seem to affect all users and is not directly sourced in Arrow, it may affect use cases that are completely independent of S3 to operate, which is particularly problematic in CI where all tests pass successfully but the process crashes at shutdown.