-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in wait
container during Artifact upload
#13248
Segmentation fault in wait
container during Artifact upload
#13248
Comments
Related Slack conversation: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1719237267885449 |
https://github.com/argoproj/pkg/blob/235a5432ec982969e2e1987e66458b5a44c2ee6f/s3/s3.go#L245 - @ljyanesm, is there a possibility that some of the artifacts or the directories containing them have unusual permissions or contents or are being modified whilst upload is being attempted? |
Attempt to fix argoproj/argo-workflows#13248 `fi` is null in the stack trace, and `err` isn't being checked here. So check `err` and don't attempt to continue.
Attempt to fix argoproj/argo-workflows#13248 `fi` is null in the stack trace, and `err` isn't being checked here. So check `err` and don't attempt to continue. Signed-off-by: Alan Clucas <[email protected]>
Thanks for adding this check. I am keen on, if possible, running a version of ArgoWF with only these changes in place. Do you have any advice for doing so? I've been having a careful look through the workflows and have found:
|
To test this you'd need to build a custom argoexec image. Having checked out the argo-workflows code: go get github.com/argoproj/pkg@s3-err-check
make argoexec-image You'll then need to push this image to somewhere that your cluster can pull from and set up your workflow controller to use it with I suggest we chat in slack if you're having problems with this. |
@Joibel we usually upload a test image somewhere (e.g. personal DockerHub) if folks can run a test |
We have made some changes to the workflows where tasks are fully independent. The error was most likely related to some delete operations on the path that was being uploaded as an artifact. This was corrected by moving these files to a different location only available to the pod running the task. @Joibel, @agilgur5, |
(PR to update |
It's a repo SHA of sorts, I forget how Go modules does it exactly off the top of my head right now, but hopefully that's enough for you to take it from there 😅 |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
One of the workflow tasks failed with the following stacktrace:
The expected behaviour was for the Pod to complete successfully and all the artifacts deposited correctly.
We have not tested using
:latest
, as this issue happens in about 1 in 1000 pods within our production environment and it is not reproducible.Version
v3.5.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
The issue has only been observed infrequently, I do not have a reproducible example.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: