Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in wait container during Artifact upload #13248

Open
3 of 4 tasks
ljyanesm opened this issue Jun 26, 2024 · 8 comments · Fixed by argoproj/pkg#689 · May be fixed by #13867
Open
3 of 4 tasks

Segmentation fault in wait container during Artifact upload #13248

ljyanesm opened this issue Jun 26, 2024 · 8 comments · Fixed by argoproj/pkg#689 · May be fixed by #13867
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/executor P3 Low priority type/bug

Comments

@ljyanesm
Copy link

ljyanesm commented Jun 26, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

One of the workflow tasks failed with the following stacktrace:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1e3455f]
goroutine 389 [running]:
github.com/argoproj/pkg/s3.generatePutTasks.func1.1({0xc000df0400, 0x72}, {0x0, 0x0}, {0xc0002b3fa0?, 0xffffffffffffff9c?})
	/go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230901113346-235a5432ec98/s3/s3.go:245 +0xdf
path/filepath.walk({0xc00108dd50, 0x63}, {0x37f0248, 0xc000dcf380}, 0xc0002b3fa0)
	/usr/local/go/src/path/filepath/path.go:497 +0x1d4
path/filepath.walk({0xc0009e95c0, 0x58}, {0x37f0248, 0xc000dcf2b0}, 0xc0002b3fa0)
	/usr/local/go/src/path/filepath/path.go:501 +0x257
path/filepath.walk({0xc0009e94a0, 0x52}, {0x37f0248, 0xc000dcf1e0}, 0xc0002b3fa0)
	/usr/local/go/src/path/filepath/path.go:501 +0x257
path/filepath.walk({0xc000eeb5e0, 0x42}, {0x37f0248, 0xc000c19ba0}, 0xc0002b3fa0)
	/usr/local/go/src/path/filepath/path.go:501 +0x257
path/filepath.Walk({0xc000eeb5e0, 0x42}, 0xc0004dd7a0)
	/usr/local/go/src/path/filepath/path.go:572 +0x66
github.com/argoproj/pkg/s3.generatePutTasks.func1()
	/go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230901113346-235a5432ec98/s3/s3.go:243 +0x68
created by github.com/argoproj/pkg/s3.generatePutTasks in goroutine 1
	/go/pkg/mod/github.com/argoproj/pkg@v0.13.7-0.20230901113346-235a5432ec98/s3/s3.go:242 +0xe5

The expected behaviour was for the Pod to complete successfully and all the artifacts deposited correctly.

We have not tested using :latest, as this issue happens in about 1 in 1000 pods within our production environment and it is not reproducible.

Version

v3.5.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

The issue has only been observed infrequently, I do not have a reproducible example.

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

The only relevant logs are shared in the summary.
@Joibel
Copy link
Member

Joibel commented Jun 26, 2024

Related Slack conversation: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1719237267885449

@Joibel
Copy link
Member

Joibel commented Jun 26, 2024

https://github.com/argoproj/pkg/blob/235a5432ec982969e2e1987e66458b5a44c2ee6f/s3/s3.go#L245 - fi is being used without checking err. This should be fixed in pkg so it reports err rather than crashing.

@ljyanesm, is there a possibility that some of the artifacts or the directories containing them have unusual permissions or contents or are being modified whilst upload is being attempted?

Joibel added a commit to argoproj/pkg that referenced this issue Jun 26, 2024
Attempt to fix argoproj/argo-workflows#13248

`fi` is null in the stack trace, and `err` isn't being checked
here. So check `err` and don't attempt to continue.
Joibel added a commit to argoproj/pkg that referenced this issue Jun 26, 2024
Attempt to fix argoproj/argo-workflows#13248

`fi` is null in the stack trace, and `err` isn't being checked
here. So check `err` and don't attempt to continue.

Signed-off-by: Alan Clucas <[email protected]>
@ljyanesm
Copy link
Author

Thanks for adding this check. I am keen on, if possible, running a version of ArgoWF with only these changes in place. Do you have any advice for doing so?

I've been having a careful look through the workflows and have found:

  • This happens in about 1 in 1000 pods, always for the same type of task suggesting this is implicated somehow.
  • I'll need a bit of time to check what exactly is happening within the task with regards to your questions. Permissions are not unusual, but may be possible that the contents of the directory being uploaded is changing.

@Joibel
Copy link
Member

Joibel commented Jun 26, 2024

To test this you'd need to build a custom argoexec image. Having checked out the argo-workflows code:

go get github.com/argoproj/pkg@s3-err-check
make argoexec-image

You'll then need to push this image to somewhere that your cluster can pull from and set up your workflow controller to use it with --executor-image.

I suggest we chat in slack if you're having problems with this.

@agilgur5 agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Jun 26, 2024
@agilgur5
Copy link
Contributor

@Joibel we usually upload a test image somewhere (e.g. personal DockerHub) if folks can run a test

@agilgur5 agilgur5 added the P3 Low priority label Jun 29, 2024
@ljyanesm
Copy link
Author

We have made some changes to the workflows where tasks are fully independent. The error was most likely related to some delete operations on the path that was being uploaded as an artifact.

This was corrected by moving these files to a different location only available to the pod running the task.

@Joibel, @agilgur5,
Do you think with the changes to the pkg repo, which should help identify the issue more readily next time is enough to close the ticket?

@agilgur5
Copy link
Contributor

agilgur5 commented Jul 20, 2024

(PR to update pkg in this repo still necessary)

@agilgur5
Copy link
Contributor

agilgur5 commented Nov 6, 2024

It's a repo SHA of sorts, I forget how Go modules does it exactly off the top of my head right now, but hopefully that's enough for you to take it from there 😅

@tooptoop4 tooptoop4 linked a pull request Nov 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/executor P3 Low priority type/bug
Projects
None yet
3 participants