Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #35 'kubectl logs -f' implemented #39

Merged

Conversation

antoinetran
Copy link
Contributor

Summary

Implement kubectl logs -f in InterLink Slurm Plugin side. Also added sessionContext (eg: GetLogs#12345) is created and added in HTTP header, so that we can also follow the HTTP request from VK to API to Plugin with the same id.

This will, in following mode, reads the container log continously with a buffer. It will sleep a 4s hard-coded period of time, to avoid stressing the file-system. It only ends if:

  • the container ends (meaning status file is written)
  • the job ends forcibly (eg scancel, or timewall reached, meaning the status file is NOT written by the job), in which case the InterLink Slurm Plugin will ask the job exit code and write it as if it was written by the container. For this to work, the squeue must also request dead jobs, thus squeue is missing requesting ALL states #38 is necessary (states=all) and getStatus based on container.status is not reliable #37 too to get exit code. In fact, maybe we can also completely remove the exit code written from script job, and only rely on asking slurm job exit code, and write it.
  • the pod is deleted (meaning JID disappear from JIDs object), in which case one rmdir is done, then if it fails because some logs file are still opened, it will wait for 5s hard-coded (which is > than the 4s logs loop), and do a second rmdir.

Related issue :
interTwin-eu/interLink#332
#35
#37
#38

@antoinetran
Copy link
Contributor Author

Tested case: 1h walltime of a unlimited sleep job, with following logs.
Result: the kubectl logs -f stops after the walltime automatically

The exit code is written from job exit code, instead of being written from script:

kubectl get pods =>
registerdcc-j7872-registerdcc-1008352040-fork-interlink   0/1     Error: 15   0          68m   10.244.1.204   interlink-slurm-node   <none>           <none>
logs =>
time="2024-12-03T15:09:53Z" level=debug msg="InterLink: status [{\"name\":\"registerdcc-j7872-registerdcc-1008352040-fork-interlink\",\"UID\":\"d7af6be5-379c-40d8-a201-4f20c2e180a7\",\"namespace\":\"argo-workflows\",\"JID\":\"\",\"containers\":[{\"name\":\"main\",\"state\":{\"terminated\":{\"exitCode\":15,\"startedAt\":\"2024-12-03T14:01:44Z\",\"finishedAt\":\"2024-12-03T14:31:46Z\"}},\"lastState\":{},\"ready\":false,\"restartCount\":0,\"image\":\"\",\"imageID\":\"\"}]}]"

If we read the logs again ( kubectl logs -f ), we get as expected the logs, and the internal logs are, as expected:

time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: Docker Sidecar: received GetLogs call"
time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: reading file /home/username/.interlink/argo-workflows-d7af6be5-379c-40d8-a201-4f20c2e180a7/main.out"
time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: reading file /home/username/.interlink/argo-workflows-d7af6be5-379c-40d8-a201-4f20c2e180a7/job.out"
time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: writing response headers and OK status"
time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: writing response body len: 454"
time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: written response body len: 454"
time="2024-12-03T15:11:26Z" level=debug msg="HTTP InterLink session GetLogs#71421: flushing after wrote response body bytes: 454"
time="2024-12-03T15:11:26Z" level=debug msg="HTTP InterLink session GetLogs#71421: Check container status/home/username/.interlink/argo-workflows-d7af6be5-379c-40d8-a201-4f20c2e180a7/main.status with current length/offset: 287"
time="2024-12-03T15:11:26Z" level=debug msg="HTTP InterLink session GetLogs#71421: opened for follow mode the container logs /home/username/.interlink/argo-workflows-d7af6be5-379c-40d8-a201-4f20c2e180a7/main.out"
time="2024-12-03T15:11:26Z" level=debug msg="HTTP InterLink session GetLogs#71421: wrote some logs, now flushing..."
time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: Container is found dead thanks to status file, reading last logs..."
time="2024-12-03T15:11:26Z" level=debug msg="HTTP InterLink session GetLogs#71421: wrote some logs, now flushing..."
time="2024-12-03T15:11:26Z" level=info msg="HTTP InterLink session GetLogs#71421: Container was found dead and no more logs are found at this step, exiting following mode..."

@antoinetran
Copy link
Contributor Author

Test case: delete a pod with multiple following logs
Result: success
The kubectl logs -f stops after the delete pod automatically

Also InterLink Slurm Plugin logs shows the first attempt of rmdir fails, we wait for logs to close the logs files, then the second attempt of rmdir works.

time="2024-12-03T15:15:51Z" level=info msg="HTTP InterLink session NoSessionFound#0: Slurm Sidecar: received Stop call"
time="2024-12-03T15:15:51Z" level=info msg="- Deleting Job for pod bdf49441-c5fb-484c-89ce-05cd4b3d584d"
time="2024-12-03T15:15:52Z" level=info msg="- Deleted Job 278417"
time="2024-12-03T15:15:52Z" level=debug msg="Attempt 1 of deletion failed, not really an error! Probably log file still opened, waiting for close... Error: unlinkat /home/username/.interlink/argo-workflows-bdf49441-c5fb-484c-89ce-05cd4b3d584d: directory not empty"
time="2024-12-03T15:15:54Z" level=debug msg="HTTP InterLink session GetLogs#44846: wrote some logs, now flushing..."
time="2024-12-03T15:15:54Z" level=info msg="HTTP InterLink session GetLogs#44846: Container is found dead thanks to missing JID, reading last logs..."
time="2024-12-03T15:15:54Z" level=debug msg="HTTP InterLink session GetLogs#44846: wrote some logs, now flushing..."
time="2024-12-03T15:15:54Z" level=info msg="HTTP InterLink session GetLogs#44846: Container was found dead and no more logs are found at this step, exiting following mode..."
time="2024-12-03T15:15:54Z" level=debug msg="HTTP InterLink session GetLogs#58538: wrote some logs, now flushing..."
time="2024-12-03T15:15:54Z" level=info msg="HTTP InterLink session GetLogs#58538: Container is found dead thanks to missing JID, reading last logs..."
time="2024-12-03T15:15:54Z" level=debug msg="HTTP InterLink session GetLogs#58538: wrote some logs, now flushing..."
time="2024-12-03T15:15:54Z" level=info msg="HTTP InterLink session GetLogs#58538: Container was found dead and no more logs are found at this step, exiting following mode..."
time="2024-12-03T15:15:55Z" level=debug msg="HTTP InterLink session GetLogs#31041: wrote some logs, now flushing..."
time="2024-12-03T15:15:55Z" level=info msg="HTTP InterLink session GetLogs#31041: Container is found dead thanks to missing JID, reading last logs..."
time="2024-12-03T15:15:55Z" level=debug msg="HTTP InterLink session GetLogs#31041: wrote some logs, now flushing..."
time="2024-12-03T15:15:55Z" level=info msg="HTTP InterLink session GetLogs#31041: Container was found dead and no more logs are found at this step, exiting following mode..."
time="2024-12-03T15:15:57Z" level=info msg="Attempt 2 of deletion succeeded!"

@antoinetran
Copy link
Contributor Author

TODO for later: when doing Ctrl+C to kubectl logs -f, stops following logs for that session. But I don't know how to detect that and react.

Copy link
Collaborator

@dciangot dciangot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @antoinetran ! Looks good, I have just a couple of comments that I put inline!

Then I'll going to merge this into main and give it a try before creating a release

.project Outdated Show resolved Hide resolved
pkg/slurm/GetLogs.go Outdated Show resolved Hide resolved
pkg/slurm/GetLogs.go Show resolved Hide resolved
@antoinetran antoinetran force-pushed the fix_issue35_kubectllogsfollow_rebased branch from b8f69a6 to 5880378 Compare December 12, 2024 16:30
@dciangot dciangot self-requested a review December 13, 2024 07:34
Copy link
Collaborator

@dciangot dciangot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tests are good although failing for independent reasons

@dciangot dciangot merged commit 813e44c into interTwin-eu:main Dec 13, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants