-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable user to provision shared memory for pipeline node #2838
Comments
After quick research it appears that Kubernetes currently does not support setting the pod's shared memory size ( |
Thank you! We are investigating and will report back. |
If there's any follow-up required on our end please re-open the issue! |
Hi @ptitzler, for this issue, I tried the following:
The pipeline ran successfully without any deadlocks. The increased shm size would help deep learning workloads that use multi processing. Could that be possibly added to node properties in the UI and the kf workload yaml in the backend? |
Is your feature request related to a problem? Please describe.
Using FARM to manage PyTorch KPI training and extraction, runs would deadlock: os-climate/aicoe-osc-demo#174
I solved this problem by limiting the use of multiprocessing so that shared memory did not need to be allocated. But a better solution would be to allocate sufficient shared memory to allow the runs to complete. The Elyra Pipeline has an elegant way for users to specify how many CPUs and GPUs, as well as how much RAM should be allocated to a pipeline node. That pattern could be applied to other resources, like shared memory. Docker run supports the --shm=SIZE parameter, for example.
Describe the solution you'd like
I'd like to specify on a per-node basis a non-default amount of shared memory to allocate. In my case, I'd like to see whether 512mb is enough, or whether 1gb or 2gb are needed. I should have the freedom do specify an amount (possibly with units).
Describe alternatives you've considered
I have already implemented changes to disable multiprocessing, but this makes poor use of the powerful CPUs our cluster makes available. Another possibility would be to write Operate First scripts to control the Kubeflow execution parameters outside of Elyra, but why not expose this parameter that is so critical to specific node tasks?
Additional context
My own project repo is here: https://github.com/MichaelTiemannOSC/aicoe-osc-demo/tree/cdp-fixups
@Shreyanand
@ptitzler
The text was updated successfully, but these errors were encountered: