Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable user to provision shared memory for pipeline node #2838

Closed
MichaelTiemannOSC opened this issue Jul 18, 2022 · 4 comments · Fixed by #2942
Closed

Enable user to provision shared memory for pipeline node #2838

MichaelTiemannOSC opened this issue Jul 18, 2022 · 4 comments · Fixed by #2942
Labels
component:pipeline-editor pipeline editor component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines kind:enhancement New feature or request
Milestone

Comments

@MichaelTiemannOSC
Copy link

Is your feature request related to a problem? Please describe.
Using FARM to manage PyTorch KPI training and extraction, runs would deadlock: os-climate/aicoe-osc-demo#174

I solved this problem by limiting the use of multiprocessing so that shared memory did not need to be allocated. But a better solution would be to allocate sufficient shared memory to allow the runs to complete. The Elyra Pipeline has an elegant way for users to specify how many CPUs and GPUs, as well as how much RAM should be allocated to a pipeline node. That pattern could be applied to other resources, like shared memory. Docker run supports the --shm=SIZE parameter, for example.

Describe the solution you'd like
I'd like to specify on a per-node basis a non-default amount of shared memory to allocate. In my case, I'd like to see whether 512mb is enough, or whether 1gb or 2gb are needed. I should have the freedom do specify an amount (possibly with units).

Describe alternatives you've considered
I have already implemented changes to disable multiprocessing, but this makes poor use of the powerful CPUs our cluster makes available. Another possibility would be to write Operate First scripts to control the Kubeflow execution parameters outside of Elyra, but why not expose this parameter that is so critical to specific node tasks?

Additional context
My own project repo is here: https://github.com/MichaelTiemannOSC/aicoe-osc-demo/tree/cdp-fixups

@Shreyanand
@ptitzler

@MichaelTiemannOSC MichaelTiemannOSC added the kind:enhancement New feature or request label Jul 18, 2022
@ptitzler ptitzler added the component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines label Jul 19, 2022
@ptitzler
Copy link
Member

After quick research it appears that Kubernetes currently does not support setting the pod's shared memory size (https://github.com/kubernetes/kubernetes/issues/28272). Using an emptyDir as seen here https://docs.openshift.com/online/pro/dev_guide/shared_memory.html in combination with a size limit might be a possible approach that could be considered.

@MichaelTiemannOSC
Copy link
Author

Thank you! We are investigating and will report back.

@ptitzler
Copy link
Member

If there's any follow-up required on our end please re-open the issue!

@Shreyanand
Copy link

Shreyanand commented Sep 21, 2022

Hi @ptitzler, for this issue, I tried the following:

  1. Exported the pipeline yaml (it's a very useful and powerful feature!)
  2. Edited the yaml as per this comment that essentially mounts a shm dir and allows for a larger shared memory for multiprocessing
  3. Imported the yaml in a kubeflow pipeline run

The pipeline ran successfully without any deadlocks. The increased shm size would help deep learning workloads that use multi processing. Could that be possibly added to node properties in the UI and the kf workload yaml in the backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:pipeline-editor pipeline editor component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines kind:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants