Enable user to provision shared memory for pipeline node #2838

MichaelTiemannOSC · 2022-07-18T21:31:48Z

Is your feature request related to a problem? Please describe.
Using FARM to manage PyTorch KPI training and extraction, runs would deadlock: os-climate/aicoe-osc-demo#174

I solved this problem by limiting the use of multiprocessing so that shared memory did not need to be allocated. But a better solution would be to allocate sufficient shared memory to allow the runs to complete. The Elyra Pipeline has an elegant way for users to specify how many CPUs and GPUs, as well as how much RAM should be allocated to a pipeline node. That pattern could be applied to other resources, like shared memory. Docker run supports the --shm=SIZE parameter, for example.

Describe the solution you'd like
I'd like to specify on a per-node basis a non-default amount of shared memory to allocate. In my case, I'd like to see whether 512mb is enough, or whether 1gb or 2gb are needed. I should have the freedom do specify an amount (possibly with units).

Describe alternatives you've considered
I have already implemented changes to disable multiprocessing, but this makes poor use of the powerful CPUs our cluster makes available. Another possibility would be to write Operate First scripts to control the Kubeflow execution parameters outside of Elyra, but why not expose this parameter that is so critical to specific node tasks?

Additional context
My own project repo is here: https://github.com/MichaelTiemannOSC/aicoe-osc-demo/tree/cdp-fixups

@Shreyanand
@ptitzler

ptitzler · 2022-07-19T16:46:37Z

After quick research it appears that Kubernetes currently does not support setting the pod's shared memory size (https://github.com/kubernetes/kubernetes/issues/28272). Using an emptyDir as seen here https://docs.openshift.com/online/pro/dev_guide/shared_memory.html in combination with a size limit might be a possible approach that could be considered.

MichaelTiemannOSC · 2022-07-21T11:13:38Z

Thank you! We are investigating and will report back.

ptitzler · 2022-08-11T12:47:07Z

If there's any follow-up required on our end please re-open the issue!

Shreyanand · 2022-09-21T15:25:23Z

Hi @ptitzler, for this issue, I tried the following:

Exported the pipeline yaml (it's a very useful and powerful feature!)
Edited the yaml as per this comment that essentially mounts a shm dir and allows for a larger shared memory for multiprocessing
Imported the yaml in a kubeflow pipeline run

The pipeline ran successfully without any deadlocks. The increased shm size would help deep learning workloads that use multi processing. Could that be possibly added to node properties in the UI and the kf workload yaml in the backend?

MichaelTiemannOSC added the kind:enhancement New feature or request label Jul 18, 2022

elyra-bot bot added the status:Needs Triage label Jul 18, 2022

ptitzler added the component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines label Jul 19, 2022

MichaelTiemannOSC mentioned this issue Jul 20, 2022

annotation input deadlock in train_kpi_extraction os-climate/aicoe-osc-demo#174

Open

ptitzler added status:Waiting for Author and removed status:Needs Triage labels Aug 4, 2022

ptitzler closed this as completed Aug 11, 2022

ptitzler removed the status:Waiting for Author label Aug 11, 2022

ptitzler reopened this Sep 21, 2022

ptitzler added this to the 3.13.0 milestone Sep 21, 2022

ptitzler added the component:pipeline-editor pipeline editor label Sep 21, 2022

ptitzler mentioned this issue Sep 28, 2022

Pipeline Editor: Allow for configuration of shared memory size #2942

Merged

10 tasks

ptitzler modified the milestones: 3.13.0, 3.12.0 Sep 29, 2022

akchinSTC closed this as completed in #2942 Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable user to provision shared memory for pipeline node #2838

Enable user to provision shared memory for pipeline node #2838

MichaelTiemannOSC commented Jul 18, 2022

ptitzler commented Jul 19, 2022

MichaelTiemannOSC commented Jul 21, 2022

ptitzler commented Aug 11, 2022

Shreyanand commented Sep 21, 2022 •

edited

Loading

Enable user to provision shared memory for pipeline node #2838

Enable user to provision shared memory for pipeline node #2838

Comments

MichaelTiemannOSC commented Jul 18, 2022

ptitzler commented Jul 19, 2022

MichaelTiemannOSC commented Jul 21, 2022

ptitzler commented Aug 11, 2022

Shreyanand commented Sep 21, 2022 • edited Loading

Shreyanand commented Sep 21, 2022 •

edited

Loading