Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please list any open Torque/Moab queue issues here #147

Closed
jchodera opened this issue Oct 20, 2014 · 10 comments
Closed

Please list any open Torque/Moab queue issues here #147

jchodera opened this issue Oct 20, 2014 · 10 comments

Comments

@jchodera
Copy link
Member

We're capturing a list of outstanding issues with Torque/Moab in advance of our discussion with Adaptive on Tuesday.

Please comment here with any issues (or links to issues) that are still problematic.

This is time-sensitive, so please get your issues in TODAY (Mon 20 Oct) if possible.

@jchodera jchodera changed the title Please comment on any open Torque/Moab queue issues here Please list any open Torque/Moab queue issues here Oct 20, 2014
@jchodera
Copy link
Member Author

@tatarsky says: Compile Torque/PBSmom to use libcgroup (instead of manual cpuset) to allow Docker to be used.

@jchodera
Copy link
Member Author

We should ask Adaptive what they recommend in terms of computing power for Torque/Moab head.

@tatarsky
Copy link
Contributor

Would like to move all nodes to CentOS 6.5 userspace before update. Don't see too many issues in that except controlling kernel revision and GPFS/Nvidia. They are currently 6.4.

@jchodera
Copy link
Member Author

I do not believe our group-based fairshare system has ever worked as desired. We should revisit this issue.

Also, there is the concept that we may want to have finer-grained control over group and group collaborator weights.

@jchodera
Copy link
Member Author

We need to tackle the setting of CUDA_VISIBLE_DEVICES for the following use cases:

  • single GPU
  • all GPUs on one node
  • multiple nodes, all GPUs on each node
  • a set of N available GPUs distributed randomly among nodes (e.g. used by MPI)

Currently, this last case is the tough one.

@jchodera
Copy link
Member Author

Last call for any outstanding issues!

@akahles
Copy link

akahles commented Oct 28, 2014

Just for completeness - not sure this is relevant: Getting the docker infrastructure to run on the cluster nodes. Maybe Adaptive has some experience there.

@tatarsky
Copy link
Contributor

All ready on my list with needed build options.

Paul Tatarsky
[email protected]

On Oct 28, 2014, at 2:34 PM, Andre Kahles [email protected] wrote:

Just for completeness - not sure this is relevant: Getting the docker infrastructure to run on the cluster nodes. Maybe Adaptive has some experience there.


Reply to this email directly or view it on GitHub.

@ratsch
Copy link

ratsch commented Oct 28, 2014

a) As John mentioned above, I think the fairshare system is not working properly yet. That needs to be tweaked to properly take group membership into account.

b) Short (<1h) jobs in the active queue (interactive jobs) should also be able to suspend batch jobs in order to find a slot more quickly.

c) Jobs in the active queue should run as cpu overcommitment. They don't need a full CPU allocated. This takes into account that they are mostly idle.

d) The number of interactive jobs allowed per user can be increased if b) is done.

@tatarsky
Copy link
Contributor

Folded into #197

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants