-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python submitted DRMAA jobs are not running on the worker nodes #246
Comments
That bash error looks suspiciously like what the bash patch for "shellshock" says when an improper function invocation is attempted. |
BTW are you saying "this worked before yesterday" ??? |
yes this scripts were running perfectly until yesterday. |
Did you perhaps "add a module" yesterday? As that error has to do with as far as I can tell the "modules" package. |
No |
Your .bashrc is modified as of this morning....what was changed? |
Also attempting to reproduce.... |
Just deleted an empty line, that is what I remember... |
Well I'll have to look around. No changes I can think of on the cluster except the epilog script which only fires if the queue is "active". |
I didn't make any changes to the scripts. thank you @tatarsky. |
Do you know BTW what "the worker node" was in that message? Can dig around but if you already know it would be appreciated. |
gpu-1-13 |
I asked for
|
Yeah I saw that. Does your script contain an attempt to use "module" ? Or perhaps provide me the location of the item you run? That error is coming from the modules /etc/profile.d/modules.sh as far as I can tell. Which is untouched, so I'm curious whats calling it. |
sending an email with details. |
Thanks! |
Made title of this more specific for my tracking purposes. |
Under some condition the method DRMAA python uses to submit jobs appears to get blocked from submitting more data. I have @vipints running again but am chasing down what the resolution was for this Torque mailing list discussion. http://www.supercluster.org/pipermail/torqueusers/2014-January/016732.html I do not believe the hotfix "introduced" this problem as the date of this is old. Opening ticket with Adaptive to enquire. |
Hi @tatarsky, today morning I noticed that python drmaa submitted jobs are not dispatching to the worker node. I am not able to see the start time: for example:
Don't know what is happening here. |
I don't see the same issue as before. Looks to me like a simple case of your jobs being rejected due to resources.
|
thanks @tatarsky, I saw this message, forgot to include in the previous message. Not sure why it got rejected as I am requesting limited resources |
This is a little weird, perhaps a syntax error?
So it seems to be asking for a feature of a hostname.... |
Its weird if you look at the "required hostlist" cpu-6-2 does not appear in it yet I see you requesting it.
|
From the queue file...
|
yes correct, I am requesting specific hostnames in my submission argument. Due to the OOM issue I have blacklisted the following nodes |
Try the submit without the blacklist. The OOM issue is not node related. I continue to work on the best solution to it. |
Thats exciting because I believe from some threads the limit was 1024. Lets declare victory at 10K ;) |
So what do you think the count is at? |
so far I have reached |
Very cool. I'll ask again in 14 days is my guesstimate for 10K. While it seems likely that was the fix, lets let it ride some more. |
@tatarsky: by today, I have reached total number finished jobs |
I haven't seen any issues, but maybe I'm not getting enough jobs submitted through drmaa? I certainly haven't had 6000 yet. -Carrie From: Vipin <[email protected]mailto:[email protected]> @tatarskyhttps://github.com/tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch... @cganotehttps://github.com/cganote: just checking, is your drmaa jobs are OK with the patch? — |
Thanks @cganote. |
Thats a rather odd number. I'd like to poke around for a bit before I restart pbs_server see if I learn anything new. Are you under time pressure to get more of these in? |
If you find sometime in today evening, I will be happy. Thanks! |
Similar code 15007 response in logs claiming "unauthorized request" |
No new information gained. Restarted pbs_server. |
@tatarsky, this time drmaa reached the limit of |
seems like an odd behavior this time. |
Whenever you have small time frame, I may need a restart to the pbs_server thank you. |
Restarted. I have this slated for possible test attempts on the new scheduler head I've built. The current issue is I need to have some nodes to test that system with and we're working on a schedule. It seems that the patch does not solve the problem. But its unclear if it hurts overall or helps. Seems weird this one didn't even get to the "normal" 1024 or so. |
It could be a reason that someone else is using drmaa to submit the jobs to cluster. The count (366) of jobs is just from my side. Is there anybody else using drmaa python combination to submit jobs on hal? |
Not that I've ever heard of. |
This worlds longest issue may be further attacked via #349 . However its unclear how it would be attacked at this moment in time. |
Has there been any progress on this issue? Just encountered it today.
|
No. No precise solution has ever been found. I will restart pbs_server and you can tell me if it works after that. Then the item will be moved to Fogbugz. |
I am not sure whether we found a way to fix this, If you are getting the error means the drmaa reached the max_num_of_jobs. to fix this you may need a pbs_server restart from admins to clear the job ids. |
Server restarted to confirm your example is a case of this. If so, open a ticket in FogBugz via the email address listed in the /etc/motd on hal. I won't be processing items here further. |
Sorry I meant to report via email to the cbio-admin group. Thanks @tatarsky! |
Thats fine. This ticket has a long long gory history. But all further attempts to figure it out require involvement by the primary support which is as of today MSKCC staff. I will assist then as needed but I don't feel this is likely trivially fixed. As we both know DRMAA is quite a hack for Torque. |
I do notice since last we battled this there is another release of pbs-drmaa. 1.0.19. Perhaps by some friday miracle they are using the Torque 5.0 submit call compared to the crufty 4.0 one that seems to be buggy. |
Yeah that is correct, Seems like they have support for v5. May be we can try after the long weekend. I didn't check the recent release version. |
I see we actually noticed it when it came out last year. I see nothing overly "Torque 5" in it yet. I am unlikely to look at this further today. Confirm/deny that your example now works with pbs_server restarted and open a ticket for some work next week. |
Yes, python drmaa job submission works now. |
Kick an email to the address listed for problem reports in /etc/motd (sorry I'm not placing it again in the public Git) to start tracking it there. We'll reference this Git thread but we no longer process bugs here. |
Not that this one is likely to be fixable anytime soon. We've tried for many years and DRMAA is basically not well supported by Adaptive. |
Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with
Exit_status=127
, here is one such event,tracejob -slm -n 2 3025874
. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for BASH_FUNC_module'
-bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory
I am able to run this python script without Torque on the login machine and a worker node (with qlogin).
Did anybody use drmaa, python combination in cluster computing?
I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.
I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)
The text was updated successfully, but these errors were encountered: