Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python submitted DRMAA jobs are not running on the worker nodes #246

Open
vipints opened this issue Apr 16, 2015 · 130 comments
Open

Python submitted DRMAA jobs are not running on the worker nodes #246

vipints opened this issue Apr 16, 2015 · 130 comments

Comments

@vipints
Copy link

vipints commented Apr 16, 2015

Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with Exit_status=127, here is one such event, tracejob -slm -n 2 3025874. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.

-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for BASH_FUNC_module'
-bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory

I am able to run this python script without Torque on the login machine and a worker node (with qlogin).

Did anybody use drmaa, python combination in cluster computing?

I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.

I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)

@tatarsky
Copy link
Contributor

That bash error looks suspiciously like what the bash patch for "shellshock" says when an improper function invocation is attempted.

@tatarsky
Copy link
Contributor

BTW are you saying "this worked before yesterday" ???

@vipints
Copy link
Author

vipints commented Apr 16, 2015

yes this scripts were running perfectly until yesterday.

@tatarsky
Copy link
Contributor

Did you perhaps "add a module" yesterday? As that error has to do with as far as I can tell the "modules" package.

@vipints
Copy link
Author

vipints commented Apr 16, 2015

No

@tatarsky
Copy link
Contributor

Your .bashrc is modified as of this morning....what was changed?

@tatarsky
Copy link
Contributor

Also attempting to reproduce....

@vipints
Copy link
Author

vipints commented Apr 16, 2015

Just deleted an empty line, that is what I remember...

@tatarsky
Copy link
Contributor

Well I'll have to look around. No changes I can think of on the cluster except the epilog script which only fires if the queue is "active".

@vipints
Copy link
Author

vipints commented Apr 16, 2015

I didn't make any changes to the scripts. thank you @tatarsky.

@tatarsky
Copy link
Contributor

Do you know BTW what "the worker node" was in that message? Can dig around but if you already know it would be appreciated.

@vipints
Copy link
Author

vipints commented Apr 16, 2015

gpu-1-13

@vipints
Copy link
Author

vipints commented Apr 16, 2015

I asked for nodes=1 and ppn=4 and it dispatched:

exec_host=gpu-1-13/6+gpu-1-17/19+gpu-1-15/10+gpu-3-8/14

@tatarsky
Copy link
Contributor

Yeah I saw that. Does your script contain an attempt to use "module" ? Or perhaps provide me the location of the item you run? That error is coming from the modules /etc/profile.d/modules.sh as far as I can tell. Which is untouched, so I'm curious whats calling it.

@vipints
Copy link
Author

vipints commented Apr 16, 2015

sending an email with details.

@tatarsky
Copy link
Contributor

Thanks!

@tatarsky tatarsky changed the title Jobs are not running on the worker nodes Python submitted DRMAA jobs are not running on the worker nodes Apr 17, 2015
@tatarsky
Copy link
Contributor

Made title of this more specific for my tracking purposes.

@tatarsky
Copy link
Contributor

Under some condition the method DRMAA python uses to submit jobs appears to get blocked from submitting more data. I have @vipints running again but am chasing down what the resolution was for this Torque mailing list discussion.

http://www.supercluster.org/pipermail/torqueusers/2014-January/016732.html

I do not believe the hotfix "introduced" this problem as the date of this is old. Opening ticket with Adaptive to enquire.

@vipints
Copy link
Author

vipints commented Apr 22, 2015

Hi @tatarsky, today morning I noticed that python drmaa submitted jobs are not dispatching to the worker node. I am not able to see the start time: for example: showstart 3122268

INFO: cannot determine start time for job 3122268

Don't know what is happening here.

@tatarsky
Copy link
Contributor

I don't see the same issue as before.

Looks to me like a simple case of your jobs being rejected due to resources.

checkjob -v 3122268
Node Availability for Partition MSKCC --------

gpu-3-9                  rejected: Features
gpu-1-4                  rejected: Features
gpu-1-5                  rejected: Features
gpu-1-6                  rejected: Features
gpu-1-7                  rejected: HostList
gpu-1-8                  rejected: HostList
gpu-1-9                  rejected: HostList
gpu-1-10                 rejected: HostList
gpu-1-11                 rejected: HostList
gpu-1-12                 rejected: Features
gpu-1-13                 rejected: Features
gpu-1-14                 rejected: Features
gpu-1-15                 rejected: Features
gpu-1-16                 rejected: Features
gpu-1-17                 rejected: Features
gpu-2-4                  rejected: HostList
gpu-2-5                  rejected: HostList
gpu-2-6                  rejected: Features
gpu-2-7                  rejected: HostList
gpu-2-8                  rejected: Features
gpu-2-9                  rejected: Features
gpu-2-10                 rejected: HostList
gpu-2-11                 rejected: Features
gpu-2-12                 rejected: Features
gpu-2-13                 rejected: HostList
gpu-2-14                 rejected: Features
gpu-2-15                 rejected: Features
gpu-2-16                 rejected: Features
gpu-2-17                 rejected: Features
gpu-3-8                  rejected: Features
cpu-6-1                  rejected: Features
cpu-6-2                  rejected: HostList
NOTE:  job req cannot run in partition MSKCC (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 608  feasible procs:   0

Node Rejection Summary: [Features: 21][HostList: 11]

@vipints
Copy link
Author

vipints commented Apr 22, 2015

thanks @tatarsky, I saw this message, forgot to include in the previous message. Not sure why it got rejected as I am requesting limited resources 12gb mem and 40hrs cput_time.

@tatarsky
Copy link
Contributor

This is a little weird, perhaps a syntax error?

Features: cpu-6-2

So it seems to be asking for a feature of a hostname....

@tatarsky
Copy link
Contributor

Its weird if you look at the "required hostlist" cpu-6-2 does not appear in it yet I see you requesting it.

Opsys: ---  Arch: ---  Features: cpu-6-2
Required HostList: [gpu-1-12:1][gpu-1-13:1][gpu-1-16:1][gpu-1-17:1][gpu-1-14:1][gpu-1-15:1]
  [cpu-6-1:1][gpu-3-8:1][gpu-3-9:1][gpu-1-4:1][gpu-1-5:1][gpu-1-6:1]
  [gpu-2-17:1][gpu-2-16:1][gpu-2-15:1][gpu-2-14:1][gpu-2-12:1][gpu-2-11:1]
  [gpu-2-6:1][gpu-2-9:1][gpu-2-8:1]

@tatarsky
Copy link
Contributor

From the queue file...

<submit_args flags="1"> -N pj_41d1c2f4-e8c0-11e4-97d2-5fd54d3e274e -l mem=12gb -l vmem=12gb -l pmem=12gb -l pvmem=12gb 
-l nodes=1:ppn=1 -l walltime=40:00:00 -l host=gpu-1-12+gpu-1-13+gpu-1-16+gpu-1-17+gpu-1-14+gpu-1-15+cpu-6-2+cpu-6-1+gpu-3-8
+gpu-3-9+gpu-1-4+gpu-1-5+gpu-1-6+gpu-2-17+gpu-2-16+gpu-2-15+gpu-2-14+gpu-2-12+gpu-2-11+gpu-2-6+gpu-2-9+gpu-2-8</submit_args
>```

@vipints
Copy link
Author

vipints commented Apr 22, 2015

yes correct, I am requesting specific hostnames in my submission argument. Due to the OOM issue I have blacklisted the following nodes ['gpu-1-10', 'gpu-1-9', 'gpu-1-8', 'gpu-1-11', 'gpu-1-7', 'gpu-2-5', 'gpu-2-13', 'gpu-2-7', 'gpu-2-4', 'gpu-2-10']

@tatarsky
Copy link
Contributor

Try the submit without the blacklist. The OOM issue is not node related. I continue to work on the best solution to it.

@tatarsky
Copy link
Contributor

Thats exciting because I believe from some threads the limit was 1024. Lets declare victory at 10K ;)

@tatarsky
Copy link
Contributor

tatarsky commented Nov 9, 2015

So what do you think the count is at?

@vipints
Copy link
Author

vipints commented Nov 9, 2015

so far I have reached 4287.

@tatarsky
Copy link
Contributor

tatarsky commented Nov 9, 2015

Very cool. I'll ask again in 14 days is my guesstimate for 10K. While it seems likely that was the fix, lets let it ride some more.

@vipints
Copy link
Author

vipints commented Nov 12, 2015

@tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch...
There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.
@cganote: just checking, is your drmaa jobs are OK with the patch?

@cganote
Copy link

cganote commented Nov 12, 2015

I haven't seen any issues, but maybe I'm not getting enough jobs submitted through drmaa? I certainly haven't had 6000 yet.

-Carrie

From: Vipin <[email protected]mailto:[email protected]>
Reply-To: cBio/cbio-cluster <[email protected]mailto:[email protected]>
Date: Thursday, November 12, 2015 at 1:22 PM
To: cBio/cbio-cluster <[email protected]mailto:[email protected]>
Cc: Carrie Ganote <[email protected]mailto:[email protected]>
Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)

@tatarskyhttps://github.com/tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch...
There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.

@cganotehttps://github.com/cganote: just checking, is your drmaa jobs are OK with the patch?


Reply to this email directly or view it on GitHubhttps://github.com//issues/246#issuecomment-156190350.

@vipints
Copy link
Author

vipints commented Nov 12, 2015

Thanks @cganote.

@tatarsky
Copy link
Contributor

Thats a rather odd number. I'd like to poke around for a bit before I restart pbs_server see if I learn anything new. Are you under time pressure to get more of these in?

@vipints
Copy link
Author

vipints commented Nov 12, 2015

If you find sometime in today evening, I will be happy. Thanks!

@tatarsky
Copy link
Contributor

Similar code 15007 response in logs claiming "unauthorized request"

@tatarsky
Copy link
Contributor

No new information gained. Restarted pbs_server.

@vipints
Copy link
Author

vipints commented Nov 25, 2015

@tatarsky, this time drmaa reached the limit of max_num_jobs in just after 366 job requests.

@vipints
Copy link
Author

vipints commented Nov 25, 2015

seems like an odd behavior this time.

@vipints
Copy link
Author

vipints commented Nov 25, 2015

Whenever you have small time frame, I may need a restart to the pbs_server thank you.

@tatarsky
Copy link
Contributor

Restarted. I have this slated for possible test attempts on the new scheduler head I've built. The current issue is I need to have some nodes to test that system with and we're working on a schedule. It seems that the patch does not solve the problem. But its unclear if it hurts overall or helps. Seems weird this one didn't even get to the "normal" 1024 or so.

@vipints
Copy link
Author

vipints commented Nov 25, 2015

It could be a reason that someone else is using drmaa to submit the jobs to cluster. The count (366) of jobs is just from my side.

Is there anybody else using drmaa python combination to submit jobs on hal?

@tatarsky
Copy link
Contributor

Not that I've ever heard of.

@tatarsky
Copy link
Contributor

tatarsky commented Dec 4, 2015

This worlds longest issue may be further attacked via #349 . However its unclear how it would be attacked at this moment in time.

@raylim
Copy link

raylim commented Jul 1, 2016

Has there been any progress on this issue? Just encountered it today.

$ ipython
In [1]: import drmaa
In [2]: s = drmaa.Session()
In [3]: s.initialize()
In [4]: jt = s.createJobTemplate()
In [5]: jt.remoteCommand = 'hostname'
In [6]: jobid = s.runJob(jt)
In [7]: retval = s.wait(jobid)
In [8]: retval
Out[8]: JobInfo(jobId=u'7551501.hal-sched1.local', hasExited=False, hasSignal=False, terminatedSignal=u'unknown signal?!', hasCoreDump=False, wasAborted=True, exitStatus=127, resourceUsage={u'mem': u'0', u'start_time': u'1467389618', u'queue': u'batch', u'vmem': u'0', u'hosts': u'gpu-1-4/4', u'end_time': u'1467389619', u'submission_time': u'1467389616', u'cpu': u'0', u'walltime': u'0'})
In [9]: retval.exitStatus
Out[9]: 127

@tatarsky
Copy link
Contributor

tatarsky commented Jul 1, 2016

No. No precise solution has ever been found. I will restart pbs_server and you can tell me if it works after that. Then the item will be moved to Fogbugz.

@vipints
Copy link
Author

vipints commented Jul 1, 2016

I am not sure whether we found a way to fix this, If you are getting the error means the drmaa reached the max_num_of_jobs. to fix this you may need a pbs_server restart from admins to clear the job ids.

@tatarsky
Copy link
Contributor

tatarsky commented Jul 1, 2016

Server restarted to confirm your example is a case of this. If so, open a ticket in FogBugz via the email address listed in the /etc/motd on hal. I won't be processing items here further.

@vipints
Copy link
Author

vipints commented Jul 1, 2016

Sorry I meant to report via email to the cbio-admin group. Thanks @tatarsky!

@tatarsky
Copy link
Contributor

tatarsky commented Jul 1, 2016

Thats fine. This ticket has a long long gory history. But all further attempts to figure it out require involvement by the primary support which is as of today MSKCC staff. I will assist then as needed but I don't feel this is likely trivially fixed. As we both know DRMAA is quite a hack for Torque.

@tatarsky
Copy link
Contributor

tatarsky commented Jul 1, 2016

I do notice since last we battled this there is another release of pbs-drmaa. 1.0.19.

Perhaps by some friday miracle they are using the Torque 5.0 submit call compared to the crufty 4.0 one that seems to be buggy.

@vipints
Copy link
Author

vipints commented Jul 1, 2016

Yeah that is correct, Seems like they have support for v5. May be we can try after the long weekend. I didn't check the recent release version.

@tatarsky
Copy link
Contributor

tatarsky commented Jul 1, 2016

I see we actually noticed it when it came out last year. I see nothing overly "Torque 5" in it yet.

I am unlikely to look at this further today. Confirm/deny that your example now works with pbs_server restarted and open a ticket for some work next week.

@raylim
Copy link

raylim commented Jul 1, 2016

Yes, python drmaa job submission works now.

@tatarsky
Copy link
Contributor

tatarsky commented Jul 1, 2016

Kick an email to the address listed for problem reports in /etc/motd (sorry I'm not placing it again in the public Git) to start tracking it there. We'll reference this Git thread but we no longer process bugs here.

@tatarsky
Copy link
Contributor

tatarsky commented Jul 1, 2016

Not that this one is likely to be fixable anytime soon. We've tried for many years and DRMAA is basically not well supported by Adaptive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants