-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request to copy stageout files failed (redux) #186
Comments
Another one. Could this be related to #182?
|
Not us right now. We are not running migrations. |
Looking for other causes as I can. |
gpu-1-12 seems extremely loaded down...trying to determine why. It appears in three of those messages today of a total of five. gpu-1-4 is the other. |
3 more in the past 20 minutes, all on gpu-1-12. |
You have the highest percentage of these in the logs. Anything interesting about the qsubs I can report to Adaptive? I am looking at the section of the pbs_mom that generates this. I suspect its migration load related from the relative times. I've backed it off for the morning and will update #182 as the day goes if that changes. |
Not really. The calls are all of the form: |
Draining gpu-1-12 for inspection just in case its node related. I noted the recent ones from it. But that will be awhile. |
the option "-d ." might be problematic. Try "-d |
I guess you meant |
I can try this in a future batch of runs. However, I do not think this is the problem for two reasons:
This argues to me that the problem lies with disk access, not with the the parameters to the job. |
Actually, if you don't mind in about 4-5 hours there will be a period where the migration is at a step where there should be no disk I/O similar to when we see these. If you could NOT make any changes to your side of the fence it would be helpful. I believe (but haven't had time to dig heavily into the code surrounding this error) its a form of overload manifestation. But I'm not clear why it would "give up" rather than simply wait. So I will look closer, but am very involved in the migration right now as its a complex step coming up. |
No problem. The jobs are continuously submitted to the server by a master script so as to not overload the scheduler... if that's still even a problem. |
We can test that another day as well ;) |
This is just to document that at 5:00PM all migration tasks are idle for a moment. During this time, all I/O is "regular I/O". I expect to return to migration tasks later this evening. |
Please note I will re-start looking at this shortly now that at least one very large variable is gone. Not today but its back on my run queue. |
So the primary hit I see on this error now post migration comes from the jobs of @nrosed I believe. It would be interesting to try to compare notes on those submit scripts. File system I/O appears quite light. |
The incidence of these is being looked at again as I see no compelling reason these are filesystem related. There are periodic bursts however. I am looking at the code closer that surrounds this error message. |
We are looking at this in light of the recent E_NOMEM token incident. To see if this is related. |
Since you are currently looking at this again. I see it still occurring in my logs once in a while:
|
And while I've never figured this out I may attempt to determine if there is a setting I can alter for this in #349 |
So also in the ancient bug bin I've been working on trying to figure this one out. I've traced a single occurence of this problem on the new Torque system to basically a Torque function called I've submitted > 100000 jobs to calculate Pi to 10000 places and got one of these failures today. The code involved appears to have a hardwired 4 attempts. And a rather silly sleep line to wait before trying again. I should note that my jobs and their output areas are NOT to GPFS homedir locations. But a separate one via NFS that is a convention of the ROCKS environment. So its a good control case. So I have at least a very specific location of the code I can try with Adaptive again and I believe I can turn up a debug knob for the test new system that would be very impacting on the real one and perhaps get the precise exit status code if I issue another 100000 jobs ;) Windmills. Tilting. Me. |
Oh, and in case I lose where I was. The items in question are in the pbs_mom code at: sys/resmom/requests.c 3088 |
Here BTW is the loop around the attempts to copy:
I believe that results in a sleep of 0, 4, 0 and 7 seconds between the attempts. I wish that was explained and perhaps I have no idea what you would use the mod to skip a sleep on the odd numbers of the loop. |
Referencing issue #173, which I can't seem to re-open. I hope this is not a bad portent.
The text was updated successfully, but these errors were encountered: