Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't immediately resume _monitor_active_jobs on exception #368

Merged
merged 1 commit into from
Jun 18, 2024

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Jun 18, 2024

Got a bunch of these on rockfish, and I don't think we're helping ourselves by calling os.listdir every 5ms:

2024-06-11 12:42:09,485 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'
2024-06-11 12:42:09,489 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'
2024-06-11 12:42:09,494 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'

Got a bunch of these on rockfish, and I don't think we're helping
ourselves by calling os.listdir every 5ms:
```
2024-06-11 12:42:09,485 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'
2024-06-11 12:42:09,489 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'
2024-06-11 12:42:09,494 ERROR [pulsar.managers.stateful][[manager=rockfish]-[action=monitor]] Failure in stateful manager monitor step.
Traceback (most recent call last):
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 364, in _run
    self._monitor_active_jobs()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 369, in _monitor_active_jobs
    active_job_ids = self.stateful_manager.active_jobs.active_job_ids()
  File "/data/nekrut/galaxy/main/pulsar/venv/lib/python3.9/site-packages/pulsar/managers/stateful.py", line 310, in active_job_ids
    job_ids = os.listdir(target_directory)
OSError: [Errno 23] Too many open files in system: '/scratch4/nekrut/galaxy/main/pulsar/var/rockfish-active-jobs'
```
@mvdbeek mvdbeek merged commit e726e5a into galaxyproject:master Jun 18, 2024
10 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants