You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
I'm facing the following issue: We are using an EC2 backed ECS cluster on AWS and a Prefect Flow that triggers a certain amount of subflows using the run_deployment method. Lets say we have a capacity for 20 tasks in our cluster and are triggering 200, then most of the time the first 20 submitted flows will fail immediately (and are also not retryied). Other events are submitted to queue and processed correctly. Sometimes the submission fails randomly during the flows. We also limited the concurrency on the queue and the Prefect worker according to our available capacity.
Expectation
Flows are submitted successfully or queued if no capacity in the cluster is available.
It looks like that the ECS client is sometimes not able to put the task on the cluster as it fails with the out of index exception below. Sadly I cannot see any more details coming from the AWS response so it is hard to analyze the reason why the task can not be placed.
Failed to submit flow run '1ab332c5-d7e5-43bb-a9de-089eb115b0ec' to infrastructure.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
result = await self.run(
File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 538, in run
) = await run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
return await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 695, in _create_task_and_wait_for_start
self._report_task_run_creation_failure(configuration, task_run_request, exc)
File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 691, in _create_task_and_wait_for_start
task = self._create_task_run(ecs_client, task_run_request)
File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 1424, in _create_task_run
return ecs_client.run_task(**task_run_request)["tasks"][0]
IndexError: list index out of range
Improvement
Make the _create_task_run method more robust in case the ECS client cannot submit the job on the first try to the underlying cluster.
The text was updated successfully, but these errors were encountered:
m-steinhauer
changed the title
ECSWorker fails to submit tasks cluster
ECSWorker fails to submit tasks to cluster
Aug 1, 2023
I'm facing the following issue: We are using an EC2 backed ECS cluster on AWS and a Prefect Flow that triggers a certain amount of subflows using the
run_deployment
method. Lets say we have a capacity for 20 tasks in our cluster and are triggering 200, then most of the time the first 20 submitted flows will fail immediately (and are also not retryied). Other events are submitted to queue and processed correctly. Sometimes the submission fails randomly during the flows. We also limited the concurrency on the queue and the Prefect worker according to our available capacity.Expectation
Flows are submitted successfully or queued if no capacity in the cluster is available.
It looks like that the ECS client is sometimes not able to put the task on the cluster as it fails with the out of index exception below. Sadly I cannot see any more details coming from the AWS response so it is hard to analyze the reason why the task can not be placed.
Environment
prefect 2.11.1
prefect-aws 0.3.6
python 3.10
ECS cluster with EC2 instances
Traceback
Improvement
Make the
_create_task_run
method more robust in case the ECS client cannot submit the job on the first try to the underlying cluster.The text was updated successfully, but these errors were encountered: