ECSWorker fails to submit tasks to cluster #301

m-steinhauer · 2023-08-01T15:54:03Z

I'm facing the following issue: We are using an EC2 backed ECS cluster on AWS and a Prefect Flow that triggers a certain amount of subflows using the run_deployment method. Lets say we have a capacity for 20 tasks in our cluster and are triggering 200, then most of the time the first 20 submitted flows will fail immediately (and are also not retryied). Other events are submitted to queue and processed correctly. Sometimes the submission fails randomly during the flows. We also limited the concurrency on the queue and the Prefect worker according to our available capacity.

Expectation

Flows are submitted successfully or queued if no capacity in the cluster is available.

It looks like that the ECS client is sometimes not able to put the task on the cluster as it fails with the out of index exception below. Sadly I cannot see any more details coming from the AWS response so it is hard to analyze the reason why the task can not be placed.

Environment

prefect 2.11.1
prefect-aws 0.3.6
python 3.10
ECS cluster with EC2 instances

Traceback

Failed to submit flow run '1ab332c5-d7e5-43bb-a9de-089eb115b0ec' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 538, in run
    ) = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 695, in _create_task_and_wait_for_start
    self._report_task_run_creation_failure(configuration, task_run_request, exc)
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 691, in _create_task_and_wait_for_start
    task = self._create_task_run(ecs_client, task_run_request)
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 1424, in _create_task_run
    return ecs_client.run_task(**task_run_request)["tasks"][0]
IndexError: list index out of range

Improvement

Make the _create_task_run method more robust in case the ECS client cannot submit the job on the first try to the underlying cluster.

The text was updated successfully, but these errors were encountered:

m-steinhauer changed the title ~~ECSWorker fails to submit tasks cluster~~ ECSWorker fails to submit tasks to cluster Aug 1, 2023

desertaxle mentioned this issue Aug 7, 2023

Adds retries to ECS task run creation for ECS worker #303

Merged

5 tasks

desertaxle closed this as completed in #303 Aug 7, 2023

coffeeandcloud mentioned this issue Sep 28, 2023

Improve error, if ECS task can not be submitted #282

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECSWorker fails to submit tasks to cluster #301

ECSWorker fails to submit tasks to cluster #301

m-steinhauer commented Aug 1, 2023 •

edited

Loading

ECSWorker fails to submit tasks to cluster #301

ECSWorker fails to submit tasks to cluster #301

Comments

m-steinhauer commented Aug 1, 2023 • edited Loading

Expectation

Environment

Traceback

Improvement

m-steinhauer commented Aug 1, 2023 •

edited

Loading