Possibility of deadlock causing timeout errors #407

ProjectsByJackHe · 2024-11-05T23:40:23Z

As the demand for netperf scales, so do the number of concurrent jobs. However, Azure has a limited capacity for the number of VMs a pool can create.

Normally this isn't a problem with a pool model like what 1ES has. A 1-machine pool can complete N jobs, it would just take a long time.

Here's the problem.

Let X be the maximum number of machines a pool can create.
Let Y be the number of workflows running concurrently.
The API for integrating Github with Azure 1ES involves pushing M jobs in a workflow and Azure will randomly assign machines to do those jobs. If X >= M then every job can run concurrently. Otherwise, some random subset will run.
Our networking perf jobs require a pair of machines, so if a workflow defines Z perf scenarios, netperf will generate M = 2 * Z jobs to request machines from Azure.

Right now, we have multiple pools and 2* Z < X for all pools + scenarios. This isn't an issue.

But, as we scale, multiple PRs are being made from multiple projects, and we reach a point where Y * (2 * Z) > X
We can get unlucky and reach a deadlock. And even without the possibility of deadlocks, we will have many jobs queued for a long time before a machine gets assigned, and they will fail because of timeout restrictions enforced by netperf.

To illustrate the deadlock possibility:
Let's say we have a 2-machine pool, and a workflow has 2 perf scenarios (A and B), so netperf will generate 4 jobs to request 4 machines from azure.

Generated Job for scenario A (client) - Assigned
Generated Job for scenario A (server) - Waiting...

Generated Job for scenario B (client) - Assigned
Generated Job for scenario B (server) - Waiting...

ProjectsByJackHe · 2024-11-05T23:45:11Z

The proposed solution here is in the /jackhe/add-pool-usage-tracker branch, which introduces a LOAD BALANCER.

Essentially, in the Prepare-Matrix step, we call the Github API and query for the number of azure machines dispatched across all concurrent workflow runs. We know how many machines the current workflow needs.

So if we find that the number of requested machines + currently dispatched machines exceed some threshold, we wait in the prepare-matrix step to avoid timing out when we queue the actual azure job.

ProjectsByJackHe added azure Specific to Azure environment bug Something isn't working P2 labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of deadlock causing timeout errors #407

Possibility of deadlock causing timeout errors #407

ProjectsByJackHe commented Nov 5, 2024

ProjectsByJackHe commented Nov 5, 2024

Possibility of deadlock causing timeout errors #407

Possibility of deadlock causing timeout errors #407

Comments

ProjectsByJackHe commented Nov 5, 2024

ProjectsByJackHe commented Nov 5, 2024