You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As the demand for netperf scales, so do the number of concurrent jobs. However, Azure has a limited capacity for the number of VMs a pool can create.
Normally this isn't a problem with a pool model like what 1ES has. A 1-machine pool can complete N jobs, it would just take a long time.
Here's the problem.
Let X be the maximum number of machines a pool can create.
Let Y be the number of workflows running concurrently.
The API for integrating Github with Azure 1ES involves pushing M jobs in a workflow and Azure will randomly assign machines to do those jobs. If X >= M then every job can run concurrently. Otherwise, some random subset will run.
Our networking perf jobs require a pair of machines, so if a workflow defines Z perf scenarios, netperf will generate M = 2 * Z jobs to request machines from Azure.
Right now, we have multiple pools and 2* Z < X for all pools + scenarios. This isn't an issue.
But, as we scale, multiple PRs are being made from multiple projects, and we reach a point where Y * (2 * Z) > X
We can get unlucky and reach a deadlock. And even without the possibility of deadlocks, we will have many jobs queued for a long time before a machine gets assigned, and they will fail because of timeout restrictions enforced by netperf.
To illustrate the deadlock possibility:
Let's say we have a 2-machine pool, and a workflow has 2 perf scenarios (A and B), so netperf will generate 4 jobs to request 4 machines from azure.
Generated Job for scenario A (client) - Assigned
Generated Job for scenario A (server) - Waiting...
Generated Job for scenario B (client) - Assigned
Generated Job for scenario B (server) - Waiting...
The text was updated successfully, but these errors were encountered:
The proposed solution here is in the /jackhe/add-pool-usage-tracker branch, which introduces a LOAD BALANCER.
Essentially, in the Prepare-Matrix step, we call the Github API and query for the number of azure machines dispatched across all concurrent workflow runs. We know how many machines the current workflow needs.
So if we find that the number of requested machines + currently dispatched machines exceed some threshold, we wait in the prepare-matrix step to avoid timing out when we queue the actual azure job.
As the demand for netperf scales, so do the number of concurrent jobs. However, Azure has a limited capacity for the number of VMs a pool can create.
Normally this isn't a problem with a pool model like what 1ES has. A 1-machine pool can complete N jobs, it would just take a long time.
Here's the problem.
Right now, we have multiple pools and 2* Z < X for all pools + scenarios. This isn't an issue.
But, as we scale, multiple PRs are being made from multiple projects, and we reach a point where Y * (2 * Z) > X
We can get unlucky and reach a deadlock. And even without the possibility of deadlocks, we will have many jobs queued for a long time before a machine gets assigned, and they will fail because of timeout restrictions enforced by netperf.
To illustrate the deadlock possibility:
Let's say we have a 2-machine pool, and a workflow has 2 perf scenarios (A and B), so netperf will generate 4 jobs to request 4 machines from azure.
Generated Job for scenario A (client) - Assigned
Generated Job for scenario A (server) - Waiting...
Generated Job for scenario B (client) - Assigned
Generated Job for scenario B (server) - Waiting...
The text was updated successfully, but these errors were encountered: