Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of deadlock causing timeout errors #407

Open
ProjectsByJackHe opened this issue Nov 5, 2024 · 1 comment
Open

Possibility of deadlock causing timeout errors #407

ProjectsByJackHe opened this issue Nov 5, 2024 · 1 comment
Labels
azure Specific to Azure environment bug Something isn't working P2

Comments

@ProjectsByJackHe
Copy link
Collaborator

As the demand for netperf scales, so do the number of concurrent jobs. However, Azure has a limited capacity for the number of VMs a pool can create.

Normally this isn't a problem with a pool model like what 1ES has. A 1-machine pool can complete N jobs, it would just take a long time.

Here's the problem.

  • Let X be the maximum number of machines a pool can create.
  • Let Y be the number of workflows running concurrently.
  • The API for integrating Github with Azure 1ES involves pushing M jobs in a workflow and Azure will randomly assign machines to do those jobs. If X >= M then every job can run concurrently. Otherwise, some random subset will run.
  • Our networking perf jobs require a pair of machines, so if a workflow defines Z perf scenarios, netperf will generate M = 2 * Z jobs to request machines from Azure.

Right now, we have multiple pools and 2* Z < X for all pools + scenarios. This isn't an issue.

But, as we scale, multiple PRs are being made from multiple projects, and we reach a point where Y * (2 * Z) > X
We can get unlucky and reach a deadlock. And even without the possibility of deadlocks, we will have many jobs queued for a long time before a machine gets assigned, and they will fail because of timeout restrictions enforced by netperf.

To illustrate the deadlock possibility:
Let's say we have a 2-machine pool, and a workflow has 2 perf scenarios (A and B), so netperf will generate 4 jobs to request 4 machines from azure.

Generated Job for scenario A (client) - Assigned
Generated Job for scenario A (server) - Waiting...

Generated Job for scenario B (client) - Assigned
Generated Job for scenario B (server) - Waiting...

@ProjectsByJackHe ProjectsByJackHe added azure Specific to Azure environment bug Something isn't working P2 labels Nov 5, 2024
@ProjectsByJackHe
Copy link
Collaborator Author

The proposed solution here is in the /jackhe/add-pool-usage-tracker branch, which introduces a LOAD BALANCER.

Essentially, in the Prepare-Matrix step, we call the Github API and query for the number of azure machines dispatched across all concurrent workflow runs. We know how many machines the current workflow needs.

So if we find that the number of requested machines + currently dispatched machines exceed some threshold, we wait in the prepare-matrix step to avoid timing out when we queue the actual azure job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure Specific to Azure environment bug Something isn't working P2
Projects
None yet
Development

No branches or pull requests

1 participant