-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance metrics vary between active lab machines #221
Comments
Here is what the RSS configurations look like on a "good perf" pair: client: RR1-NETPERF-05-RSS.txt Here is what the RSS configurations look like on a "bad perf" pair: client: RR1-NETPERF-25-RSS.txt TLDR; they look the same. |
cc @mtfriesen do you see any notable differences in these? I agree that the physical adapters 'Mellanox ConnectX-6 Dx Adapter' all look the same:
The virtual adapters also look the same, except for some minor naming differences (
|
I don't see any obvious functional differences, but the two machines were obviously not set up using the same steps, because one set of machines has "vEthernet (200G)" in the NIC name, and the other has the default "vEthernet (Mellanox ConnectX-6 Dx Adapter - Virtual Switch)" If it ain't automated, it ain't gonna be repeatable or consistent, no matter how hard we manually try to get these two to behave the same. |
I agree we need to standardize the setup process. That will be done as a part of moving to dynamic machines. But since RSS seems to be the same, I suspect the problem is actually in the code. Unless there is code to consistently line up with RSS and NUMA, tests will generally randomly use different processors/threads/NUMA. The scheduler or thread pool can only do so much for you. So, the next step is to grab some CPU traces for 'good' and 'bad' runs. I suspect you will be able to get both on the same set of machines, given enough runs. |
From many recent Secnetperf runs,
there is a noticeable performance difference between some lab machines --- even though the code is the exact same, BIOS configuration is the same, SR-IOV is both enabled...
For example, in this run: https://github.com/microsoft/netperf/actions/runs/11565315387/job/32192210637,
lab machines 05, 10 were assigned for this windows iocp test job (WIN-2CSMQHE8ML4)
And were able to get sub 20 Gb/ps throughput on tcp + iocp.
But in this run: https://github.com/microsoft/netperf/actions/runs/11564283006/job/32189283432
lab machines 25, 26 were assigned for this windows iocp test job (WIN-B9SEU47NHOT)
And we were getting sub 9.8 Gb/ps throughput on tcp + iocp.
This is also true comparing the lab machines hosting the static lab VMs and the dynamic lab VMs from the stateless lab . (lab machines 42, 43)
We need to investigate why performance data is so different between the static lab, and the stateless lab .
This unblocks #73
The text was updated successfully, but these errors were encountered: