performance waves in seconds #1303

divfor · 2020-01-08T10:32:42Z

I noted that when I use an powerful machines to stress another two servers, the requesting rate fluctuates periodically with two cpu cores pinned in 100%:

cpu% in high position of period:

cpu% in low position of period:

divfor · 2020-01-08T10:37:22Z

k6 v0.26.0-dev (dev build, go1.13.5, linux/amd64) with multiple_NICs_feature
script:

import http from "k6/http";
import { check, sleep, fail } from "k6";
//import { Trend, Rate, Counter, Gauge } from "k6/metrics";

//export let TrendRTT = new Trend("RTT");
//export let RateContentOK = new Rate("Content OK");
//export let GaugeContentSize = new Gauge("ContentSize");
//export let CounterErrors = new Counter("Errors");

export let options = {
	hosts: {
		"s1.fred.com": "27.126.207.43",
		"s2.fred.com": "27.126.207.43",
		"s3.fred.com": "27.126.207.43",
		"s4.fred.com": "27.126.207.43",
		"s5.fred.com": "27.126.207.43",
	},
	vus: 8000,
        //duration: "10h",
	stages: [
		{duration: "1m", target: 8000},
		{duration: "48h", target: 8000},
		{duration: "10s", target: 10},
	],
	//thresholds: {
	        //avg=23.89 min=0 med=21.22 max=70.96 p(90)=60.16 p(95)=65.33
		//"RTT": [ "p(95)<30000", "p(90)<18000", "avg<10000", "med<10000", "min<3000"],
		//"Content OK": ["rate>0.95"],
		//"ContentSize": ["value<4000"],
		//"Errors": ["count<100"]
	//},
	insecureSkipTLSVerify: true,
	noConnectionReuse: true,
	noVUConnectionReuse: true,
	discardResponseBodies: true,
	minIterationDuration: 1,
	setupTimeout: "30m",
	teardownTimeout: "10m",
	userAgent: "FredK6LoadTesting/1.0"
};

export default function() {
	let reqs = [
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "http://s1.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s2.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s3.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s4.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s5.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "http://s1.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s2.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s3.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s4.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s5.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "http://s1.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s2.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s3.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s4.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s5.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
	];
	let res = http.batch(reqs);
	for (var i = 0; i < reqs.length; i++) {
		check(res[i], {
			"is status 200": (r) => r.status === 200,
			//"url is correct": (r) => r.url === reqs[i]
    		});
		//TrendRTT.add(res[i].timings.duration);
		//RateContentOK.add(contentOK);
		//GaugeContentSize.add(res.body.length);
		//CounterErrors.add(!contentOK);
	};
}

imiric · 2020-01-08T12:46:32Z

Hi @divfor, thanks for reporting this.

It's a bit difficult to follow all your screenshots, but can you summarize your main complaint in text? Is the issue that k6 is using 100% of 2 CPU cores, or that the CPU usage fluctuates, or that the RPS itself fluctuates?

I would consider RPS fluctuation to be the only problem, as the other 2 can be expected, depending on your environment. RPS variation can also have many causes (network, system under test, etc.), so it's difficult to pin it down as a k6 issue.

divfor · 2020-01-09T07:01:12Z

I think there is one (or two) lock wait limit(s) inside k6 which uses a core each. when scaling VUs up to 500, the two core will reach 100% and then k6 starts fluctuation. if I set max VUs to less than 500 (the two cpu cores never hit 100%), fluctuation will not be seen.
I mention this as a k6 issue, because the two cores stay at 100% while other all cores are half used.

mstoykov · 2020-01-14T10:40:33Z

I am going to do some experiments, trying to reproduce the issue on my dual X5660 server this week, but I am currently doing some changes on the machine and doing perfomance benchmarking is probably not a good idea ;).

mstoykov · 2020-01-23T13:48:41Z

Hi, I did some testing ... finally. Sorry, it took so long I needed to finish some work on #1007 and #1285 in particular.

So my tests were over 1gbps connection and I was serving a 10kb and 100kb files through Nginx (as that is what I thought "10k", "100k" meant in your file names) with 10 IPs on the interface on the k6 machine that as mention before has 2 X5660 and 24GB of ram. The Nginx machine is a 4core i7 .

I more or less run the exact same script (only for smaller durations and VUs as I ... don't have the time to wait for it and the memory required).
A thing that I noticed is that I get the same throughput with 10 as well as with 500 VUs(the below is selected metrics from k6's summary and /usr/bin/time -v output:

Command being timed: "./k6.145a20f7 run --nic lan -u 10 -d 2m -s test.js"
    data_received..............: 14 GB   114 MB/s
    http_reqs..................: 399393  3328.273056/s
    iterations.................: 6650    55.416634/s
        User time (seconds): 443.12
        System time (seconds): 106.97
        Percent of CPU this job got: 448%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:02.74
        Maximum resident set size (kbytes): 209700

Command being timed: "./k6.145a20f7 run --nic lan -u 500 -d 2m -s  test.js"
    data_received..............: 13 GB  105 MB/s
    http_reqs..................: 388656 3238.797456/s
    iterations.................: 6106   50.883293/s
        User time (seconds): 570.84
        System time (seconds): 138.68
        Percent of CPU this job got: 516%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:17.26
        Maximum resident set size (kbytes): 2834372

As seen above I do the same amount of traffic, iterations and requests but with 50x less VUs (and lesser CPU and memory usage).

In my particular case, I didn't have two CPUs that are pinned to 100, I had 1 that was ... close with 50-80% of the CPU being system and I am pretty sure this is the thread doing epool for us in go , but it wasn't affected by me disabling or enabling thresholds and so on as we first thought it would. It also disappears completely if I just run empty default function (which still eats a lot of CPU for the record), so my epool theory seems good.

I did run with GODEBUG=gctrace=1 which would print a line of info about the GC every time there is garbage collection and it did match up somewhat with the waves/fluctuations of the CPU usage. And once I went down to 10VUs the waves disappeared as then there is much less Memory usage and so less to GC.

So my recommendation to get ... better performance is to try and see if lowering your VUs will still yield the same amount of throughput (you can try with 5 minutes tests for example :))

You can also try and use the new javascript compatibility mode which will lower the memory per VU significantly for your test and as it seems pretty simple you won't need to rewrite a lot to keep it ES5.1 compatible :).

On a related note ... the stages implementation pre #1007 is kind of ... not great, and given what you are doing you are probably better served by just setting duration and vus.

I hope this was useful and I will try to take another stab at it with the changes from #1007 at some point, that time using some profiling and possibly some tracing in order to better understand what is happening :D

oleiade · 2024-02-02T09:54:52Z

The team discussed this during our backlog grooming, and we decided we would likely not pursue addressing this (ourselves, anyway).

If you encounter this issue and have the interest and capacity to address it, please feel free to reopen and discuss it further. 🙇‍♂️

na-- added evaluation needed proposal needs to be validated or tested before fully implementing it in k6 performance labels Jan 10, 2020

mstoykov mentioned this issue Jan 29, 2020

Add option to specify VU network interface #1293

Closed

oleiade closed this as completed Feb 2, 2024

github-actions bot added the triage label Feb 2, 2024

github-actions bot assigned olegbespalov Feb 2, 2024

olegbespalov removed their assignment Feb 2, 2024

olegbespalov removed the triage label Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance waves in seconds #1303

performance waves in seconds #1303

divfor commented Jan 8, 2020 •

edited

Loading

divfor commented Jan 8, 2020

imiric commented Jan 8, 2020

divfor commented Jan 9, 2020 •

edited

Loading

mstoykov commented Jan 14, 2020

mstoykov commented Jan 23, 2020

oleiade commented Feb 2, 2024

performance waves in seconds #1303

performance waves in seconds #1303

Comments

divfor commented Jan 8, 2020 • edited Loading

divfor commented Jan 8, 2020

imiric commented Jan 8, 2020

divfor commented Jan 9, 2020 • edited Loading

mstoykov commented Jan 14, 2020

mstoykov commented Jan 23, 2020

oleiade commented Feb 2, 2024

divfor commented Jan 8, 2020 •

edited

Loading

divfor commented Jan 9, 2020 •

edited

Loading