Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance waves in seconds #1303

Closed
divfor opened this issue Jan 8, 2020 · 6 comments
Closed

performance waves in seconds #1303

divfor opened this issue Jan 8, 2020 · 6 comments
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 performance

Comments

@divfor
Copy link
Contributor

divfor commented Jan 8, 2020

I noted that when I use an powerful machines to stress another two servers, the requesting rate fluctuates periodically with two cpu cores pinned in 100%:
image

cpu% in high position of period:
image

cpu% in low position of period:
image

@divfor
Copy link
Contributor Author

divfor commented Jan 8, 2020

image
k6 v0.26.0-dev (dev build, go1.13.5, linux/amd64) with multiple_NICs_feature
script:

import http from "k6/http";
import { check, sleep, fail } from "k6";
//import { Trend, Rate, Counter, Gauge } from "k6/metrics";

//export let TrendRTT = new Trend("RTT");
//export let RateContentOK = new Rate("Content OK");
//export let GaugeContentSize = new Gauge("ContentSize");
//export let CounterErrors = new Counter("Errors");

export let options = {
	hosts: {
		"s1.fred.com": "27.126.207.43",
		"s2.fred.com": "27.126.207.43",
		"s3.fred.com": "27.126.207.43",
		"s4.fred.com": "27.126.207.43",
		"s5.fred.com": "27.126.207.43",
	},
	vus: 8000,
        //duration: "10h",
	stages: [
		{duration: "1m", target: 8000},
		{duration: "48h", target: 8000},
		{duration: "10s", target: 10},
	],
	//thresholds: {
	        //avg=23.89 min=0 med=21.22 max=70.96 p(90)=60.16 p(95)=65.33
		//"RTT": [ "p(95)<30000", "p(90)<18000", "avg<10000", "med<10000", "min<3000"],
		//"Content OK": ["rate>0.95"],
		//"ContentSize": ["value<4000"],
		//"Errors": ["count<100"]
	//},
	insecureSkipTLSVerify: true,
	noConnectionReuse: true,
	noVUConnectionReuse: true,
	discardResponseBodies: true,
	minIterationDuration: 1,
	setupTimeout: "30m",
	teardownTimeout: "10m",
	userAgent: "FredK6LoadTesting/1.0"
};

export default function() {
	let reqs = [
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "http://s1.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s2.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s3.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s4.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s5.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "http://s1.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s2.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s3.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s4.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s5.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "http://s1.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s2.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s3.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s4.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "http://s5.fred.com/string_100k.txt", null, {timeout: 30000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s1.fred.com/string_10k.txt", null, {timeout: 30000}],
		["GET", "https://s2.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s3.fred.com/string_10k.txt", null, {timeout: 40000}],
		["GET", "https://s4.fred.com/string_10k.txt", null, {timeout: 50000}],
		["GET", "https://s5.fred.com/string_10k.txt", null, {timeout: 50000}],
	];
	let res = http.batch(reqs);
	for (var i = 0; i < reqs.length; i++) {
		check(res[i], {
			"is status 200": (r) => r.status === 200,
			//"url is correct": (r) => r.url === reqs[i]
    		});
		//TrendRTT.add(res[i].timings.duration);
		//RateContentOK.add(contentOK);
		//GaugeContentSize.add(res.body.length);
		//CounterErrors.add(!contentOK);
	};
}

@imiric
Copy link
Contributor

imiric commented Jan 8, 2020

Hi @divfor, thanks for reporting this.

It's a bit difficult to follow all your screenshots, but can you summarize your main complaint in text? Is the issue that k6 is using 100% of 2 CPU cores, or that the CPU usage fluctuates, or that the RPS itself fluctuates?

I would consider RPS fluctuation to be the only problem, as the other 2 can be expected, depending on your environment. RPS variation can also have many causes (network, system under test, etc.), so it's difficult to pin it down as a k6 issue.

@divfor
Copy link
Contributor Author

divfor commented Jan 9, 2020

I think there is one (or two) lock wait limit(s) inside k6 which uses a core each. when scaling VUs up to 500, the two core will reach 100% and then k6 starts fluctuation. if I set max VUs to less than 500 (the two cpu cores never hit 100%), fluctuation will not be seen.
I mention this as a k6 issue, because the two cores stay at 100% while other all cores are half used.

@na-- na-- added evaluation needed proposal needs to be validated or tested before fully implementing it in k6 performance labels Jan 10, 2020
@mstoykov
Copy link
Contributor

I am going to do some experiments, trying to reproduce the issue on my dual X5660 server this week, but I am currently doing some changes on the machine and doing perfomance benchmarking is probably not a good idea ;).

@mstoykov
Copy link
Contributor

Hi, I did some testing ... finally. Sorry, it took so long I needed to finish some work on #1007 and #1285 in particular.

So my tests were over 1gbps connection and I was serving a 10kb and 100kb files through Nginx (as that is what I thought "10k", "100k" meant in your file names) with 10 IPs on the interface on the k6 machine that as mention before has 2 X5660 and 24GB of ram. The Nginx machine is a 4core i7 .

I more or less run the exact same script (only for smaller durations and VUs as I ... don't have the time to wait for it and the memory required).
A thing that I noticed is that I get the same throughput with 10 as well as with 500 VUs(the below is selected metrics from k6's summary and /usr/bin/time -v output:

Command being timed: "./k6.145a20f7 run --nic lan -u 10 -d 2m -s test.js"
    data_received..............: 14 GB   114 MB/s
    http_reqs..................: 399393  3328.273056/s
    iterations.................: 6650    55.416634/s
        User time (seconds): 443.12
        System time (seconds): 106.97
        Percent of CPU this job got: 448%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:02.74
        Maximum resident set size (kbytes): 209700
Command being timed: "./k6.145a20f7 run --nic lan -u 500 -d 2m -s  test.js"
    data_received..............: 13 GB  105 MB/s
    http_reqs..................: 388656 3238.797456/s
    iterations.................: 6106   50.883293/s
        User time (seconds): 570.84
        System time (seconds): 138.68
        Percent of CPU this job got: 516%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:17.26
        Maximum resident set size (kbytes): 2834372

As seen above I do the same amount of traffic, iterations and requests but with 50x less VUs (and lesser CPU and memory usage).

In my particular case, I didn't have two CPUs that are pinned to 100, I had 1 that was ... close with 50-80% of the CPU being system and I am pretty sure this is the thread doing epool for us in go , but it wasn't affected by me disabling or enabling thresholds and so on as we first thought it would. It also disappears completely if I just run empty default function (which still eats a lot of CPU for the record), so my epool theory seems good.

I did run with GODEBUG=gctrace=1 which would print a line of info about the GC every time there is garbage collection and it did match up somewhat with the waves/fluctuations of the CPU usage. And once I went down to 10VUs the waves disappeared as then there is much less Memory usage and so less to GC.

So my recommendation to get ... better performance is to try and see if lowering your VUs will still yield the same amount of throughput (you can try with 5 minutes tests for example :))

You can also try and use the new javascript compatibility mode which will lower the memory per VU significantly for your test and as it seems pretty simple you won't need to rewrite a lot to keep it ES5.1 compatible :).

On a related note ... the stages implementation pre #1007 is kind of ... not great, and given what you are doing you are probably better served by just setting duration and vus.

I hope this was useful and I will try to take another stab at it with the changes from #1007 at some point, that time using some profiling and possibly some tracing in order to better understand what is happening :D

@oleiade
Copy link
Member

oleiade commented Feb 2, 2024

The team discussed this during our backlog grooming, and we decided we would likely not pursue addressing this (ourselves, anyway).

If you encounter this issue and have the interest and capacity to address it, please feel free to reopen and discuss it further. 🙇‍♂️

@oleiade oleiade closed this as completed Feb 2, 2024
@github-actions github-actions bot added the triage label Feb 2, 2024
@olegbespalov olegbespalov removed their assignment Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evaluation needed proposal needs to be validated or tested before fully implementing it in k6 performance
Projects
None yet
Development

No branches or pull requests

6 participants