Pods in error state #52

yuriipolishchuk · 2022-05-10T12:01:08Z

After upgrading chrome image to version 101 (this can be a coincidence) we see that some chrome pods are in error state with
such entries in the log.

seleniferous {"level":"warning","msg":"session chrome-101-0-b1520d9a-82d5-403d-95ac-b66643885aae delete request failed: delete request failed: Delete \"http:///wd/hub/session/chrome-101-0-b1520d9a-82d5-403d-95ac-b66643
seleniferous {"level":"info","msg":"stopping seleniferous: failed to perform delete request","time":"2022-05-10T11:50:34Z"}
seleniferous {"level":"fatal","msg":"failed to stop%!(EXTRA context.deadlineExceededError=context deadline exceeded)","time":"2022-05-10T11:50:49Z"}

I've already tried to increase gracefulTerminationPeriod but that didn't help.

Here're the values for timeouts we use:

browser-wait-timeout: 20s
session-wait-timeout: 30s
session-idle-timeout: 1m0s
graceful-shutdown-timeout: 20s

The text was updated successfully, but these errors were encountered:

SebastianPereiro · 2023-02-15T12:40:09Z

@yuriipolishchuk Hi
Have you found a way to fix this error?

alcounit · 2023-02-15T15:14:51Z

@yuriipolishchuk can you please share full log from seleniferous container

SebastianPereiro · 2023-02-15T15:52:22Z

@alcounit I can share mine
chrome94.log

alcounit · 2023-02-15T16:03:19Z

@SebastianPereiro thanks, will check

alcounit · 2023-02-15T21:12:06Z

@SebastianPereiro, @yuriipolishchuk should be fixed now, update seleniferous image version to alcounit/seleniferous:v1.0.4

SebastianPereiro · 2023-02-16T07:21:43Z

@alcounit I'm doing the tests, so far so good - I haven't seen a single failed POD today.
I'll do some more stress tests and keep you posted.
Thank you very much for such a quick response!

SebastianPereiro · 2023-02-19T09:29:14Z

Hi @alcounit ! We ran a much bigger stress test today. And unfortunately got more of the same bugs.
I'm attaching two sample logs:
selenosis-chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978-1676797786597419213.log
selenosis-chrome-105-0-6e5adda6-b83a-4079-839c-c4b60ae60209-1676798322725726600.log

SebastianPereiro · 2023-02-19T09:34:52Z

Maybe the problem is on the browser side? If so, we can simply change browser's POD restart policy from "restartPolicy: Never" to "restartPolicy: Always" and ignore it?

alcounit · 2023-02-19T22:35:27Z

@SebastianPereiro what are the timeouts for the seleniferous container?

SebastianPereiro · 2023-02-20T07:17:54Z

@alcounit I didn't change them, so they are default.

alcounit · 2023-02-21T22:22:34Z

I know how to fix this, will be a quick fix. But in your logs I see that session timed out, and delete request failed. smth wrong with browser?
seleniferous {"level":"info","msg":"proxy session","request":"DELETE /wd/hub/session/chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978","request_by":"","request_id":"d3975f60-d8b5-441e-883f-eba0da2dcb23","time":"2023-02-19T08:50:21Z"} seleniferous {"level":"warning","msg":"session chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978 delete request failed: delete request failed: Delete \"http://127.0.0.1:4445/wd/hub/session/chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978\": context deadline exceeded","request":"POST /wd/hub/session/chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978/execute/sync","request_by":"selenosis-6f7479ddb5-58kwr","request_id":"194cb85d-cd13-4599-8425-6bbcadc51f86","time":"2023-02-19T08:50:51Z"} seleniferous {"level":"info","msg":"stopping seleniferous: session deleted","time":"2023-02-19T08:50:51Z"}

SebastianPereiro · 2023-02-22T06:52:54Z

@alcounit You're right, I assume that something is wrong with our QA team tests.
Maybe some of the tests hang up random browser containers...

alcounit · 2023-02-22T15:26:15Z

@SebastianPereiro try new seleniferous image alcounit/seleniferous:v1.0.5

SebastianPereiro · 2023-02-22T17:31:01Z

@alcounit Thank you. I've updated the selenosis config and will run more tests soon.

SebastianPereiro · 2023-02-23T10:55:49Z

@alcounit I've just got a fresh error (looks the same). Attaching logs...
selenosis-chrome-93-0-ed7fdb76-bfbd-4686-a439-c01c94769315-1677149444657623189.log

alcounit · 2023-02-28T22:25:38Z

@SebastianPereiro I don't see any error in log you provided, can you give more details where you see an error?

SebastianPereiro · 2023-03-01T07:46:58Z

@alcounit Sorry, my bad (I attached the wrong log). Doing more tests...

SebastianPereiro · 2023-03-01T08:08:18Z

Actually it wasn't the wrong log. I got a new log from the similar POD:
selenosis-chrome-90-0-1eb493e5-3871-4f67-b512-f8a610712fe6-1677657742320591833.log
Here is how it looks in k9s:

Seleniferous exited normally, but the browser container is still alive for 19 hours.

SebastianPereiro · 2023-03-01T12:03:44Z

@alcounit Found another strange situation during a prolonged test:

the browser limit was set to 5000
At the moment I saw the problem, we had ~10-20 browsers open
more than 60 selenosis PODs
the logs of all selenosis containers were full of:

selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg=session request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/hidden/selenoid/firefox:97.0" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:06Z" level=error msg="failed to start browser: failed to create pod pods \"firefox-97-0-31b0115b-ab02-4d20-916e-00686939e3be\" is forbidden: exceeded quota: selenosis-pod-limit, requested: pods=1, used: pods=5044, limited: pods=5044" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.26s

Could it be related to stuck browser containers?
What can you suggest?

alcounit · 2023-03-02T01:08:25Z

Actually it wasn't the wrong log. I got a new log from the similar POD: selenosis-chrome-90-0-1eb493e5-3871-4f67-b512-f8a610712fe6-1677657742320591833.log Here is how it looks in k9s: Seleniferous exited normally, but the browser container is still alive for 19 hours.

This is odd.

@alcounit Found another strange situation during a prolonged test:

the browser limit was set to 5000
At the moment I saw the problem, we had ~10-20 browsers open
more than 60 selenosis PODs
the logs of all selenosis containers were full of:

selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg=session request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/hidden/selenoid/firefox:97.0" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:06Z" level=error msg="failed to start browser: failed to create pod pods \"firefox-97-0-31b0115b-ab02-4d20-916e-00686939e3be\" is forbidden: exceeded quota: selenosis-pod-limit, requested: pods=1, used: pods=5044, limited: pods=5044" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.26s

Could it be related to stuck browser containers? What can you suggest?

If pod is not deleted then you will quickly reach quota limit. Need to understand why pods not deleted after calling kubernetes api.
Thanks for updates.

SebastianPereiro · 2023-03-02T08:24:52Z

@alcounit I also see these errors from time to time:
time="2023-03-02T08:22:22Z" level=error msg="Failed to start browser: failed to create pod Operation cannot be fulfilled on resourcequotas \"selenosis-pod-limit\": the object has been modified; please apply your changes to the latest version and try again" request="POST /wd/hub/session" request_id=6424f050-ae4c-4957-a3ef-bd1fe5c7872b time_elapsed=0.64s
What does this mean?

SebastianPereiro · 2023-03-07T15:34:12Z

@alcounit Looks like there is some sort of session stacking bug. We can see it by Selenosis RAM usage:

This bug occurs when we have 500+ browser sessions. Selenosis PODs start to consume lots of RAM/CPU (our limit set to 1GB) .
Logs of POD full of "failed to start browser: failed to create pod pods "firefox-90-0-5ebe9298-b4fc-410f-bc11-9bb16f8f8b60" is forbidden: exceeded quota: selenosis-pod-limit, requested: pods=1, used: pods=2065".
At the same time real number of PODs can be less than 100.

SebastianPereiro · 2023-03-07T15:42:12Z

Temporary fix - resize the Selenosis deployment to 0 pods and then upscale it back to HPA recommended value.
This will help, but only for a short time. If there are a lot of browsers, the pod limit errors will come back quickly.

alcounit · 2023-03-07T16:46:43Z

@SebastianPereiro sorry for the delays. I'll try to fix the browser pod deletion issue. Will post develop image shorty so you can test it.
Selenosis has a memory issue in some cases, I have plans to rewrite the whole project

SebastianPereiro · 2023-03-07T17:25:49Z

@alcounit No problem. Thank you!
p.s. as a workaround for stale browser PODs, I recompiled the Selenosis with "RestartPolicyAlways".

SebastianPereiro · 2023-03-07T17:26:48Z

Let me know if I can provide more diagnostic information.

alcounit · 2023-03-23T04:42:37Z

@SebastianPereiro can you please try this image alcounit/seleniferous:develop_v1.0.6
Changes: alcounit/seleniferous@a71a453

SebastianPereiro · 2023-03-23T08:12:51Z

@alcounit Thank you. I tried but browsers failed to start. I got the following errors:
From selniferous logs:

repository: "alcounit/seleniferous"
tag: "develop_v1.0.6"
browser Starting ChromeDriver 105.0.5195.19 (b9c217c128c16f53d12f9a02933fcfdec1bf49af-refs/branch-heads/5195@{#176}) on port 4444
browser All remote connections are allowed. Use an allowlist instead!
browser Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
browser ChromeDriver was started successfully.
seleniferous exec /seleniferous: exec format error
Stream closed EOF for devops/chrome-105-0-760376d1-4d43-4e4a-a0d2-f3c11c757818 (seleniferous)
Stream closed EOF for devops/chrome-105-0-760376d1-4d43-4e4a-a0d2-f3c11c757818 (browser)

Selenosis logs:

selenosis:v1.0.5
time="2023-03-23T08:05:34Z" level=info msg=session request="POST /wd/hub/session" request_id=0f09ae27-18a6-41d6-82e2-90112970f462 time_elapsed=0.00s
time="2023-03-23T08:05:34Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/non-production-80fc/selenoid/chrome:107.0" request="POST /wd/hub/session" request_id=0f09ae27-18a6-41d6-82e2-90112970f462 time_elapsed=0.00s
time="2023-03-23T08:05:34Z" level=error msg="session failed: Post "http://chrome-98-0-69ecc584-3bc6-4626-a348-a95e3de6b2d1.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.17:4445: connect: connection refused" request="POST /wd/hub/session" request_id=2a8c9fe4-e4f7-4769-a437-5d6427f472e7 time_elapsed=9.50s
time="2023-03-23T08:05:34Z" level=info msg=session request="POST /wd/hub/session" request_id=1f9f9e3c-4569-44e0-85a5-83b2e94dd264 time_elapsed=0.00s
time="2023-03-23T08:05:34Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/ /chrome:101.0" request="POST /wd/hub/session" request_id=1f9f9e3c-4569-44e0-85a5-83b2e94dd264 time_elapsed=0.00s
time="2023-03-23T08:05:35Z" level=info msg=session request="POST /wd/hub/session" request_id=f88c1959-cfeb-43a4-a8ff-78d6a390a0cb time_elapsed=0.00s
time="2023-03-23T08:05:35Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/ /chrome:103.0" request="POST /wd/hub/session" request_id=f88c1959-cfeb-43a4-a8ff-78d6a390a0cb time_elapsed=0.00s
time="2023-03-23T08:05:43Z" level=error msg="session failed: Post "http://chrome-107-0-ac970928-0456-483f-b8eb-c195bdc6f6a7.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.18:4445: connect: connection refused" request="POST /wd/hub/session" request_id=0f09ae27-18a6-41d6-82e2-90112970f462 time_elapsed=9.19s
time="2023-03-23T08:05:43Z" level=error msg="session failed: Post "http://chrome-101-0-2c04a3e5-1bb9-4c75-a59e-bab61b2c9228.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.19:4445: connect: connection refused" request="POST /wd/hub/session" request_id=1f9f9e3c-4569-44e0-85a5-83b2e94dd264 time_elapsed=8.73s
time="2023-03-23T08:05:44Z" level=error msg="session failed: Post "http://chrome-103-0-8e797f76-1683-4701-a13d-86945c9fbd7e.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.20:4445: connect: connection refused" request="POST /wd/hub/session" request_id=f88c1959-cfeb-43a4-a8ff-78d6a390a0cb time_elapsed=8.75s

alcounit · 2023-03-23T17:49:08Z

my bad, image was built with arch arm64, try alcounit/seleniferous:develop_v1.0.6.1

SebastianPereiro · 2023-03-27T11:47:03Z

@alcounit Many thanks! Testing...

SebastianPereiro · 2023-03-27T14:35:29Z

Ok, I ran my tests and still some pods end up in 'unfinished' state (seleniferous is in completed state & browser is still running).

Attaching one of the logs.
devops-chrome-105-0-2a722778-7506-4b86-9ebf-cc14e633bd1e-1679926652109504881.log

alcounit · 2023-03-30T20:54:02Z

This looks odd, pod delete call failed

seleniferous {"error":"Delete "https://10.53.144.1:443/api/v1/namespaces/devops/pods/chrome-105-0-2a722778-7506-4b86-9ebf-cc14e633bd1e\": dial tcp 10.53.144.1:443: i/o timeout","level":"info","msg":"deleting pod chrome-105-0-2a722778-7506-4b86-9ebf-cc14e633bd1e","time":"2023-03-27T14:14:45Z"}

SebastianPereiro · 2023-04-02T07:24:02Z

Could you please increase the timeout value and add more retries for pod deletion?

alcounit · 2023-04-10T00:39:59Z

Could you please increase the timeout value and add more retries for pod deletion?

@SebastianPereiro will do, sorry for delays

alcounit added bug Something isn't working seleniferous sidecar container labels Feb 15, 2023

alcounit self-assigned this Feb 15, 2023

alcounit added a commit to alcounit/seleniferous that referenced this issue Feb 15, 2023

Fix alcounit/selenosis#52

2ff7269

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods in error state #52

Pods in error state #52

yuriipolishchuk commented May 10, 2022

SebastianPereiro commented Feb 15, 2023

alcounit commented Feb 15, 2023

SebastianPereiro commented Feb 15, 2023

alcounit commented Feb 15, 2023

alcounit commented Feb 15, 2023

SebastianPereiro commented Feb 16, 2023

SebastianPereiro commented Feb 19, 2023

SebastianPereiro commented Feb 19, 2023 •

edited

Loading

alcounit commented Feb 19, 2023 •

edited

Loading

SebastianPereiro commented Feb 20, 2023

alcounit commented Feb 21, 2023

SebastianPereiro commented Feb 22, 2023

alcounit commented Feb 22, 2023

SebastianPereiro commented Feb 22, 2023

SebastianPereiro commented Feb 23, 2023

alcounit commented Feb 28, 2023

SebastianPereiro commented Mar 1, 2023

SebastianPereiro commented Mar 1, 2023

SebastianPereiro commented Mar 1, 2023

alcounit commented Mar 2, 2023

SebastianPereiro commented Mar 2, 2023

SebastianPereiro commented Mar 7, 2023

SebastianPereiro commented Mar 7, 2023

alcounit commented Mar 7, 2023

SebastianPereiro commented Mar 7, 2023

SebastianPereiro commented Mar 7, 2023

alcounit commented Mar 23, 2023

SebastianPereiro commented Mar 23, 2023

alcounit commented Mar 23, 2023

SebastianPereiro commented Mar 27, 2023

SebastianPereiro commented Mar 27, 2023

alcounit commented Mar 30, 2023 •

edited

Loading

SebastianPereiro commented Apr 2, 2023

alcounit commented Apr 10, 2023

Pods in error state #52

Pods in error state #52

Comments

yuriipolishchuk commented May 10, 2022

SebastianPereiro commented Feb 15, 2023

alcounit commented Feb 15, 2023

SebastianPereiro commented Feb 15, 2023

alcounit commented Feb 15, 2023

alcounit commented Feb 15, 2023

SebastianPereiro commented Feb 16, 2023

SebastianPereiro commented Feb 19, 2023

SebastianPereiro commented Feb 19, 2023 • edited Loading

alcounit commented Feb 19, 2023 • edited Loading

SebastianPereiro commented Feb 20, 2023

alcounit commented Feb 21, 2023

SebastianPereiro commented Feb 22, 2023

alcounit commented Feb 22, 2023

SebastianPereiro commented Feb 22, 2023

SebastianPereiro commented Feb 23, 2023

alcounit commented Feb 28, 2023

SebastianPereiro commented Mar 1, 2023

SebastianPereiro commented Mar 1, 2023

SebastianPereiro commented Mar 1, 2023

alcounit commented Mar 2, 2023

SebastianPereiro commented Mar 2, 2023

SebastianPereiro commented Mar 7, 2023

SebastianPereiro commented Mar 7, 2023

alcounit commented Mar 7, 2023

SebastianPereiro commented Mar 7, 2023

SebastianPereiro commented Mar 7, 2023

alcounit commented Mar 23, 2023

SebastianPereiro commented Mar 23, 2023

alcounit commented Mar 23, 2023

SebastianPereiro commented Mar 27, 2023

SebastianPereiro commented Mar 27, 2023

alcounit commented Mar 30, 2023 • edited Loading

SebastianPereiro commented Apr 2, 2023

alcounit commented Apr 10, 2023

SebastianPereiro commented Feb 19, 2023 •

edited

Loading

alcounit commented Feb 19, 2023 •

edited

Loading

alcounit commented Mar 30, 2023 •

edited

Loading