Instabilities with browsers under the load (300-600 tests in parallel) #1

AlexeyAltunin · 2020-04-18T17:58:54Z

Hi! I have been testing Callisto starting from the last week.

Issue description: there are random containers/browsers freezes -> hanging pods , reproduced for running a lot of tests in parallel

3 types of errors:

WebDriverError: Pod does not have an IP (not critical, happens very seldom)

<center><h1>500 Internal Server Error</h1></center>
 <hr><center>nginx/1.17.2</center>
 </body>

Fixed after increasing resources for nginx

The most critical one, happens quite often but randomly, impacts on pipeline stability. This log was found in hanging browser pods:

[91:124:0417/171003.763223:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 376: Permission denied (13)
[91:124:0417/171004.767769:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 380: Permission denied (13)
[91:124:0417/171005.367275:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 384: Permission denied (13)
[91:124:0417/171005.594971:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 389: Permission denied (13)
[91:124:0417/171006.003322:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 393: Permission denied (13)
[91:124:0417/171006.581433:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 397: Permission denied (13)

Didn't find smth useful for callisto pod

Our configuration:

300-600 tests in parallel
GCP GKE cluster
Spec:

initial_node_count = 1

  autoscaling {
    min_node_count = 1
    max_node_count = 200
  }

  node_config {
    preemptible  = true
    machine_type = "n2-highcpu-8"

Callisto setup: values.yaml

# Unique ID of callisto instance
instanceID: 'unknown'

rbac:
  create: true

callisto:
...  
  replicas: 1
  resources:
    limits:
      cpu: "500m"
      memory: "512Mi"
    requests:
      cpu: "250m"
      memory: "128Mi"
  logLevel: "DEBUG"
  service:
    type: "LoadBalancer"
 
  browser:
    name: "chrome"
    chromeImage: "selenoid/chrome:81.0"
    resources:
      limits:
        cpu: "1000m"
        memory: "1024Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
...
    env:
    - name: TZ
      value: 'UTC'
    - name: ENABLE_VNC
      value: 'true'

nginx:
  image:
    registry:
    repository: nginx
    tag: '1.17.2-alpine'
    pullPolicy: Always

  prometheusExporter:
    image:
      registry:
      repository: nginx/nginx-prometheus-exporter
      tag: '0.4.0'
      pullPolicy: Always
  replicas: 2
  minReadySeconds: 15
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  resources:
    requests:
      cpu: "2000m"
      memory: "1024Mi"
  
...

We also tested Callisto for small suites (30-45) in parallel and it works fine.
Did you face the same issue or any ideas how to fix ?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

vigneshfourkites · 2021-04-05T18:39:31Z

is the issue fixed? @AlexeyAltunin Browser version update make the things stable?

srntqn · 2021-04-06T08:33:24Z

We've discussed this issue in the mail and there was an assumption that the reason of the issue is the small size of the cluster nodes. It causes an often cluster autoscaling under the load and then it leads to the browsers freezes and failures. But it was just an assumption and we didn't check it. Maybe @AlexeyAltunin have some info.

Here in Wrike we use 32 vCPU/128 Gb RAM node config and there are no such problems with the browsers.

@vigneshfourkites do you experience the same issue?

vigneshfourkites · 2021-04-06T09:25:16Z

@srntqn No, we are in POC mode and try to run below 100 browsers. In future, we will scale more than 300 for sure, and precautionary measure under this issue might give us some idea in scaling the numbers. so asked this question! Thanks for the response!

vigneshfourkites · 2021-06-03T05:04:28Z

@srntqn We are running 32GB machine with 50 parallel test, containers are not destroyed properly and pods taint happening. what is the K8 version you are using? Any benchmark information do you have? currently using machine config is 8vCPU/32GB RAM.

srntqn · 2021-06-04T08:18:08Z

@vigneshfourkites did you check the logs of callisto? Are there any errors?
Also, it could be helpful to check the logs of kubernetes API server and kubelet.

pods taint happening

Sorry, there is a chance that I understand it in a wrong way. Could you please provide more details? What do you mean here?

what is the K8 version you are using? Any benchmark information do you have?

Now we use 1.18.17 version of Kubernetes, unfortunately we have no benchmarks for this version. But there are no problems with pods creation/deletion and the latency is okay.

vigneshfourkites · 2021-06-05T12:26:41Z

@srntqn .. Yeah i saw some ERROR logs in the Callisto pods,

2021-06-05 12:12:35,603 unknown ERROR >>> {"tid": "web-2b5f388e811b46d9882d15f45f00b045"}
Traceback (most recent call last):
File "/app/callisto/libs/middleware.py", line 30, in error_middleware
return await handler(request)
File "/app/callisto/web/webdriver_logs.py", line 16, in webdriver_logs_handler
async for line in await uc.get_logs_stream(pod_name=get_pod_name(request)):
File "/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 39, in anext
rv = await self.read_func()
File "/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 328, in readline
await self._wait('readline')
File "/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 296, in _wait
await waiter
concurrent.futures._base.CancelledError

what is the root cause for this error? post the above error, Delete/create request happened but pods are not removed/added in the cluster.

vpokotilov · 2021-06-07T16:02:33Z

@vigneshfourkites this particular error is related to displaying logs in Selenoid-UI, and not related to starting or stopping pods.
It would be helpful to get more logs. For example, you can enable debug logs here by setting logLevel: "DEBUG".

vigneshfourkites · 2021-06-07T16:20:26Z

It is enabled already as DEBUG. I only see above mentioned failures in Callisto pod, other than that no errors logged. Do you restrict the browser CPU and Memory utilisation internally anywhere? Seems like, CPU is at 100% constantly during the execution. @vpokotilov

srntqn · 2021-06-10T15:36:34Z

@vigneshfourkites looks like you have some problems with the cluster performance. Maybe the reason is the load produced by your tests. Did you try to decrease the number of parallel sessions and check how it will affect the performance?

Do you restrict the browser CPU and Memory utilisation internally anywhere?

We use only k8s requests/limits.

resources:
  limits:
    cpu: 2500m
    memory: 2000Mi
  requests:
    cpu: 1
    memory: 500Mi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instabilities with browsers under the load (300-600 tests in parallel) #1

Instabilities with browsers under the load (300-600 tests in parallel) #1

AlexeyAltunin commented Apr 18, 2020

vigneshfourkites commented Apr 5, 2021

srntqn commented Apr 6, 2021

vigneshfourkites commented Apr 6, 2021

vigneshfourkites commented Jun 3, 2021

srntqn commented Jun 4, 2021 •

edited

Loading

vigneshfourkites commented Jun 5, 2021 •

edited

Loading

vpokotilov commented Jun 7, 2021

vigneshfourkites commented Jun 7, 2021 •

edited

Loading

srntqn commented Jun 10, 2021

Instabilities with browsers under the load (300-600 tests in parallel) #1

Instabilities with browsers under the load (300-600 tests in parallel) #1

Comments

AlexeyAltunin commented Apr 18, 2020

vigneshfourkites commented Apr 5, 2021

srntqn commented Apr 6, 2021

vigneshfourkites commented Apr 6, 2021

vigneshfourkites commented Jun 3, 2021

srntqn commented Jun 4, 2021 • edited Loading

vigneshfourkites commented Jun 5, 2021 • edited Loading

vpokotilov commented Jun 7, 2021

vigneshfourkites commented Jun 7, 2021 • edited Loading

srntqn commented Jun 10, 2021

srntqn commented Jun 4, 2021 •

edited

Loading

vigneshfourkites commented Jun 5, 2021 •

edited

Loading

vigneshfourkites commented Jun 7, 2021 •

edited

Loading