Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.0.24 request routing causes massive performance regression #2579

Open
2 of 7 tasks
johnmadden-sp opened this issue Dec 17, 2024 · 7 comments
Open
2 of 7 tasks

6.0.24 request routing causes massive performance regression #2579

johnmadden-sp opened this issue Dec 17, 2024 · 7 comments
Assignees

Comments

@johnmadden-sp
Copy link

johnmadden-sp commented Dec 17, 2024

Issue report

Are you sure this is a bug in Passenger?

Not necessarily a bug but the implementation change -- we think -- caused a severe increase in latency leading to outages. The 6.0.24 changelog mentions: We changed the way we route requests. Instead of picking the least-busy process, we now first prioritize new processes first. Reverting to 6.0.23 fixes the problem.

Please try with the newest version of Passenger to avoid issues that have already been fixed
New issue in 6.0.24

Question 1: What is the problem?

  • What is the expected behavior? No change in performance from 6.0.23
  • What is the actual behavior? Unknown performance degradation leads to huge latency issues and outages. With a timeout of 2 minutes, passengers are being restarted automatically, leading to increased CPU use on startup. The concept of routing requests to new passengers is the culprit, I think the issue is these haven't been warmed up yet, leading to a lot of latency and that passenger being killed off again. The proc manager kills the child passenger, sending a 502 to nginx, etc.
  • How can we reproduce it? Deploy new [Rails] containers on 6.0.24 and send traffic to them. We did nothing special here, the impact was immediate and obvious. These are just docker containers, no rolling restarts. I think we should instead provide a restart configuration option or something to allow the user to pick the algorithm.

Question 2: Passenger version and integration mode:

  • Enterprise, nginx, libnginx-mod-http-passenger-enterprise

Question 3: OS or Linux distro, platform (including version):

  • Docker image phusion/passenger-ruby32:3.0.7 amd64.

Question 4: Passenger installation method:

Your answer:

  • RubyGems + Gemfile
  • RubyGems, no Gemfile
  • Phusion APT repo
  • Phusion YUM repo
  • OS X Homebrew
  • source tarball
  • Other, please specify:

phusion-supplied docker images, but with an apt-get install libnginx-mod-http-passenger-enterprise the 6.0.24 upgrade takes place.

Question 5: Your app's programming language (including any version managers) and framework (including versions):

  • Ruby 3.2.x, Rails 7.x

Question 6: Are you using a PaaS and/or containerization? If so which one?

  • We're building containers off of phusion/passenger-ruby32:3.0.7
@CamJN
Copy link
Member

CamJN commented Dec 17, 2024

No rolling restarts are involved so the difference between 2.0.23 and 2.0.24 is that older processes are prioritized over less busy processes. The newness bias is only involved when rolling restarting (ie processes from both before and after a restart was begun).

So what you are actually seeing is that older more-warmed-up processes are responding more slowly than less-busy processes, which is surprising, so it's possible that it could be due to another change in 6.0.24, or it might just be that the warmed-up processes are slower, we'll have to gather more data.

Can you provide a bit more information:

Are you using multi-threading?
How many app-processes are you having passenger create?
Are you using Passenger's pre-start option?
Can you capture the latencies of the first several requests to a container with 2.0.23 and 2.0.24?

Can you be more detailed with regards to this bit:

With a timeout of 2 minutes, Passengers are being restarted automatically, […] The proc manager kills the child Passenger, sending a 502 to Nginx, etc.

Specifically, is the Passenger process being killed by a watchdog or are the Passenger-managed app processes being killed by Passenger?

@CamJN
Copy link
Member

CamJN commented Dec 17, 2024

Just a note for myself: I can diff these tags with the following to get the most suspect changes: git diff release-6.0.23 release-6.0.24 -- src ':!src/cxx_supportlib/vendor-modified/boost'

@johnmadden-sp
Copy link
Author

  • Yes to multi-threading
  • 12 procs (in this app group)
  • Further config:
passenger_root /usr/lib/ruby/vendor_ruby/phusion_passenger/locations.ini;
passenger_ruby /usr/bin/ruby;
passenger_instance_registry_dir /var/passenger-instreg;
passenger_core_file_descriptor_ulimit 16384;
passenger_start_timeout 180;
passenger_pre_start http://localhost/websocket;
passenger_show_version_in_header off;
passenger_intercept_errors on;
passenger_rolling_restarts on;
passenger_pool_idle_time 3600;
passenger_max_request_queue_size 1000;
passenger_max_request_time 120;
passenger_concurrency_model thread;
passenger_max_pool_size 15;
passenger_thread_count 10;
passenger_request_queue_overflow_status_code 504;
  • Latencies... No, there'd be now way of capturing the first few. But after 30 minutes of runtime,
    p95 on 6.0.23 ~390ms
    p95 on 6.0.24 ~1800ms
  • As for what's doing the killing,
InternalUtils.cpp:80 ]: [Client 14-3168] Sending 502 response: application did not send a complete response
CheckoutSession.cpp:449 ]: [Client 3-3129] Maximum request time of 120 seconds reached, killing process 851

@CamJN
Copy link
Member

CamJN commented Dec 17, 2024

Ok so the oldest process is likely being loaded with 10 requests before the next process is chosen, and then Passenger is killing the app process because some requests aren't done in 120s. @FooBarWidget thoughts?

@johnmadden-sp
Copy link
Author

We did indeed see high thread use and busyness. (I assumed this was more tied to latency and thus just regular use-the-capacity.) Can we maybe round-robin or otherwise better distribute to new passengers, or maybe only start sending to new ones once a threshold of them are ready? (But again we aren't doing rolling restarts, this is happening in new containers on deployment.)

@CamJN
Copy link
Member

CamJN commented Dec 24, 2024

Can you tell us how you decided on the 12 processes and 10 threads configuration?

@CamJN
Copy link
Member

CamJN commented Dec 27, 2024

I'm having trouble reproducing this issue. I'm able to load nginx+Passenger Enterprise set to 12 procs and 10 threads with 120 concurrent requests and 50,000 total requests and the worst time I get is 1.6 seconds to finish processing a request+reply, with a mean of 208ms and a median of 185ms. And that's with a 6 core machine, so it's actually a bit over-loaded in that configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants