-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Critical performance bottleneck with high-throughput GRPC stream handling #2669
Comments
We ran the exact same test in Java to confirm that the issue was not with our GRPC server. Here is the result:
The Any insight on the rate differential would be appreciated. |
Our own benchmark shows significantly better performance than you are seeing there. The "Streaming secure ping pong" test runs a stream in which the client sends a message and then the server sends a response message, and then the client sends another message, etc.. The median latency measured is about 135us, which corresponds to about 7000 messages per second. The implementation of that benchmark can be found here. It's hard to tell from what you have here what the likely cause of the discrepancy is. One likely possibility is protobuf parsing. Your messages are likely significantly more complex than the ones in our benchmark. Overall, we probably would want to look at a flame graph or something similar to investigate this performance issue more effectively. In your original post, you talk about multiprocessing functionality in node like You can run gRPC clients or servers in multiple processes or threads, and gRPC does work with the |
Thank you for your insights and the reference to your benchmark. It's perplexing, indeed, that we're observing such a stark difference in performance.
The comparison with our Java benchmarks, which are ~14.7x faster despite using identical proto files, suggests that the issue might not solely lie with the complexity of the protobuf parsing.
I've conducted preliminary debugging on my example code above, which revealed noticeable idle periods where no events are processed, despite a continuous flow from the server. This gap in processing leads to a growing backlog of events, as illustrated in the attached snapshots: With so much idle time, it is indeed peculiar that the throughput is so low -- there is clearly room for more.
The concept of workload distribution was an exploratory measure to alleviate the bottleneck. I wondered if dispersing the substantial event throughput across multiple threads or processes in a round-robin fashion might improve performance. Imagine a theoretical scenario where a client needs to process around 10 million events per second. It's clear that no single Node.js thread could handle this alone, necessitating some form of parallel processing. Node.js's
The way I would envision it working is each process in a child thread gets to process the next message in the sequence. So if there's 10 messages that come in and 4 nodes working to process them, each process's
In this setup, each sub-process sequentially processes a message from the stream, ensuring that the workload is evenly distributed among all available processes. This approach leverages parallel processing to handle high-throughput scenarios more efficiently.
The
I understand that you can run multiple gRPC client instances across different threads or processes, and that it's compatible with the |
I conducted additional tests on both Java and Node, focusing on the "average delta" for each event received. The "delta" is the difference between when the payload was generated (as marked by the server in the event's payload) and when it was received. I calculated the average delta by summing all deltas and dividing by the total number of events processed, over a 60-second duration for each test. Here are my results: Node
Java
Surprisingly, the Java test outperformed expectations by processing events about 600ms faster than the server time, indicating a significant untapped potential in the Node module. Java managed to process events at a rate 12.65x higher than Node in the same 60-second timeframe. On the other hand, Node lagged considerably, trailing by an average of 23 seconds in processing events. |
In a final (for now) test, which included tracking the peak delta (the longest single delay recorded), it became evident that Java didn't lag at all on any single request, handling the entire load effortlessly (note peak delta of -485, still ahead of the producer):
Conversely, it appears that this module lags up to 54 seconds in the 60s test:
|
Here's something interesting. I created a sort of "frankenstein solution". I set up the Java RPC consumer used in my tests to send an HTTP request to my Node server for every event received. Using Koa to manage these requests and distributing them across 32 instances, I achieved exactly the desired throughput, with Node receiving and processing events at now-breakneck speeds: The number of each log on the left is the My findings here in particular draw my attention to whatever mechanism this module uses to consume incoming events. |
Looking at that frame graph, I agree that the protobuf parsing time (and message processing time in general) is not the likely culprit. My next guess is that this is a flow control issue. It looks like the event processing happens in bursts separated by about 40ms. That seems like a plausible network round trip time communicating across the Internet. So it could be that the client receives a burst of messages and processes them and sends the WINDOW_UPDATE, and then waits those 40ms for the next burst of messages. This would also explain the discrepancy between your results and our benchmarks: our benchmarks run the test client and server probably in the same data center, so that round trip time would likely be much smaller. One possible optimization would be to implement BDP-based dynamic flow control window adjustment, as discussed in #2429. That should allow the flow control window to expand enough that the client spends a lot more time processing messages, and a lot less time idle. I know Java already has that, so that could explain your discrepancy between the Node and Java results. |
Thank you for your reply and the valuable information provided. Could this optimization be applied within our own codebase using this module, or would it require implementation on your end? |
That would need to be done in the library. |
Okay, thank you for the update. We'll look for other solutions for now and keep an eye on future developments. I'd offer to help with a PR, but BDP-based dynamic flow control sadly seems outside my area of expertise. Kudos to you for the progress made thus far, and I'm keen to see how this library evolves in the future. |
It's unlikely that that fix will impact this problem. |
Problem description
Our current application utilizes this module within a Node.js backend environment, specifically designed to process events transmitted via a GRPC stream. The volume of these events is substantial, frequently reaching several thousands per second, and can potentially escalate to 10-15k events per second during peak periods.
Each event carries a timestamp indicating its original compilation time by the GRPC server. During our evaluations, we've identified a significant performance limitation with this module—it struggles to process beyond approximately 250 events per second. Moreover, a noticeable delay emerges rapidly, as indicated by logs comparing the current time against the event timestamps, leading to the processing of events that are significantly outdated, sometimes by several minutes.
This performance shortfall renders the task of managing such a high-volume stream through a single Node.js process impractical. Fortunately, our infrastructure includes powerful machines equipped with over 150 vcores and substantial RAM, enabling us to consider distributing the workload across multiple "consumer" sub-processes (be it through
child_process
,worker_threads
,cluster.fork()
, etc.) in a round-robin configuration.Node.js's introduction of worker threads and the
cluster
module was a strategic enhancement to address such challenges, facilitating parallel request handling and optimizing multi-core processing capabilities. Given Node.js's proven capability to handle upwards of 20k transactions per second in benchmarks with frameworks like Express and Koa, it stands to reason that this scenario should be well within Node's operational domain.However, it appears this module lacks support for such a distributed processing approach.
Inquiry
What is the optimal strategy for leveraging this module to handle thousands of events per second efficiently? Is there a method to employ Node.js's native cluster module to distribute the processing of these event transactions across multiple clustered instances in a round-robin manner, without duplicating events between processes?
Reproduction steps
Code used to test throughput:
Results after 30 seconds:
While we acknowledge the challenge in replicating this specific scenario due to our event provider's closed-source nature, we can offer private access to our GRPC server endpoint and our protocol definitions for deeper investigation. Unfortunately, our ability to share further details is limited under these circumstances.
Environment
Additional context
Should this module inherently be incapable of managing such high throughput, we suggest the inclusion of a disclaimer in the documentation to guide users with similar requirements, thereby preventing comparable challenges.
The text was updated successfully, but these errors were encountered: