Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical performance bottleneck with high-throughput GRPC stream handling #2669

Open
Threebow opened this issue Feb 20, 2024 · 12 comments
Open

Comments

@Threebow
Copy link

Threebow commented Feb 20, 2024

Problem description

Our current application utilizes this module within a Node.js backend environment, specifically designed to process events transmitted via a GRPC stream. The volume of these events is substantial, frequently reaching several thousands per second, and can potentially escalate to 10-15k events per second during peak periods.

Each event carries a timestamp indicating its original compilation time by the GRPC server. During our evaluations, we've identified a significant performance limitation with this module—it struggles to process beyond approximately 250 events per second. Moreover, a noticeable delay emerges rapidly, as indicated by logs comparing the current time against the event timestamps, leading to the processing of events that are significantly outdated, sometimes by several minutes.

This performance shortfall renders the task of managing such a high-volume stream through a single Node.js process impractical. Fortunately, our infrastructure includes powerful machines equipped with over 150 vcores and substantial RAM, enabling us to consider distributing the workload across multiple "consumer" sub-processes (be it through child_process, worker_threads, cluster.fork(), etc.) in a round-robin configuration.

Node.js's introduction of worker threads and the cluster module was a strategic enhancement to address such challenges, facilitating parallel request handling and optimizing multi-core processing capabilities. Given Node.js's proven capability to handle upwards of 20k transactions per second in benchmarks with frameworks like Express and Koa, it stands to reason that this scenario should be well within Node's operational domain.

However, it appears this module lacks support for such a distributed processing approach.

Inquiry

What is the optimal strategy for leveraging this module to handle thousands of events per second efficiently? Is there a method to employ Node.js's native cluster module to distribute the processing of these event transactions across multiple clustered instances in a round-robin manner, without duplicating events between processes?

Reproduction steps

  1. Access a high-throughput GRPC stream.
  2. Attempt to process this high-volume stream.
  3. Observe significant delays in event processing, with events not being processed timely.

Code used to test throughput:

const client = new XClient("endpoint", ChannelCredentials.createInsecure())
const stream = client.SubscribeXUpdates(new SubscribeXUpdatesRequest())

let start: number | undefined
let n = 0

stream.on("data", () => {
	if (!start) start = Date.now()
	n++
})

stream.on("end", () => {
	console.error("Stream ended")
	process.exit(1)
})

stream.on("error", (error) => {
	console.error("Stream error", error)
	process.exit(1)
})

setTimeout(() => {
	if (!start) {
		console.error("No data received from the stream.")
		process.exit(1)
	}
	
	const msElapsed = Date.now() - start
	const sElapsed = msElapsed / 1000
	const rate = n / sElapsed
	
	console.log("intake rate:", rate, "transactions/sec")

	process.exit(0)
}, 30 * 1000)

Results after 30 seconds:

intake rate: 223.4826808496314 transactions/sec

While we acknowledge the challenge in replicating this specific scenario due to our event provider's closed-source nature, we can offer private access to our GRPC server endpoint and our protocol definitions for deeper investigation. Unfortunately, our ability to share further details is limited under these circumstances.

Environment

  • Operating System: Windows 11 (16c/32t, 64GB RAM)
  • Node.js Version: 20.10.1
  • Also appears on Ubuntu 22 server (160 cores, 350GB RAM), with same node version.
  • Docker is not being used.

Additional context

Should this module inherently be incapable of managing such high throughput, we suggest the inclusion of a disclaimer in the documentation to guide users with similar requirements, thereby preventing comparable challenges.

@Threebow
Copy link
Author

Threebow commented Feb 20, 2024

We ran the exact same test in Java to confirm that the issue was not with our GRPC server.

Here is the result:

> Task :Main.main()
Starting test...
Reached target, exiting...
intake rate: 3287.266666666667 transactions/sec

The 223 transactions/sec we saw from this library using Node.js is an incredible drop from the 3287 transactions/sec observed here.

Any insight on the rate differential would be appreciated.

@murgatroid99
Copy link
Member

Our own benchmark shows significantly better performance than you are seeing there. The "Streaming secure ping pong" test runs a stream in which the client sends a message and then the server sends a response message, and then the client sends another message, etc.. The median latency measured is about 135us, which corresponds to about 7000 messages per second. The implementation of that benchmark can be found here.

It's hard to tell from what you have here what the likely cause of the discrepancy is. One likely possibility is protobuf parsing. Your messages are likely significantly more complex than the ones in our benchmark. Overall, we probably would want to look at a flame graph or something similar to investigate this performance issue more effectively.


In your original post, you talk about multiprocessing functionality in node like child_process, worker_threads, and cluster, but I don't see how they would be relevant here. We're looking at the performance of receiving sequential messages in a single gRPC stream, which is an inherently serial operation. And cluster in particular is only relevant to running a server, while this is a purely client-side benchmark.

You can run gRPC clients or servers in multiple processes or threads, and gRPC does work with the cluster module, so in other circumstances you could use those things to improve performance with gRPC.

@Threebow
Copy link
Author

Our own benchmark shows significantly better performance than you are seeing there.

Thank you for your insights and the reference to your benchmark. It's perplexing, indeed, that we're observing such a stark difference in performance.

It's hard to tell from what you have here what the likely cause of the discrepancy is. One likely possibility is protobuf parsing. Your messages are likely significantly more complex than the ones in our benchmark.

The comparison with our Java benchmarks, which are ~14.7x faster despite using identical proto files, suggests that the issue might not solely lie with the complexity of the protobuf parsing.

Overall, we probably would want to look at a flame graph or something similar to investigate this performance issue more effectively.

I've conducted preliminary debugging on my example code above, which revealed noticeable idle periods where no events are processed, despite a continuous flow from the server. This gap in processing leads to a growing backlog of events, as illustrated in the attached snapshots:

chrome_y3zIVhxANq

chrome_6V3gCt16wJ

chrome_n61oYXNwQZ

With so much idle time, it is indeed peculiar that the throughput is so low -- there is clearly room for more.

In your original post, you talk about multiprocessing functionality in node like child_process, worker_threads, and cluster, but I don't see how they would be relevant here.

The concept of workload distribution was an exploratory measure to alleviate the bottleneck. I wondered if dispersing the substantial event throughput across multiple threads or processes in a round-robin fashion might improve performance. Imagine a theoretical scenario where a client needs to process around 10 million events per second. It's clear that no single Node.js thread could handle this alone, necessitating some form of parallel processing. Node.js's cluster module represents our best option for enabling such a level of concurrency.

We're looking at the performance of receiving sequential messages in a single gRPC stream, which is an inherently serial operation.

The way I would envision it working is each process in a child thread gets to process the next message in the sequence. So if there's 10 messages that come in and 4 nodes working to process them, each process's stream.on handler would be called as follows:

Message # 1 2 3 4 5 6 7 8 9 10
Worker # 1 2 3 4 1 2 3 4 1 2

In this setup, each sub-process sequentially processes a message from the stream, ensuring that the workload is evenly distributed among all available processes. This approach leverages parallel processing to handle high-throughput scenarios more efficiently.

And cluster in particular is only relevant to running a server, while this is a purely client-side benchmark.

The cluster module extends beyond server-side load balancing, applicable beyond just backend requests. Additionally, viewing each consumed event as a server request aligns it with how express or koa servers function.

You can run gRPC clients or servers in multiple processes or threads, and gRPC does work with the cluster module, so in other circumstances you could use those things to improve performance with gRPC.

I understand that you can run multiple gRPC client instances across different threads or processes, and that it's compatible with the cluster module. However, if all these instances listen to the same stream, they'd each get every single event. This approach wouldn't really help, as it just multiplies the number of events each stream processes by the number of clusters, without actually distributing the load effectively.

@Threebow
Copy link
Author

I conducted additional tests on both Java and Node, focusing on the "average delta" for each event received. The "delta" is the difference between when the payload was generated (as marked by the server in the event's payload) and when it was received. I calculated the average delta by summing all deltas and dividing by the total number of events processed, over a 60-second duration for each test.

Here are my results:

Node
Starting
intake rate: 294.27679152958467 events/sec. total events received: 17621 avg. delta: 23808.32494435517
Java
Starting test...
Reached target, exiting...
intake rate: 3710.722195394069 events/sec. total events received: 222840 avg. delta: -660

Surprisingly, the Java test outperformed expectations by processing events about 600ms faster than the server time, indicating a significant untapped potential in the Node module. Java managed to process events at a rate 12.65x higher than Node in the same 60-second timeframe. On the other hand, Node lagged considerably, trailing by an average of 23 seconds in processing events.

@Threebow
Copy link
Author

In a final (for now) test, which included tracking the peak delta (the longest single delay recorded), it became evident that Java didn't lag at all on any single request, handling the entire load effortlessly (note peak delta of -485, still ahead of the producer):

Starting test...
Reached target, exiting...
intake rate: 3024.2833333333333 events/sec. total events received: 181457 avg. delta: -734 ms peak delta: -485 ms

Conversely, it appears that this module lags up to 54 seconds in the 60s test:

Starting
intake rate: 268.97703549060543 events/sec. total events received: 16105 avg. delta: 27656.987129142733 ms peak delta: 54213.22265625 ms

@Threebow
Copy link
Author

Threebow commented Feb 21, 2024

Here's something interesting. I created a sort of "frankenstein solution".

I set up the Java RPC consumer used in my tests to send an HTTP request to my Node server for every event received. Using Koa to manage these requests and distributing them across 32 instances, I achieved exactly the desired throughput, with Node receiving and processing events at now-breakneck speeds:

webstorm64_wv9jWixY6F

The number of each log on the left is the pid of that thread. If you extrapolate the # of events and timing numbers in each log by the thread count (32 in this case), you will get a better idea of how much the app is processing as a whole.

My findings here in particular draw my attention to whatever mechanism this module uses to consume incoming events.

@murgatroid99
Copy link
Member

Looking at that frame graph, I agree that the protobuf parsing time (and message processing time in general) is not the likely culprit.

My next guess is that this is a flow control issue. It looks like the event processing happens in bursts separated by about 40ms. That seems like a plausible network round trip time communicating across the Internet. So it could be that the client receives a burst of messages and processes them and sends the WINDOW_UPDATE, and then waits those 40ms for the next burst of messages. This would also explain the discrepancy between your results and our benchmarks: our benchmarks run the test client and server probably in the same data center, so that round trip time would likely be much smaller.

One possible optimization would be to implement BDP-based dynamic flow control window adjustment, as discussed in #2429. That should allow the flow control window to expand enough that the client spends a lot more time processing messages, and a lot less time idle. I know Java already has that, so that could explain your discrepancy between the Node and Java results.

@Threebow
Copy link
Author

Thank you for your reply and the valuable information provided.

Could this optimization be applied within our own codebase using this module, or would it require implementation on your end?

@murgatroid99
Copy link
Member

That would need to be done in the library.

@Threebow
Copy link
Author

Okay, thank you for the update. We'll look for other solutions for now and keep an eye on future developments. I'd offer to help with a PR, but BDP-based dynamic flow control sadly seems outside my area of expertise. Kudos to you for the progress made thus far, and I'm keen to see how this library evolves in the future.

@Farenheith
Copy link

Hello!

@Threebow could you check if after #2788 solution has there been any performance improvement for this issue?

@murgatroid99
Copy link
Member

It's unlikely that that fix will impact this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants