-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Package Delivery Related Syscalls with Batch Processing #7
Comments
I remember having looked at this with lemoer. There was still a performance disadvantage compared to a in-kernel tunnel like WireGuard because of the copying process of the packets between user- and kernelland. Is there a possibility to introduce something comparable to MSG_ZEROCOPY with io_uring? I've always wondered why Jason Donenfeld, the OpenVPN or tinc team didn't work on exposing the virtualization TAP sockets... |
Hm... I just had a look at the current state of the code. SOCK_ZEROCOPY has already been implemented. |
Here's a proof-of-concept patched fastd branch which lemoer and I created: Indeed the syscall overhead can be reduced with io_uring. Unfortunately a kernel version >5.7 is required to allow poll-retry/fastpoll which is crucial for the performance gain. Furthermore along a number of minor bugs some race conditions seem to occur unless the operations on an individual socket are being hardlinked. This patch works around this issue which introduces a slight performance penalty. It might have been fixed upstream already and needs further testing. I'll open up a pull-request when NeoRaider has reworked the buffer management to reduce the allocation overhead. |
@NeoRaider are you done with the buffer pool? I've got a commit somewhere where I started to implement a dynamic buffer pool (which might grow if there is a high demand and shrinks when they are not needed anymore). It looks like your changes are compatible. A dynamic buffer pool is needed for getting a good performance for io_uring while keeping a low memory footprint. BTW... What about introducing shared memory to implement threading support? Have you given it a thought already? I guess with io_uring the crypto performance will become the bottleneck. Is it possible to do the crypto with packets not "in order"? Otherwise I'd at least hope to make use of more cores on the servers with multiple slave processes. |
The new buffer implementation is finished. I don't understand the question about shared memory - threads always share their memory? Doing packet processing in threads should be fine on multi-core systems (but it will require some careful locking and/or barriers to ensure that no state is changed when the worker threads do not expect it). I think packet processing for each peer should be serialized to avoid introducing additional reordering (fastd can handle packets reordered by up to 64 sequence numbers, but the transported network protocols may not), but as multi-core systems usually play a central role in a network and are connected to many peers, this could still provide some speedup. |
@NeoRaider Sorry I meant subprocesses not pthreading - Shared memory between processes. Pthreads wouldn't bring a performance increase I guess, would they? Indeed I aim for making use of multiple cores, which isn't possible with pthreads only, is it? |
Using multiple cores is the main use case of threads. In fact, the Linux kernel does not really distinguish between processes and threads - a thread is just a process that shares its PID, memory, file descriptors, and a few other things with its parent. Using multiple processes as workers only makes sense when you need to isolate them from each other, for example to contain crashes or security issues. For fastd, multithreading is the way to go: It should be easier to implement for our use case and uses fewer resources (as almost all memory is shared). |
(Originally raised in freifunk-gluon/gluon#2019)
Currently fastd needs one syscall to obtain/deliver a every packet from/to the kernel. The idea is to avoid this by obtaining and delivering multiple packages per syscall.
@NeoRaider wrote:
A preliminary test using recvmmsg/sendmmsg showed a performance gain of approximately 30% on a small mips based router with a batch size of 64 (see original thread for details).
The text was updated successfully, but these errors were encountered: