Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Package Delivery Related Syscalls with Batch Processing #7

Open
lemoer opened this issue May 10, 2020 · 7 comments
Open

Handle Package Delivery Related Syscalls with Batch Processing #7

lemoer opened this issue May 10, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@lemoer
Copy link

lemoer commented May 10, 2020

(Originally raised in freifunk-gluon/gluon#2019)

Currently fastd needs one syscall to obtain/deliver a every packet from/to the kernel. The idea is to avoid this by obtaining and delivering multiple packages per syscall.

@NeoRaider wrote:

With Kernel 5.4, we have everything we need in io_uring, making any sendmmsg/recvmmsg-based solution inferior (and we don't need additional kernel patches). So if we make the effort to rework the way fastd handles packets, it should be based on io_uring.

A preliminary test using recvmmsg/sendmmsg showed a performance gain of approximately 30% on a small mips based router with a batch size of 64 (see original thread for details).

@CodeFetch
Copy link

I remember having looked at this with lemoer. There was still a performance disadvantage compared to a in-kernel tunnel like WireGuard because of the copying process of the packets between user- and kernelland.

Is there a possibility to introduce something comparable to MSG_ZEROCOPY with io_uring? I've always wondered why Jason Donenfeld, the OpenVPN or tinc team didn't work on exposing the virtualization TAP sockets...

@CodeFetch
Copy link

Hm... I just had a look at the current state of the code. SOCK_ZEROCOPY has already been implemented.
https://elixir.bootlin.com/linux/v5.8-rc3/source/drivers/net/tap.c#L693
So the only big performance difference between an in-kernel tunnel and an userland one is the additional copying of an skb to an iov. We can't get rid of that one, because skbs can't be forced to be in the user memory region. Actually I expected the impact of the copying to be much lower... By the way @lemoer we've already talked to NeoRaider about io_uring last year on IRC...

@CodeFetch
Copy link

Here's a proof-of-concept patched fastd branch which lemoer and I created:
https://github.com/CodeFetch/fastd/tree/final

Indeed the syscall overhead can be reduced with io_uring. Unfortunately a kernel version >5.7 is required to allow poll-retry/fastpoll which is crucial for the performance gain.

Furthermore along a number of minor bugs some race conditions seem to occur unless the operations on an individual socket are being hardlinked. This patch works around this issue which introduces a slight performance penalty. It might have been fixed upstream already and needs further testing. I'll open up a pull-request when NeoRaider has reworked the buffer management to reduce the allocation overhead.

@CodeFetch
Copy link

@NeoRaider are you done with the buffer pool? I've got a commit somewhere where I started to implement a dynamic buffer pool (which might grow if there is a high demand and shrinks when they are not needed anymore). It looks like your changes are compatible. A dynamic buffer pool is needed for getting a good performance for io_uring while keeping a low memory footprint.

BTW... What about introducing shared memory to implement threading support? Have you given it a thought already? I guess with io_uring the crypto performance will become the bottleneck. Is it possible to do the crypto with packets not "in order"? Otherwise I'd at least hope to make use of more cores on the servers with multiple slave processes.

@neocturne
Copy link
Owner

The new buffer implementation is finished.

I don't understand the question about shared memory - threads always share their memory? Doing packet processing in threads should be fine on multi-core systems (but it will require some careful locking and/or barriers to ensure that no state is changed when the worker threads do not expect it).

I think packet processing for each peer should be serialized to avoid introducing additional reordering (fastd can handle packets reordered by up to 64 sequence numbers, but the transported network protocols may not), but as multi-core systems usually play a central role in a network and are connected to many peers, this could still provide some speedup.

@CodeFetch
Copy link

@NeoRaider Sorry I meant subprocesses not pthreading - Shared memory between processes. Pthreads wouldn't bring a performance increase I guess, would they? Indeed I aim for making use of multiple cores, which isn't possible with pthreads only, is it?

@neocturne
Copy link
Owner

Using multiple cores is the main use case of threads. In fact, the Linux kernel does not really distinguish between processes and threads - a thread is just a process that shares its PID, memory, file descriptors, and a few other things with its parent.

Using multiple processes as workers only makes sense when you need to isolate them from each other, for example to contain crashes or security issues. For fastd, multithreading is the way to go: It should be easier to implement for our use case and uses fewer resources (as almost all memory is shared).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants