Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Followup items from #758 Async IO threads #761

Open
1 of 10 tasks
madolson opened this issue Jul 9, 2024 · 15 comments
Open
1 of 10 tasks

Followup items from #758 Async IO threads #761

madolson opened this issue Jul 9, 2024 · 15 comments

Comments

@madolson
Copy link
Member

madolson commented Jul 9, 2024

List of pending items from #758, all of these are future enhancements, and not pending issues that need to be merged before Valkey 8 (but they could be).

@uriyage
Copy link
Contributor

uriyage commented Aug 6, 2024

  • Avoid recalculating slots when the I/O thread has already done so

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Aug 27, 2024

  • Offload PING command

Edit: Let's not do it. See discussion below.

@zuiderkwast zuiderkwast changed the title Followup items from https://github.com/valkey-io/valkey/pull/758 Followup items from #758 Async IO threads Aug 29, 2024
@rjd15372
Copy link
Contributor

Hi all, I'll start working on the "Offload PING command".

@rjd15372
Copy link
Contributor

rjd15372 commented Oct 8, 2024

I've been working on the offload of the PING command and I came up with two approaches that I would like your opinion on.

Approach A)

The IO thread that is choose to handle the read of the PING command, read, parses, and executes the command, and then also writes the reply, before processing a new IO job,

Approach B)

After the PING command has been read and parsed, the main thread schedules the execution of the PING command (enqueues a new job in the IO queue). The IO thread that picks this "execution" job, executes the job and enqueues the write job in the IO queue.

Approach A is more simple, but it might block the IO thread if the socket is not available to write, while approach B is more complex but more asynchronous.

I'm in favor of approach B, even though it will require a generalization of the IO threads component to support a different kind (non-IO job), and also support the concurrent insertion of jobs in the shared queue.

@zuiderkwast @madolson thoughts?

@zuiderkwast
Copy link
Contributor

@uriyage, you know IO threading better than anyone. Can you help?

@rjd15372 Approach B sounds like it will not be worth the overhead. Executing PING is very fast, maybe even faster executing in the main thread than delegating it to some thread to execute it.

I like approach A, which should add zero work to the main thread if the socket is writable, which is probably the normal case. If it's not writable, we need to register a write handler in the event loop and add the client to the global list server.clients_pending_io_write, which is not thread safe, so this probably needs to be done by the main thread (unless we refactor this in some way).

When the main thread delegates a read-and-parse-command job to an IO thread, it does among other things..

    c->io_read_state = CLIENT_PENDING_IO;
    connSetPostponeUpdateState(c->conn, 1);
    IOJobQueue_push(jq, ioThreadReadQueryFromClient, c);
    c->flag.pending_read = 1;
    listLinkNodeTail(server.clients_pending_io_read, &c->pending_read_list_node);

Later, the main thread checks if the clients are ready by calling processIOThreadsReadDone which loops over all clients in this list. Here, can we indicate that instead of parsing a command, the thread wants to write something to the client instead? We have the c->io_read_state and c->io_write_state variables to indicate these things back to the main thread. Or we can add a new flag in c->read_flags for this special case. I'm not quite sure about this. @uriyage WDYT?

Now, I'm just thinking, is it possible that the main thread has delegated read-and-parse for a client to one IO thread and also delegated write to that same client to another IO thread? That may be a problem... Have we've already prevented this due to TLS?

@uriyage
Copy link
Contributor

uriyage commented Oct 9, 2024

@zuiderkwast @rjd15372
Approach A could lead to potential issues if the client already has pending data in the reply buffer or if the main thread simultaneously writes to the client after delegating the read (e.g., if the client is subscribed to a channel or MONITOR).
As you mentioned, we have read flags that the main thread can use to indicate whether the I/O thread can write to the client (when the client has no pending replies and is not subscribed or in a similar state). We currently use the same flags to indicate if the I/O thread can parse a command (if the client is not blocked) or if it should just read.
When the read job returns, we will need to:

  1. Use flags to indicate that a command was executed
  2. Update the cmd_stat and other relevant statistics
  3. Handle the out-bytes stats as well

If the socket is not writeable, we will have to store the command output (or partial output if we succeed in writing part of it) in the client's reply buffer to be sent later.

Overall, I believe the performance gain from both approaches is marginal and may not justify the added complexity. Perhaps we should consider this approach only for more time-consuming commands that can be delegated to the I/O threads.

Regarding your question, we already ensure that the same thread that reads is the one that writes to the client.

@rjd15372
Copy link
Contributor

rjd15372 commented Oct 9, 2024

@uriyage @zuiderkwast
I also agree that for the specific case of PING, we will not gain a lot by doing this, but the changes that we make for PING will be the same for any other commands that we will want to offload in the future.

My idea is to make it generic, and add a new command flag that states if the command can be offloaded.

@zuiderkwast
Copy link
Contributor

Thanks Uri! Good points. I guess offloading PING may be not worth it then.

I don't see the point of doing it for PING just as a preparation for other commands. It's better to do it for another command where it actually matters. I don't know any command that doesn't need to access any global data though. Do you?

Perhaps it's better to focus on another of the follow-up items?

@uriyage
Copy link
Contributor

uriyage commented Nov 7, 2024

  • Remove err_clean in tls

#761

@uriyage
Copy link
Contributor

uriyage commented Nov 21, 2024

@hwware
Copy link
Member

hwware commented Nov 21, 2024

I've been working on the offload of the PING command and I came up with two approaches that I would like your opinion on.

Approach A)

The IO thread that is choose to handle the read of the PING command, read, parses, and executes the command, and then also writes the reply, before processing a new IO job,

Approach B)

After the PING command has been read and parsed, the main thread schedules the execution of the PING command (enqueues a new job in the IO queue). The IO thread that picks this "execution" job, executes the job and enqueues the write job in the IO queue.

Approach A is more simple, but it might block the IO thread if the socket is not available to write, while approach B is more complex but more asynchronous.

I'm in favor of approach B, even though it will require a generalization of the IO threads component to support a different kind (non-IO job), and also support the concurrent insertion of jobs in the shared queue.

@zuiderkwast @madolson thoughts?

Approach B is what we want, PING command, even including hello command, they are health status commands of the server, they should be executed by separated thread without any data command blocking.

@madolson
Copy link
Member Author

I don't want to offload the ping command. I feel like part of the job of the ping command is to determine the health of the server, so it could be offloaded even if the server is in some type of dead loop.

@arukiidou
Copy link

arukiidou commented Dec 2, 2024

@uriyage
Copy link
Contributor

uriyage commented Dec 9, 2024

@uriyage
Copy link
Contributor

uriyage commented Dec 17, 2024

ranshid added a commit that referenced this issue Dec 18, 2024
## TLS Negotiation Offloading to I/O Threads

### Overview
This PR introduces the ability to offload TLS handshake negotiations to
I/O threads, significantly improving performance under high TLS
connection loads.

### Key Changes
- Added infrastructure to offload TLS negotiations to I/O threads
- Refactored SSL event handling to allow I/O threads modify conn flags.
- Introduced new connection flag to identify client connections

### Performance Impact
Testing with 650 clients with SET commands and 160 new TLS connections
per second in the background:

#### Throughput Impact of new TLS connections
- **With Offloading**: Minimal impact (1050K → 990K ops/sec)
- **Without Offloading**: Significant drop (1050K → 670K ops/sec)

#### New Connection Rate
- **With Offloading**: 
  - 1,757 conn/sec
- **Without Offloading**: 
  - 477 conn/sec

### Implementation Details
1. **Main Thread**:
   - Initiates negotiation-offload jobs to I/O threads
- Adds connections to pending-read clients list (using existing read
offload mechanism)
   - Post-negotiation handling:
     - Creates read/write events if needed for incomplete negotiations
     - Calls accept handler for completed negotiations

2. **I/O Thread**:
   - Performs TLS negotiation
   - Updates connection flags based on negotiation result

Related issue:#761

---------

Signed-off-by: Uri Yagelnik <[email protected]>
Signed-off-by: ranshid <[email protected]>
Co-authored-by: ranshid <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
zuiderkwast pushed a commit that referenced this issue Jan 2, 2025
Support Primary client IO offload.

Related issue: #761

---------

Signed-off-by: Uri Yagelnik <[email protected]>
ranshid pushed a commit that referenced this issue Jan 8, 2025
# Refactor client structure to use modular data components

## Current State
The client structure allocates memory for replication / pubsub /
multi-keys / module / blocked data for every client, despite these
features being used by only a small subset of clients. In addition the
current field layout in the client struct is suboptimal, with poor
alignment and unnecessary padding between fields, leading to a larger
than necessary memory footprint of 896 bytes per client. Furthermore,
fields that are frequently accessed together during operations are
scattered throughout the struct, resulting in poor cache locality.

## This PR's Change

1.  Lazy Initialization 
- **Components are only allocated when first used:**
  - PubSubData: Created on first SUBSCRIBE/PUBLISH operation
  - ReplicationData: Initialized only for replica connections
  - ModuleData: Allocated when module interaction begins
  - BlockingState: Created when first blocking command is issued
  - MultiState: Initialized on MULTI command

2. Memory Layout Optimization:
   - Grouped related fields for better locality
   - Moved rarely accessed fields (e.g., client->name) to struct end
   - Optimized field alignment to eliminate padding

3. Additional changes:
   - Moved watched_keys to be static allocated in the `mstate` struct
   - Relocated replication init logic to replication.c
  

### Key Benefits
- **Efficient Memory Usage:**
- 45% smaller base client structure - Basic clients now use 528 bytes
(down from 896).
- Better memory locality for related operations
- Performance improvement in high throughput scenarios. No performance
regressions in other cases.


### Performance Impact

Tested with 650 clients and 512 bytes values.

#### Single Thread Performance
| Operation   | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET        | 1 key   | 261,799      | 258,261      | +1.37%    |
| SET        | 3M keys | 209,134      | ~209,000     | ~0%       |
| GET        | 1 key   | 281,564      | 277,965      | +1.29%    |
| GET        | 3M keys | 231,158      | 228,410      | +1.20%    |

#### 8 IO Threads Performance
| Operation   | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET        | 1 key   | 1,331,578    | 1,331,626    | -0.00%    |
| SET        | 3M keys | 1,254,441    | 1,152,645    | +8.83%    |
| GET        | 1 key   | 1,293,149    | 1,289,503    | +0.28%    |
| GET        | 3M keys | 1,152,898    | 1,101,791    | +4.64%    |

#### Pipeline Performance (3M keys)
| Operation | Pipeline Size | New (ops/sec) | Old (ops/sec) | Change % |
|-----------|--------------|---------------|---------------|-----------|
| SET       | 10          | 548,964      | 538,498      | +1.94%    |
| SET       | 20          | 606,148      | 594,872      | +1.89%    |
| SET       | 30          | 631,122      | 616,606      | +2.35%    |
| GET       | 10          | 628,482      | 624,166      | +0.69%    |
| GET       | 20          | 687,371      | 681,659      | +0.84%    |
| GET       | 30          | 725,855      | 721,102      | +0.66%    |

### Observations:
1. Single-threaded operations show consistent improvements (1-1.4%)
2. Multi-threaded performance shows significant gains for large
datasets:
   - SET with 3M keys: +8.83% improvement
   - GET with 3M keys: +4.64% improvement
3. Pipeline operations show consistent improvements:
   - SET operations: +1.89% to +2.35%
   - GET operations: +0.66% to +0.84%
4. No performance regressions observed in any test scenario


Related issue:#761

---------

Signed-off-by: Uri Yagelnik <[email protected]>
Signed-off-by: uriyage <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants