Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: (0.10) Ensure Respond Waits for IO Completion #1180

Merged
merged 1 commit into from
Jul 20, 2024

Conversation

drmingdrmer
Copy link
Member

@drmingdrmer drmingdrmer commented Jul 19, 2024

Changelog

Fix: (0.10) Ensure Respond Waits for IO Completion

This commit addresses the premature sending of Respond messages in
response to AppendEntries RPCs and similar requests. Previously,
responses could be sent before the associated IO operations were
confirmed as complete, potentially leading to inconsistencies.

Changes:

  • The Respond now blocks until the corresponding IO operation
    is completed and properly recorded in IOProgress.

  • RequestVote and AppendEntries RPC handlers now include a
    Condition that remains unsatisfied until operations like SaveVote
    or AppendInputEntries are flushed to disk.

  • Add IOProgress:

    • accepted: Tracks the ID of the last IO operation that was accepted but not yet submitted.
    • submitted: The ID of the last IO operation submitted to either RaftLogStorage or RaftStateMachine.
    • flushed: The ID of the last IO operation successfully flushed to storage.
  • Remove command_seq: Instead of using command_seq to track the
    completion of state machine commands, this update uses
    IOState.applied and IOState.snapshot. This change accounts for the
    non-sequential execution of some SM commands, such as BuildSnapshot,
    which runs in a separate task.

  • Remove IOState.vote: The IOState.vote field has been
    removed. The system now utilizes
    IOState.io_progress.flushed().voted_ref() for tracking vote
    operations.


This change is Reviewable

@drmingdrmer drmingdrmer marked this pull request as draft July 19, 2024 14:22
@drmingdrmer drmingdrmer force-pushed the 103-fix-callbak branch 5 times, most recently from 1161d52 to 00f413e Compare July 19, 2024 15:19
@drmingdrmer drmingdrmer marked this pull request as ready for review July 19, 2024 15:28
This commit addresses the premature sending of `Respond` messages in
response to `AppendEntries` RPCs and similar requests. Previously,
responses could be sent before the associated IO operations were
confirmed as complete, potentially leading to inconsistencies.

Changes:

- The `Respond` now blocks until the corresponding IO operation
  is completed and properly recorded in `IOProgress`.

- `RequestVote` and `AppendEntries` RPC handlers now include a
  `Condition` that remains unsatisfied until operations like `SaveVote`
  or `AppendInputEntries` are flushed to disk.

- Add `IOProgress`:
    - **`accepted`:** Tracks the ID of the last IO operation that was accepted but not yet submitted.
    - **`submitted`:** The ID of the last IO operation submitted to either `RaftLogStorage` or `RaftStateMachine`.
    - **`flushed`:** The ID of the last IO operation successfully flushed to storage.

- Remove `command_seq`: Instead of using `command_seq` to track the
  completion of state machine commands, this update uses
  `IOState.applied` and `IOState.snapshot`. This change accounts for the
  non-sequential execution of some SM commands, such as `BuildSnapshot`,
  which runs in a separate task.

- Remove `IOState.vote`: The `IOState.vote` field has been
  removed. The system now utilizes
  `IOState.io_progress.flushed().voted_ref()` for tracking vote
  operations.
@drmingdrmer drmingdrmer changed the title Fix: in 0.10: Respond should not be send to early Fix: (0.10) Ensure Respond Waits for IO Completion Jul 19, 2024
@drmingdrmer drmingdrmer requested a review from schreter July 19, 2024 15:37
Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The principle with accepted/submitted/flushed is good, but the handling is getting a bit convoluted.

I also thought about how to handle most of openraft internal communication in constant memory and more streamlined. Currently, there are too many memory allocations, which may run into OOM. Further, although operations are batched (which is good), the batches are given by openraft and not taken by the storage/log as-needed. A Stream-like interface sourcing entries to apply and to replicate with another Stream-like consuming (potentially out-of-order) replies would be helpful to get from RPC-like to dataflow-like interface for apply/replicate.

If the Stream is built on top of a circular buffer, then it would be possible to also apply backpressure to slow down previous steps in order not to overload the network, for instance. An example would be the log coming in from the client_write() on the leader or apply_entries() on the follower, which would simply put the entries into the circular buffer, from which they can be consumed by keeping appropriate pointers by log write, apply and replication w/o further copy or stream (just watching a pointer update in respective task).

I'll try to formulate more complete design in a new issue as a basis for discussion.

Reviewed 24 of 51 files at r1, 3 of 3 files at r2, 6 of 14 files at r3, 18 of 18 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @drmingdrmer)


openraft/src/core/raft_core.rs line 1296 at r5 (raw file):

            Notification::LocalIO { io_id } => {
                self.engine.state.io_state.io_progress.flush(io_id);

I'm not quite clear on how do you ensure that I/Os completing out-of-order won't confirm the older ones still in progress. For log, that's not likely, though not impossible (for state machine, apply()'s I/Os can complete in arbitrary order).

In our project, we solved the reordering issue at another place using a BinaryHeap. I.e., push the completed I/O sequence into the heap. Then, while the heap smallest value is the expected I/O sequence, pop it out and confirm it. Thus, "too new" I/Os will just push its sequence number, "expected" I/Os can fill a gap and complete a larger range.

@drmingdrmer drmingdrmer requested a review from schreter July 20, 2024 02:17
Copy link
Member Author

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. That's a good point to avoid memory allocation by Openraft itself.

I will investigate whether it is feasible for OpenRaft to operate without any memory allocation.

Reviewed 35 of 51 files at r1, 3 of 3 files at r2, 14 of 14 files at r3, 18 of 18 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @schreter)


openraft/src/core/raft_core.rs line 1296 at r5 (raw file):

Previously, schreter wrote…

I'm not quite clear on how do you ensure that I/Os completing out-of-order won't confirm the older ones still in progress. For log, that's not likely, though not impossible (for state machine, apply()'s I/Os can complete in arbitrary order).

In our project, we solved the reordering issue at another place using a BinaryHeap. I.e., push the completed I/O sequence into the heap. Then, while the heap smallest value is the expected I/O sequence, pop it out and confirm it. Thus, "too new" I/Os will just push its sequence number, "expected" I/Os can fill a gap and complete a larger range.

RaftLogStorage operates under the assumption that it must apply and flush I/O operations in the order they are received. If a previous I/O operation has not yet been flushed, RaftLogStorage must not proceed to flush the next I/O. Failing to adhere to this order will compromise Raft consistency.

This rule also applies to RaftStateMachine.

Therefore, an out-of-order callback confirms an I/O operation and all preceding ones.

The BinaryHeap seems a bit clumsy in this situation If the sequence number window for out-of-order I/O operations is relatively small (e.g., a few thousand), employing a VecDeque would likely be more convenient and efficient.

Copy link
Member Author

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @schreter)


openraft/src/core/raft_core.rs line 1296 at r5 (raw file):

Previously, drmingdrmer (张炎泼) wrote…

RaftLogStorage operates under the assumption that it must apply and flush I/O operations in the order they are received. If a previous I/O operation has not yet been flushed, RaftLogStorage must not proceed to flush the next I/O. Failing to adhere to this order will compromise Raft consistency.

This rule also applies to RaftStateMachine.

Therefore, an out-of-order callback confirms an I/O operation and all preceding ones.

The BinaryHeap seems a bit clumsy in this situation If the sequence number window for out-of-order I/O operations is relatively small (e.g., a few thousand), employing a VecDeque would likely be more convenient and efficient.

Hmmm... I changed my mind. BinaryHeap would be more simple in this case.

@drmingdrmer drmingdrmer merged commit ee460f3 into databendlabs:main Jul 20, 2024
32 of 33 checks passed
@drmingdrmer drmingdrmer deleted the 103-fix-callbak branch July 20, 2024 02:56
@drmingdrmer drmingdrmer mentioned this pull request Jul 20, 2024
10 tasks
Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion


openraft/src/core/raft_core.rs line 1296 at r5 (raw file):

Previously, drmingdrmer (张炎泼) wrote…

Hmmm... I changed my mind. BinaryHeap would be more simple in this case.

I created an issue describing concepts how to avoid memory allocations in steady state. With that concept, it would be the responsibility of the user to confirm writes in proper order, since the user will have to just update applied or flushed pointer. Then, there is no issue on openraft side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants