-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: (0.10) Ensure Respond
Waits for IO Completion
#1180
Conversation
1161d52
to
00f413e
Compare
This commit addresses the premature sending of `Respond` messages in response to `AppendEntries` RPCs and similar requests. Previously, responses could be sent before the associated IO operations were confirmed as complete, potentially leading to inconsistencies. Changes: - The `Respond` now blocks until the corresponding IO operation is completed and properly recorded in `IOProgress`. - `RequestVote` and `AppendEntries` RPC handlers now include a `Condition` that remains unsatisfied until operations like `SaveVote` or `AppendInputEntries` are flushed to disk. - Add `IOProgress`: - **`accepted`:** Tracks the ID of the last IO operation that was accepted but not yet submitted. - **`submitted`:** The ID of the last IO operation submitted to either `RaftLogStorage` or `RaftStateMachine`. - **`flushed`:** The ID of the last IO operation successfully flushed to storage. - Remove `command_seq`: Instead of using `command_seq` to track the completion of state machine commands, this update uses `IOState.applied` and `IOState.snapshot`. This change accounts for the non-sequential execution of some SM commands, such as `BuildSnapshot`, which runs in a separate task. - Remove `IOState.vote`: The `IOState.vote` field has been removed. The system now utilizes `IOState.io_progress.flushed().voted_ref()` for tracking vote operations.
00f413e
to
877b464
Compare
Respond
should not be send to earlyRespond
Waits for IO Completion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The principle with accepted/submitted/flushed is good, but the handling is getting a bit convoluted.
I also thought about how to handle most of openraft
internal communication in constant memory and more streamlined. Currently, there are too many memory allocations, which may run into OOM. Further, although operations are batched (which is good), the batches are given by openraft
and not taken by the storage/log as-needed. A Stream
-like interface sourcing entries to apply and to replicate with another Stream
-like consuming (potentially out-of-order) replies would be helpful to get from RPC-like to dataflow-like interface for apply/replicate.
If the Stream
is built on top of a circular buffer, then it would be possible to also apply backpressure to slow down previous steps in order not to overload the network, for instance. An example would be the log coming in from the client_write()
on the leader or apply_entries()
on the follower, which would simply put the entries into the circular buffer, from which they can be consumed by keeping appropriate pointers by log write, apply and replication w/o further copy or stream (just watching a pointer update in respective task).
I'll try to formulate more complete design in a new issue as a basis for discussion.
Reviewed 24 of 51 files at r1, 3 of 3 files at r2, 6 of 14 files at r3, 18 of 18 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @drmingdrmer)
openraft/src/core/raft_core.rs
line 1296 at r5 (raw file):
Notification::LocalIO { io_id } => { self.engine.state.io_state.io_progress.flush(io_id);
I'm not quite clear on how do you ensure that I/Os completing out-of-order won't confirm the older ones still in progress. For log, that's not likely, though not impossible (for state machine, apply()
's I/Os can complete in arbitrary order).
In our project, we solved the reordering issue at another place using a BinaryHeap
. I.e., push the completed I/O sequence into the heap. Then, while the heap smallest value is the expected I/O sequence, pop it out and confirm it. Thus, "too new" I/Os will just push its sequence number, "expected" I/Os can fill a gap and complete a larger range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. That's a good point to avoid memory allocation by Openraft itself.
I will investigate whether it is feasible for OpenRaft to operate without any memory allocation.
Reviewed 35 of 51 files at r1, 3 of 3 files at r2, 14 of 14 files at r3, 18 of 18 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @schreter)
openraft/src/core/raft_core.rs
line 1296 at r5 (raw file):
Previously, schreter wrote…
I'm not quite clear on how do you ensure that I/Os completing out-of-order won't confirm the older ones still in progress. For log, that's not likely, though not impossible (for state machine,
apply()
's I/Os can complete in arbitrary order).In our project, we solved the reordering issue at another place using a
BinaryHeap
. I.e., push the completed I/O sequence into the heap. Then, while the heap smallest value is the expected I/O sequence, pop it out and confirm it. Thus, "too new" I/Os will just push its sequence number, "expected" I/Os can fill a gap and complete a larger range.
RaftLogStorage
operates under the assumption that it must apply and flush I/O operations in the order they are received. If a previous I/O operation has not yet been flushed, RaftLogStorage
must not proceed to flush the next I/O. Failing to adhere to this order will compromise Raft consistency.
This rule also applies to RaftStateMachine
.
Therefore, an out-of-order callback confirms an I/O operation and all preceding ones.
The BinaryHeap seems a bit clumsy in this situation If the sequence number window for out-of-order I/O operations is relatively small (e.g., a few thousand), employing a VecDeque
would likely be more convenient and efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @schreter)
openraft/src/core/raft_core.rs
line 1296 at r5 (raw file):
Previously, drmingdrmer (张炎泼) wrote…
RaftLogStorage
operates under the assumption that it must apply and flush I/O operations in the order they are received. If a previous I/O operation has not yet been flushed,RaftLogStorage
must not proceed to flush the next I/O. Failing to adhere to this order will compromise Raft consistency.This rule also applies to
RaftStateMachine
.Therefore, an out-of-order callback confirms an I/O operation and all preceding ones.
The BinaryHeap seems a bit clumsy in this situation If the sequence number window for out-of-order I/O operations is relatively small (e.g., a few thousand), employing a
VecDeque
would likely be more convenient and efficient.
Hmmm... I changed my mind. BinaryHeap would be more simple in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion
openraft/src/core/raft_core.rs
line 1296 at r5 (raw file):
Previously, drmingdrmer (张炎泼) wrote…
Hmmm... I changed my mind. BinaryHeap would be more simple in this case.
I created an issue describing concepts how to avoid memory allocations in steady state. With that concept, it would be the responsibility of the user to confirm writes in proper order, since the user will have to just update applied
or flushed
pointer. Then, there is no issue on openraft
side.
Changelog
Fix: (0.10) Ensure
Respond
Waits for IO CompletionThis commit addresses the premature sending of
Respond
messages inresponse to
AppendEntries
RPCs and similar requests. Previously,responses could be sent before the associated IO operations were
confirmed as complete, potentially leading to inconsistencies.
Changes:
The
Respond
now blocks until the corresponding IO operationis completed and properly recorded in
IOProgress
.RequestVote
andAppendEntries
RPC handlers now include aCondition
that remains unsatisfied until operations likeSaveVote
or
AppendInputEntries
are flushed to disk.Add
IOProgress
:accepted
: Tracks the ID of the last IO operation that was accepted but not yet submitted.submitted
: The ID of the last IO operation submitted to eitherRaftLogStorage
orRaftStateMachine
.flushed
: The ID of the last IO operation successfully flushed to storage.Remove
command_seq
: Instead of usingcommand_seq
to track thecompletion of state machine commands, this update uses
IOState.applied
andIOState.snapshot
. This change accounts for thenon-sequential execution of some SM commands, such as
BuildSnapshot
,which runs in a separate task.
Remove
IOState.vote
: TheIOState.vote
field has beenremoved. The system now utilizes
IOState.io_progress.flushed().voted_ref()
for tracking voteoperations.
This change is