Skip to content
This repository has been archived by the owner on Mar 31, 2022. It is now read-only.

Handle "repair finished" message not reaching all peers #124

Open
Bj0rnen opened this issue Oct 5, 2015 · 0 comments
Open

Handle "repair finished" message not reaching all peers #124

Bj0rnen opened this issue Oct 5, 2015 · 0 comments

Comments

@Bj0rnen
Copy link
Contributor

Bj0rnen commented Oct 5, 2015

We recently ran into a case of a hanging repair run. Some segments kept getting postponed indefinitely, because an involved node reported that it was participating in a repair. We got the repair session's hash from the node's log. Other nodes' logs reported this session as finished, but not that node. So apparently, the cross-communication within Cassandra had failed there.

Reaper on the other hand got notified that the repair was done, so it moved along to remaining segments. But segments within that node's range obviously got stopped by SegmentRunner::canRepair, because a repair was already underway according to the node.

Potential fix: when SegmentRunner::canRepair discovers a node that's already busy with repair, compare with Reaper's storage to determine if it really should have a repair ongoing. If not, use JmxProxy::cancelAllRepairs to clear that node's state.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant