Handle "repair finished" message not reaching all peers #124

Bj0rnen · 2015-10-05T14:25:41Z

We recently ran into a case of a hanging repair run. Some segments kept getting postponed indefinitely, because an involved node reported that it was participating in a repair. We got the repair session's hash from the node's log. Other nodes' logs reported this session as finished, but not that node. So apparently, the cross-communication within Cassandra had failed there.

Reaper on the other hand got notified that the repair was done, so it moved along to remaining segments. But segments within that node's range obviously got stopped by SegmentRunner::canRepair, because a repair was already underway according to the node.

Potential fix: when SegmentRunner::canRepair discovers a node that's already busy with repair, compare with Reaper's storage to determine if it really should have a repair ongoing. If not, use JmxProxy::cancelAllRepairs to clear that node's state.

Bj0rnen mentioned this issue Nov 10, 2015

Bj0rnen/kill lingering repairs #128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle "repair finished" message not reaching all peers #124

Handle "repair finished" message not reaching all peers #124

Bj0rnen commented Oct 5, 2015

Handle "repair finished" message not reaching all peers #124

Handle "repair finished" message not reaching all peers #124

Comments

Bj0rnen commented Oct 5, 2015