Slave recovery #1

peterbourgon · 2013-12-14T16:56:07Z

It looks like if a slave dies, some % of slave requests will simply fail. Feature request: support failure detection, and some kind of circuit-breaker recovery mechanism.

tsenart · 2013-12-17T11:29:10Z

There are two kinds of failure in this scenario: transient and permanent.
When a slave fails permanently, the infrastructure shuffling that follows is really coupled to each organisation.
One could have VIPs in front of each slave and keep the reshuffling under the application layer but I don't think most solutions are that advanced yet. If the master fails, things get even dirtier with manual or automatic failover, this means that what previously was a slave could now be a master which just breaks all assumptions in this library. With so many fragilities, this sort of dynamic reconfiguration is perhaps better handled by the user of the library by re-instantiating the DB object.

With transient failures the story is different as they could originate from a rather large scope of events such as:

Network connectivity loss
Network connectivity degradation
Load spikes
Slow queries
Replication lag

Distinction between each of this is hard or impossible from the application layer. How do you envision a circuit breaking mechanism with such a variate array of failure modes and coarse detection capabilities? Detection and more generally, infrastructure state introspection, is an orthogonal concern of each organisation and I don't see how I could account for these in a general way. I am very happy to discuss it though!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slave recovery #1

Slave recovery #1

peterbourgon commented Dec 14, 2013

tsenart commented Dec 17, 2013

Slave recovery #1

Slave recovery #1

Comments

peterbourgon commented Dec 14, 2013

tsenart commented Dec 17, 2013