[Slony1-general] Proposed Failover changes for 2.2

Tue Dec 13 07:07:04 PST 2011

On Tue, Dec 13, 2011 at 9:32 AM, Steve Singer <ssinger at ca.afilias.info> wrote:
> I also think it is safer for slonik to make the most ahead node the new
> master and then let you reshape the cluster with move set.  Today if
> additional things go wrong in the middle of a FAILOVER procedure it can
> be very difficult to recover the cluster.  I feel that if we just
> promote the most ahead node to the new master things will be safer.

I'm not thrilled with this introducing a rather non-deterministic
factor into things.

That is, you don't know what reshapings you need to do until *after* FAILOVER.

I suspect we may want a bit of tooling to dump the shape of the
cluster; I could see people be irritated if tell them...

"After FAILOVER, you'll have to puzzle through some SQL queries
against, erm, some of the nodes to figure out where things are, before
reshaping it to the way you now want."

But you're likely right that what may be preferable for Slony to do is
to do the best failover that it can do, and leave it to admins to
figure what next.

If there are 3 nodes, then there are 6 different failures that may
occur (e.g. - 1, 1+2, 1+3, 2, 2+3, 1+2+3).  If node #1 is the origin,
there are 3 of those cases that permit FAILOVER to succeed (e.g. - 1,
1+2, 1+3).

And we really can't predict which of those will have occurred until
they actually have occurred.

(If there are 4 nodes, there would be a rather larger set of possible
failovers, and the set grows for larger clusters.)

In order for an admin to be properly prepared, they'd need a FAILOVER
script for each of those cases, and they'd need to pick the right one.
 I don't imagine it's fundamentally worse for Slony to say "I'll fix
as well as I can; reshape subscriptions once I'm done."