[Slony1-hackers] Failover never completes

Mon Oct 15 20:20:56 PDT 2012

On 10/15/2012 07:49 PM, Steve Singer wrote:
>> all commands run from C
>>
>> * switchover from A to B
>> * clone A to make C
>> * switchback from B to A
> 
> Do you make sure that all nodes have confirmed the switchback before
> proceeding to the failover below?  If not it would be better if you did.

Yes -- in fact we wait for confirmation, and then do a sync on each node
and wait for confirmation of those as well.

>>   sl_path looks correct
>>   sl_subscribe has an extra row marked active=false with
>>     B as the provider (leftover from the switchback?)
> 
> Exactly which version of slony are you using?   I assume this isn't bug
> http://www.slony.info/bugzilla/show_bug.cgi?id=260 by any chance?

We are using 2.1.0. We tried upgrading to 2.1.2 but got stuck because we
cannot have a mixed 2.1.0/2.1.2 cluster. We have constraints that do not
allow for upgrade-in-place of existing nodes, which is why we want to
add a new node and failover to it (to facilitate upgrades of components
other than slony, e.g. postgres itself).

I guess if you think this bug is our problem we can set up an entirely
2.1.2 test environment, but it will be painful, and not solve all our
problems as we have some 2.1.0 clusters that we eventually need to upgrade.

Is bug 260 issue #2 deterministic or a race condition? Our current
process works 9 out of 10 times...

FWIW we only have one set so I don't think issue #1 applies.

>>   sl_set still has set_origin pointing to A
>>   sl_node still shows all 4 nodes as active=true
>>
>> So questions:
>> 1) Is bug 80 still open?
>> 2) Any plan to fix it or even ideas how to fix it?
> 
> I substantially rewrote a lot of the failover logic for 2.2 (grab master
> from git).  One of the big things holding up a 2.2 release is that it
> needs people other than myself to test it to verify that I haven't
> missed something obvious and that the new behaviours are sane.
> 
> A FAILOVER in 2.2 no longer involves that 'faked event' from the old
> origin,  The changes in 2.2 also allow you to specify multiple failed
> nodes as arguments to the FAILOVER command.  The hope is that it
> addresses the issues Jan alludes to with multiple failed nodes.

Interesting, but even more difficult to test in our environment for
reasons I cannot really go into on a public list.

Thanks for the reply.

Joe

-- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support