Christopher Browne cbbrowne at ca.afilias.info
Tue Feb 16 13:30:58 PST 2010
Andy Dale <andy.dale at gmail.com> writes:
>     However, I have found that if I put a sleep (even 1 second works in my
>     environment), the DROP_NODE command succeeds, and everything proceeds
>     happily.
>
>
>
> I can also confirm that adding a sleep period between the FAILOVER and DROP
> NODE commands seems to work (for 10 seconds in my case)

I had a chat with Jan about this on Friday; we both agreed that there
seemed to be something wrong with the idea of having the failover being
treated as issued by the failed node.

After all, if that node is to, momentarily, be treated as "shunned," it
doesn't make much sense to have *any* events coming out of it.  I'd tend
to think that the *new* origin ought to be a good source for that event.

Jan thought there might have been some reason why the event *was*
submitted on behalf of the failed node.  The reason may no longer hold;
it'll take a bit of research to determine that.

I suppose that a thing to think about is what could break if the event
was submitted as being from the new origin.

A road to think down...

 - Suppose the event is treated as coming from new origin

 - Suppose there is a subscriber that is somewhat behind

 - Can that subscriber:
    a) get confused
    b) outright lose data (say, because sl_log_* gets trimmed too early)
   as a result of a FAILOVER that takes place under these circumstances?

No-development-required answer: Don't drop the failed node until all the
other nodes are aware that they shouldn't be using it anymore.

Possible change to DROP NODE command: 

  Perhaps DROP NODE should check against *all nodes* that this node
  isn't considered a provider, and fail if it is.

One of the problems with Slony-I, in retrospect, is that it tries a bit
too hard to be asynchronous, and that makes it rather hard to debug
issues surrounding configuration changes to the cluster.  Perhaps
configuration should be pretty much synchronous, checking state against
all the nodes, and griping if *any* of them disagree.

For instance, in this case, if DROP NODE were changed to verify on *all*
nodes that the node is no longer in use, that would protect from the
problem observed in this discussion thread.
-- 
let name="cbbrowne" and tld="ca.afilias.info" in name ^ "@" ^ tld;;
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"


More information about the Slony1-general mailing list