[Slony1-hackers] automatic WAIT FOR proposal

Fri Dec 10 13:54:48 PST 2010

Steve Singer <ssinger at ca.afilias.info> writes:
> The Problem
> ----------------
> An informal survey of the slony mailing list shows that almost no users 
> understand how WAIT FOR should be used.

I daresay that as a not-unsophisticated user of Slony, I don't always
get it right.  I think there's only one person who has successful
"intuition" on WAIT FOR behaviour, and I don't think it takes much
guessing as to who that is!

> Proposal
> ------------
> The goal is to make slonik handle the proper waiting between events. If 
> based on the previous set of slonik commands the next command needs to 
> wait until a previous command is confirmed by another node then slonik 
> should be smart enough to figure this out and to wait.

...

> What would solve this?
>
> 1) If we had a global ordering on events maybe assigned by a cluster 
> coordinator node (c) would be able to process the events in the right order.
> 2) When a event is created on node (b) if we store the fact that it has 
> already seen/confirmed event (a),1234 from node (a) we could transmit 
> this pre-condition as part of the event so node (c) can know that it 
> can't process the event from b until it has seen 1234 from (a).  This 
> way node (c) will process things in the right order but we can submit 
> events to (b) - which is up to date without having to wait for the busy 
> node (c) to get caught up.
> 3) We could disallow or discourage the use of multiple event nodes and 
> require all slonik command events to originate on a single cluster node 
> (other than store path and maybe subscribe set) and provide facilities 
> for dealing with cases where that event node fails or is split.
> 4) We really do require the cluster be caught up before using a 
> different event node.  This is where we automatically do the WAIT FOR ALL.
>
> The approach proposed here is to go with (4) where before switching 
> event nodes slonik will WAIT FOR all nodes to confirm the last event

Related to #2...  We might introduce a new event that tries to
coordinate between nodes.

In effect, a "WAIT FOR EVENT" event...

  So, we submit, against node #1, WAIT_FOR_EVENT (2,355).

  The intent of this event is that processing of the stream of events
  for node #1 holds back until it has received event #355 from node #2.

That doesn't mandate waiting for *EVERY* node, just one node.  Multiple
WAIT FOR EVENT requests could get you a "wait on all."  Note that this
is on the slon side, not so much the slonik side...

> 1) STORE PATH - the event node is dictated by how you are setting up the 
> path. Furthoremore if the backwards path isn't yet set up the node won't 
> recive the confirm message

  There's an argument to be made that STORE PATH perhaps should be going
  directly to nodes, and doesn't need to be involved in event
  propagation.  It's pretty cool to propagate STORE PATH requests
  everywhere, but it's not hugely necessary.

  ...[erm, rethinking]...

  The conninfo field only ever matters on the node where it is used.
  But computation of listen paths requires that all nodes have the
  [from,to] data.  So there's a partial truth there.  conninfo isn't
  necessary, but [from,to] is...

> 2) SUBSCRIBE set (in 2.0.5+) always gets submitted at the origin.  So if 
> you are subscribing multiple sets slonik will switch event nodes. This 
> means that subscribing to multiple sets (with different set origins) in 
> parallel will be harder (you will need to disable automatic wait-for or 
> use different slonik invocations). You can still do parallel subscribes 
> to the same set because the subscribe set always goes to the origin in 
> 2.0.5+ not the provider or the receiver.

  I have always been a little uncomfortable about this change, and this
  underlines that discomfort.  But that doesn't mean I'm right...

> 3) STORE/DROP listen goes to specific nodes based on the arguments but 
> you shouldn't need STORE/DROP listen commands anyway in 1.2 or 2.0

  Right.

> 4) CREATE/DROP SET must go to the set origin. If your creating sets the 
> cluster probably needs to be caught up.

  And if these events are lost, due to a FAILOVER partition or such,
  if they were only in the partition of the cluster that was lost, it
  doesn't matter...

> 5) ADD TABLE/ADD SEQUENCE - must go to the origin.  Again if your 
> manipulating sets you must stick to a single set origin or have your 
> cluster be caught up
> 6) MOVE TABLE goes to the origin - but the docs already warn you about 
> trying this if your cluster isn't caught up (with respect to this set)
> 8) MOVE SET - Doing this with a behind cluster is already a bad idea
> 9) FAILOVER - See multi-node failover discussion

There's a mix of needful semantics here.

For instance, SET ADD TABLE/SEQUENCE only forcibly need to propagate
alongside the successful propagation of subscriptions to those sets.

That's different from the propagation needs for other events.  It seems
to me that we might want to classify the "propagation needs"; if there
are good names for the different classifications, then we're likely
really onto something.  

Good names aren't arriving to me on Friday afternoon :-).

> STORE PATH
> -----------
> A WAIT FOR ALL nodes won't work unless all of the paths are stored.
> When I say 'all' I mean there must exist a route from every node to 
> every other node.  The routes don't need to be direct. There are certain 
> common usage patterns that shouldn't be excluded. It would be good if 
> slonik could detect missing paths before 'changing things' because 
> otherwise users might be left with a half complete script.

I'd classify this two ways:

a) When bootstrapping a cluster, WAIT FOR ALL can't work if there aren't
enough paths yet.  

I'm not sure it makes sense to go to the extent of computing spanning
trees or such to validate this.

If we try to validate at every point, then you can't have a sequence
of...

  Set up all nodes...
   INIT CLUSTER
   STORE NODE
   STORE NODE
   STORE NODE

  Then, set up paths...
   STORE PATH
   STORE PATH
   STORE PATH
   STORE PATH

It seems like a logical idea to construct a cluster by setting up all
nodes, then to set up communications between them.  It doesn't thrill me
if we make that impossible.

> The easy answer is: Don't write scripts that can leave your cluster in 
> an indetermined state.   What we should do if someone tries is an open 
> question.  We could a) check that all code paths (cross product) leave 
> the cluster consistent/complete.  b) Assume the try blocks always finish 
> successfully c) don't do the parse tree analysis described above for the 
> entire script at parse time but instead do it for each block before 
> entering that block.
> I am leaning towards c.

If we're going down a "prevent nondetermined states" road, then it seems
to me there needs to be a presentation of a would-be algebra of cluster
states so we can talk about this analytically.

I think having that algebra is a prerequisite to deciding between any of
those alternatives.

> - How do we want to handle TRY blocks. See discussion above

WAIT FOR and TRY are right well incompatible with each other, unless we
determine, within the algebra, that there is some subset of commands
that make state changes that we consider don't need to be guarded by
WAIT FOR that are permissible in a TRY block.
-- 
select 'cbbrowne' || '@' || 'afilias.info';
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"