[Slony1-hackers] automatic WAIT FOR proposal

Thu Feb 3 07:19:39 PST 2011

On 11-02-03 09:44 AM, Jan Wieck wrote:
> On 2/2/2011 11:42 AM, Steve Singer wrote:
>> On 10-12-22 04:30 PM, Steve Singer wrote:
>>
>>
>> Since I haven't had much response on this maybe a plain language example
>> would be useful.
>>
>> Consider a cluster with paths where node 1 is a provider+origin to all
>> other nodes
>>
>> 4--1----2
>> | \ /
>> |--- 3
>>
>> EXECUTE SCRIPT( FILE=file1.sql, EVENT NODE=1);
>> wait for event(origin=1, confirmed=2, wait on=1);
>> EXECUTE SCRIPT(file=file2.sql, EVENT NODE=2);
>>
>> Take node 3. Does node 3 perform the SQL in file1.sql first or
>> file2.sql first? Today this is non-deterministic either could win.
>>
>> The two solutions I see are
>> a) Require all nodes to be caught up before going to the next event
>> node. As discussed this seems somewhat limiting
>> b) Make slon wait for the event with origin=1 to be applied on node 3
>> before applying the event from node 2 (because the event from node 1 had
>> already been processed on node 2 by the time the node 2 event was
>> generated).
>>
>> b) is what I am proposing to implement here.
>>
>> I can create this type of race condition with other event types as well
>> it isn't specific to execute script.
>
> What you are basically asking for is a guaranteed total order in which
> events from multiple nodes are processed. Very much like the total order
> guarantees provided by group communication systems.

I'm not going as far as a total order over all events just an ordering 
over that deals with events that have already been processed by the 
event origin.

For example if

remote events are processed

node 1:               node 2:
2,1233		      1,1233

(node 1 has seen 2,1233 and node 2 has seen 1,1233)

then they each do a sync generating events
1,1234                2,1234

In the scheme I propose node 3 can either process events in this order

1,1233
2,1233
1,1234
2,1234

OR
1,1233
2,1233
2,1234
1,1234

ie I am not requiring any ordering constraints between the two events 
1,1234 and 2,1234 other than they must come after 1,1233 and 2,1233.

What i describe requires no additional communications between nodes over 
what we are already doing.

The issue I describe isn't specific to two execute scripts.

For example I have a 3 node cluster with two sets (set 1 origin is node 
1, set 2 origin is node 2).

subscribe set(set id=1,provider=1,receiver=2)
subscribe set(set id=2,provider=2,receiver=1)
wait for event(origin=1,confirmed=2,wait on=1)
wait for event(origin=2,confirmed=1,wait on=2)
subscribe set(set id=1,origin=1,receiver=3)
subscribe set(set id=2,origin=2,receiver=3)
#
# subscribing to set 3 takes a LONG time
# because it is in a remote data centre
#
# while it is subscribing I discover
# I need to make an emergency schema change
# via EXECUTE SCRIPT such that I can't wait
# for node 3 to finish subscribing before
# making the change on node 1 and 2.

If i use node 1 or node 2 as the event node it might get applied on node 
3 before the set from the other node finishes.

---------

Here is an example that doesn't involve execute script. (assume the same 
cluster config as in my last example)

create set(id=1, origin=1)
set add table(set id=1, origin=1, fully qualified table='public.foo');
#commands execute, dba notices a mistake
drop set(set id=1,event node=1);
wait for event(origin=1,confirmed=3,wait on=1);
create set(id=2, origin=2)
set add table(set id=2,origin=3);
set add table(set id=2, origin=2, fully qualified table='public.foo');

Node 3 might process the add table from node 2 BEFORE it proceses the 
drop set from node 1.  The above example probably happens in the real 
world quite a bit, a dba creates a set then notices they are hosting it 
on the wrong node and wants to fix things.

>
> While the example above seems to be possible, I don't know why someone
> would actually attempt such. If node 1 is the origin of everything, it
> doesn't even make sense to use node 2 as the event node unless node 2
> also is the ONLY node to execute it.
>
> The design of EXECUTE SCRIPT expects the event node to be the origin of
> the objects modified, so that the SQL statements inside the script are
> executed at the same data SYNC point on all nodes. Since it is
> impractical to perform sanity checks against the script to ensure that
> the user is actually doing that, all we can and should do is to make
> this requirement clearer in the documentation.
>
>
> Jan
>