[Slony1-general] Graceful switchover gone wrong.

Wed Apr 5 16:15:45 PDT 2006

Oh dear - why is nothing ever easy? :(

I've just tried to gracefully swap master/slave roles with little success...

cayenne:~# /usr/lib/postgresql/8.1/bin/slonik <slonswap.txt
... which provided no output for 5 minutes at which point I CTRL-C'd...

This script is:

cluster name = replication;
 node 1 admin conninfo='host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX';
 node 2 admin conninfo='host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX';

lock set (id = 1, origin = 1);
wait for event (origin = 1, confirmed = 2);
move set (id = 1, old origin = 1, new origin = 2);
wait for event (origin = 1, confirmed = 2);

i.e. copy+paste directly from slony_115/failover.html

The log for node 1 shows:
2006-04-05 23:49:32 BST CONFIG moveSet: set_id=1 old_origin=1 new_origin=2
2006-04-05 23:49:32 BST DEBUG1 remoteWorkerThread_2: helper thread for provider 2 created
2006-04-05 23:49:32 BST CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
2006-04-05 23:49:35 BST DEBUG1 remoteWorkerThread_2: connected to data provider 2 on 'host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX'

During this 'dead time' I tried to execute a simple UPDATE on node 1, and was told 
"ERROR:  Slony-I: Table pbx_ext_state is replicated and cannot be modified on a subscriber node" - great - that's exactly what I'd expect.

Unfortunately, I was told the same thing when I executed the same query on node 2! At this point I paniced and executed 'uninstall node (id=1)' to clean out the current machine's slony config so I could at least bring our website back online again.

The next line in node 1's log after the above is:

2006-04-05 23:56:55 BST FATAL  syncThread: "start transaction;set transaction isolation level serializable;select last_value from "_replication".sl_action_seq;" - ERROR:  schema "_replication" does not exist

which unsurprisingly is when I uninstalled node 1 :) I also saw a process on node 1 during the 'dead time' marked as 'idle in transaction'.

I've not touched node 2 - what can the various sl_ tables tell me about why this process froze, and what should I be looking for if/when it happens when I try it again tomorrow? :(

<sigh> Maybe I should just run away and join the circus...

Cheers,
Gavin