[Slony1-general] Failover Stalls

Thu Sep 29 10:25:31 PDT 2005

Thanks for your response Chris.

I did get further by stepping through the slonik code to see where the
problem was.

It appears that slonik is waiting forever for the slon daemon to
restart. You'll see in the log that slon gets the restart signal but
never seems to do anything about it.

The work around was to stop slon before running the script. The script
progresses through properly and then I can restart slon and drop node 2
with no problems.

Thanks for the advice on how to do the failover. I think I'll stick with
keeping replication in place because
a) It should make recovery of the failed node simpler as I only have to
worry about bringing one node back in, rather than rebuild the whole
cluster, and
b) the way project is going it won't be long before somebody asks me to
do a 3 node solution.

If I have any time I'll have a look at what is happening in slon. If
there are any tests you think I could run to give more clues I'd be glad
to try them out.

Steve Hindmarch
BT Exact

-----Original Message-----
From: Christopher Browne [mailto:cbbrowne at ca.afilias.info] 
Sent: 28 September 2005 20:20
To: Hindmarch,SJ,Stephen,XBD R
Cc: slony1-general at gborg.postgresql.org
Subject: Re: [Slony1-general] Failover Stalls

stephen.hindmarch at bt.com wrote:

>I have set up Slony-1 v 1.1.0 on two servers, each with identical 
>databases, and have organised replication between the two of them.
>
>I can get a switchover to work, but when I do a failover the failover 
>script stops in the middle of the failover command.
>
>Here is an extra from the script. The variables are set by the rest of 
>the script, nodeId is the name of the subscriber, remoteId is the name 
>of the origin, and the idea of the script is to run it on the surviving

>server after something nasty has happened to the other server.
>
>log "Attempting failover to local node ($nodeId)"
>log `date`
>slonik <<EOF
>    cluster name = $CLUSTER_NAME;
>    node 1 admin conninfo = '$one_conninfo';
>    node 2 admin conninfo = '$two_conninfo';
>    echo 'Failing over to node $nodeId';
>    failover ( id=$remoteId, backup node = $nodeId);
>    echo 'Failover complete';
>EOF
>
>In my test scenario, node 2 is the origin. I kill the postmaster on 
>node 2 to simulate the server dying a horrible death. The slon daemon 
>on node 2 dies and the slon daemon on node 1 starts to complain of 
>being unable to access the node 2 database (I've x'd out the true IP 
>address) :-
>
>2005-09-27 14:33:39 BST DEBUG2 remoteWorkerThread_2: forward confirm 
>1,9841 received by 2 2005-09-27 14:33:40 BST ERROR  
>remoteListenThread_2: "select ev_origin,
>ev_seqno, ev_timestamp,        ev_minxid, ev_maxxid, ev_xip,
>ev_type,        ev_data1, ev_data2,        ev_data3, ev_data4,
>ev_data5, ev_data6,        ev_data7, ev_data8 from "_dot_ha".sl_event e
>where (e.ev_origin = '2' and e.ev_seqno > '26') order by e.ev_origin, 
>e.ev_seqno" - server closed the connection unexpectedly
>	This probably means the server terminated abnormally
>	before or while processing the request.
>2005-09-27 14:33:49 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC 
>9842 2005-09-27 14:33:49 BST DEBUG2 localListenThread: Received event 
>1,9842 SYNC
>2005-09-27 14:33:50 BST ERROR  slon_connectdb: PQconnectdb("dbname=DOT
>host=xxx.xxx.xxx.xxx user=postgres") failed - could not connect to
>server: Connection refused
>	Is the server running on host "xxx.xxx.xxx.xxx" and accepting
>	TCP/IP connections on port 5432?
>2005-09-27 14:33:50 BST WARN   remoteListenThread_2: DB connection
>failed - sleep 10 seconds
>
>At this point I now have no origin, but a working subscriber, up to 
>date, at least up to the time of the last synch.
>
>I want to make this subscriber the new origin. A switchover won't work 
>so I execute the above script on node 1 and I get the following 
>output:-
>
>ha-failover.sh: Attempting failover to local node (1)
>ha-failover.sh: Tue Sep 27 14:34:17 BST 2005
><stdin>:4: Failing over to node 1
><stdin>:5: NOTICE:  failedNode: set 1 has no other direct receivers - 
>move now
>
>And then the script just hangs there (I've left it running for over an 
>hour). It seems to be stuck on the failover line as it never reaches 
>the second echo statement.
>  
>
That is somewhat curious.  I'll see if I can see why that would be.

"Wild speculation" (which is no more valuable than "speculative gossip")
would be that perhaps it's waiting to tell all the remaining subscribers
something, and since there aren't any, there's something confused about
that.

Your scenario here is one where it would be about as useful to simply do
an UNINSTALL NODE on node 1, because once the FAILOVER is done, there
will be nothing other than node 1 in the cluster.  With no subscribers,
the presence of replication is pretty well a "historical curiosity."

Under such a circumstance, with two nodes, and the master dead, I'd be
inclined to simply drop replication, as, with only one node, you don't
honestly have replication going on anymore...

>When I look at the node 1 database I see that I can now update the 
>replicated tables, so node 1 now thinks it is the master. I can check 
>this by inspecting sl_set and see the origin for my replication set is 
>now node 1. The sl_subscribe table is empty. The sl_node table shows 
>both nodes and both are active, which strikes me as suspicious.
>
>DOT=# select * from _dot_ha.sl_node;
> no_id | no_active |   no_comment    | no_spool
>-------+-----------+-----------------+----------
>     1 | t         | Node One        | f
>     2 | t         | Node Two        | f
>(2 rows)
>
>  
>
This actually *isn't* suspicious.  This is normal.

FAILOVER doesn't actually drop out the failed node.

Dropping a node has, alas, side-effects, notably purging out information
about the events coming from that node.  That would add insult to injury
supposing we had a node 3 that was more up to date than node 1.

We would then find ourselves in the regrettable position where we knew
node 3 had some better data, but have no way to properly apply it to
node 1 to get it up to speed.  That would essentially add insult to
injury; node 3 was in better shape, but we would have to drop it, too,
because there's no way to get at its data :-(.

Anyhoo, node 2 won't go away until you explicitly drop it.  Which should
wait until the reformed cluster is working OK...

And as for sl_subscribe, well, there is no longer any subscriber to set
1.  Node #1 is the only node still working; nothing is subscribing to
it.  The emptiness of sl_subscribe is just fine.

>There is only one line in the slon daemon log that is of significance 
>at the moment of the failover:-
>
>2005-09-27 14:34:10 BST WARN   remoteListenThread_2: DB connection
>failed - sleep 10 seconds
>2005-09-27 14:34:17 BST INFO   localListenThread: got restart
>notification - signal scheduler
>2005-09-27 14:34:20 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC 
>9845
>
>Can anybody gives me any clues as to what is going on?
>  
>
It seems to me as though everything is actually OK.

You'll want to drop node 2...