Thu Sep 29 10:25:31 PDT 2005
- Previous message: [Slony1-general] Failover failures
- Next message: [Slony1-general] SQL query for "acks outstanding"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thanks for your response Chris. I did get further by stepping through the slonik code to see where the problem was. It appears that slonik is waiting forever for the slon daemon to restart. You'll see in the log that slon gets the restart signal but never seems to do anything about it. The work around was to stop slon before running the script. The script progresses through properly and then I can restart slon and drop node 2 with no problems. Thanks for the advice on how to do the failover. I think I'll stick with keeping replication in place because a) It should make recovery of the failed node simpler as I only have to worry about bringing one node back in, rather than rebuild the whole cluster, and b) the way project is going it won't be long before somebody asks me to do a 3 node solution. If I have any time I'll have a look at what is happening in slon. If there are any tests you think I could run to give more clues I'd be glad to try them out. Steve Hindmarch BT Exact -----Original Message----- From: Christopher Browne [mailto:cbbrowne at ca.afilias.info] Sent: 28 September 2005 20:20 To: Hindmarch,SJ,Stephen,XBD R Cc: slony1-general at gborg.postgresql.org Subject: Re: [Slony1-general] Failover Stalls stephen.hindmarch at bt.com wrote: >I have set up Slony-1 v 1.1.0 on two servers, each with identical >databases, and have organised replication between the two of them. > >I can get a switchover to work, but when I do a failover the failover >script stops in the middle of the failover command. > >Here is an extra from the script. The variables are set by the rest of >the script, nodeId is the name of the subscriber, remoteId is the name >of the origin, and the idea of the script is to run it on the surviving >server after something nasty has happened to the other server. > >log "Attempting failover to local node ($nodeId)" >log `date` >slonik <<EOF > cluster name = $CLUSTER_NAME; > node 1 admin conninfo = '$one_conninfo'; > node 2 admin conninfo = '$two_conninfo'; > echo 'Failing over to node $nodeId'; > failover ( id=$remoteId, backup node = $nodeId); > echo 'Failover complete'; >EOF > >In my test scenario, node 2 is the origin. I kill the postmaster on >node 2 to simulate the server dying a horrible death. The slon daemon >on node 2 dies and the slon daemon on node 1 starts to complain of >being unable to access the node 2 database (I've x'd out the true IP >address) :- > >2005-09-27 14:33:39 BST DEBUG2 remoteWorkerThread_2: forward confirm >1,9841 received by 2 2005-09-27 14:33:40 BST ERROR >remoteListenThread_2: "select ev_origin, >ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip, >ev_type, ev_data1, ev_data2, ev_data3, ev_data4, >ev_data5, ev_data6, ev_data7, ev_data8 from "_dot_ha".sl_event e >where (e.ev_origin = '2' and e.ev_seqno > '26') order by e.ev_origin, >e.ev_seqno" - server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. >2005-09-27 14:33:49 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC >9842 2005-09-27 14:33:49 BST DEBUG2 localListenThread: Received event >1,9842 SYNC >2005-09-27 14:33:50 BST ERROR slon_connectdb: PQconnectdb("dbname=DOT >host=xxx.xxx.xxx.xxx user=postgres") failed - could not connect to >server: Connection refused > Is the server running on host "xxx.xxx.xxx.xxx" and accepting > TCP/IP connections on port 5432? >2005-09-27 14:33:50 BST WARN remoteListenThread_2: DB connection >failed - sleep 10 seconds > >At this point I now have no origin, but a working subscriber, up to >date, at least up to the time of the last synch. > >I want to make this subscriber the new origin. A switchover won't work >so I execute the above script on node 1 and I get the following >output:- > >ha-failover.sh: Attempting failover to local node (1) >ha-failover.sh: Tue Sep 27 14:34:17 BST 2005 ><stdin>:4: Failing over to node 1 ><stdin>:5: NOTICE: failedNode: set 1 has no other direct receivers - >move now > >And then the script just hangs there (I've left it running for over an >hour). It seems to be stuck on the failover line as it never reaches >the second echo statement. > > That is somewhat curious. I'll see if I can see why that would be. "Wild speculation" (which is no more valuable than "speculative gossip") would be that perhaps it's waiting to tell all the remaining subscribers something, and since there aren't any, there's something confused about that. Your scenario here is one where it would be about as useful to simply do an UNINSTALL NODE on node 1, because once the FAILOVER is done, there will be nothing other than node 1 in the cluster. With no subscribers, the presence of replication is pretty well a "historical curiosity." Under such a circumstance, with two nodes, and the master dead, I'd be inclined to simply drop replication, as, with only one node, you don't honestly have replication going on anymore... >When I look at the node 1 database I see that I can now update the >replicated tables, so node 1 now thinks it is the master. I can check >this by inspecting sl_set and see the origin for my replication set is >now node 1. The sl_subscribe table is empty. The sl_node table shows >both nodes and both are active, which strikes me as suspicious. > >DOT=# select * from _dot_ha.sl_node; > no_id | no_active | no_comment | no_spool >-------+-----------+-----------------+---------- > 1 | t | Node One | f > 2 | t | Node Two | f >(2 rows) > > > This actually *isn't* suspicious. This is normal. FAILOVER doesn't actually drop out the failed node. Dropping a node has, alas, side-effects, notably purging out information about the events coming from that node. That would add insult to injury supposing we had a node 3 that was more up to date than node 1. We would then find ourselves in the regrettable position where we knew node 3 had some better data, but have no way to properly apply it to node 1 to get it up to speed. That would essentially add insult to injury; node 3 was in better shape, but we would have to drop it, too, because there's no way to get at its data :-(. Anyhoo, node 2 won't go away until you explicitly drop it. Which should wait until the reformed cluster is working OK... And as for sl_subscribe, well, there is no longer any subscriber to set 1. Node #1 is the only node still working; nothing is subscribing to it. The emptiness of sl_subscribe is just fine. >There is only one line in the slon daemon log that is of significance >at the moment of the failover:- > >2005-09-27 14:34:10 BST WARN remoteListenThread_2: DB connection >failed - sleep 10 seconds >2005-09-27 14:34:17 BST INFO localListenThread: got restart >notification - signal scheduler >2005-09-27 14:34:20 BST DEBUG2 syncThread: new sl_action_seq 25 - SYNC >9845 > >Can anybody gives me any clues as to what is going on? > > It seems to me as though everything is actually OK. You'll want to drop node 2...
- Previous message: [Slony1-general] Failover failures
- Next message: [Slony1-general] SQL query for "acks outstanding"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list