Richard Yen dba at richyen.com
Thu Jun 18 11:18:23 PDT 2009
Hi,

I've been trying to get failover to work in 2.0.2, but it seems to hang.

I have a 3-node architecture, and have tried the instructions, per http://www.slony.info/documentation/failover.html#COMPLEXFAILOVER

Here's how I do it (node 1 is provider, and node 2 is failover node):
    -- subscribe node 3 to node 2
    -- execute FAILOVER
    -- slonik hangs

If I go into node 2 and to and look at sl_subscribe, there is only one  
row with provider=2, subscriber=3 (which is correct and expected).   
However, looking at sl_status, looks like everything is running just  
fine (sl_event_lag and sl_time_lag go up and down, as if there's  
activity).  HOWEVER, if I do an update on node 2, the update never  
makes it to node 3.  (Node 1 still says provider=1, subscriber=2 AND  
provider=2, subscriber=3)

slonik is still running/hanging during all this.

if I strace the slonik process, I find the following:

======BEGIN STRACE======
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
sendto(3, "Q\0\0\0\30begin transaction; \0"..., 25, 0, NULL, 0) = 25
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3,  
revents=POLLIN}])
recvfrom(3, "C\0\0\0\nBEGIN\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 17
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
sendto(3, "Q\0\0\0Wselect nl_backendpid from \"_sltest 
\".sl_nodelock     where nl_backendpid <> 28927; \0"..., 88, 0, NULL,  
0) = 88
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3,  
revents=POLLIN}])
recvfrom(3, "T\0\0\0&\0\1nl_backendpid\0\304\27Dn 
\0\3\0\0\0\27\0\4\377\377\377\377\0\0D\0\0\0\17\0\1\0\0\0\00529006D 
\0\0\0\17\0\1\0\0\0\00529011D\0\0\0\17\0\1\0\0\0\00529012C 
\0\0\0\vSELECT\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 105
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
sendto(3, "Q\0\0\0\32rollback transaction;\0"..., 27, 0, NULL, 0) = 27
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3,  
revents=POLLIN}])
recvfrom(3, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I"..., 16384, 0, NULL, NULL)  
= 20
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
sendto(4, "Q\0\0\0\30begin transaction; \0"..., 25, 0, NULL, 0) = 25
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=4, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=4,  
revents=POLLIN}])
recvfrom(4, "C\0\0\0\nBEGIN\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 17
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
sendto(4, "Q\0\0\0Wselect nl_backendpid from \"_sltest 
\".sl_nodelock     where nl_backendpid <> 16155; \0"..., 88, 0, NULL,  
0) = 88
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=4, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=4,  
revents=POLLIN}])
recvfrom(4, "T\0\0\0&\0\1nl_backendpid 
\0\0\1\"\203\0\3\0\0\0\27\0\4\377\377\377\377\0\0D 
\0\0\0\17\0\1\0\0\0\00517510D\0\0\0\17\0\1\0\0\0\00517511C 
\0\0\0\vSELECT\0Z\0\0\0\5T"..., 16384, 0, NULL, NULL) = 89
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
sendto(4, "Q\0\0\0\32rollback transaction;\0"..., 27, 0, NULL, 0) = 27
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=4, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=4,  
revents=POLLIN}])
recvfrom(4, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I"..., 16384, 0, NULL, NULL)  
= 20
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
======END STRACE======

This repeats over and over again in the log (infinite loop?)

I also tried a different time with the script provided by slony-ctl,  
but no luck. (It DOES, however, work when there's only 2 nodes)

Are there any know issues for 3+ node failover in 2.0.2?

Would anyone be able to walk me through this, if perhaps I'm doing  
something wrong?

Thanks!
--Richard


More information about the Slony1-general mailing list