Tue Aug 23 19:33:07 PDT 2005
- Previous message: [Slony1-general] Failover failures
- Next message: [Slony1-general] Failover failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, Aug 23, 2005 at 11:29:34AM -0700, elein wrote: > > On Tue, Aug 23, 2005 at 11:40:45AM -0400, Christopher Browne wrote: > > elein wrote: > > > > >Slony 1.1. Three nodes. 10 set(1) => 20 => 30. > > > > > >I ran failover from node10 to node20. > > > > > >On node30, the origin of the set was changed > > >from 10 to 20, however, drop node10 failed > > >because of the row in sl_setsync. > > > > > >This causes slon on node30 to quit and the cluster to > > >become unstable. Which in turn prevents putting > > >node10 back into the mix. > > > > > >Please tell me I'm not the first one to run into > > >this... > > > > > >The only clean work around I can see is to drop > > >node 30. Re-add it. And then re-add node10. This > > >leaves us w/o a back up for the downtime. > > > > > > > > >This is what is in some of the tables for node20: > > > > > >gb2=# select * from sl_node; > > > no_id | no_active | no_comment | no_spool > > >-------+-----------+-------------------------+---------- > > > 20 | t | Node 20 - gb2 at localhost | f > > > 30 | t | Node 30 - gb3 at localhost | f > > >(2 rows) > > > > > >gb2=# select * from sl_set; > > > set_id | set_origin | set_locked | set_comment > > >--------+------------+------------+---------------------- > > > 1 | 20 | | Set 1 for gb_cluster > > >gb2=# select * from sl_setsync; > > > ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list > > >-----------+------------+-----------+------------+------------+---------+----------------- > > >(0 rows) > > > > > >This is what I have for node30: > > > > > >gb3=# select * from sl_node; > > > no_id | no_active | no_comment | no_spool > > >-------+-----------+-------------------------+---------- > > > 10 | t | Node 10 - gb at localhost | f > > > 20 | t | Node 20 - gb2 at localhost | f > > > 30 | t | Node 30 - gb3 at localhost | f > > >(3 rows) > > > > > >gb3=# select * from sl_set; > > > set_id | set_origin | set_locked | set_comment > > >--------+------------+------------+---------------------- > > > 1 | 20 | | Set 1 for gb_cluster > > >(1 row) > > > > > >gb3=# select * from sl_setsync; > > > ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | ssy_action_list > > >-----------+------------+-----------+------------+------------+---------+----------------- > > > 1 | 10 | 235 | 1290260 | 1290261 | | > > >(1 row) > > > > > >frustrated, > > >--elein > > >_______________________________________________ > > >Slony1-general mailing list > > >Slony1-general at gborg.postgresql.org > > >http://gborg.postgresql.org/mailman/listinfo/slony1-general > > > > > > > > That error message in your other email was VERY helpful in pointing at > > least at what clues to look for... > > > > I /think/ that the FAILOVER_SET event hasn't yet been processed on node > > 30, which would be consistent with everything we see. > > > > Can you check logs on node 30 or sl_event on node 30 to see if > > FAILOVER_SET has made it there? > > > > What seems not to have happened is for the FAILOVER_SET event to process > > on node 30; when that *does* happen, it would delete out the sl_setsync > > entry pointing to node 10 and create a new one pointing to node 20. > > (This is in the last 1/2 of function failoverSet_int().) > > > > I'll bet that sl_subscribe on node 30 is still pointing to node 10; that > > would be further confirmation that FAILOVER_SET hasn't processed on node 30. > > > > If that event hasn't processed, then we can at least move the confusion > > from being: > > "Help! I don't know why the configuration is so odd on node 30!" > > to > > "Hmm. The config is consistent with FAILOVER not being done yet. What > > prevented the FAILOVER_SET event from processing on node 30?" > > > > We're not at a full answer, but the latter question points to a more > > purposeful search :-). > > > > I'm fairly certain the failover did not make it to node30. I have > the set up here and will rerun it clean and send you logs and whatever > tables you would like. But not until this evening PST. > > --elein One more hint. node30 was originally getting set 1 (originating on node10) from node20. Perhaps the failover logic thought it did not have to do so much in that case? if only I could think of everything in one messsage :) elein
- Previous message: [Slony1-general] Failover failures
- Next message: [Slony1-general] Failover failures
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list