Fiel Cabral e4696wyoa63emq6w3250kiw60i45e1
Tue Oct 4 23:17:11 PDT 2005
The sl_event table on Node 2 contains a FAILOVER_SET event but node 3 (the
backup node specified in the failover command) does not. Should the backup
node's sl_event table contain the FAILOVER_SET?

sl_event on node 2 contains a FAILOVER_SET:
ev_timestamp | ev_origin | ev_seqno | ev_type
----------------------------+-----------+----------+---------------------
2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH
2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH
2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN
2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN
2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET
2005-10-04 17:49:05.608095 | 1 | 306 | STORE_NODE
2005-10-04 17:49:05.608095 | 1 | 307 | ENABLE_NODE
2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE
2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE
2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH
2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH
2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN
2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN
2005-10-04 17:55:02.413185 | 2 | 6 | SYNC
2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION
2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH
2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION
2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH
2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN
2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN
2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET
2005-10-04 17:49:52.680621 | 1 | 316 | SYNC
2005-10-04 17:50:53.010532 | 1 | 317 | SYNC
2005-10-04 17:51:53.112317 | 1 | 318 | SYNC
2005-10-04 17:52:53.146222 | 1 | 319 | SYNC
2005-10-04 17:53:53.192119 | 1 | 320 | SYNC
2005-10-04 17:54:53.602106 | 1 | 321 | SYNC
2005-10-04 17:55:53.710807 | 1 | 322 | SYNC
2005-10-04 17:56:02.893106 | 2 | 7 | SYNC
2005-10-04 17:56:42.786823 | 3 | 6 | SYNC
2005-10-04 17:56:53.833985 | 1 | 323 | SYNC
2005-10-04 17:57:03.007883 | 2 | 8 | SYNC
2005-10-04 17:57:43.692981 | 3 | 7 | SYNC
2005-10-04 17:57:53.902912 | 1 | 324 | SYNC
2005-10-04 17:58:03.062867 | 2 | 9 | SYNC
2005-10-04 17:58:43.736478 | 3 | 8 | SYNC
2005-10-04 17:58:53.953325 | 1 | 325 | SYNC
2005-10-04 17:59:03.112996 | 2 | 10 | SYNC
2005-10-04 17:59:43.77303 | 3 | 9 | SYNC
2005-10-04 17:59:54.095892 | 1 | 326 | SYNC
2005-10-04 18:00:03.155204 | 2 | 11 | SYNC
2005-10-04 18:00:43.810793 | 3 | 10 | SYNC
2005-10-04 18:01:03.196571 | 2 | 12 | SYNC
2005-10-04 18:01:43.865925 | 3 | 11 | SYNC
2005-10-04 18:02:03.216029 | 2 | 13 | SYNC
2005-10-04 18:02:43.905505 | 3 | 12 | SYNC
2005-10-04 18:03:03.238632 | 2 | 14 | SYNC
2005-10-04 18:03:38.947704 | 1 | 327 | FAILOVER_SET
2005-10-04 18:03:48.819508 | 3 | 13 | SYNC
2005-10-04 18:03:49.921361 | 2 | 15 | SYNC
2005-10-04 18:04:48.875801 | 3 | 14 | SYNC
2005-10-04 18:04:49.970829 | 2 | 16 | SYNC
2005-10-04 18:05:48.92941 | 3 | 15 | SYNC
2005-10-04 18:05:49.985511 | 2 | 17 | SYNC
2005-10-04 18:06:48.963277 | 3 | 16 | SYNC
2005-10-04 18:06:49.998737 | 2 | 18 | SYNC
2005-10-04 18:07:49.033346 | 3 | 17 | SYNC
2005-10-04 18:07:50.028334 | 2 | 19 | SYNC
2005-10-04 18:08:49.051861 | 3 | 18 | SYNC
2005-10-04 18:08:50.056542 | 2 | 20 | SYNC
2005-10-04 18:09:49.075309 | 3 | 19 | SYNC
2005-10-04 18:09:50.093277 | 2 | 21 | SYNC
(62 rows)

sl_event on node 3 (backup node) does not have the FAILOVER_SET:

ev_timestamp | ev_origin | ev_seqno | ev_type
----------------------------+-----------+----------+---------------------
2005-10-04 17:49:10.60801 | 3 | 1 | STORE_PATH
2005-10-04 17:49:10.678128 | 3 | 2 | STORE_PATH
2005-10-04 17:49:10.713706 | 3 | 3 | STORE_LISTEN
2005-10-04 17:49:10.743235 | 3 | 4 | STORE_LISTEN
2005-10-04 17:49:38.417454 | 3 | 5 | SUBSCRIBE_SET
2005-10-04 17:49:10.487603 | 2 | 1 | STORE_PATH
2005-10-04 17:49:08.029042 | 1 | 308 | STORE_NODE
2005-10-04 17:49:10.70457 | 2 | 2 | STORE_PATH
2005-10-04 17:49:08.029042 | 1 | 309 | ENABLE_NODE
2005-10-04 17:49:10.712416 | 2 | 3 | STORE_LISTEN
2005-10-04 17:49:10.641208 | 1 | 310 | STORE_PATH
2005-10-04 17:49:10.77891 | 2 | 4 | STORE_LISTEN
2005-10-04 17:49:10.679501 | 1 | 311 | STORE_PATH
2005-10-04 17:49:38.146642 | 2 | 5 | SUBSCRIBE_SET
2005-10-04 17:49:10.722549 | 1 | 312 | STORE_LISTEN
2005-10-04 17:55:02.413185 | 2 | 6 | SYNC
2005-10-04 17:56:02.893106 | 2 | 7 | SYNC
2005-10-04 17:49:10.751999 | 1 | 313 | STORE_LISTEN
2005-10-04 17:49:42.44082 | 1 | 314 | ENABLE_SUBSCRIPTION
2005-10-04 17:56:42.786823 | 3 | 6 | SYNC
2005-10-04 17:57:03.007883 | 2 | 8 | SYNC
2005-10-04 17:49:42.769833 | 1 | 315 | ENABLE_SUBSCRIPTION
2005-10-04 17:49:52.680621 | 1 | 316 | SYNC
2005-10-04 17:50:53.010532 | 1 | 317 | SYNC
2005-10-04 17:51:53.112317 | 1 | 318 | SYNC
2005-10-04 17:52:53.146222 | 1 | 319 | SYNC
2005-10-04 17:53:53.192119 | 1 | 320 | SYNC
2005-10-04 17:54:53.602106 | 1 | 321 | SYNC
2005-10-04 17:55:53.710807 | 1 | 322 | SYNC
2005-10-04 17:56:53.833985 | 1 | 323 | SYNC
2005-10-04 17:57:43.692981 | 3 | 7 | SYNC
2005-10-04 17:57:53.902912 | 1 | 324 | SYNC
2005-10-04 17:58:03.062867 | 2 | 9 | SYNC
2005-10-04 17:58:43.736478 | 3 | 8 | SYNC
2005-10-04 17:58:53.953325 | 1 | 325 | SYNC
2005-10-04 17:59:03.112996 | 2 | 10 | SYNC
2005-10-04 17:59:43.77303 | 3 | 9 | SYNC
2005-10-04 17:59:54.095892 | 1 | 326 | SYNC
2005-10-04 18:00:03.155204 | 2 | 11 | SYNC
2005-10-04 18:00:43.810793 | 3 | 10 | SYNC
2005-10-04 18:01:03.196571 | 2 | 12 | SYNC
2005-10-04 18:01:43.865925 | 3 | 11 | SYNC
2005-10-04 18:02:03.216029 | 2 | 13 | SYNC
2005-10-04 18:02:43.905505 | 3 | 12 | SYNC
2005-10-04 18:03:03.238632 | 2 | 14 | SYNC
2005-10-04 18:03:48.819508 | 3 | 13 | SYNC
2005-10-04 18:03:49.921361 | 2 | 15 | SYNC
2005-10-04 18:04:48.875801 | 3 | 14 | SYNC
2005-10-04 18:04:49.970829 | 2 | 16 | SYNC
2005-10-04 18:05:48.92941 | 3 | 15 | SYNC
2005-10-04 18:05:49.985511 | 2 | 17 | SYNC
2005-10-04 18:06:48.963277 | 3 | 16 | SYNC
2005-10-04 18:06:49.998737 | 2 | 18 | SYNC
2005-10-04 18:07:49.033346 | 3 | 17 | SYNC
2005-10-04 18:07:50.028334 | 2 | 19 | SYNC
2005-10-04 18:08:49.051861 | 3 | 18 | SYNC
2005-10-04 18:08:50.056542 | 2 | 20 | SYNC
2005-10-04 18:09:49.075309 | 3 | 19 | SYNC
2005-10-04 18:09:50.093277 | 2 | 21 | SYNC
2005-10-04 18:10:49.100012 | 3 | 20 | SYNC
2005-10-04 18:10:50.117138 | 2 | 22 | SYNC
(61 rows)


On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com> wrote:
>
> The problem persists after the node IDs were changed from [1, 2, 3] to
> [10, 20, 30].
>
> Inside gdb, the failedNode2 query did not return an error (function return
> value was 0).
>
> Node 2 was able to move the set_origin = node 3.
> Nodes 3 is stuck with set_origin = node 1.
>
> On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com > wrote:
> >
> > Thanks Elein. I'll run gdb and step through slonik_failed_node to
> > (maybe) see if failedNode2 is failing.
> >
> >
> > On 10/4/05, elein <elein at varlena.com > wrote:
> > >
> > > Fiel,
> > >
> > > In my own tests, with node 10->20->30, failover from 10 to 20 failed
> > > because node 30 was unusable and had to be recreated from scratch.
> > > This is a serious bug in my book.
> > >
> > > In one case the problem seemed to be dropping the first node
> > > "too soon". I have not tested that case so I don't know that
> > > this was the problem.
> > >
> > > What I have verified is that the third node never recieved any message
> > > regarding the failover and did not change its information
> > > to get its table set from the new origin, 20.
> > >
> > > Also, try not to use Node 1, 2, 3. Node 1 has some special meaning
> > > in some cases that you will want to avoid.
> > >
> > > We are with you, not ignoring you.
> > >
> > > --elein
> > >
> > > On Tue, Oct 04, 2005 at 11:13:19AM -0400, Fiel Cabral wrote:
> > > > Right after running the failover command I issue the DROP NODE
> > > command to drop
> > > > node 1. slonik prints an error message and exits with return value
> > > 12:
> > > >
> > > > sys:17: TRY: drop node
> > > > sys:19: PGRES_FATAL_ERROR select "_whatever".dropNode(1); - ERROR:
> > > Slony-I:
> > > > Node 1 is still origin of one or more sets
> > > >
> > > > Something should have changed the origin to node 3 but it isn't
> > > happening.
> > > >
> > > >
> > > > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com >
> > > wrote:
> > > >
> > > > I have 3 nodes. Nodes 2 and 3 are subscribers of node 1 and I'm
> > > trying to
> > > > failover from node 1 to node 3. The failover command succeeds but
> > > the
> > > > database of node 3 is still read-only and the origin is still node
> > > 1. I
> > > > don't have the same problem when doing failover with only two nodes
> > > because
> > > > the set is moved immediately by failedNode.
> > > >
> > > > failedNode (in the code below) is able to set the provider
> > > successfully.
> > > >
> > > > Some code elsewhere is actually moving the replication set. Where is
> > > that
> > > > code? Is it in slon or slonik or in the sql scripts?
> > > >
> > > > How do I find out that slon caught the signal and is doing the right
> > > thing
> > > > in response to the signal?
> > > >
> > > > 784 raise notice ''failedNode: set % has other direct receivers -
> > > > change providers only'', v_row.set_id;
> > > > 785 -- ----
> > > > 786 -- Backup node is not the only direct
> > > > subscriber. This
> > > > 787 -- means that at this moment, we redirect
> > > > all direct
> > > > 788 -- subscribers to receive from the backup
> > > > node, and the
> > > > 789 -- backup node itself to receive from
> > > > another one.
> > > > 790 -- The admin utility will wait for the slon
> > > > engine to
> > > > 791 -- restart and then call failedNode2() on
> > > > the node with
> > > > 792 -- the highest SYNC and redirect this to it
> > > > on
> > > > 793 -- backup node later.
> > > > 794 -- ----
> > > > ... etc ...
> > > > 811
> > > > 812 -- ----
> > > > 813 -- Make sure the node daemon will restart
> > > > 814 -- ----
> > > > 815 notify "_ at CLUSTERNAME@_Restart";
> > > > 816
> > > >
> > > > -Fiel
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > > _______________________________________________
> > > > Slony1-general mailing list
> > > > Slony1-general at gborg.postgresql.org
> > > > http://gborg.postgresql.org/mailman/listinfo/slony1-general
> > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://gborg.postgresql.org/pipermail/slony1-general/attachments/20051004/41d49751/attachment-0001.html


More information about the Slony1-general mailing list