Tignor, Tom ttignor at akamai.com
Thu Jul 6 05:38:32 PDT 2017
	Hi Steve,
	Your diagrams and description would make sense. In our failover, though, we selected node 3 (“failover (id=1, backup node=3)”). The output I provided (see below) seems to show we failed because we were missing paths 2<->4 and 2<->5. With slony1-2.2.4, I can actually reproduce this by deleting those paths, and sometimes (not always) the failover will self-correct if I add them back after a prolonged delay. It seems in the problem case, 2 is farthest ahead at failover time, and while I might have hoped 2 would only have to feed 3, it seems to have to feed everybody.
	Sorry to hear about the Bugzilla spam. We are pretty well invested in slony1 at this point, so if there is a way I can contribute for this or other efforts, certainly let me know.
	Thanks,

	Tom    (


On 7/5/17, 9:53 PM, "Steve Singer" <steve at ssinger.info> wrote:

    On Wed, 5 Jul 2017, Tignor, Tom wrote:
    
    >
    > 	Interesting. Of course the behavior evident on inspection indicated something like this must be happening.
    > 	It seems the doc could be improved on the subject of required paths. I recall some sections indicate it is not harmful to have a path from each node to each other node. What seems not to be spelled out is that for the service to be highly available, to have the ability to failover, each node is *required* to have a path to each other node.
    > 	On a related point, it would be a lot more convenient if we could give each node a default path instead of re-specifying the same IP for each new subscriber, and a new line of conninfo for every slonik script.
    > 	Would either of these items be worth writing up in bug tracking and/or providing the solution? If so, could I get that link?
    
    You don't need a path to EVERY other node, you just need a path to nodes 
    that might be the providers as part of the failover.
    
    For example
    
    1-->2-->3
         |
         V
         4
    
    If that is the direction of the replication flow, and the paths (plus back 
    paths).  Node 2 is the only viable failover candidate for node 1.  There 
    isn't a reason why node 3 and 4 need to have paths between each other.
    
    However
    
    1--->2
    |\
    V \
    3  4
    
    means that any of the nodes 2,3,4 might be failover candidates and if node 3 
    becomes the new origin then there would need to be paths between 3 and 2,4
    
    I tried to capture a lot of these rules in the sl_failover_targets view.
    
    
    We had to take the slony bugzilla instance offline because of excesive spam.
    
    
    Steve
    
    
    > 	Tom    (
    >
    >
    > On 7/2/17, 9:30 PM, "Steve Singer" <steve at ssinger.info> wrote:
    >
    >    On Wed, 28 Jun 2017, Tignor, Tom wrote:
    >
    >    >
    >    > 	Hi Steve,
    >    > 	Thanks for the info. I was able to repro this problem in testing and saw as soon as I added the missing path back the still-in-process failover op continued on and completed successfully.
    >    > 	We do issue DROP NODEs in the event we need to restore a replica from scratch, which did occur. However, the restore workflow also should issue store paths to/from the new replica node and every other node. Still investigating this.
    >    > 	What still confuses me is the recurring “remoteWorkerThread_X: SYNC” output, despite the fact of not having a configured path. If the path is missing, how does slon continue to get SYNC events?
    >
    >    Slon can get events including SYNC from nodes other than the event origin if
    >    it has a path to that node.   However a slon can only replicate the data
    >    from a node it has a path to.
    >
    >
    >    Steve
    >
    >
    >
    >    >
    >    > 	Tom    (
    >    >
    >    >
    >    > On 6/27/17, 5:04 PM, "Steve Singer" <steve at ssinger.info> wrote:
    >    >
    >    >    On 06/27/2017 11:59 AM, Tignor, Tom wrote:
    >    >
    >    >
    >    >    The disableNode() in the makes it look like someone did a DROP NODE
    >    >
    >    >    If the only issue is that your missing active paths in sl_path you can
    >    >    add/update the paths with slonik.
    >    >
    >    >
    >    >
    >    >
    >    >    > **
    >    >    >
    >    >    > **Hello Slony-I community,
    >    >    >
    >    >    >              Hoping someone can advise on a strange and serious problem.
    >    >    > We performed a slony service failover yesterday. For the first time
    >    >    > ever, our slony service FAILOVER op errored out. We recently expanded
    >    >    > our cluster to 7 consumers from a single provider. There are no load
    >    >    > issues during normal operations. As the error output below shows,
    >    >    > though, our node 4 and node 5 consumers never got the events they
    >    >    > needed. Here’s where it gets weird: closer inspection has shown that
    >    >    > node 2->4 and node 2->5 path data went missing out of the service at
    >    >    > some point. It seems clear that’s the main issue, but in spite of that,
    >    >    > both node 4 and node 5 continued to find and process node 2 SYNC events
    >    >    > for a full week! The logs show this happened in spite of multiple restarts.
    >    >    >
    >    >    > How can this happen? If missing path data stymies the failover, wouldn’t
    >    >    > it also prevent normal SYNC processing?
    >    >    >
    >    >    > In the case where a failover is begun with inadequate path data, what’s
    >    >    > the best resolution? Can path data be quickly applied to allow failover
    >    >    > to succeed?
    >    >    >
    >    >    >              Thanks in advance for any insights.
    >    >    >
    >    >    > ---- failover error ----
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:
    >    >    > calling restart node 1
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
    >    >    > 2017-06-26 18:33:02
    >    >    >
    >    >    > executing preFailover(1,1) on 2
    >    >    >
    >    >    > executing preFailover(1,1) on 3
    >    >    >
    >    >    > executing preFailover(1,1) on 4
    >    >    >
    >    >    > executing preFailover(1,1) on 5
    >    >    >
    >    >    > executing preFailover(1,1) on 6
    >    >    >
    >    >    > executing preFailover(1,1) on 7
    >    >    >
    >    >    > executing preFailover(1,1) on 8
    >    >    >
    >    >    > NOTICE: executing "_ams_cluster".failedNode2 on node 2
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 8 only on event 5000061654, node 4 only
    >    >    > on event 5000061654, node 5 only on event 5000061655, node 3 only on
    >    >    > event 5000061662, node 6\
    >    >    >
    >    >    >   only on event 5000061654, node 7 only on event 5000061656
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061657, node 5 only
    >    >    > on event 5000061663, node 3 only on event 5000061663, node 6 only on
    >    >    > event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663, node 6 only on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    >    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    >    >    > on event 5000061663
    >    >    >
    >    >    > ---- node 4 log archive ----
    >    >    >
    >    >    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
    >    >    > pa_server=2 pa_client=4|restart notification' prod4/node4-pathconfig.out
    >    >    >
    >    >    > 2017-06-15 15:14:00 UTC [5688] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4
    >    >    > pa_conninfo="dbname=ams
    >    >    >
    >    >    > 2017-06-15 15:53:00 UTC [8431] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2
    >    >    > pa_client=4 pa_conninfo="dbname=ams
    >    >    >
    >    >    > 2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2
    >    >    > pa_client=4 pa_conninfo="dbname=ams
    >    >    >
    >    >    > 2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 pa_client=4
    >    >    > pa_conninfo="dbname=ams
    >    >    >
    >    >    > 2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2
    >    >    >
    >    >    > 2017-06-19 15:11:45 UTC [2707] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-20 18:40:15 UTC [31224] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-21 14:31:42 UTC [6253] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-21 14:35:26 UTC [32367] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-26 18:21:25 UTC [9278] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-26 18:33:04 UTC [28839] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-26 18:33:30 UTC [1785] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > bos-mpt5c:odin-9353 ttignor$
    >    >    >
    >    >    > ---- node 5 log archive ----
    >    >    >
    >    >    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
    >    >    > pa_server=2 pa_client=5|restart notification' prod5/node5-pathconfig.out
    >    >    >
    >    >    > 2017-06-15 15:13:56 UTC [20700] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-15 15:14:06 UTC [20374] CONFIG storePath: pa_server=2
    >    >    > pa_client=5 pa_conninfo="dbname=ams
    >    >    >
    >    >    > 2017-06-15 15:53:01 UTC [20374] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-15 15:53:11 UTC [2859] CONFIG storePath: pa_server=2 pa_client=5
    >    >    > pa_conninfo="dbname=ams
    >    >    >
    >    >    > 2017-06-16 17:28:19 UTC [2859] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-16 17:28:29 UTC [10753] CONFIG storePath: pa_server=2
    >    >    > pa_client=5 pa_conninfo="dbname=ams
    >    >    >
    >    >    > 2017-06-19 15:11:40 UTC [10753] CONFIG disableNode: no_id=2
    >    >    >
    >    >    > 2017-06-19 15:11:40 UTC [10753] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-20 18:40:11 UTC [450] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-21 14:31:41 UTC [22300] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-21 14:35:28 UTC [26777] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-26 18:21:27 UTC [28366] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-26 18:33:04 UTC [29345] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > 2017-06-26 18:33:27 UTC [1299] INFO   localListenThread: got restart
    >    >    > notification
    >    >    >
    >    >    > bos-mpt5c:odin-9353 ttignor$
    >    >    >
    >    >    >              Tom ☺
    >    >    >
    >    >    >
    >    >    >
    >    >    > _______________________________________________
    >    >    > Slony1-general mailing list
    >    >    > Slony1-general at lists.slony.info
    >    >    > http://lists.slony.info/mailman/listinfo/slony1-general
    >    >    >
    >    >
    >    >
    >    >
    >    >
    >
    >
    >
    



More information about the Slony1-general mailing list