Steve Singer steve at ssinger.info
Sun Jul 2 18:30:12 PDT 2017
On Wed, 28 Jun 2017, Tignor, Tom wrote:

>
> 	Hi Steve,
> 	Thanks for the info. I was able to repro this problem in testing and saw as soon as I added the missing path back the still-in-process failover op continued on and completed successfully.
> 	We do issue DROP NODEs in the event we need to restore a replica from scratch, which did occur. However, the restore workflow also should issue store paths to/from the new replica node and every other node. Still investigating this.
> 	What still confuses me is the recurring “remoteWorkerThread_X: SYNC” output, despite the fact of not having a configured path. If the path is missing, how does slon continue to get SYNC events?

Slon can get events including SYNC from nodes other than the event origin if 
it has a path to that node.   However a slon can only replicate the data 
from a node it has a path to.


Steve



>
> 	Tom    (
>
>
> On 6/27/17, 5:04 PM, "Steve Singer" <steve at ssinger.info> wrote:
>
>    On 06/27/2017 11:59 AM, Tignor, Tom wrote:
>
>
>    The disableNode() in the makes it look like someone did a DROP NODE
>
>    If the only issue is that your missing active paths in sl_path you can
>    add/update the paths with slonik.
>
>
>
>
>    > **
>    >
>    > **Hello Slony-I community,
>    >
>    >              Hoping someone can advise on a strange and serious problem.
>    > We performed a slony service failover yesterday. For the first time
>    > ever, our slony service FAILOVER op errored out. We recently expanded
>    > our cluster to 7 consumers from a single provider. There are no load
>    > issues during normal operations. As the error output below shows,
>    > though, our node 4 and node 5 consumers never got the events they
>    > needed. Here’s where it gets weird: closer inspection has shown that
>    > node 2->4 and node 2->5 path data went missing out of the service at
>    > some point. It seems clear that’s the main issue, but in spite of that,
>    > both node 4 and node 5 continued to find and process node 2 SYNC events
>    > for a full week! The logs show this happened in spite of multiple restarts.
>    >
>    > How can this happen? If missing path data stymies the failover, wouldn’t
>    > it also prevent normal SYNC processing?
>    >
>    > In the case where a failover is begun with inadequate path data, what’s
>    > the best resolution? Can path data be quickly applied to allow failover
>    > to succeed?
>    >
>    >              Thanks in advance for any insights.
>    >
>    > ---- failover error ----
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:
>    > calling restart node 1
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
>    > 2017-06-26 18:33:02
>    >
>    > executing preFailover(1,1) on 2
>    >
>    > executing preFailover(1,1) on 3
>    >
>    > executing preFailover(1,1) on 4
>    >
>    > executing preFailover(1,1) on 5
>    >
>    > executing preFailover(1,1) on 6
>    >
>    > executing preFailover(1,1) on 7
>    >
>    > executing preFailover(1,1) on 8
>    >
>    > NOTICE: executing "_ams_cluster".failedNode2 on node 2
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 8 only on event 5000061654, node 4 only
>    > on event 5000061654, node 5 only on event 5000061655, node 3 only on
>    > event 5000061662, node 6\
>    >
>    >   only on event 5000061654, node 7 only on event 5000061656
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061657, node 5 only
>    > on event 5000061663, node 3 only on event 5000061663, node 6 only on
>    > event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663, node 6 only on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
>    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
>    > on event 5000061663
>    >
>    > ---- node 4 log archive ----
>    >
>    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
>    > pa_server=2 pa_client=4|restart notification' prod4/node4-pathconfig.out
>    >
>    > 2017-06-15 15:14:00 UTC [5688] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4
>    > pa_conninfo="dbname=ams
>    >
>    > 2017-06-15 15:53:00 UTC [8431] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2
>    > pa_client=4 pa_conninfo="dbname=ams
>    >
>    > 2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2
>    > pa_client=4 pa_conninfo="dbname=ams
>    >
>    > 2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 pa_client=4
>    > pa_conninfo="dbname=ams
>    >
>    > 2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2
>    >
>    > 2017-06-19 15:11:45 UTC [2707] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-20 18:40:15 UTC [31224] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-21 14:31:42 UTC [6253] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-21 14:35:26 UTC [32367] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-26 18:21:25 UTC [9278] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-26 18:33:04 UTC [28839] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-26 18:33:30 UTC [1785] INFO   localListenThread: got restart
>    > notification
>    >
>    > bos-mpt5c:odin-9353 ttignor$
>    >
>    > ---- node 5 log archive ----
>    >
>    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
>    > pa_server=2 pa_client=5|restart notification' prod5/node5-pathconfig.out
>    >
>    > 2017-06-15 15:13:56 UTC [20700] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-15 15:14:06 UTC [20374] CONFIG storePath: pa_server=2
>    > pa_client=5 pa_conninfo="dbname=ams
>    >
>    > 2017-06-15 15:53:01 UTC [20374] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-15 15:53:11 UTC [2859] CONFIG storePath: pa_server=2 pa_client=5
>    > pa_conninfo="dbname=ams
>    >
>    > 2017-06-16 17:28:19 UTC [2859] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-16 17:28:29 UTC [10753] CONFIG storePath: pa_server=2
>    > pa_client=5 pa_conninfo="dbname=ams
>    >
>    > 2017-06-19 15:11:40 UTC [10753] CONFIG disableNode: no_id=2
>    >
>    > 2017-06-19 15:11:40 UTC [10753] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-20 18:40:11 UTC [450] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-21 14:31:41 UTC [22300] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-21 14:35:28 UTC [26777] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-26 18:21:27 UTC [28366] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-26 18:33:04 UTC [29345] INFO   localListenThread: got restart
>    > notification
>    >
>    > 2017-06-26 18:33:27 UTC [1299] INFO   localListenThread: got restart
>    > notification
>    >
>    > bos-mpt5c:odin-9353 ttignor$
>    >
>    >              Tom ☺
>    >
>    >
>    >
>    > _______________________________________________
>    > Slony1-general mailing list
>    > Slony1-general at lists.slony.info
>    > http://lists.slony.info/mailman/listinfo/slony1-general
>    >
>
>
>
>


More information about the Slony1-general mailing list