[Slony1-general] Need some help again, keep alives, wide area replication , set failure

Sun Sep 11 23:57:48 PDT 2016

Sorry I'm a bit sleep deprived, this is almost the exact thing I asked for
help on in 2014.. Jan and Jeff both came in and gave me suggestions for
keep alives which are much more aggressive than I have it set to.

So I'm going to test with the more aggressive settings from this thread in
2014 "
https://www.mail-archive.com/slony1-general@lists.slony.info/msg06967.html"

How lame I spaced, I knew Jan had been helpful, but totally spaced this
thread.. UUGH! sorry

And yes double bad, top posting!!

Tory

On Sun, Sep 11, 2016 at 11:41 PM, Tory M Blue <tmblue at gmail.com> wrote:

> Jan has helped me before, giving me ideas to help with wide area
> replication where it seems that the connection drops between a large copy
> set and/or an index creation,  when there is no bits crossing the wire and
> the connections are dropped by the FW or other so Slony finishes up a
> table, index creation and attempts to grab the next table, but the
> connection is no longer there, so Slony says failed and attempts again.
>
> I think I'm running into this between my Colo and Amazon, using their VPN
> gateway.
>
> Here is the snippet of logs, there is no index here, we dropped it on the
> new node, so that it would not fail, but what's odd here is that it copies
> all the data and 35 minutes later it reports the time, which tells me it's
> doing something, but I'm not sure what, if there is no index on that table.
> (there is a primary key with maintains integrity, and we didn't think we
> should drop that). but there are no other indexes, so the 35 minutes or
> whatever is a mystery..
>
>
> 2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table
> "torque"."adimpressions"
> 2016-09-11 *22:39:39 *PDT CONFIG remoteWorkerThread_1: 76955497834 bytes
> copied for table "torque"."adimpressions"
> 916499:2016-09-11 *23:14:25 *PDT CONFIG remoteWorkerThread_1: 6121.393
> seconds to copy table "torque"."impressions"
> 916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table
> "torque".impressions_archive"
> 916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of
> table "torque"."impressions_archive"
> 916811:2016-09-11 23:14:25 PDT ERROR  remoteWorkerThread_1: "select
> "_cls".copyFields(237);"
> 916907:2016-09-11 23:14:25 PDT WARN   remoteWorkerThread_1: data copy for
> set 2 failed 1 times - sleep 15 seconds
> 917014:2016-09-11 23:14:25 PDT INFO   cleanupThread: 7606.655 seconds for
> cleanupEvent()
>
> This run,  I added keep-alives by the following method. (and the timing
> and results are the same without them, set 2 fails with error 237).
>
> Adding the following to both slon commands on the origin and the new node
>
> tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300
>
> Now not entirely sure how this is suppose to work and did I not tune this
> right. It obviously fails at the 30 minute mark, this is 25 minutes,
> however the servers never loses connection (I have a ping (not quite the
> same), but it has zero packet loss over the 2+ hours that these attempts to
> get things replicated take)). So maybe someone smarter then me can advice
> how I should tune the keep alives if that's what is happening.
>
> I thought it would only use the keep-alives if it felt the partner was no
> longer there, but since i know pings show there is no connectivity issues,
> I'm at a loss. AGAIN :)
>
> Thanks for the assist
>
> Tory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20160911/3498240d/attachment.htm