[Slony1-general] Need some help again, keep alives, wide area replication , set failure STILL

Mon Sep 12 11:33:38 PDT 2016

> On Sun, Sep 11, 2016 at 11:41 PM, Tory M Blue <tmblue at gmail.com> wrote:
>
>> Jan has helped me before, giving me ideas to help with wide area
>> replication where it seems that the connection drops between a large copy
>> set and/or an index creation,  when there is no bits crossing the wire and
>> the connections are dropped by the FW or other so Slony finishes up a
>> table, index creation and attempts to grab the next table, but the
>> connection is no longer there, so Slony says failed and attempts again.
>>
>> I think I'm running into this between my Colo and Amazon, using their VPN
>> gateway.
>>
>> Here is the snippet of logs, there is no index here, we dropped it on the
>> new node, so that it would not fail, but what's odd here is that it copies
>> all the data and 35 minutes later it reports the time, which tells me it's
>> doing something, but I'm not sure what, if there is no index on that table.
>> (there is a primary key with maintains integrity, and we didn't think we
>> should drop that). but there are no other indexes, so the 35 minutes or
>> whatever is a mystery..
>>
>>
>> 2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table
>> "torque"."adimpressions"
>> 2016-09-11 *22:39:39 *PDT CONFIG remoteWorkerThread_1: 76955497834 bytes
>> copied for table "torque"."adimpressions"
>> 916499:2016-09-11 *23:14:25 *PDT CONFIG remoteWorkerThread_1: 6121.393
>> seconds to copy table "torque"."impressions"
>> 916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table
>> "torque".impressions_archive"
>> 916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of
>> table "torque"."impressions_archive"
>> 916811:2016-09-11 23:14:25 PDT ERROR  remoteWorkerThread_1: "select
>> "_cls".copyFields(237);"
>> 916907:2016-09-11 23:14:25 PDT WARN   remoteWorkerThread_1: data copy for
>> set 2 failed 1 times - sleep 15 seconds
>> 917014:2016-09-11 23:14:25 PDT INFO   cleanupThread: 7606.655 seconds for
>> cleanupEvent()
>>
>> This run,  I added keep-alives by the following method. (and the timing
>> and results are the same without them, set 2 fails with error 237).
>>
>> Adding the following to both slon commands on the origin and the new node
>>
>> tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300
>>
>> Now not entirely sure how this is suppose to work and did I not tune this
>> right. It obviously fails at the 30 minute mark, this is 25 minutes,
>> however the servers never loses connection (I have a ping (not quite the
>> same), but it has zero packet loss over the 2+ hours that these attempts to
>> get things replicated take)). So maybe someone smarter then me can advice
>> how I should tune the keep alives if that's what is happening.
>>
>> I thought it would only use the keep-alives if it felt the partner was no
>> longer there, but since i know pings show there is no connectivity issues,
>> I'm at a loss. AGAIN :)
>>
>> Thanks for the assist
>>
>> Tory
>>
>
> Okay keepalives didn't work, but maybe I configured the slon.conf wrong,
there does not appear to be any real examples

I used:

tcp_keepalive_time = 5

tcp_keepalive_probes = 24

tcp_keepalive_intvl = 5

While my kernel is set at, maybe I need to adjust the kernel as well?

net.ipv4.tcp_keepalive_time = 7200

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_intvl = 75

2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: 1869.486 seconds to
copy table "torque"."impressions_daily"

2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: copy table
"torque"."impressions"

2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: Begin COPY of table
"torque"."impressions"

NOTICE:  truncate of "torque"."impressions" succeeded

2016-09-12 10:31:09 PDT CONFIG remoteWorkerThread_1: 77048102322 bytes
copied for table "torque"."adimpressions"

2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: 5708.515 seconds to
copy table "torque"."impressions"

2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: copy table
"torque"."impressions_archive"

2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: Begin COPY of table
"torque"."impressions_archive"

2016-09-12 11:02:56 PDT ERROR  remoteWorkerThread_1: "select
"_cls".copyFields(237);"
2016-09-12 11:02:56 PDT WARN   remoteWorkerThread_1: data copy for set 2
failed 1 times - sleep 15 seconds

There are no indexes, so I don't know what Slon is doing for the 31 minutes
between when the data is finished copied and it attempts to start the next
table.

More suggestions? I know I'm being needy but I'm spinning my wheels it seems

Thanks
Tory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20160912/d3d6a1b1/attachment-0001.htm