Mon Sep 12 11:33:38 PDT 2016
- Previous message: [Slony1-general] Controlled Switchover
- Next message: [Slony1-general] Need some help again, keep alives, wide area replication , set failure STILL
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> On Sun, Sep 11, 2016 at 11:41 PM, Tory M Blue <tmblue at gmail.com> wrote: > >> Jan has helped me before, giving me ideas to help with wide area >> replication where it seems that the connection drops between a large copy >> set and/or an index creation, when there is no bits crossing the wire and >> the connections are dropped by the FW or other so Slony finishes up a >> table, index creation and attempts to grab the next table, but the >> connection is no longer there, so Slony says failed and attempts again. >> >> I think I'm running into this between my Colo and Amazon, using their VPN >> gateway. >> >> Here is the snippet of logs, there is no index here, we dropped it on the >> new node, so that it would not fail, but what's odd here is that it copies >> all the data and 35 minutes later it reports the time, which tells me it's >> doing something, but I'm not sure what, if there is no index on that table. >> (there is a primary key with maintains integrity, and we didn't think we >> should drop that). but there are no other indexes, so the 35 minutes or >> whatever is a mystery.. >> >> >> 2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table >> "torque"."adimpressions" >> 2016-09-11 *22:39:39 *PDT CONFIG remoteWorkerThread_1: 76955497834 bytes >> copied for table "torque"."adimpressions" >> 916499:2016-09-11 *23:14:25 *PDT CONFIG remoteWorkerThread_1: 6121.393 >> seconds to copy table "torque"."impressions" >> 916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table >> "torque".impressions_archive" >> 916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of >> table "torque"."impressions_archive" >> 916811:2016-09-11 23:14:25 PDT ERROR remoteWorkerThread_1: "select >> "_cls".copyFields(237);" >> 916907:2016-09-11 23:14:25 PDT WARN remoteWorkerThread_1: data copy for >> set 2 failed 1 times - sleep 15 seconds >> 917014:2016-09-11 23:14:25 PDT INFO cleanupThread: 7606.655 seconds for >> cleanupEvent() >> >> This run, I added keep-alives by the following method. (and the timing >> and results are the same without them, set 2 fails with error 237). >> >> Adding the following to both slon commands on the origin and the new node >> >> tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300 >> >> Now not entirely sure how this is suppose to work and did I not tune this >> right. It obviously fails at the 30 minute mark, this is 25 minutes, >> however the servers never loses connection (I have a ping (not quite the >> same), but it has zero packet loss over the 2+ hours that these attempts to >> get things replicated take)). So maybe someone smarter then me can advice >> how I should tune the keep alives if that's what is happening. >> >> I thought it would only use the keep-alives if it felt the partner was no >> longer there, but since i know pings show there is no connectivity issues, >> I'm at a loss. AGAIN :) >> >> Thanks for the assist >> >> Tory >> > > Okay keepalives didn't work, but maybe I configured the slon.conf wrong, there does not appear to be any real examples I used: tcp_keepalive_time = 5 tcp_keepalive_probes = 24 tcp_keepalive_intvl = 5 While my kernel is set at, maybe I need to adjust the kernel as well? net.ipv4.tcp_keepalive_time = 7200 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_intvl = 75 2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: 1869.486 seconds to copy table "torque"."impressions_daily" 2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: copy table "torque"."impressions" 2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: Begin COPY of table "torque"."impressions" NOTICE: truncate of "torque"."impressions" succeeded 2016-09-12 10:31:09 PDT CONFIG remoteWorkerThread_1: 77048102322 bytes copied for table "torque"."adimpressions" 2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: 5708.515 seconds to copy table "torque"."impressions" 2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: copy table "torque"."impressions_archive" 2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: Begin COPY of table "torque"."impressions_archive" 2016-09-12 11:02:56 PDT ERROR remoteWorkerThread_1: "select "_cls".copyFields(237);" 2016-09-12 11:02:56 PDT WARN remoteWorkerThread_1: data copy for set 2 failed 1 times - sleep 15 seconds There are no indexes, so I don't know what Slon is doing for the 31 minutes between when the data is finished copied and it attempts to start the next table. More suggestions? I know I'm being needy but I'm spinning my wheels it seems Thanks Tory -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.slony.info/pipermail/slony1-general/attachments/20160912/d3d6a1b1/attachment-0001.htm
- Previous message: [Slony1-general] Controlled Switchover
- Next message: [Slony1-general] Need some help again, keep alives, wide area replication , set failure STILL
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list