<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Sep 11, 2016 at 11:41 PM, Tory M Blue <span dir="ltr"><<a href="mailto:tmblue@gmail.com" target="_blank">tmblue@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Jan has helped me before, giving me ideas to help with wide area replication where it seems that the connection drops between a large copy set and/or an index creation, when there is no bits crossing the wire and the connections are dropped by the FW or other so Slony finishes up a table, index creation and attempts to grab the next table, but the connection is no longer there, so Slony says failed and attempts again.<div class="gmail_extra"><br></div><div class="gmail_extra">I think I'm running into this between my Colo and Amazon, using their VPN gateway. </div><div class="gmail_extra"><br></div><div class="gmail_extra">Here is the snippet of logs, there is no index here, we dropped it on the new node, so that it would not fail, but what's odd here is that it copies all the data and 35 minutes later it reports the time, which tells me it's doing something, but I'm not sure what, if there is no index on that table. (there is a primary key with maintains integrity, and we didn't think we should drop that). but there are no other indexes, so the 35 minutes or whatever is a mystery..</div>
<div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra"><div class="gmail_extra"><br></div><div class="gmail_extra">2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table "torque"."adimpressions"</div><div class="gmail_extra">2016-09-11 <b>22:39:39 </b>PDT CONFIG remoteWorkerThread_1: 76955497834 bytes copied for table "torque"."adimpressions"</div></div><div class="gmail_extra">916499:2016-09-11 <b>23:14:25 </b>PDT CONFIG remoteWorkerThread_1: 6121.393 seconds to copy table "torque"."impressions"</div><div class="gmail_extra">916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table "torque".impressions_archive"</div><div class="gmail_extra">916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of table "torque"."impressions_archive"</div><div class="gmail_extra">916811:2016-09-11 23:14:25 PDT ERROR remoteWorkerThread_1: "select "_cls".copyFields(237);" </div><div class="gmail_extra">916907:2016-09-11 23:14:25 PDT WARN remoteWorkerThread_1: data copy for set 2 failed 1 times - sleep 15 seconds</div><div class="gmail_extra">917014:2016-09-11 23:14:25 PDT INFO cleanupThread: 7606.655 seconds for cleanupEvent()</div><div class="gmail_extra"><br></div><div class="gmail_extra">This run, I added keep-alives by the following method. (and the timing and results are the same without them, set 2 fails with error 237).</div><div class="gmail_extra"><br></div><div class="gmail_extra">Adding the following to both slon commands on the origin and the new node</div><div class="gmail_extra">
<p><span>tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300</span></p><p>Now not entirely sure how this is suppose to work and did I not tune this right. It obviously fails at the 30 minute mark, this is 25 minutes, however the servers never loses connection (I have a ping (not quite the same), but it has zero packet loss over the 2+ hours that these attempts to get things replicated take)). So maybe someone smarter then me can advice how I should tune the keep alives if that's what is happening.</p><p>I thought it would only use the keep-alives if it felt the partner was no longer there, but since i know pings show there is no connectivity issues, I'm at a loss. AGAIN :)</p><p>Thanks for the assist</p><span><font color="#888888"><p>Tory</p></font></span></div></div></div>
</blockquote></div><br></div></blockquote><div>Okay keepalives didn't work, but maybe I configured the slon.conf wrong, there does not appear to be any real examples</div><div><br></div><div>I used:</div><div><br></div><div>
<p class="gmail-p1"><span class="gmail-s1">tcp_keepalive_time = 5</span></p>
<p class="gmail-p1"><span class="gmail-s1">tcp_keepalive_probes = 24</span></p>
<p class="gmail-p1"><span class="gmail-s1">tcp_keepalive_intvl = 5</span></p><p class="gmail-p1"><span class="gmail-s1">While my kernel is set at, maybe I need to adjust the kernel as well?</span></p><p class="gmail-p1"><span class="gmail-s1">net.ipv4.tcp_keepalive_time = 7200</span></p><p class="gmail-p1"><span class="gmail-s1">net.ipv4.tcp_keepalive_probes = 9</span></p><p class="gmail-p1"><span class="gmail-s1">
</span></p><p class="gmail-p1"><span class="gmail-s1">net.ipv4.tcp_keepalive_intvl = 75</span></p></div><div><br></div>
<p class="gmail-p1"><span class="gmail-s1">2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: 1869.486 seconds to copy table "torque"."impressions_daily"</span></p>
<p class="gmail-p1"><span class="gmail-s1">2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: copy table "torque"."impressions"</span></p>
<p class="gmail-p1"><span class="gmail-s1">2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: Begin COPY of table "torque"."impressions"</span></p>
<p class="gmail-p1"><span class="gmail-s1">NOTICE: truncate of "torque"."impressions" succeeded</span></p><p class="gmail-p1"><span class="gmail-s1">
</span></p><p class="gmail-p1"><span class="gmail-s1">2016-09-12 10:31:09 PDT CONFIG remoteWorkerThread_1: 77048102322 bytes copied for table "torque"."adimpressions"</span></p>
<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: 5708.515 seconds to copy table "torque"."impressions"</span></p>
<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: copy table "torque"."impressions_archive"</span></p>
<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: Begin COPY of table "torque"."impressions_archive"</span></p>
<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT ERROR remoteWorkerThread_1: "select "_cls".copyFields(237);" </span></p>
<div>2016-09-12 11:02:56 PDT WARN remoteWorkerThread_1: data copy for set 2 failed 1 times - sleep 15 seconds </div><div><br></div><div>There are no indexes, so I don't know what Slon is doing for the 31 minutes between when the data is finished copied and it attempts to start the next table.</div><div><br></div><div>More suggestions? I know I'm being needy but I'm spinning my wheels it seems</div><div><br></div><div>Thanks</div><div>Tory</div></div><br></div></div>