<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Sep 11, 2016 at 11:41 PM, Tory M Blue <span dir="ltr">&lt;<a href="mailto:tmblue@gmail.com" target="_blank">tmblue@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Jan has helped me before, giving me ideas to help with wide area replication where it seems that the connection drops between a large copy set and/or an index creation,  when there is no bits crossing the wire and the connections are dropped by the FW or other so Slony finishes up a table, index creation and attempts to grab the next table, but the connection is no longer there, so Slony says failed and attempts again.<div class="gmail_extra"><br></div><div class="gmail_extra">I think I&#39;m running into this between my Colo and Amazon, using their VPN gateway.  </div><div class="gmail_extra"><br></div><div class="gmail_extra">Here is the snippet of logs, there is no index here, we dropped it on the new node, so that it would not fail, but what&#39;s odd here is that it copies all the data and 35 minutes later it reports the time, which tells me it&#39;s doing something, but I&#39;m not sure what, if there is no index on that table. (there is a primary key with maintains integrity, and we didn&#39;t think we should drop that). but there are no other indexes, so the 35 minutes or whatever is a mystery..</div>


<div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra"><div class="gmail_extra"><br></div><div class="gmail_extra">2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table &quot;torque&quot;.&quot;adimpressions&quot;</div><div class="gmail_extra">2016-09-11 <b>22:39:39 </b>PDT CONFIG remoteWorkerThread_1: 76955497834 bytes copied for table &quot;torque&quot;.&quot;adimpressions&quot;</div></div><div class="gmail_extra">916499:2016-09-11 <b>23:14:25 </b>PDT CONFIG remoteWorkerThread_1: 6121.393 seconds to copy table &quot;torque&quot;.&quot;impressions&quot;</div><div class="gmail_extra">916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table &quot;torque&quot;.impressions_archive&quot;</div><div class="gmail_extra">916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of table &quot;torque&quot;.&quot;impressions_archive&quot;</div><div class="gmail_extra">916811:2016-09-11 23:14:25 PDT ERROR  remoteWorkerThread_1: &quot;select &quot;_cls&quot;.copyFields(237);&quot; </div><div class="gmail_extra">916907:2016-09-11 23:14:25 PDT WARN   remoteWorkerThread_1: data copy for set 2 failed 1 times - sleep 15 seconds</div><div class="gmail_extra">917014:2016-09-11 23:14:25 PDT INFO   cleanupThread: 7606.655 seconds for cleanupEvent()</div><div class="gmail_extra"><br></div><div class="gmail_extra">This run,  I added keep-alives by the following method. (and the timing and results are the same without them, set 2 fails with error 237).</div><div class="gmail_extra"><br></div><div class="gmail_extra">Adding the following to both slon commands on the origin and the new node</div><div class="gmail_extra">


<p><span>tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300</span></p><p>Now not entirely sure how this is suppose to work and did I not tune this right. It obviously fails at the 30 minute mark, this is 25 minutes, however the servers never loses connection (I have a ping (not quite the same), but it has zero packet loss over the 2+ hours that these attempts to get things replicated take)). So maybe someone smarter then me can advice how I should tune the keep alives if that&#39;s what is happening.</p><p>I thought it would only use the keep-alives if it felt the partner was no longer there, but since i know pings show there is no connectivity issues, I&#39;m at a loss. AGAIN :)</p><p>Thanks for the assist</p><span><font color="#888888"><p>Tory</p></font></span></div></div></div>

</blockquote></div><br></div></blockquote><div>Okay keepalives didn&#39;t work, but maybe I configured the slon.conf wrong, there does not appear to be any real examples</div><div><br></div><div>I used:</div><div><br></div><div>


<p class="gmail-p1"><span class="gmail-s1">tcp_keepalive_time = 5</span></p>

<p class="gmail-p1"><span class="gmail-s1">tcp_keepalive_probes = 24</span></p>

<p class="gmail-p1"><span class="gmail-s1">tcp_keepalive_intvl = 5</span></p><p class="gmail-p1"><span class="gmail-s1">While my kernel is set at, maybe I need to adjust the kernel as well?</span></p><p class="gmail-p1"><span class="gmail-s1">net.ipv4.tcp_keepalive_time = 7200</span></p><p class="gmail-p1"><span class="gmail-s1">net.ipv4.tcp_keepalive_probes = 9</span></p><p class="gmail-p1"><span class="gmail-s1">


</span></p><p class="gmail-p1"><span class="gmail-s1">net.ipv4.tcp_keepalive_intvl = 75</span></p></div><div><br></div>


<p class="gmail-p1"><span class="gmail-s1">2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: 1869.486 seconds to copy table &quot;torque&quot;.&quot;impressions_daily&quot;</span></p>

<p class="gmail-p1"><span class="gmail-s1">2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: copy table &quot;torque&quot;.&quot;impressions&quot;</span></p>

<p class="gmail-p1"><span class="gmail-s1">2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: Begin COPY of table &quot;torque&quot;.&quot;impressions&quot;</span></p>

<p class="gmail-p1"><span class="gmail-s1">NOTICE:  truncate of &quot;torque&quot;.&quot;impressions&quot; succeeded</span></p><p class="gmail-p1"><span class="gmail-s1">


</span></p><p class="gmail-p1"><span class="gmail-s1">2016-09-12 10:31:09 PDT CONFIG remoteWorkerThread_1: 77048102322 bytes copied for table &quot;torque&quot;.&quot;adimpressions&quot;</span></p>

<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: 5708.515 seconds to copy table &quot;torque&quot;.&quot;impressions&quot;</span></p>

<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: copy table &quot;torque&quot;.&quot;impressions_archive&quot;</span></p>

<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: Begin COPY of table &quot;torque&quot;.&quot;impressions_archive&quot;</span></p>

<p class="gmail-p1"><span class="gmail-s1">2016-09-12 11:02:56 PDT ERROR  remoteWorkerThread_1: &quot;select &quot;_cls&quot;.copyFields(237);&quot; </span></p>

<div>2016-09-12 11:02:56 PDT WARN   remoteWorkerThread_1: data copy for set 2 failed 1 times - sleep 15 seconds </div><div><br></div><div>There are no indexes, so I don&#39;t know what Slon is doing for the 31 minutes between when the data is finished copied and it attempts to start the next table.</div><div><br></div><div>More suggestions? I know I&#39;m being needy but I&#39;m spinning my wheels it seems</div><div><br></div><div>Thanks</div><div>Tory</div></div><br></div></div>