Steve Singer ssinger_pg at sympatico.ca
Tue Dec 27 17:25:58 PST 2011
On Tue, 27 Dec 2011, Mike Wilson wrote:

> Thanks for the reply.  Per your questions here is additional information.  For the purposes of this discussion db1 (node: 101) is our Slave and db2 (node:102) is our Slony Master.
>
>> On Tue, 27 Dec 2011, Mike Wilson wrote:

>>
>> 2. Now that you have killed the old vacuums you need to determine what if anything slony is doing.
> During the period where the replication stopped I found several vacuums that had been running for upwards of 11 days.   I killed the processes (using pg_cancel_backup) and shortly after that started to get pings from Nagios which monitors the sl_status table that replication was falling behind.
>>
>> a) Are new SYNC events being generated on the master?
> I'm not sure how I determine this.

select max(ev_seqno) FROM sl_event where ev_origin=102;
Every few seconds that number should be increasing.

>
> Here is the sl_status table from the master over the same few minutes:
> =# select * from sl_status;
> st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts  | st_lag_num_events |      st_lag_time
> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
>       102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |       5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |            133831 | 3 days 06:40:41.911154
>
> =# select * from sl_status;
> st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts  | st_lag_num_events |      st_lag_time
> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
>       102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |       5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |            133831 | 3 days 06:42:49.222796
>
> As you can see the slaves st_last_event value appears to be increasing, whereas the master's appears to be static.

This tells me that (according to the master) the slave has so far processed 
event (102,5003985169) and the last event generated on the master is 
(102,5004119000).  This means that your slave is 133,831 events behind.  You 
didn't tell me what time you run the above query at.

If the above query was run at about 2011-12-27 10:06:xx then it proabably 
means that replication is now working normally but your slave is behind 
133,000 events.

The line from your slave.slon.log below
2011-12-27 14:06:32 PSTINFO   remoteWorkerThread_102: SYNC 5003985339 done in 80.973 seconds

tells me that at 14:06:32 the slave processed a SYNC from the master that 
took 80 seconds to process.  Every 80 seconds or so you should be seeing 
another similar line.  The difference in the SYNC # in the previous 
log line and these one tells you how many SYNC's were processed in that 
80.973 seconds.

Now that it is 6-8 or so hours later, is your cluster closing to being 
caught up or has it fallen more behind?

If your getting caught up then you should be able to estimate when you will 
be caught up.

If your falling even farther behind then your options are

1) To unsubscribe + resubscribe the slave.  As I mentioned this morning that 
if your database is small enough and your network is fast enough this 
might be faster then letting it get caught up
2) If you can't resubscribe, then tune the the sync_interval(make it larger) 
on the master and tune the sync_max_groupsize (make it larger) on the slave 
as described 
http://www.slony.info/documentation/2.0/slon-config-interval.html to make 
the master generate fewer syncs and the slave to process more of those syncs 
together.
3) Consider upgrading your cluster to 2.1.0, you can upgrade from 2.0.7 to 
2.1.0 on a cluster that is not caught up.

If my assumptions are correct then this is an instance of slony bug #167 
(fixed in 2.1.0) where sync processing in a cluster that is behind takes a 
lot longer than it should.  Your long running transactions made sl_log get 
very big and now slony takse longer to process a SYNC then it does to 
generate a new one.  2.1.0 should process SYNC events in this case much 
faster.




> Thanks so much for your help thus far.  Please advise with any attempted 
> fixes you can think of or if you can possibly describe what is currently 
> going on.  Cheers.
>
>>
>> Steve
>>
>>> Under incredible load last week during the Christmas season our primary PG (8.4.7: Slony 2.0.7) stopped replicating.  Now that we are past the Christmas season I need to figure out how to clear the back log of replication rows in sl_log_1/2.  This is all running on our commercial website and if possible I would prefer not to have to restart the PG instance as it would require a scheduled maintenance window on a week where everyone is out of the office.  In an attempt to fix the issue without rebooting the PG instance and I've already restarted the Slony services on the primary PG node as a first attempt at a fix.  This did not get replication working again and I'm still getting the same error from Slony in the logs: log switch to sl_log_1 still in progress - sl_log_2 not truncated
>>>
>>>> From my research I can see that this error message is called when the function logswitch_finish() is called.  I did have some hung vacuums during this period of high load on the server but I have killed them with pg_cancel_backend.  From other lock analysis I can see that nothing is currently running or locked in the db (nothing more than a few milliseconds old at least).  I'm certain whatever transaction was in progress that prevented the switch from occurring is long since past.
>>>
>>> Any ideas on the best way to get replication working again?  I'm adverse to rebooting the PG instance but I am willing to do it if someone more knowlegable out there thinks it would fix this issue.  We currently are operating without a backup of all of our XMas sales data and I *really* want to get this data replicated.  Any help would be appreciated.
>>>
>>> Mike Wilson
>>> Predicate Logic
>>> Cell: (310) 600-8777
>>> SkypeID: lycovian
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Slony1-general mailing list
>>> Slony1-general at lists.slony.info
>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>
>>
>
>



More information about the Slony1-general mailing list