[Slony1-general] Slony replication problem

Sun Mar 27 06:03:41 PDT 2011

On 3/26/2011 8:02 PM, Tim Lloyd wrote:
> It wasn't mission critical changes lost. Postgres log was full of messages saying it couldn't switch the log because it was already in progress. Uptime was showing load averages of 60. Checking sl_log_1 it only had 4 entries. nuking it and re-initing the log switch reduced the load average to between 4 and 12.
>

The slony cleanup thread up to 1.2 does delete the no longer needed 
sl_log_X entries. So the fact that you found something in there means 
that you deleted stuff, that probably had not replicated to all nodes 
yet. Maybe you personally don't care about a few updates lost, but for 
most of us what you suggested doing is actually a good reason by itself 
to start rebuilding all replicas.

The actual reason why a large backlog in sl_log_X causes problems is 
that the query plan for selecting that log is scanning the log from the 
beginning, however far into the log the catch up has progressed already. 
So the startup cost for the log selection increases more and more until 
it actually finishes processing that entire sl_log_X. All that time, it 
cannot and should not finish that log switch. We have a fix for that in 
the current 2.1 development tree and consider backpatching that logic 
into 2.0.

Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin