[Slony1-general] Feature Idea: improve performance when a large sl_log backlog exists

Tue Nov 23 08:53:33 PST 2010

Steve Singer <ssinger at ca.afilias.info> writes:
> On 10-11-23 09:48 AM, Vick Khera wrote:
>> On Tue, Nov 23, 2010 at 9:31 AM, Steve Singer<ssinger at ca.afilias.info>  wrote:
>>> Slony can get into a state where it can't keep up/catch up with
>>> replication because the sl_log table is so large.
>>>
>>>
>>> Does this problem bite people often enough in the real world for us to
>>> devote effort to fixing?
>>>
>>
>> It used to happen to me a lot when I had my origin running on spinning
>> media.  Ever since I moved to an SSD, it doesn't really happen.  At
>> worst when I do a large delete I fall behind by a few minutes but it
>> catches up quickly.  For me, it didn't even require taking the DB down
>> for any extended period.. just running a large update or delete that
>> touched many many rows (ie, generated a lot of events in sl_log) could
>> send the system into a tailspin that would take hours or possibly days
>> (until we hit a weekend) to recover.
>>
>> I am not sure it was caused by the log being too big... because
>> sometimes reindexing the tables on the replica would clear up the
>> backlog quickly too.  But I may be sniffing down the wrong trail.
>>
>
> The other place this will hit busy systems is during the initial sync.
> If your database is very large (or very busy) a lot of log rows can 
> accumulate while that initial sync is going on.   OMIT_COPY doesn't help 
> you because it requires an outage to get the master and slave in sync 
> (just the loading time on a 1TB database is a while).
>
> CLONE PREPARE/FINISH also aren't of help because a) these only work if 
> you already have at least 1 subscriber setup and b) after you do the 
> clone prepare any later transactions still need to be kept in sl_log 
> until the new slave is up and running.

I'm not sure that we gain much by splitting the logs into a bunch of
pieces for that case.

It's still the same huge backlog, and until it gets worked down, it's
bloated, period.

A different suggestion, that doesn't involve any changes to Slony...

Initially, it might be a good idea to set up the new subscriber with
FORWARD=no...

   subscribe set (id=1, provider=1, receiver=2,forward=no);

That means that log data won't get captured in sl_log_(1|2) on the
subscriber for a while, while the subscription is catching up.  Once
it's reasonably caught up, you submit:

   subscribe set (id=1, provider=1, receiver=2,forward=yes);

which turns that logging on, so that the new node becomes a failover
target and a legitimate target to feed other subscriptions.

While node #2 is catching up, it's a crummy candidate for a failover
target, so while this strategy *altogether* loses the ability for it to
be a failover target while catching up, remember that it was moving from
18-ish hours behind towards caught up, which represented a crummy
failover target.  I don't think something hugely useful is being lost,
here.
-- 
select 'cbbrowne' || '@' || 'ca.afilias.info';
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"