[Slony1-general] Initial subscribe results in slave never catching up

Fri Apr 21 06:16:48 PDT 2006

Hello.  

I've held off several days before posting this to be sure I didn't
miss anything in the docs.

Slony version is 1.1.5, Postgres 8.0.7 on Solaris 2.9.  

Had no trouble configuring an bringing up a simple master/slave
cluster between two lightly loaded machines (same large DB as
production ) but can't get this working on our production system after
3 tries now.

Our initial goal is to use Slony to make a Pg upgrade to 8.1.x happen
quickly.  Soon after we'll need it for replication to hot-standby and
reports app servers.

The DB is quite large with 270 tables and about as many sequences.
The system is quite busy during the period of 5-6 hours that it takes
for subscription to occur.  total DB size is approx 25GB of raw disk
space.  System is well maintained and handles a decent traffic load
easily in routine use.

I have changed the sync parameters in order to prevent the master from
querying the sequences too often.  Below are (presently) the only
overriding options that I'm giving the two slon demons;

master -s 30000  (resulted in much smaller growth of sl_seqlog table)
slave  -g 60

I find on the master side commonly 2 queries having runtimes of
several seconds;

fetch 100 from log;
commit transaction;

The end result is that by the time subscription is finished, we're
approx 5+ hours behind and the sl_status view shows the lag time
increasing not decreasing.

Very strange is that out of 3 seperate tries, on one of them, the
system did reach up-to-date status.  Unfortunately a nightly batch job
that did TRUNCATE on one of the master tables caused dupe PK violation
on reinsert and I had to start again.

I do not understand why I had a good result in "catch-up" on this one
day however except that perhaps as I had started the subscribe a bit
earlier, the backlog volume was less and I stayed on the good side of
a "tipping point" in terms of performance?

When failing to catch up, the slave system shows very little activity
as reported by pg_stat_activity.  It does indeed process sync events
and I can see the queries being run.  Most of the time however, the
slave is idle.  It could catch up a lot faster I believe.  Apparently
the master cannot send it enough work to make this happen however.

My questions include; 

1. Is it doomed to fail having so many tables and seqs in a single rep
   set?

2. What other slon opts might  I try changing (have hacked on them but
   found little improvement from what shown above)

3. Why does the commit on master side take so long?

4. How to tell from what observed at runtime which of the slon parms
   should be tuned and in which direction?  (Advice beyond what given
   in the docs is needed.)

Thank you.

-- 
-------------------------------------------------------------------------------
Jerry Sievers   305 854-3001 (home)     WWW ECommerce Consultant
                305 321-1144 (mobile	http://www.JerrySievers.com/