[Slony1-general] Slony lagging, no errors, due to bulk data change?

Thu Sep 6 14:09:05 PDT 2012

Hi all,

postgresql 8.4
slony 1.2
centos system

I have single master to single slave slony cluster where slony is very 
far behind. There are no errors in the logs or anything, but what looks 
to be happeneing is that queries slony is executing on the slony log 
table on the master are taking a long time to complete.

The query is "fetch 100 from LOG;" which can take a long time to 
complete, over 15 minutes at times. Each time this happens we process 1 
event. This usually takes milliseconds to complete.

At this point, the query on the master 'fetch 100 from log' takes about 
20 minutes to complete, and after it completes the slave processes 1 
more event, and then 'fetch 100 from log' kicks off again and takes yet 
another 20 or so minutes. So the slave is processing an event about once 
per 20 minutes.

As for a cause I believe it's due to the follow up work after adding a 
column to a table in replication. After adding the column, I believe 
that the table is being updated to set new values in the newly added 
column. This could result in millions of new items for slony to process, 
which may have caused the tables to become so large that they are 
resulting in sequential scans or something.

I'm trying to dig in and see what exactly 'fetch 100 from log' is doing 
on the master, and if I can speed it up. Is this querying sl_log_<1/2> 
tables?

the pg table pg_notify does not have outstanding dead rows, it's at 0. 
also out of all the slony tables in the slony schema, the one with the 
most dead rows is at about 2K dead rows.

sl_log_1 has 0 rows, sl_log_2 has about 9,326,260 rows (and zero dead rows).

I'm going to see if we can reduce group size, see if for whatever reason 
that reduces the query result set so it does a index scan vs sequential 
(if that's even the issue).

Any help is greatly appreciated.

- Brian F