[Slony1-general] Replication node suddenly lagging, CPU bound postmaster

Wed Aug 20 08:24:09 PDT 2008

On Wednesday 20 August 2008, Benjamin Pineau <bpineau at elma.fr> wrote:
> Hi everyone.
>
> I have a replicating node that suddenly started to lag, on a 4 nodes
> Slony cluster that worked well for months. This node is powerful enough
> (ie. older, slower machines on the cluster achieve to keep up well).
> Network and block devices are mostly idling (with regard to
> interrupts/second and throughput). Strangely, the replication on this
> node seems CPU bound by the postmaster process doing the actual
> inserts/updates for slon (this postmaster process is stuck at 99% CPU
> usage since the beginning of the problem).
> Neither Slony (at "slon -d2" level) nor PostgreSQL did log any warning or
> error message, and the replication did not stopped on this node (it makes
> progresses, but too slowly to keep up, so it's now 3 days behind master).
>
> Any clue?

Look at pg_stat_activity for the slon process on the slave - you'll probably 
see a bunch of updates or deletes that look like they should be finishing 
fast, but aren't. When this happens here it's usually because the target 
table that's causing problems needs an ANALYZE (for us it usually happens 
on the first of the month when new month's data starts showing up and the 
planner loses its mind and stops using the primary key to find 
update/delete rows).

-- 
Alan