Marc Munro marc
Wed Aug 17 20:43:32 PDT 2005
Brad, thank you so much for your response.  I have more information but
still no solution.

In response to Brad's suggestions:

1) pg_stat_activity
Interesting.  There is an idle in transaction process from the provider.
Shutting down and restarting the database resulted in a new session with
the same status.  This appears to be something to do with slony.

2) vacuum subscriber's pg_listener
No bloat to report.  Only a handful of rows existed.

3) vacuum subscribers sl_log_1
Nothing - there are no rows in the subscriber.

4) Vacuuming subscriber
Already done.

Results of further investigation:
I think the problem may simply be that the load on the provider during
my initial sync was very high.  This has resulted in sl_log_1 becoming
huge.  In my slony logs on the subscriber I often see lines like:

2005-08-17 09:08:47 PDT DEBUG2 remoteHelperThread_1_1: 83.952 seconds
delay for first row

Vacuuming the sl_log_1 table on the provider shows it to be pretty big:

INFO:  --Relation _cage2.sl_log_1--
INFO:  Pages 714558: Changed 1084, Empty 0; Tup 35986240: Vac 0, Keep 0,
UnUsed 7172.
	Total CPU 52.50s/9.31u sec elapsed 177.23 sec.
INFO:  --Relation pg_toast.pg_toast_2177383887--
INFO:  Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
	Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  Analyzing _cage2.sl_log_1

I have tracked down a specific slony query that takes a long time to run
on the provider (100+ secs):

select log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype,
log_cmddata from "_cage2".sl_log_1 where log_origin = 1 and (  (
 log_tableid in
(100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,99999)
    and (log_xid < '2106689363' and "_cage2".xxid_lt_snapshot(log_xid,
'2106433975:2106689363:''2106433975'',''2106688480'''))
    and (log_xid >= '2106433975' and "_cage2".xxid_ge_snapshot(log_xid,
'2106433975:2106689266:''2106433975'',''2106688480'''))
) ) order by log_actionseq;

This returns 662 rows.  If I remove the xxid_ge_snapshot and
xxid_lt_snapshot function calls it returns 2135674.  This would seem to
be the root of my performance problem.  

I don't understand what the functions do, but allowing them to so
drastically prune the result set means that the query does a lot of work
for relatively few rows.

Can anyone explain what is going on here?

Thanks
__
Marc Munro

> From: Brad Nicholson <bnichols at ca.afilias.info>
> Subject: Re: [Slony1-general] Slow catchup
> To: slony1-general at gborg.postgresql.org
> [. . . ]
> >What can I look for in order to track down this problem?  
> >  
> >
> 
> A few things:
> 
> 1) Have a look at pg_stat_activity on the subscriber and see what's 
> going on with the database.  Slony doesn't like 'idle in transacation' 
> transacations (they cause the pg_listener).  Look for anything else out 
> of the ordinary, like long running transactions.
> 
> 2) On the subscriber, try
> VACUUM ANALYZE VERBOSE pg_listener;
> And have a look at the output.  pg_listener bloat can cause the replicas 
> to fall behind.  You may have to kill off some connections to get the 
> vacuum to actaully do anything.
> 
> 3)Again, on the subscriber, run
> VACUUM ANALYZE VERBOSE _clustername.sl_log_1;
> 
> I've seen replica's fall behind due to a heavily bloated index on sl_log_1
> 
> 4)Are you vacumming your subsciber?
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://gborg.postgresql.org/pipermail/slony1-general/attachments/20050817/d4b29784/attachment.bin


More information about the Slony1-general mailing list