[Slony1-general] Re: Slony1-general Digest, Vol 19, Issue 25

Wed Aug 17 22:25:09 PDT 2005

Marc Munro wrote:

>Brad, thank you so much for your response.  I have more information but
>still no solution.
>
>In response to Brad's suggestions:
>
>1) pg_stat_activity
>Interesting.  There is an idle in transaction process from the provider.
>Shutting down and restarting the database resulted in a new session with
>the same status.  This appears to be something to do with slony.
>  
>
That may be fine, as long as the idle-in-transaction process isn't old.

>2) vacuum subscriber's pg_listener
>No bloat to report.  Only a handful of rows existed.
>
>  
>
The question is how many rows it *couldn't* purge; how many dead tuples
were kicking around.  e.g. - the output of VACUUM ANALYZE on the table...

>3) vacuum subscribers sl_log_1
>Nothing - there are no rows in the subscriber.
>
>4) Vacuuming subscriber
>Already done.
>
>Results of further investigation:
>I think the problem may simply be that the load on the provider during
>my initial sync was very high.  This has resulted in sl_log_1 becoming
>huge.  In my slony logs on the subscriber I often see lines like:
>
>2005-08-17 09:08:47 PDT DEBUG2 remoteHelperThread_1_1: 83.952 seconds
>delay for first row
>
>Vacuuming the sl_log_1 table on the provider shows it to be pretty big:
>
>INFO:  --Relation _cage2.sl_log_1--
>INFO:  Pages 714558: Changed 1084, Empty 0; Tup 35986240: Vac 0, Keep 0,
>UnUsed 7172.
>	Total CPU 52.50s/9.31u sec elapsed 177.23 sec.
>INFO:  --Relation pg_toast.pg_toast_2177383887--
>INFO:  Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
>	Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
>INFO:  Analyzing _cage2.sl_log_1
>
>I have tracked down a specific slony query that takes a long time to run
>on the provider (100+ secs):
>
>select log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype,
>log_cmddata from "_cage2".sl_log_1 where log_origin = 1 and (  (
> log_tableid in
>(100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,99999)
>    and (log_xid < '2106689363' and "_cage2".xxid_lt_snapshot(log_xid,
>'2106433975:2106689363:''2106433975'',''2106688480'''))
>    and (log_xid >= '2106433975' and "_cage2".xxid_ge_snapshot(log_xid,
>'2106433975:2106689266:''2106433975'',''2106688480'''))
>) ) order by log_actionseq;
>
>This returns 662 rows.  If I remove the xxid_ge_snapshot and
>xxid_lt_snapshot function calls it returns 2135674.  This would seem to
>be the root of my performance problem.  
>
>I don't understand what the functions do, but allowing them to so
>drastically prune the result set means that the query does a lot of work
>for relatively few rows.
>
>Can anyone explain what is going on here?
>  
>
You're hitting a "Rod Taylor" style situation, where you have some
pretty enormous amount of activity going on.

Each SYNC is running for a really long time, and there's no much way of
avoiding that :-(.

What Rod did under analagous conditions was to group a lot of SYNCs
together.  Since it'll probably be doing a Seq Scan, you might as well
try to process most of sl_log_1.

With the default "-g 6", you'll probably see replication falling behind :-(

Suggestion: set the -g option as high as you can get away with.  The
SYNCs will run for quite a long time, but at least they'll find lots of
data to replicate.  In v 1.0.x, the maximum setting out of the box for
-g is "-g 100", which tries to process 100 SYNCs at a time, if you're
running behind.

If you're running 1.1.x, the maximum value is 10000, and you'd probably
find it useful to set it pretty high until things catch up.  And set the
"desired_sync_time" value (the -o option to slon) to 0, so that it
doesn't do the "adaptive" thing.