[Slony1-general] replication stalled

Sun Dec 11 10:50:34 PST 2005

hi
i have database of about 40G running under postgresql 8.0.4
i added slony 1.1.2 replication for all tables (about 50 tables, in 5
schemas).
database is quite loaded - both witl a lot of selects, and tens of thousands
of updates/inserts daily.

everything worked really fine.

i made mrtg graph showing replication lag (taken from st_lag_time in
sl_status view).
during the day the lag was up to 5 seconds, withusually 2-3 peaks of up to
200 seconds every night (big, fast imports take place at this time).
then - on friday - replication seemed to stop - or works *very* slow.
even:
select * from sl_status takes a lot of time:
# explain analyze SELECT * from sl_status ;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=4487.07..4497.20 rows=1 width=48) (actual time=
7527.381..7527.442 rows=1 loops=1)
   ->  Nested Loop  (cost=4487.07..4490.56 rows=1 width=60) (actual time=
7527.302..7527.316 rows=1 loops=1)
         ->  Merge Join  (cost=4487.07..4487.09 rows=1 width=40) (actual
time=7527.191..7527.199 rows=1 loops=1)
               Merge Cond: ("outer".ev_origin = "inner".con_origin)
               ->  Sort  (cost=4220.76..4220.76 rows=1 width=24) (actual
time=6399.919..6399.920 rows=1 loops=1)
                     Sort Key: e.ev_origin
                     ->  Nested Loop  (cost=4217.27..4220.75 rows=1
width=24) (actual time=6399.896..6399.909 rows=1 loops=1)
                           ->  Subquery Scan "IN_subquery"  (cost=
4217.27..4217.28 rows=1 width=12) (actual time=6399.829..6399.834 rows=1
loops=1)
                                 ->  HashAggregate 
(cost=4217.27..4217.27rows=1 width=12) (actual time=
6399.824..6399.826 rows=1 loops=1)
                                       ->  Seq Scan on sl_event  (cost=
0.00..3963.37 rows=50779 width=12) (actual time=71.569..6251.907 rows=48949
loops=1)
                                             Filter: (ev_origin =
_irr.getlocalnodeid('_irr'::name))
                           ->  Index Scan using "sl_event-pkey" on sl_event
e  (cost=0.00..3.45 rows=1 width=20) (actual time=0.051..0.054 rows=1
loops=1)
                                 Index Cond: ((e.ev_origin =
"outer".ev_origin) AND (e.ev_seqno = "outer".max))
               ->  Sort  (cost=266.31..266.31 rows=2 width=16) (actual time=
1127.256..1127.257 rows=1 loops=1)
                     Sort Key: "IN_subquery".con_origin
                     ->  Subquery Scan "IN_subquery" 
(cost=266.27..266.30rows=2 width=16) (actual time=
1127.224..1127.229 rows=1 loops=1)
                           ->  HashAggregate  (cost=266.27..266.28 rows=2
width=16) (actual time=1127.218..1127.220 rows=1 loops=1)
                                 ->  Seq Scan on sl_confirm  (cost=
0.00..266.02 rows=34 width=16) (actual time=1125.590..1127.156 rows=7
loops=1)
                                       Filter: (con_origin =
_irr.getlocalnodeid('_irr'::name))
         ->  Index Scan using "sl_event-pkey" on sl_event ce  (cost=
0.00..3.45 rows=1 width=20) (actual time=0.094..0.097 rows=1 loops=1)
               Index Cond: (("outer".ev_origin = ce.ev_origin) AND
("outer".max = ce.ev_seqno))
   ->  Index Scan using sl_confirm_idx1 on sl_confirm c 
(cost=0.00..6.61rows=1 width=24) (actual time=
0.038..0.064 rows=1 loops=1)
         Index Cond: ((c.con_origin = "outer".ev_origin) AND
(c.con_received= "outer".con_received) AND (
c.con_seqno = "outer".max))
 Total runtime: 7527.723 ms
(24 rows)

which is very strange.
as i see it - the problem lies in sl_event table, which has now 49222
records.
sl_status shows:
# SELECT * from sl_status ;
-[ RECORD 1 ]-------------+---------------------------
st_origin                 | 1
st_received               | 2
st_last_event             | 251011
st_last_event_ts          | 2005-12-11 13:47:00.524709
st_last_received          | 201902
st_last_received_ts       | 2005-12-11 13:44:55.793112
st_last_received_event_ts | 2005-12-09 20:44:57.414378
st_lag_num_events         | 49109
st_lag_time               | 1 day 17:02:03.85666

most probably - the information i am giving you are not enoght, but i just
dont know what else i can supply to help you help me.

basically - when the problem showed - there was nothing special happening in
database.
right now both slons are raporting similar things.
master:
2005-12-11 12:24:50 MSK DEBUG1 cleanupThread:    5.653 seconds for delete
logs
2005-12-11 12:36:52 MSK DEBUG1 cleanupThread:   28.315 seconds for
cleanupEvent()
2005-12-11 12:37:01 MSK DEBUG1 cleanupThread:    9.285 seconds for delete
logs
2005-12-11 12:48:56 MSK DEBUG1 cleanupThread:   31.703 seconds for
cleanupEvent()
2005-12-11 12:49:05 MSK DEBUG1 cleanupThread:    8.735 seconds for delete
logs
2005-12-11 13:01:44 MSK DEBUG1 cleanupThread:   27.764 seconds for
cleanupEvent()
2005-12-11 13:01:51 MSK DEBUG1 cleanupThread:    7.696 seconds for delete
logs
2005-12-11 13:13:36 MSK DEBUG1 cleanupThread:   12.179 seconds for
cleanupEvent()
2005-12-11 13:13:38 MSK DEBUG1 cleanupThread:    2.141 seconds for delete
logs
2005-12-11 13:26:16 MSK DEBUG1 cleanupThread:   29.539 seconds for
cleanupEvent()
2005-12-11 13:26:22 MSK DEBUG1 cleanupThread:    6.845 seconds for delete
logs
2005-12-11 13:37:28 MSK DEBUG1 cleanupThread:   23.460 seconds for
cleanupEvent()
2005-12-11 13:37:34 MSK DEBUG1 cleanupThread:    5.345 seconds for delete
logs

slave:
2005-12-11 12:28:13 MSK DEBUG1 cleanupThread:    0.025 seconds for delete
logs
2005-12-11 12:39:07 MSK DEBUG1 cleanupThread:    0.038 seconds for
cleanupEvent()
2005-12-11 12:39:07 MSK DEBUG1 cleanupThread:    0.033 seconds for delete
logs
2005-12-11 12:50:04 MSK DEBUG1 cleanupThread:    0.033 seconds for
cleanupEvent()
2005-12-11 12:50:04 MSK DEBUG1 cleanupThread:    0.026 seconds for delete
logs
2005-12-11 13:02:02 MSK DEBUG1 cleanupThread:    0.049 seconds for
cleanupEvent()
2005-12-11 13:02:02 MSK DEBUG1 cleanupThread:    0.025 seconds for delete
logs
2005-12-11 13:14:07 MSK DEBUG1 cleanupThread:    0.037 seconds for
cleanupEvent()
2005-12-11 13:14:07 MSK DEBUG1 cleanupThread:    0.037 seconds for delete
logs
2005-12-11 13:25:40 MSK DEBUG1 cleanupThread:    0.032 seconds for
cleanupEvent()
2005-12-11 13:25:40 MSK DEBUG1 cleanupThread:    0.025 seconds for delete
logs
2005-12-11 13:36:46 MSK DEBUG1 cleanupThread:    0.055 seconds for
cleanupEvent()
2005-12-11 13:36:46 MSK DEBUG1 cleanupThread:    0.035 seconds for delete
logs
2005-12-11 13:48:20 MSK DEBUG1 cleanupThread:    0.036 seconds for
cleanupEvent()
2005-12-11 13:48:20 MSK DEBUG1 cleanupThread:    0.026 seconds for delete
logs

could you help me somehow? this replication is very improtant for me - as we
have to deploy it finally, and still have problems.

best regards

depesz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://gborg.postgresql.org/pipermail/slony1-general/attachments/20051211/1e5feb1f/attachment-0001.html