[Solved] Re: [Slony1-general] replication Lag, sync grouping not happening

Thu Feb 21 11:37:06 PST 2008

On Thu, 2008-02-21 at 18:47 +0000, Christopher Browne wrote:
> Ow Mun Heng <Ow.Mun.Heng at wdc.com> writes:
> > I'm not sure what is happening, usually when the slave lags behind, I
> > will stop the slon process and then add in the -o10 -g500 options to the
> > master process an I usually will see that the syncs on the subscriber
> > will be grouped together.
> >
> > AS of right now, I'm not seeing this happening and it's just processing
> > the syncs 1 by 1 and it's taking a long time for this to happen.
> >
> > I also tried the -o10 -g500 on both the master and the slave and still
> > it goes 1 by 1.
> > 2008-02-22 01:58:51 MYT DEBUG2 remoteHelperThread_1_1: inserts=0 updates=0 deletes=0
> > 2008-02-22 01:58:51 MYT DEBUG2 remoteWorkerThread_1: SYNC 483711 done in 88.818 seconds
> > 2008-02-22 02:00:07 MYT DEBUG2 remoteHelperThread_1_1: inserts=1984 updates=0 deletes=0
> > 2008-02-22 02:00:10 MYT DEBUG2 remoteWorkerThread_1: SYNC 483712 done in 78.634 seconds
> > 2008-02-22 02:01:34 MYT DEBUG2 remoteHelperThread_1_1: inserts=529 updates=0 deletes=56
> > 2008-02-22 02:01:36 MYT DEBUG2 remoteWorkerThread_1: SYNC 483713 done in 85.745 seconds
> > 2008-02-22 02:03:25 MYT DEBUG2 remoteHelperThread_1_1: inserts=1532 updates=0 deletes=0
> > 2008-02-22 02:03:28 MYT DEBUG2 remoteWorkerThread_1: SYNC 483714 done in 112.476 seconds
> > 2008-02-22 02:05:47 MYT DEBUG2 remoteHelperThread_1_1: inserts=1557 updates=0 deletes=0
> > 2008-02-22 02:05:49 MYT DEBUG2 remoteWorkerThread_1: SYNC 483715 done in 140.691 seconds
> > 2008-02-22 02:08:26 MYT DEBUG2 remoteHelperThread_1_1: inserts=2600 updates=0 deletes=225
> > 2008-02-22 02:08:27 MYT DEBUG2 remoteWorkerThread_1: SYNC 483716 done in 157.839 seconds
> 
> I believe that -o10 causes Slony-I to try to track having SYNC
> processing time take an estimated time of 10ms per group; the value is
> measured in milliseconds, not seconds.
> 
> That being the case, if the last *single* SYNC took "lots more than
> 10ms," then the slon will not be considering processing several SYNCs
> at once.  (And note that since the times were also >>> 10s, the
> principle would still hold if -o was measuring in seconds.)
> 
> Based on the timings you indicate, the only way that you'll see SYNC
> grouping is if you set the value to something more like 200000.

Master : slon -d4 -c2 -g500 -s60000 -o200000 -f slon_master.conf
Slave :  slon -d2     -g500         -o200000 -f slon_slave1.conf | egrep -i 'done in|inserts='

2008-02-22 03:11:58 MYT DEBUG2 remoteHelperThread_1_1: inserts=807 updates=0 deletes=130
2008-02-22 03:12:04 MYT DEBUG2 remoteWorkerThread_1: SYNC 483756 done in 62.840 seconds
2008-02-22 03:13:19 MYT DEBUG2 remoteHelperThread_1_1: inserts=4626 updates=0 deletes=8
2008-02-22 03:13:19 MYT DEBUG2 remoteWorkerThread_1: SYNC 483759 done in 75.382 seconds
2008-02-22 03:14:49 MYT DEBUG2 remoteHelperThread_1_1: inserts=8824 updates=0 deletes=418
2008-02-22 03:14:50 MYT DEBUG2 remoteWorkerThread_1: SYNC 483766 done in 90.575 second
2008-02-22 03:17:06 MYT DEBUG2 remoteHelperThread_1_1: inserts=19587 updates=0 deletes=566
2008-02-22 03:17:07 MYT DEBUG2 remoteWorkerThread_1: SYNC 483781 done in 136.992 seconds
2008-02-22 03:20:12 MYT DEBUG2 remoteHelperThread_1_1: inserts=24451 updates=1138 deletes=484
2008-02-22 03:20:14 MYT DEBUG2 remoteWorkerThread_1: SYNC 483802 done in 187.493 seconds

Seems like this is starting back to go in groups.

To be frank, the -o -s options really befuddles me. I've read the docs
but I guess I don't really understand them enough to know whether these
options work on the master or the slave. (hence as above, I just put it
on both master ans slave)

On another front, I tend to believe that one of the reason for the lag
is because my disks are slow. (1x 500GB IDE 7200 rpm and they're bogged
down, atop shows 90% usage nearly 80% of the time) To add to that, I
noticed that it will start to slow even more when sl_log_1/2 becomes
large ~2GB and no amount of vacuum/reindex/recreate index will get it
back up to speed. (fetch 100 from log becomes real slow >500secs )

Chris(you) already shown me how to manually force a logswitch, and thus,
now I'm considering making a job to manually force the switch like every
6 hours just to get the size under control. Is this a good Idea?