Andrew Hammond andrew.george.hammond at gmail.com
Mon Sep 10 15:44:20 PDT 2007
On 9/10/07, Cyril SCETBON <cscetbon.ext at orange-ftgroup.com> wrote:
>
>
> Cyril SCETBON wrote:
> >
> >
> > Jan Wieck wrote:
> >> On 9/7/2007 9:36 AM, Cyril SCETBON wrote:
> >>> Hi,
> >>>
> >>> I got this configuration                Node1 --> Node2 (5 seconds
> >>> late)
> >>>                                                           |
> >>>                                                           --> Node3
> >>> (2 hours late)
> >>>
> >>> Node2 is processing each SYNC from Node3 and Node2, but Node3 is
> >>> processing each SYNC from Node2 but not from Node1 which is the
> >>> origin of the sets :
> >>>
> >>> On Node3 we see  `grep processing
> >>> /var/log/slony1/node3-pns_profiles_preprod.log|awk '{print
> >>> $5}'|sort|uniq -c`
> >>>      19 remoteWorkerThread_1:
> >>>     963 remoteWorkerThread_2:
> >>>
> >>> On Node2 we see `grep processing
> >>> /var/log/slony1/node2-pns_profiles_preprod.log |awk '{print
> >>> $5}'|sort|uniq -c`
> >>>    1570 remoteWorkerThread_1:
> >>>     865 remoteWorkerThread_3:
> >>>
> >>> Why is there so many SYNC not processed on Node3 ???
> >>>
> >>> Node3 got 22440 queue event and 25 Received event from
> >>> remoteWorkerThread_1, while Node2 got 4467 queue event and 1578
> >>> Received event from the same worker.
> >>>
> >>> Is there something to do ?
> >>
> >> How about looking for some error messages?
> > None.
> I've put slon in debug level 2
> >>
> >> What comes to mind would be that sl_event is grossly out of shape and
> >> that the event selection times out.
> > Seems vacuuming sl_log_1 takes too much time cause of
> > vacuum_cost_delay and that selecting from this table use a seq scan.
> > I'm investiguating.
> I forced vacuum to go faster and checked slon logs of subscribers. They
> got similar disks capabilities which seems to be the bottleneck on all
> node (wait io ~=3D50% in vmstat).
>
> I found replication tasks time are different :
>
> On node 3 :
>                      delay in seconds =3D 585.974ms
>                      cleanupEvent in seconds =3D 9.25167s
>
> On node 2 :
>                      delay in seconds =3D 37.6463ms
>                      cleanupEvent in seconds =3D 0.203265s
>
> May these times explain why node 3 is late compared to node 2 ? What do
> you think I have to investiguate now ?
>
> PS: hosts consume the same processor load but node 2 is a biprocessor
> 2.6Ghz and node 3 is a biprocessor dual core 1.8Ghz (4 processors seen
> by Linux kernel SMP)
>

So... the computer with the slower processor is slower?
What delay are you referring too? If it's from _foo.sl_status.st_lag_time
then you should be aware that it's actual precision is about +/-5 seconds.
While the cleanup is disk intensive, it also does a good chunk of number
crunching. I'm surprised to see an order of magnitude in difference, but...
not shocked.

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20070910/=
b7f9e201/attachment.htm


More information about the Slony1-general mailing list