Christopher Browne cbbrowne at ca.afilias.info
Thu Jun 14 08:25:05 PDT 2007
Steven Singer <ssinger_pg at sympatico.ca> writes:
> On Fri, 8 Jun 2007, Christopher Browne wrote:
>
> I've been trying 1.2.10 from CVS against 8.2 on a Linux PowerPC(mac) and
>
> I've been occasional failures of the testddl unit test with
>
> <stdin>:5: sleep a couple seconds
> <stdin>:9: timeout exceeded while waiting for event confirmation
>
> Which appears to be from a:wait for event (origin=1, confirmed=2);
>
> Most of the time testddl passes fine but it failed at least once when
> I tried running it. I am still trying to reproduce the failure.

I'm not entirely sure why this happens; it has occasionally happened
to me too.

In the slon logs, the slon for the subscriber node has fallen into the
following "pit":

When node #2 tries to process the subscription...

2007-06-14 15:06:21 UTC DEBUG2 remoteWorkerThread_1: Received event 1,17 ENABLE_SUBSCRIPTION
2007-06-14 15:06:21 UTC WARN   remoteWorkerThread_1: copy set: data provider 1 only on sync -1 - sleep 5 seconds

This is normally the result of node 2's slon not properly listening to
node #1.

If you restart the slon for node #2 (e.g. - kill the slon, run it
again, by hand), it picks itself up out of the "pit" and subscribes
and runs fine.

I figure there must be some race condition surrounding the events that
is taking place here...

I just ran the DDL test three times in a row; twice, it failed with
the above problem; the third time it worked.

I grabbed the logs from the three iterations, and chopped them off
immediately after the ENABLE_SUBSCRIPTION message, which is clearly
the point at which the behaviour is diverging.  I then chopped off
timestamps to make them look temporally similar, and did a sort + diff
to see what the differences look like.

Here's one of the diffs; if we can identify what it is that isn't
being processed properly in the case where it failed (which will
involve the entries marked ">"), then maybe there's a bug to be
stomped :-).

I'm seeing one oddity, that in the "succeeding" slon, the events are
associated with remoteWorkerThread_1, whereas on the other, they're on
remoteListenThread_3.

slony-regress.ks8357/slon_log.2.sorted-diff
::::::::::::::
20d19
< DEBUG1 copy_set 1
39,41c38
< DEBUG2 remoteListenThread_1: queue event 1,16 SUBSCRIBE_SET
< DEBUG2 remoteListenThread_1: queue event 1,17 ENABLE_SUBSCRIPTION
< DEBUG2 remoteListenThread_1: queue event 1,18 SYNC
---
> DEBUG2 remoteListenThread_1: queue event 1,19 SYNC
44,46d40
< DEBUG2 remoteListenThread_1: queue event 3,1 STORE_PATH
< DEBUG2 remoteListenThread_1: queue event 3,2 STORE_PATH
< DEBUG2 remoteListenThread_1: queue event 3,3 SYNC
48a43,48
> DEBUG2 remoteListenThread_3: queue event 1,16 SUBSCRIBE_SET
> DEBUG2 remoteListenThread_3: queue event 1,17 ENABLE_SUBSCRIPTION
> DEBUG2 remoteListenThread_3: queue event 1,18 SYNC
> DEBUG2 remoteListenThread_3: queue event 3,1 STORE_PATH
> DEBUG2 remoteListenThread_3: queue event 3,2 STORE_PATH
> DEBUG2 remoteListenThread_3: queue event 3,3 SYNC
50a51
> DEBUG2 remoteListenThread_3: stop listening for event origin 1
63,64d63
< DEBUG2 remoteWorkerThread_1: forward confirm 1,9 received by 3
< DEBUG2 remoteWorkerThread_1: forward confirm 2,0 received by 3
68d66
< DEBUG2 remoteWorkerThread_1: forward confirm 3,3 received by 1
74a73,76
> DEBUG2 remoteWorkerThread_3: forward confirm 1,18 received by 3
> DEBUG2 remoteWorkerThread_3: forward confirm 1,9 received by 3
> DEBUG2 remoteWorkerThread_3: forward confirm 2,0 received by 1
> DEBUG2 remoteWorkerThread_3: forward confirm 2,0 received by 3
75a78,79
> DEBUG2 remoteWorkerThread_3: forward confirm 3,0 received by 1
> DEBUG2 remoteWorkerThread_3: forward confirm 3,3 received by 1
82,83c86,87
< DEBUG2 slon: watchdog ready - pid = 8471
< DEBUG2 slon: worker process created - pid = 8473
---
> DEBUG2 slon: watchdog ready - pid = 8279
> DEBUG2 slon: worker process created - pid = 8280
84a89
> WARN   remoteWorkerThread_1: copy set: data provider 1 only on sync -1 - sleep 5 seconds

Jan just IM'ed me that he just committed a change that evidently
affects this, so this may already be fixed :-).
-- 
output = reverse("ofni.sailifa.ac" "@" "enworbbc")
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)


More information about the Slony1-general mailing list