[Slony1-general] Postmaster restart breaks slony

Wed Apr 19 23:24:54 PDT 2006

On 4/19/06, Christopher Browne <cbbrowne at ca.afilias.info> wrote:
> Ian Burrell wrote:
> >
> > We didn't notice the node1 slon was down until a few hours later.  I
> > started the node1 slon daemon which inserted SYNC events.  The slave
> > slon daemon then started processing the SYNC events and trying to
> > transfer data.  Since the data connection had failed and the slave
> > slon daemons started failing.  I noticed the error and restart all the
> > slon daemons which fixed the problem.
> >
> > Shouldn't the slon daemons reconnect if the remoteWorkerThread
> > connection goes down?  Even dying and being restarted would be better
> > than continuously failing in a loop.  We are using 1.1.0 with most of
> > the 1.1.1 patches.  Has this problem been fixed in 1.1.5?
> >
> Unfortunately, the real fix to this is in CVS HEAD/1.2, which fairly
> significantly restructures the thread handling, needful for Windows
> support...
>
> I wish there were a better answer; unfortunately, the "sorta-internal
> watchdog" that was added in 1.1.0 leaves something to be desired.  The
> fix is the "big fix," which is what's in CVS HEAD.
>
> All I can recommend for now is to make sure that there's a watchdog running.
>

The problem is that the watchdog does not help for the slave slon
daemons because they do not die.  They kept trying to use the broken
database connection, aborting the sync event, and then trying again.

We had a different problem with the slon_watchdog script dying itself
instead of starting the master slon daemon.

 - Ian