[Slony1-general] Postmaster restart breaks slony

Wed Apr 19 15:14:07 PDT 2006

Ian Burrell wrote:
> We ran into a problem with the postmaster restarting database
> connections causing replication to break.  A segfault in one of the
> backends running a custom C aggregate caused the postmaster to restart
> all of the other backend processes.  This broke all of the existing
> connections to the database.  The node1 slon died. For some reason the
> slon_watchdog.pl script also died but that is a different problem. 
> The slave slon daemons both reconnected the remoteListenThread
> connection.  But the remoteWorkerThread connection for transferring
> data was unused and left broken.
>
> We didn't notice the node1 slon was down until a few hours later.  I
> started the node1 slon daemon which inserted SYNC events.  The slave
> slon daemon then started processing the SYNC events and trying to
> transfer data.  Since the data connection had failed and the slave
> slon daemons started failing.  I noticed the error and restart all the
> slon daemons which fixed the problem.
>
> Shouldn't the slon daemons reconnect if the remoteWorkerThread
> connection goes down?  Even dying and being restarted would be better
> than continuously failing in a loop.  We are using 1.1.0 with most of
> the 1.1.1 patches.  Has this problem been fixed in 1.1.5?
>   
Unfortunately, the real fix to this is in CVS HEAD/1.2, which fairly
significantly restructures the thread handling, needful for Windows
support...

I wish there were a better answer; unfortunately, the "sorta-internal
watchdog" that was added in 1.1.0 leaves something to be desired.  The
fix is the "big fix," which is what's in CVS HEAD.

All I can recommend for now is to make sure that there's a watchdog running.

For 1.2, I've got what I think is a "better mousetrap" for
watchdogging.  And nothing about it is really 1.2-specific; I'm running
it now...

You basically write slon .conf files into a well-defined directory
structure, and those .conf files have the respective slons write .pid
files nearby.

A "watchdog" basically goes in every so often and checks to see if
there's a slon running for each of the .conf/.pid combination,
restarting whatever falls over.  It's a *real* simple shell script.

http://gborg.postgresql.org/cgi-bin/cvsweb.cgi/slony1-engine/tools/launch_clusters.sh?cvsroot=slony1

The script is only a page long, and can manage a whole bunch of Slony-I
clusters all at once...