[Slony1-general] Postmaster restart breaks slony

Thu Apr 20 05:53:31 PDT 2006

On 4/20/2006 2:24 AM, Ian Burrell wrote:

> On 4/19/06, Christopher Browne <cbbrowne at ca.afilias.info> wrote:
>> Ian Burrell wrote:
>> >
>> > We didn't notice the node1 slon was down until a few hours later.  I
>> > started the node1 slon daemon which inserted SYNC events.  The slave
>> > slon daemon then started processing the SYNC events and trying to
>> > transfer data.  Since the data connection had failed and the slave
>> > slon daemons started failing.  I noticed the error and restart all the
>> > slon daemons which fixed the problem.

 From the CVS log for src/slon/remote_worker.c:

revision 1.86.2.5
date: 2005/10/08 19:37:29;  author: wieck;  state: Exp;  lines: +10 -1
Check existing provider DB connection in sync event processing.
A DB connection loss during fetching of log rows does not cause
the database connection to be dropped within the helper thread.
This was able to cause a dead connection to stall replication.

This fix was released with version 1.1.2.

Jan

>> >
>> > Shouldn't the slon daemons reconnect if the remoteWorkerThread
>> > connection goes down?  Even dying and being restarted would be better
>> > than continuously failing in a loop.  We are using 1.1.0 with most of
>> > the 1.1.1 patches.  Has this problem been fixed in 1.1.5?
>> >
>> Unfortunately, the real fix to this is in CVS HEAD/1.2, which fairly
>> significantly restructures the thread handling, needful for Windows
>> support...
>>
>> I wish there were a better answer; unfortunately, the "sorta-internal
>> watchdog" that was added in 1.1.0 leaves something to be desired.  The
>> fix is the "big fix," which is what's in CVS HEAD.
>>
>> All I can recommend for now is to make sure that there's a watchdog running.
>>
> 
> The problem is that the watchdog does not help for the slave slon
> daemons because they do not die.  They kept trying to use the broken
> database connection, aborting the sync event, and then trying again.
> 
> We had a different problem with the slon_watchdog script dying itself
> instead of starting the master slon daemon.
> 
>  - Ian
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at gborg.postgresql.org
> http://gborg.postgresql.org/mailman/listinfo/slony1-general

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck at Yahoo.com #