[Slony1-general] Keeping the 'slon' daemons running

Wed Oct 20 12:39:38 PDT 2004

>>> When shutting the database engines down, the associated "slon"
>>> processes tend to exit too.
>>>
>>> My preference is that when the database is stopped/started (or
>>> during a network outage) that the daemon remains running, and
>>> attempts periodic reconnects.
>>>
>>> What is the suggested approach for managing the daemon processes
>>> so that replication starts up again when the database becomes
>>> available?
>>
>> There's a watchdog process written in Perl in the "altperl" directory
>> (recent CVS, whether for STABLE or for HEAD) that has some capability to
>> do this.
>
> I have a case where I have 3 computers
>
> db1 runs postgres slave
> db2 runs postgres master and masters slon
> db3 runs slaves slon
>
> after a network outage af a few minutes which disconnects db1 (slave pg)
> from the rest, the following happens
>
> slon on db3 dies (with Timeou message in log)
> slon on db2 keeps running but one of the threads is in R state
>         and keeps eating all CPU it can.
>
> I have to manually start slon on db3 and kill main slon and restart it on
> db2 to be back to normal ops.
>
> Unfortunalely these are production machines so I can't attach gdb to the
> CPU-eating slon to see what's up
>
> Can your perl script handle this situation too ?

I can't verify the "eating all CPU it can" part, but yes, the
"slon_watchdog2.pl" script would handle this.

1.  When the slon actually dies, the watchdog fairly quickly (default: 2
minutes) notices that it is gone and restarts it.

2.  The one where one of the threads breaks stops updating the subscriber,
which the watchdog eventually (we set it to "in 20 minutes"; I think CVS
says "in 40 minutes") notices.

We get a situation similar to this where a particular network connection
going across a VPN layer will stop responding; it normally only affects
one of the slons by killing one of the threads.

The policy in that watchdog is NOT universally perfect; it is quite
inappropriate to use it while a subscriber is doing the initial
subscription, because during that phase, it could take hours for the
SUBSCRIBE_SET event to complete, and since that's way more than either 20
minutes or 40 minutes, the watchdog would conclude that the thread was
dead and that it should restart slon.

This situation is one where it would be really nice if the slon wrote out
_something_ on the node that it was still talking to that could be
interpreted as "Help!  I'm not working anymore!  Replace me with a new
slon!"  Perhaps it should even terminate itself.

Find me a log excerpt showing what exactly gets recorded when the slon
gets into that state, and we can compare logs and look for a common answer
to it...