[Slony1-general] Smarter Watchdog Approach

Tue Sep 21 21:35:41 PDT 2004

After watching what sorts of common "bump in the night" scenarios pop
up, I'm wanting to set up a smarter sort of "watchdog" script to watch
each slon instance to see if it needs to be restarted.

At present, there's a script that basically "zaps" the slons every now
and again and restarts them.  That is not nearly ideal from a couple
of perspectives that I can see:

 1.  It leaves PG backends around waiting for notifications, which
     causes dead tuples on pg_listener to linger around, and whatever
     other ills are engendered by "zombie" transactions.

 2.  Sometimes it causes the slon instances to get a bit deranged such
     that they need a "restart node".

My thought is to have the "watchdog" be smarter in three ways:

 a) It should only kill the slon if there seems reason to do so.

    The case where we _definitely_ need it is when a VPN network
    connection falls down, so that events no longer get through.

    That suggests looking to see how recently events have made it
    through.

    Here's the query I'm thinking of.

oxrslive=# select now() - ev_timestamp > '00:20:00'::interval as event_old, now() - ev_timestamp as age, 
oxrslive-#        ev_timestamp, ev_seqno, ev_origin as origin
oxrslive-# from _oxrslive.sl_event events, _oxrslive.sl_subscribe slony_master
oxrslive-#   where 
oxrslive-#      events.ev_origin = slony_master.sub_provider and
oxrslive-#      not exists (select * from _oxrslive.sl_subscribe providers 
oxrslive(#                   where providers.sub_receiver = slony_master.sub_provider and
oxrslive(#                         providers.sub_set = slony_master.sub_set and
oxrslive(#                         slony_master.sub_active = 't' and
oxrslive(#                         providers.sub_active = 't')
oxrslive-# order by ev_origin desc, ev_seqno desc limit 1;
 event_old |       age       |        ev_timestamp        | ev_seqno | origin 
-----------+-----------------+----------------------------+----------+--------
 f         | 00:00:01.025902 | 2004-09-21 19:16:43.804917 |   621069 |      1
(1 row)

    It looks for the latest timestamp associated with an event coming
    from a "master" node, and returns "t" in the first field if the
    interval since the last event exceeds 20 minutes (which I'm
    treating as a provisional parameter value).

    Is there anything particularly deranged about that?  Or should I
    be looking to see which 'active' origin has checked in least
    recently?

 b) It should submit a "restart node" if it notices, in the logs:

    FATAL  localListenThread: Another slon daemon is serving this node already

    Question: How exuberent should it be about this?  Tell all the
    nodes to restart?  Or just the offending one?

 c) If the slon process has died, it should restart it, and probably
    throw out a "Help!  Call a dba!" if this has happened too many times
    recently.
-- 
let name="cbbrowne" and tld="ca.afilias.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)