[Slony1-general] slon + nagios monitoring

Mon Jan 31 16:37:37 PST 2005

Very useful info - many thanks.

Will have a through read through, and get a plugin script built in the 
next day or so (hopefully).

Thanks.

John Sidney-Woollett

Christopher Browne wrote:

> Andrew Sullivan <ajs at crankycanuck.ca> writes:
> 
>>On Sat, Jan 29, 2005 at 10:35:17AM +0000, John Sidney-Woollett wrote:
>>
>>>I think that the slon_watchdog2 script queries the database to see 
>>>whether the node is working properly - is this the best approach?
>>
>>It's a good approach in that it checks not only whether the daemon
>>is listening, but also whether it's actually working.  The latter
>>question is not a trivial one -- if something is wrong, you want to
>>know it sooner rather than later.  The watchdog also logs in a way
>>designed to be convenient for Nagios.  Chris can provide more
>>details.
> 
> 
> It's not all quite up to what managerial types might want to term
> "best practices."  (I at some point should go through the
> documentation to see about making sure that places where the term
> might apply, it gets used, so as to make life easier for those Gentle
> Readers who have PHBs that get all excited about that...)
> 
> The watchdog _isn't_ particularly logging for Nagios; what is logging
> for Nagios is the separate "replication test" script,
> test_slony_replication.pl.
> 
> But yes, the way slon_watchdog2 queries the DB to see if the node is
> working is certainly pointing towards "best practices."  
> 
> - It isn't sufficient to simply see if the slon process is there, as
>   some threads may shut down without the whole thing dying
> 
> - It isn't sufficient to simply see if events are propagating, as if a
>   big COPY_SET is under way, that represents One Big Event that might 
>   run for 28 hours if it's a big replication set.
> 
> Thus, whether a particular slon is "doing OK" seems best measured by
> the combination of:
> 
> 1. If it's not running ===> Problem
> 2. If it's running, and events are making it through ===> OK
> 3. If it's running, and a COPY_SET is in progress ====> Treat as OK
> 
> 4. What's left over:
>       -> It's running (#1 wasn't the case)
>       -> It's not propagating events (#2 wasn't the case)
>       -> There's no COPY_SET, so there's no good excuse for not
>          propagating events
> 
>    Which implies something's broken.  A specific scenario that we have
>    run into is where a node is running on a remote site via a WAN, and
>    the connectivity is a little flakey such that DB connections
>    sometimes silently die.
> 
> It may be that there are further problem scenarios worth pointing out,
> although the above logic has been pretty successful thus far.  Not so
> much for telling Nagios anything, but rather for indicating policy
> where we tell slon processes to restart themselves.
>   
> There are a set of queries in <tools/test_slony_state.pl> (see CVS
> HEAD) which rummage thru the state of each node.  It seems to me that
> THAT is the way to get to further Nagios tests.  
> 
>  - Some of the tests check to see if sl_listen configuration seems
>    OK/is broken
> 
>  - Other tests look for growth of tables like sl_log_1, pg_listener
> 
>  - Others look to see if some nodes aren't confirming results
> 
>  - Still others look at whether there seem to be long running
>    transactions on some nodes.
> 
> These all represent sorts of problems that are at least somewhat
> orthogonal to whether slon is running.
> 
> What is unfortunate is that Nagios has rather limited capabilities to
> cope with the possibility that multiple things are broken.  But given
> that that is a possible problem...
> 
> It seems to me that the "Best Practices" would be to add the tests
> from the watchdog as well as some tests based on test_slony_state.pl
> in to improve the replication test.  
> 
> At present, the replication test requires adding in a custom table,
> and does quite a bit of work surrounding pushing thru some largely
> useless updates thru that table.  If sl_event/sl_confirm were more
> extensively tested, that would probably make the use of the table
> "slony_test" unnecessary.