Mon Jan 31 16:37:37 PST 2005
- Previous message: [Slony1-general] slon + nagios monitoring
- Next message: [Slony1-general] slon + nagios monitoring
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Very useful info - many thanks. Will have a through read through, and get a plugin script built in the next day or so (hopefully). Thanks. John Sidney-Woollett Christopher Browne wrote: > Andrew Sullivan <ajs at crankycanuck.ca> writes: > >>On Sat, Jan 29, 2005 at 10:35:17AM +0000, John Sidney-Woollett wrote: >> >>>I think that the slon_watchdog2 script queries the database to see >>>whether the node is working properly - is this the best approach? >> >>It's a good approach in that it checks not only whether the daemon >>is listening, but also whether it's actually working. The latter >>question is not a trivial one -- if something is wrong, you want to >>know it sooner rather than later. The watchdog also logs in a way >>designed to be convenient for Nagios. Chris can provide more >>details. > > > It's not all quite up to what managerial types might want to term > "best practices." (I at some point should go through the > documentation to see about making sure that places where the term > might apply, it gets used, so as to make life easier for those Gentle > Readers who have PHBs that get all excited about that...) > > The watchdog _isn't_ particularly logging for Nagios; what is logging > for Nagios is the separate "replication test" script, > test_slony_replication.pl. > > But yes, the way slon_watchdog2 queries the DB to see if the node is > working is certainly pointing towards "best practices." > > - It isn't sufficient to simply see if the slon process is there, as > some threads may shut down without the whole thing dying > > - It isn't sufficient to simply see if events are propagating, as if a > big COPY_SET is under way, that represents One Big Event that might > run for 28 hours if it's a big replication set. > > Thus, whether a particular slon is "doing OK" seems best measured by > the combination of: > > 1. If it's not running ===> Problem > 2. If it's running, and events are making it through ===> OK > 3. If it's running, and a COPY_SET is in progress ====> Treat as OK > > 4. What's left over: > -> It's running (#1 wasn't the case) > -> It's not propagating events (#2 wasn't the case) > -> There's no COPY_SET, so there's no good excuse for not > propagating events > > Which implies something's broken. A specific scenario that we have > run into is where a node is running on a remote site via a WAN, and > the connectivity is a little flakey such that DB connections > sometimes silently die. > > It may be that there are further problem scenarios worth pointing out, > although the above logic has been pretty successful thus far. Not so > much for telling Nagios anything, but rather for indicating policy > where we tell slon processes to restart themselves. > > There are a set of queries in <tools/test_slony_state.pl> (see CVS > HEAD) which rummage thru the state of each node. It seems to me that > THAT is the way to get to further Nagios tests. > > - Some of the tests check to see if sl_listen configuration seems > OK/is broken > > - Other tests look for growth of tables like sl_log_1, pg_listener > > - Others look to see if some nodes aren't confirming results > > - Still others look at whether there seem to be long running > transactions on some nodes. > > These all represent sorts of problems that are at least somewhat > orthogonal to whether slon is running. > > What is unfortunate is that Nagios has rather limited capabilities to > cope with the possibility that multiple things are broken. But given > that that is a possible problem... > > It seems to me that the "Best Practices" would be to add the tests > from the watchdog as well as some tests based on test_slony_state.pl > in to improve the replication test. > > At present, the replication test requires adding in a custom table, > and does quite a bit of work surrounding pushing thru some largely > useless updates thru that table. If sl_event/sl_confirm were more > extensively tested, that would probably make the use of the table > "slony_test" unnecessary.
- Previous message: [Slony1-general] slon + nagios monitoring
- Next message: [Slony1-general] slon + nagios monitoring
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list