[Slony1-bugs] [Bug 175] New: Monitoring cluster better

Fri Dec 3 13:45:56 PST 2010

http://www.slony.info/bugzilla/show_bug.cgi?id=175

           Summary: Monitoring cluster better
           Product: Slony-I
           Version: devel
          Platform: All
               URL: http://wiki.postgresql.org/wiki/SlonyBrainstorming#Slo
                    n_monitoring
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: low
         Component: core scripts
        AssignedTo: slony1-bugs at lists.slony.info
        ReportedBy: cbbrowne at ca.afilias.info
                CC: slony1-bugs at lists.slony.info
   Estimated Hours: 0.0

* slon records in a queryable form what it's working on
* Requires writing (+COMMIT) at the start of the event loop
* [As noted by Yehuda] This ought to also be useful with slonik, to allow
indicating "whazzup"?

Debate took place surrounding various mechanisms.  The only one without
dramatically unacceptable properties was to add a table to the Slony-I schema.

; SNMP
: Existing standard for network-based monitoring
: Downside - requires extra infrastructure including libraries and possibly
additional tooling to use it
; [http://www.spread.org/ Spread]
: Fast
: Downside - requires extra infrastructure that's not particularly "standard"
; NOTIFY/LISTEN
: Built-in to Postgres - no new infrastructure needed
: In Postgres 9.0+ can carry payload
: Only becomes visible upon COMMIT
: Results are not stored; listeners must pre-declare their interest
; SQL table
: Built-in to Postgres - no new infrastructure needed
: Only becomes visible upon COMMIT
: Another table to manage and clean up

==== Monitoring Requirements ====

Crucial facts that we want to know about:
# Is replication behind?
# What are components (e.g. - slon, slonik) doing? <BR> Note that slon has
numerous threads
# Recent abnormal states for events (e.g. - error messages)
# Are any non-SYNC events outstanding?
# Backlog volume?
# What is the cluster's configuration?

===== Replication Behind? =====

The existing view sl_status captures this from a what-is-confirmed perspective.
 That is not perfect, but it is not obvious that there is high priority to
enhancing this.

===== What are components doing? =====

Nothing relevant is captured in a usable fashion.

It is thought that what we may do is to add a table where each thread would
capture ''what am I doing?'' (which would replace whatever was previously being
done)

This table would contain a tuple for:
# Each remote worker thread
# Cleanup thread
# Each remote listener thread
# Local SYNC thread

It would track things such as:
# Time processing started
# What thread/process am I?
# What node am I for?
# What am I doing?  <BR> Possibly in several pieces, to cover the following
sorts of facts:
## Event ID
## Event type <BR> Though this could be pulled from sl_event, given node and
event ID
## Additional event activity <BR> Again, could be pulled from sl_event, given
node and event ID

Note that the contents of this table should be quite tiny; a tuple per slon
thread on a node.

This also needs to be able to capture what '''slonik''' is doing; this seems
more troublesome.
# It is possible to have multiple slonik instances acting concurrently -
multiple concurrent events!
# There is no natural "event loop" such that slonik activities would be
expected to clean themselves up over time

====== Suggested slon implementation ======

Two approaches emerged for establishing connections to capture this monitoring
data
# Each thread opens its own DB connection <BR> Unacceptable: Leads to
''enormous'' increase in use of DB connections that are mostly basically idle
# Establish a "monitoring thread"
## A message queue allows other threads to stow entries (complete with
timestamps) that the monitoring thread periodically flushes to the database
## It is plausible that this thread could be merged into the existing local
SYNC thread which isn't terribly busy

===== Recent abnormal states for events  =====

This captures messages about the most recent problem that occurred, storing:
# Time of abnormality
# Event ID
# Node ID
# Description / Error Message

===== non-SYNC events outstanding? =====

This information is already captured, and may be revealed by running a query
that asks, on the source node, for all events that are:
# Not SYNC events
# Have not been confirmed by the subscriber

===== Backlog volume =====

[http://www.slony.info/bugzilla/show_bug.cgi?id=166 Bug #166]

This seems troublesome; calculating the number of sl_log_* tuples involved in a
particular SYNC requires running the same complex query that the remote_worker
thread uses to determine which tuples are to be applied.

This is a query that is complex to generate that is fairly expensive to run.

Note that [http://www.slony.info/bugzilla/show_bug.cgi?id=167 Bug #167] is
changing this query.

===== Cluster configuration =====

There is an existing tool that does some analysis of cluster configuration; see
[http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=tools/test_slony_state.pl;h=fdc9dcc060229f39a1e1ac8608e33d63054658bf;hb=refs/heads/master
test_slony_state.pl]

It is desirable to have something that generates diagrams of the relationships
between nodes, capturing:
# Nodes
# Subscription Sets, and the paths they take
# Paths between nodes
# Listen paths

It would be nice for the Subscription Set diagram to include indication of
replication state/lag for each node, indicating things like:
# Event Number
# Events Behind Parent
# Time Behind Parent
# Events Behind Origin
# Time Behind Origin

-- 
Configure bugmail: http://www.slony.info/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are the assignee for the bug.