Fri Dec 3 13:45:56 PST 2010
- Previous message: [Slony1-bugs] [Bug 174] Isolated Nodes
- Next message: [Slony1-bugs] [Bug 175] Monitoring cluster better
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
http://www.slony.info/bugzilla/show_bug.cgi?id=175 Summary: Monitoring cluster better Product: Slony-I Version: devel Platform: All URL: http://wiki.postgresql.org/wiki/SlonyBrainstorming#Slo n_monitoring OS/Version: All Status: NEW Severity: enhancement Priority: low Component: core scripts AssignedTo: slony1-bugs at lists.slony.info ReportedBy: cbbrowne at ca.afilias.info CC: slony1-bugs at lists.slony.info Estimated Hours: 0.0 * slon records in a queryable form what it's working on * Requires writing (+COMMIT) at the start of the event loop * [As noted by Yehuda] This ought to also be useful with slonik, to allow indicating "whazzup"? Debate took place surrounding various mechanisms. The only one without dramatically unacceptable properties was to add a table to the Slony-I schema. ; SNMP : Existing standard for network-based monitoring : Downside - requires extra infrastructure including libraries and possibly additional tooling to use it ; [http://www.spread.org/ Spread] : Fast : Downside - requires extra infrastructure that's not particularly "standard" ; NOTIFY/LISTEN : Built-in to Postgres - no new infrastructure needed : In Postgres 9.0+ can carry payload : Only becomes visible upon COMMIT : Results are not stored; listeners must pre-declare their interest ; SQL table : Built-in to Postgres - no new infrastructure needed : Only becomes visible upon COMMIT : Another table to manage and clean up ==== Monitoring Requirements ==== Crucial facts that we want to know about: # Is replication behind? # What are components (e.g. - slon, slonik) doing? <BR> Note that slon has numerous threads # Recent abnormal states for events (e.g. - error messages) # Are any non-SYNC events outstanding? # Backlog volume? # What is the cluster's configuration? ===== Replication Behind? ===== The existing view sl_status captures this from a what-is-confirmed perspective. That is not perfect, but it is not obvious that there is high priority to enhancing this. ===== What are components doing? ===== Nothing relevant is captured in a usable fashion. It is thought that what we may do is to add a table where each thread would capture ''what am I doing?'' (which would replace whatever was previously being done) This table would contain a tuple for: # Each remote worker thread # Cleanup thread # Each remote listener thread # Local SYNC thread It would track things such as: # Time processing started # What thread/process am I? # What node am I for? # What am I doing? <BR> Possibly in several pieces, to cover the following sorts of facts: ## Event ID ## Event type <BR> Though this could be pulled from sl_event, given node and event ID ## Additional event activity <BR> Again, could be pulled from sl_event, given node and event ID Note that the contents of this table should be quite tiny; a tuple per slon thread on a node. This also needs to be able to capture what '''slonik''' is doing; this seems more troublesome. # It is possible to have multiple slonik instances acting concurrently - multiple concurrent events! # There is no natural "event loop" such that slonik activities would be expected to clean themselves up over time ====== Suggested slon implementation ====== Two approaches emerged for establishing connections to capture this monitoring data # Each thread opens its own DB connection <BR> Unacceptable: Leads to ''enormous'' increase in use of DB connections that are mostly basically idle # Establish a "monitoring thread" ## A message queue allows other threads to stow entries (complete with timestamps) that the monitoring thread periodically flushes to the database ## It is plausible that this thread could be merged into the existing local SYNC thread which isn't terribly busy ===== Recent abnormal states for events ===== This captures messages about the most recent problem that occurred, storing: # Time of abnormality # Event ID # Node ID # Description / Error Message ===== non-SYNC events outstanding? ===== This information is already captured, and may be revealed by running a query that asks, on the source node, for all events that are: # Not SYNC events # Have not been confirmed by the subscriber ===== Backlog volume ===== [http://www.slony.info/bugzilla/show_bug.cgi?id=166 Bug #166] This seems troublesome; calculating the number of sl_log_* tuples involved in a particular SYNC requires running the same complex query that the remote_worker thread uses to determine which tuples are to be applied. This is a query that is complex to generate that is fairly expensive to run. Note that [http://www.slony.info/bugzilla/show_bug.cgi?id=167 Bug #167] is changing this query. ===== Cluster configuration ===== There is an existing tool that does some analysis of cluster configuration; see [http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=tools/test_slony_state.pl;h=fdc9dcc060229f39a1e1ac8608e33d63054658bf;hb=refs/heads/master test_slony_state.pl] It is desirable to have something that generates diagrams of the relationships between nodes, capturing: # Nodes # Subscription Sets, and the paths they take # Paths between nodes # Listen paths It would be nice for the Subscription Set diagram to include indication of replication state/lag for each node, indicating things like: # Event Number # Events Behind Parent # Time Behind Parent # Events Behind Origin # Time Behind Origin -- Configure bugmail: http://www.slony.info/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are the assignee for the bug.
- Previous message: [Slony1-bugs] [Bug 174] Isolated Nodes
- Next message: [Slony1-bugs] [Bug 175] Monitoring cluster better
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-bugs mailing list