[Slony1-commit] By cbbrowne: Add FAQ entry to explain the sl_nodelock interlock error

Wed Mar 29 08:48:17 PST 2006

Log Message:
-----------
Add FAQ entry to explain the sl_nodelock interlock error message.

Made &lslon; as short form for <xref linkend="slon">

Modified Files:
--------------
    slony1-engine/doc/adminguide:
        bestpractices.sgml (r1.15 -> r1.16)
        faq.sgml (r1.54 -> r1.55)
        man.sgml (r1.6 -> r1.7)
        plainpaths.sgml (r1.11 -> r1.12)
        slony.sgml (r1.28 -> r1.29)

-------------- next part --------------
Index: man.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/man.sgml,v
retrieving revision 1.6
retrieving revision 1.7
diff -Ldoc/adminguide/man.sgml -Ldoc/adminguide/man.sgml -u -w -r1.6 -r1.7

--- doc/adminguide/man.sgml
+++ doc/adminguide/man.sgml
@@ -45,6 +45,7 @@
   <!ENTITY slconfirm "<envar>sl_confirm</envar>">
   <!ENTITY bestpracticelink "Best Practice">
   <!ENTITY pglistener "<envar>pg_listener</envar>">
+  <!ENTITY lslon "<xref linkend=slon>">
 ]>
 
 <book id="slony">
Index: plainpaths.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/plainpaths.sgml,v
retrieving revision 1.11
retrieving revision 1.12
diff -Ldoc/adminguide/plainpaths.sgml -Ldoc/adminguide/plainpaths.sgml -u -w -r1.11 -r1.12
--- doc/adminguide/plainpaths.sgml
+++ doc/adminguide/plainpaths.sgml
@@ -24,15 +24,15 @@
 connections using <link linkend="tunnelling">SSH
 tunnelling</link>.</para></listitem>
 
-<listitem><para> The <xref linkend="slon"> DSN parameter. </para> 
+<listitem><para> The &lslon; DSN parameter. </para>
 
-<para> The DSN parameter passed to each <xref linkend="slon">
-indicates what network path should be used to get from the slon
-process to the database that it manages.</para> </listitem>
+<para> The DSN parameter passed to each &lslon; indicates what network
+path should be used to get from the &lslon; process to the database
+that it manages.</para> </listitem>
 
 <listitem><para> <xref linkend="stmtstorepath"> - controlling how
-<xref linkend="slon"> daemons communicate with remote nodes.  These
-paths are stored in <xref linkend="table.sl-path">.</para>
+&lslon; daemons communicate with remote nodes.  These paths are stored
+in <xref linkend="table.sl-path">.</para>
 
 <para> You forcibly <emphasis>need</emphasis> to have a path between
 each subscriber node and its provider; other paths are optional, and
Index: bestpractices.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/bestpractices.sgml,v
retrieving revision 1.15
retrieving revision 1.16
diff -Ldoc/adminguide/bestpractices.sgml -Ldoc/adminguide/bestpractices.sgml -u -w -r1.15 -r1.16
--- doc/adminguide/bestpractices.sgml
+++ doc/adminguide/bestpractices.sgml
@@ -121,18 +121,18 @@
 </listitem>
 
 <listitem>
-<para> Running all of the <xref linkend="slon"> daemons on a
+<para> Running all of the &lslon; daemons on a
 central server for each network has proven preferable. </para> 
 
-<para> Each <xref linkend="slon"> should run on a host on the same
+<para> Each &lslon; should run on a host on the same
 local network as the node that it is servicing, as it does a
 <emphasis>lot</emphasis> of communications with its database.  </para>
 
 <para> In theory, the <quote>best</quote> speed would come from
-running the <xref linkend="slon"> on the database server that it is
+running the &lslon; on the database server that it is
 servicing. </para>
 
-<para> In practice, having the <xref linkend="slon"> processes strewn
+<para> In practice, having the &lslon; processes strewn
 across a dozen servers turns out to be really inconvenient to manage,
 as making changes to their configuration requires logging onto a whole
 bunch of servers.  In environments where it is necessary to use
@@ -248,9 +248,9 @@
 </listitem>
 
 <listitem>
-<para> Configuring <xref linkend="slon"> </para> 
+<para> Configuring &lslon; </para> 
 
-<para> As of version 1.1, <xref linkend="slon"> configuration may be
+<para> As of version 1.1, &lslon; configuration may be
 drawn either from the command line or from configuration files.
 <quote>Best</quote> practices have yet to emerge from the two
 options:</para>
@@ -266,7 +266,7 @@
 active are visible in the process environment.  (And if there are a
 lot of them, they may be a nuisance to read.)</para>
 
-<para> Unfortunately, if you invoke <xref linkend="slon"> from the
+<para> Unfortunately, if you invoke &lslon; from the
 command line, you could <emphasis>forget</emphasis> to include
 &logshiplink; configuration and thereby destroy the sequence of logs
 for a log shipping node. </para>
@@ -274,7 +274,7 @@
 
 <listitem> <para> Unlike when command line options are used, the
 active options are <emphasis>not</emphasis> visible.  They can only be
-inferred from the name and/or contents of the <xref linkend="slon">
+inferred from the name and/or contents of the &lslon;
 configuration file, and will not reflect subsequent changes to the
 configuration file.  </para>
 
@@ -371,7 +371,7 @@
 </para> 
 
 <para> Several things can be done that will help, involving
-careful selection of <xref linkend="slon"> parameters:</para>
+careful selection of &lslon; parameters:</para>
 </listitem>
 </itemizedlist>
 
@@ -383,13 +383,13 @@
 before version 1.1.1, see <filename> slony1_base.sql </filename> for
 the exact form that the index setup should take. </para> </listitem>
 
-<listitem><para> On the subscriber's <xref linkend="slon">, increase
+<listitem><para> On the subscriber's &lslon;, increase
 the number of <command>SYNC</command> events processed together, with
 the <xref linkend= "slon-config-sync-group-maxsize"> parameter to some
 value that allows it to process a significant portion of the
 outstanding <command>SYNC</command> events. </para> </listitem>
 
-<listitem><para> On the subscriber's <xref linkend="slon">, set the
+<listitem><para> On the subscriber's &lslon;, set the
 <xref linkend="slon-config-desired-sync-time"> to 0, as the adaptive
 <command>SYNC</command> grouping system will start with small
 groupings that will, under these circumstances, perform
Index: faq.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/faq.sgml,v
retrieving revision 1.54
retrieving revision 1.55
diff -Ldoc/adminguide/faq.sgml -Ldoc/adminguide/faq.sgml -u -w -r1.54 -r1.55
--- doc/adminguide/faq.sgml
+++ doc/adminguide/faq.sgml
@@ -99,7 +99,7 @@
 <qandaentry id="threadsafety">
 
 <question><para> &slony1; seemed to compile fine; now, when I run a
-<xref linkend="slon">, some events are moving around, but no
+&lslon;, some events are moving around, but no
 replication is taking place.</para>
 
 <para> Slony logs might look like the following:
@@ -170,7 +170,7 @@
 <question><para>I looked for the <envar>_clustername</envar> namespace, and
 it wasn't there.</para></question>
 
-<answer><para> If the DSNs are wrong, then <xref linkend="slon">
+<answer><para> If the DSNs are wrong, then &lslon;
 instances can't connect to the nodes.</para>
 
 <para>This will generally lead to nodes remaining entirely untouched.</para>
@@ -395,6 +395,67 @@
 further at <xref linkend="listenpaths"> </para></answer>
 </qandaentry>
 
+<qandaentry id="multipleslonconnections">
+
+<question><para> I was starting a &lslon;, and got the
+following <quote>FATAL</quote> messages in its logs.  What's up??? </para>
+<screen>
+2006-03-29 16:01:34 UTC CONFIG main: slon version 1.2.0 starting up
+2006-03-29 16:01:34 UTC DEBUG2 slon: watchdog process started
+2006-03-29 16:01:34 UTC DEBUG2 slon: watchdog ready - pid = 28326
+2006-03-29 16:01:34 UTC DEBUG2 slon: worker process created - pid = 28327
+2006-03-29 16:01:34 UTC CONFIG main: local node id = 1
+2006-03-29 16:01:34 UTC DEBUG2 main: main process started
+2006-03-29 16:01:34 UTC CONFIG main: launching sched_start_mainloop
+2006-03-29 16:01:34 UTC CONFIG main: loading current cluster configuration
+2006-03-29 16:01:34 UTC CONFIG storeSet: set_id=1 set_origin=1 set_comment='test set'
+2006-03-29 16:01:34 UTC DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + worker signaled)
+2006-03-29 16:01:34 UTC DEBUG2 main: last local event sequence = 7
+2006-03-29 16:01:34 UTC CONFIG main: configuration complete - starting threads
+2006-03-29 16:01:34 UTC DEBUG1 localListenThread: thread starts
+2006-03-29 16:01:34 UTC FATAL  localListenThread: "select "_test1538".cleanupNodelock(); insert into "_test1538".sl_nodelock values (    1, 0, "pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key violates unique constraint "sl_nodelock-pkey"
+
+2006-03-29 16:01:34 UTC FATAL  Do you already have a slon running against this node?
+2006-03-29 16:01:34 UTC FATAL  Or perhaps a residual idle backend connection from a dead slon?
+</screen>
+
+</question>
+
+<answer><para> The table <envar>sl_nodelock</envar> is used as an
+<quote>interlock</quote> to prevent two &lslon; processes from trying
+to manage the same node at the same time.  The &lslon; tries inserting
+a record into the table; it can only succeed if it is the only node
+manager. </para></answer>
+
+<answer><para> This error message is typically a sign that you have
+started up a second &lslon; process for a given node.  The &lslon; asks
+the obvious question: <quote>Do you already have a slon running
+against this node?</quote> </para></answer>
+
+<answer><para> Supposing you experience some sort of network outage,
+the connection between &lslon; and database may fail, and the &lslon;
+may figure this out long before the &postgres; instance it was
+connected to does.  The result is that there will be some number of
+idle connections left on the database server, which won't be closed
+out until TCP/IP timeouts complete, which seems to normally take about
+two hours.  For that two hour period, the &lslon; will try to connect,
+over and over, and will get the above fatal message, over and
+over. </para>
+
+<para> An administrator may clean this out by logging onto the server
+and issuing <command>kill -2</command> to any of the offending
+connections.  Unfortunately, since the problem took place within the
+networking layer, neither &postgres; nor &slony1; have a direct way of
+detecting this. </para> 
+
+<para> You can <emphasis>mostly</emphasis> avoid this by making sure
+that &lslon; processes always run somewhere nearby the server that
+each one manages.  If the &lslon; runs on the same server as the
+database it manages, any <quote>networking failure</quote> that could
+interrupt local connections would be likely to be serious enough to
+threaten the entire server.  </para></answer>
+</qandaentry>
+
 </qandadiv>
 
 <qandadiv id="faqconfiguration"> <title> &slony1; FAQ: Configuration Issues </title>
@@ -650,7 +711,7 @@
 
 </question>
 
-<answer><para> If you see a <xref linkend="slon"> shutting down with
+<answer><para> If you see a &lslon; shutting down with
 <emphasis>ignore new events due to shutdown</emphasis> log entries,
 you typically need to step back in the log to
 <emphasis>before</emphasis> they started failing to see indication of
@@ -717,7 +778,7 @@
 
 </para></question>
 
-<answer><para> If you see a <xref linkend="slon"> shutting down with
+<answer><para> If you see a &lslon; shutting down with
 <emphasis>ignore new events due to shutdown</emphasis> log entries,
 you'll typically have to step back to <emphasis>before</emphasis> they
 started failing to see indication of the root cause of the problem.
@@ -988,7 +1049,7 @@
 linkend="table.sl-log-1">/<xref linkend="table.sl-log-2"> tables, and
 do a summary to see if there are any really enormous &slony1;
 transactions in there.  Up until at least 1.0.2, there needs to be a
-<xref linkend="slon"> connected to the origin in order for
+&lslon; connected to the origin in order for
 <command>SYNC</command> events to be generated.</para>
 
 <para>If none are being generated, then all of the updates until the
@@ -1046,7 +1107,7 @@
 </qandadiv>
 <qandadiv id="faqbugs"> <title> &slony1; FAQ: &slony1; Bugs in Elder Versions </title>
 <qandaentry>
-<question><para>The <xref linkend="slon"> processes servicing my
+<question><para>The &lslon; processes servicing my
 subscribers are growing to enormous size, challenging system resources
 both in terms of swap space as well as moving towards breaking past
 the 2GB maximum process size on my system. </para> 
@@ -1056,14 +1117,14 @@
 Perhaps that is somehow relevant? </para> </question>
 
 <answer> <para> Yes, those very large records are at the root of the
-problem.  The problem is that <xref linkend="slon"> normally draws in
+problem.  The problem is that &lslon; normally draws in
 about 100 records at a time when a subscriber is processing the query
 which loads data from the provider.  Thus, if the average record size
 is 10MB, this will draw in 1000MB of data which is then transformed
 into <command>INSERT</command> or <command>UPDATE</command>
-statements, in the <xref linkend="slon"> process' memory.</para>
+statements, in the &lslon; process' memory.</para>
 
-<para> That obviously leads to <xref linkend="slon"> growing to a
+<para> That obviously leads to &lslon; growing to a
 fairly tremendous size. </para>
 
 <para> The number of records that are fetched is controlled by the
@@ -1085,7 +1146,7 @@
 
 <para> If you are experiencing this problem, you might modify the
 definition of <envar> SLON_DATA_FETCH_SIZE </envar>, perhaps reducing
-by a factor of 10, and recompile <xref linkend="slon">.  There are two
+by a factor of 10, and recompile &lslon;.  There are two
 definitions as <envar> SLON_CHECK_CMDTUPLES</envar> allows doing some
 extra monitoring to ensure that subscribers have not fallen out of
 SYNC with the provider.  By default, this option is turned off, so the
@@ -1112,7 +1173,7 @@
 consumption of this sort to about 5MB.  This value is not a strict
 upper bound; if you have a tuple with attributes 50MB in size, it
 forcibly <emphasis>must</emphasis> be loaded into memory.  There is no
-way around that.  But <xref linkend="slon"> at least won't be trying
+way around that.  But &lslon; at least won't be trying
 to load in 100 such records at a time, chewing up 10GB of memory by
 the time it's done.  </para> </listitem>
 </itemizedlist>
@@ -1327,7 +1388,7 @@
 <qandadiv id="faqobsolete"> <title> &slony1; FAQ: Hopefully Obsolete Issues </title>
 
 <qandaentry>
-<question><para> <xref linkend="slon"> does not restart after
+<question><para> &lslon; does not restart after
 crash</para>
 
 <para> After an immediate stop of &postgres; (simulation of system
@@ -1798,7 +1859,7 @@
 <qandaentry> 
 
 <question><para> Node #1 was dropped via <xref
-linkend="stmtdropnode">, and the <xref linkend="slon"> one of the
+linkend="stmtdropnode">, and the &lslon; one of the
 other nodes is repeatedly failing with the error message:</para>
 
 <screen>
Index: slony.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/slony.sgml,v
retrieving revision 1.28
retrieving revision 1.29
diff -Ldoc/adminguide/slony.sgml -Ldoc/adminguide/slony.sgml -u -w -r1.28 -r1.29
--- doc/adminguide/slony.sgml
+++ doc/adminguide/slony.sgml
@@ -46,6 +46,7 @@
   <!ENTITY rplainpaths "<xref linkend=plainpaths>">
   <!ENTITY rlistenpaths "<xref linkend=listenpaths>">
   <!ENTITY pglistener "<envar>pg_listener</envar>">
+  <!ENTITY lslon "<xref linkend=slon>">
 
 ]>