[Slony1-commit] By cbbrowne: Improved discussion of FAILOVER and why it is vital that it

Wed Feb 2 18:34:56 PST 2005

Log Message:
-----------
Improved discussion of FAILOVER and why it is vital that it abandon the
failed node.

Modified Files:
--------------
    slony1-engine/doc/adminguide:
        failover.sgml (r1.7 -> r1.8)
        faq.sgml (r1.13 -> r1.14)
        slonik_ref.sgml (r1.9 -> r1.10)

-------------- next part --------------
Index: slonik_ref.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/slonik_ref.sgml,v
retrieving revision 1.9
retrieving revision 1.10
diff -Ldoc/adminguide/slonik_ref.sgml -Ldoc/adminguide/slonik_ref.sgml -u -w -r1.9 -r1.10

--- doc/adminguide/slonik_ref.sgml
+++ doc/adminguide/slonik_ref.sgml
@@ -1693,9 +1693,9 @@
     <title>Description</title>
     
     <para>
-     The failover command causes the backup node to take over all sets
+     The <command>FAILOVER</command> command causes the backup node to take over all sets
      that currently originate on the failed
-     node. <application>Slonik</application> will contact all other
+     node. <application>slonik</application> will contact all other
      direct subscribers of the failed node to determine which node has
      the highest sync status for each set. If another node has a
      higher sync status than the backup node, the replication will
@@ -1703,19 +1703,21 @@
      that other node, before assuming the origin role and allowing
      update activity.
     </para>
+
     <para>
      After successful failover, all former direct subscribers of the
      failed node become direct subscribers of the backup node. The
-     failed node can and should be removed from the configuration with
-     <command><link linkend="stmtdropnode"> DROP NODE</link>
-     </command>.
+     failed node is abandoned, and can and should be removed from the
+     configuration with <command><link linkend="stmtdropnode"> DROP
+     NODE</link> </command>.
     </para>
     
     <warning><para> This command will abandon the status of the failed
       node.  There is no possibility to let the failed node join the
-      cluster again without rebuilding it from scratch as a slave slave.
-      It would often be highly preferable to use <command> <link
-	linkend="stmtmoveset"> MOVE SET </link> </command> instead.
+      cluster again without rebuilding it from scratch as a slave.  If
+      at all possible, you would likely prefer to use <command> <link
+	linkend="stmtmoveset"> MOVE SET </link> </command> instead, as
+      that does <emphasis>not</emphasis> abandon the failed node.
      </para></warning>
     
     <variablelist>
Index: faq.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/faq.sgml,v
retrieving revision 1.13
retrieving revision 1.14
diff -Ldoc/adminguide/faq.sgml -Ldoc/adminguide/faq.sgml -u -w -r1.13 -r1.14
--- doc/adminguide/faq.sgml
+++ doc/adminguide/faq.sgml
@@ -684,20 +684,21 @@
 
 <itemizedlist>
 
-<listitem><para> The <quote>glitch</quote> seems to coincide with some sort
-of outage; it has been observed both in cases where databases were
-suffering from periodic "SIG 11" problems, where backends were falling
-over, as well as when temporary network failure seemed
-likely.</para></listitem>
+<listitem><para> The <quote>glitch</quote> has occasionally coincided
+with some sort of outage; it has been observed both in cases where
+databases were suffering from periodic <quote>SIG 11</quote> problems,
+where backends were falling over, as well as when temporary network
+failure seemed likely.</para></listitem>
 
 <listitem><para> The scenario seems to involve a delete transaction
-having been missed by <productname>Slony-I</productname>.</para></listitem>
+having been missed by <productname>Slony-I</productname>. </para>
+</listitem>
 
 </itemizedlist></para>
 
 <para>By the time we notice that there is a problem, the missed delete
-transaction has been cleaned out of sl_log_1, so there is no recovery
-possible.</para>
+transaction has been cleaned out of <envar>sl_log_1</envar>, so there
+is no recovery possible.</para>
 
 <para>What is necessary, at this point, is to drop the replication set
 (or even the node), and restart replication from scratch on that
@@ -712,7 +713,17 @@
 from the condition or at least diagnosing it more exactly.  And
 perhaps the problem is that sl_log_1 was being purged too
 aggressively, and this will resolve the issue completely.</para>
-</answer></qandaentry>
+</answer>
+<answer><para> Unfortunately, this problem has been observed in 1.0.5,
+so this still appears to represent a bug still in existence.</para>
+
+<para> It is a shame to have to reconstruct a large replication node
+for this; if you discover that this problem recurs, it may be an idea
+to break replication down into multiple sets in order to diminish the
+work involved in restarting replication.  If only one set has broken,
+you only unsubscribe/drop and resubscribe the one set.
+</para></answer>
+</qandaentry>
 
 <qandaentry>
 
@@ -1159,6 +1170,37 @@
 </para></answer>
 </qandaentry>
 
+<qandaentry>
+<question> <para> I had a network <quote>glitch</quote> that led to my
+using <command><link linkend="stmtfailover">FAILOVER</link></command>
+to fail over to an alternate node.  The failure wasn't a disk problem
+that would corrupt databases; why do I need to rebuild the failed node
+from scratch? </para></question>
+
+<answer><para> The action of <command><link
+linkend="stmtfailover">FAILOVER</link></command> is to
+<emphasis>abandon</emphasis> the failed node so that no more
+<productname>Slony-I</productname> activity goes to or from that node.
+As soon as that takes place, the failed node will progressively fall
+further and further out of sync.
+</para></answer>
+
+<answer><para> The <emphasis>big</emphasis> problem with trying to
+recover the failed node is that it may contain updates that never made
+it out of the origin.  If they get retried, on the new origin, you may
+find that you have conflicting updates.  In any case, you do have a
+sort of <quote>logical</quote> corruption of the data even if there
+never was a disk failure making it <quote>physical.</quote>
+</para></answer>
+
+<answer><para> As discusssed in the section on <link
+linkend="failover"> Doing switchover and failover with Slony-I</link>,
+using <command><link linkend="stmtfailover">FAILOVER</link></command>
+should be considered a <emphasis>last resort</emphasis> as it implies
+that you are abandoning the origin node as being corrupted.
+</para></answer>
+</qandaentry>
+
 </qandaset>
 
 <!-- Keep this comment at the end of the file Local variables:
Index: failover.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/failover.sgml,v
retrieving revision 1.7
retrieving revision 1.8
diff -Ldoc/adminguide/failover.sgml -Ldoc/adminguide/failover.sgml -u -w -r1.7 -r1.8
--- doc/adminguide/failover.sgml
+++ doc/adminguide/failover.sgml
@@ -98,13 +98,13 @@
 where it can limp along long enough to do a controlled switchover,
 that is <emphasis>greatly</emphasis> preferable.</para>
 
-<para> Slony does not provide any automatic detection for failed
-systems.  Abandoning committed transactions is a business decision
-that cannot be made by a database.  If someone wants to put the
-commands below into a script executed automatically from the network
-monitoring system, well ... it's <emphasis>your</emphasis> data, and
-it's <emphasis>your</emphasis> failover policy.
-</para>
+<para> <productname>Slony-I</productname> does not provide any
+automatic detection for failed systems.  Abandoning committed
+transactions is a business decision that cannot be made by a database
+system.  If someone wants to put the commands below into a script
+executed automatically from the network monitoring system, well
+... it's <emphasis>your</emphasis> data, and it's
+<emphasis>your</emphasis> failover policy. </para>
 
 <itemizedlist>
 
@@ -139,10 +139,10 @@
 </listitem>
 
 <listitem>
-<para> After the failover is complete and node2 accepts
-write operations against the tables, remove all remnants of node1's
-configuration information with the <link linkend="slonik"><application>slonik</application></link> 
-<command><link linkend="stmtdropnode">DROP NODE</link></command> command:
+<para> After the failover is complete and node2 accepts write
+operations against the tables, remove all remnants of node1's
+configuration information with the <command><link
+linkend="stmtdropnode">DROP NODE</link></command> command:
 
 <programlisting>
 drop node (id = 1, event node = 2);
@@ -154,17 +154,28 @@
 
 <sect2><title>After Failover, Reconfiguring node1</title>
 
-<para> After the above failover, the data stored on node1 is
-considered out of sync with the rest of the nodes, and must be treated
-as corrupt.  Therefore, the only way to get node1 back and transfer
-the origin role back to it is to rebuild it from scratch as a
+<para> After the above failover, the data stored on node1 will rapidly
+become increasingly out of sync with the rest of the nodes, and must
+be treated as corrupt.  Therefore, the only way to get node1 back and
+transfer the origin role back to it is to rebuild it from scratch as a
 subscriber, let it catch up, and then follow the switchover
 procedure.</para>
 
+<para> A good reason <emphasis>not</emphasis> to do this automatically
+is the fact that important updates (from a
+<emphasis>business</emphasis> perspective) may have been
+<command>commit</command>ted on the failing system.  You probably want
+to analyze the last few transactions that made it into the failed node
+to see if some of them need to be reapplied on the <quote>live</quote>
+cluster.  For instance, if someone was entering bank deposits
+affecting customer accounts at the time of failure, you wouldn't want
+to lose that information.</emphasis>
+
 <para> If the database is very large, it may take many hours to
-recover node1 as a functioning <productname>Slony-I</productname> node; that is
-another reason to consider failover as an undesirable <quote>final
-resort.</quote></para>
+recover node1 as a functioning <productname>Slony-I</productname>
+node; that is another reason to consider failover as an undesirable
+<quote>final resort.</quote></para>
+
 </sect2>
 
 </sect1>