Wed Feb 2 18:34:56 PST 2005
- Previous message: [Slony1-commit] By cbbrowne: Major additions to comments by Steve Simms
- Next message: [Slony1-commit] By darcyb: add request for upgrade docs, and more examples.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Log Message: ----------- Improved discussion of FAILOVER and why it is vital that it abandon the failed node. Modified Files: -------------- slony1-engine/doc/adminguide: failover.sgml (r1.7 -> r1.8) faq.sgml (r1.13 -> r1.14) slonik_ref.sgml (r1.9 -> r1.10) -------------- next part -------------- Index: slonik_ref.sgml =================================================================== RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/slonik_ref.sgml,v retrieving revision 1.9 retrieving revision 1.10 diff -Ldoc/adminguide/slonik_ref.sgml -Ldoc/adminguide/slonik_ref.sgml -u -w -r1.9 -r1.10 --- doc/adminguide/slonik_ref.sgml +++ doc/adminguide/slonik_ref.sgml @@ -1693,9 +1693,9 @@ <title>Description</title> <para> - The failover command causes the backup node to take over all sets + The <command>FAILOVER</command> command causes the backup node to take over all sets that currently originate on the failed - node. <application>Slonik</application> will contact all other + node. <application>slonik</application> will contact all other direct subscribers of the failed node to determine which node has the highest sync status for each set. If another node has a higher sync status than the backup node, the replication will @@ -1703,19 +1703,21 @@ that other node, before assuming the origin role and allowing update activity. </para> + <para> After successful failover, all former direct subscribers of the failed node become direct subscribers of the backup node. The - failed node can and should be removed from the configuration with - <command><link linkend="stmtdropnode"> DROP NODE</link> - </command>. + failed node is abandoned, and can and should be removed from the + configuration with <command><link linkend="stmtdropnode"> DROP + NODE</link> </command>. </para> <warning><para> This command will abandon the status of the failed node. There is no possibility to let the failed node join the - cluster again without rebuilding it from scratch as a slave slave. - It would often be highly preferable to use <command> <link - linkend="stmtmoveset"> MOVE SET </link> </command> instead. + cluster again without rebuilding it from scratch as a slave. If + at all possible, you would likely prefer to use <command> <link + linkend="stmtmoveset"> MOVE SET </link> </command> instead, as + that does <emphasis>not</emphasis> abandon the failed node. </para></warning> <variablelist> Index: faq.sgml =================================================================== RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/faq.sgml,v retrieving revision 1.13 retrieving revision 1.14 diff -Ldoc/adminguide/faq.sgml -Ldoc/adminguide/faq.sgml -u -w -r1.13 -r1.14 --- doc/adminguide/faq.sgml +++ doc/adminguide/faq.sgml @@ -684,20 +684,21 @@ <itemizedlist> -<listitem><para> The <quote>glitch</quote> seems to coincide with some sort -of outage; it has been observed both in cases where databases were -suffering from periodic "SIG 11" problems, where backends were falling -over, as well as when temporary network failure seemed -likely.</para></listitem> +<listitem><para> The <quote>glitch</quote> has occasionally coincided +with some sort of outage; it has been observed both in cases where +databases were suffering from periodic <quote>SIG 11</quote> problems, +where backends were falling over, as well as when temporary network +failure seemed likely.</para></listitem> <listitem><para> The scenario seems to involve a delete transaction -having been missed by <productname>Slony-I</productname>.</para></listitem> +having been missed by <productname>Slony-I</productname>. </para> +</listitem> </itemizedlist></para> <para>By the time we notice that there is a problem, the missed delete -transaction has been cleaned out of sl_log_1, so there is no recovery -possible.</para> +transaction has been cleaned out of <envar>sl_log_1</envar>, so there +is no recovery possible.</para> <para>What is necessary, at this point, is to drop the replication set (or even the node), and restart replication from scratch on that @@ -712,7 +713,17 @@ from the condition or at least diagnosing it more exactly. And perhaps the problem is that sl_log_1 was being purged too aggressively, and this will resolve the issue completely.</para> -</answer></qandaentry> +</answer> +<answer><para> Unfortunately, this problem has been observed in 1.0.5, +so this still appears to represent a bug still in existence.</para> + +<para> It is a shame to have to reconstruct a large replication node +for this; if you discover that this problem recurs, it may be an idea +to break replication down into multiple sets in order to diminish the +work involved in restarting replication. If only one set has broken, +you only unsubscribe/drop and resubscribe the one set. +</para></answer> +</qandaentry> <qandaentry> @@ -1159,6 +1170,37 @@ </para></answer> </qandaentry> +<qandaentry> +<question> <para> I had a network <quote>glitch</quote> that led to my +using <command><link linkend="stmtfailover">FAILOVER</link></command> +to fail over to an alternate node. The failure wasn't a disk problem +that would corrupt databases; why do I need to rebuild the failed node +from scratch? </para></question> + +<answer><para> The action of <command><link +linkend="stmtfailover">FAILOVER</link></command> is to +<emphasis>abandon</emphasis> the failed node so that no more +<productname>Slony-I</productname> activity goes to or from that node. +As soon as that takes place, the failed node will progressively fall +further and further out of sync. +</para></answer> + +<answer><para> The <emphasis>big</emphasis> problem with trying to +recover the failed node is that it may contain updates that never made +it out of the origin. If they get retried, on the new origin, you may +find that you have conflicting updates. In any case, you do have a +sort of <quote>logical</quote> corruption of the data even if there +never was a disk failure making it <quote>physical.</quote> +</para></answer> + +<answer><para> As discusssed in the section on <link +linkend="failover"> Doing switchover and failover with Slony-I</link>, +using <command><link linkend="stmtfailover">FAILOVER</link></command> +should be considered a <emphasis>last resort</emphasis> as it implies +that you are abandoning the origin node as being corrupted. +</para></answer> +</qandaentry> + </qandaset> <!-- Keep this comment at the end of the file Local variables: Index: failover.sgml =================================================================== RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/failover.sgml,v retrieving revision 1.7 retrieving revision 1.8 diff -Ldoc/adminguide/failover.sgml -Ldoc/adminguide/failover.sgml -u -w -r1.7 -r1.8 --- doc/adminguide/failover.sgml +++ doc/adminguide/failover.sgml @@ -98,13 +98,13 @@ where it can limp along long enough to do a controlled switchover, that is <emphasis>greatly</emphasis> preferable.</para> -<para> Slony does not provide any automatic detection for failed -systems. Abandoning committed transactions is a business decision -that cannot be made by a database. If someone wants to put the -commands below into a script executed automatically from the network -monitoring system, well ... it's <emphasis>your</emphasis> data, and -it's <emphasis>your</emphasis> failover policy. -</para> +<para> <productname>Slony-I</productname> does not provide any +automatic detection for failed systems. Abandoning committed +transactions is a business decision that cannot be made by a database +system. If someone wants to put the commands below into a script +executed automatically from the network monitoring system, well +... it's <emphasis>your</emphasis> data, and it's +<emphasis>your</emphasis> failover policy. </para> <itemizedlist> @@ -139,10 +139,10 @@ </listitem> <listitem> -<para> After the failover is complete and node2 accepts -write operations against the tables, remove all remnants of node1's -configuration information with the <link linkend="slonik"><application>slonik</application></link> -<command><link linkend="stmtdropnode">DROP NODE</link></command> command: +<para> After the failover is complete and node2 accepts write +operations against the tables, remove all remnants of node1's +configuration information with the <command><link +linkend="stmtdropnode">DROP NODE</link></command> command: <programlisting> drop node (id = 1, event node = 2); @@ -154,17 +154,28 @@ <sect2><title>After Failover, Reconfiguring node1</title> -<para> After the above failover, the data stored on node1 is -considered out of sync with the rest of the nodes, and must be treated -as corrupt. Therefore, the only way to get node1 back and transfer -the origin role back to it is to rebuild it from scratch as a +<para> After the above failover, the data stored on node1 will rapidly +become increasingly out of sync with the rest of the nodes, and must +be treated as corrupt. Therefore, the only way to get node1 back and +transfer the origin role back to it is to rebuild it from scratch as a subscriber, let it catch up, and then follow the switchover procedure.</para> +<para> A good reason <emphasis>not</emphasis> to do this automatically +is the fact that important updates (from a +<emphasis>business</emphasis> perspective) may have been +<command>commit</command>ted on the failing system. You probably want +to analyze the last few transactions that made it into the failed node +to see if some of them need to be reapplied on the <quote>live</quote> +cluster. For instance, if someone was entering bank deposits +affecting customer accounts at the time of failure, you wouldn't want +to lose that information.</emphasis> + <para> If the database is very large, it may take many hours to -recover node1 as a functioning <productname>Slony-I</productname> node; that is -another reason to consider failover as an undesirable <quote>final -resort.</quote></para> +recover node1 as a functioning <productname>Slony-I</productname> +node; that is another reason to consider failover as an undesirable +<quote>final resort.</quote></para> + </sect2> </sect1>
- Previous message: [Slony1-commit] By cbbrowne: Major additions to comments by Steve Simms
- Next message: [Slony1-commit] By darcyb: add request for upgrade docs, and more examples.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-commit mailing list