CVS User Account cvsuser
Fri Dec 10 23:53:17 PST 2004
Log Message:
-----------
Add in FAQ to the admin guide

Added Files:
-----------
    slony1-engine/doc/adminguide:
        faq.sgml (r1.1)

-------------- next part --------------
--- /dev/null
+++ doc/adminguide/faq.sgml
@@ -0,0 +1,640 @@
+<qandaset>
+
+<qandaentry>
+<question><para>I looked for the <envar/_clustername/ namespace, and
+it wasn't there.</question>
+
+<answer><para> If the DSNs are wrong, then slon instances can't connect to the nodes.
+
+<para>This will generally lead to nodes remaining entirely untouched.
+
+<para>Recheck the connection configuration.  By the way, since
+<application/slon/ links to libpq, you could have password information
+stored in <filename> <envar>$HOME</envar>/.pgpass</filename>,
+partially filling in right/wrong authentication information there.
+</answer>
+</qandaentry>
+
+<qandaentry id="SlonyFAQ02">
+<question><para>
+Some events moving around, but no replication
+
+<para> Slony logs might look like the following:
+
+<screen>
+DEBUG1 remoteListenThread_1: connected to 'host=host004 dbname=pgbenchrep user=postgres port=5432'
+ERROR  remoteListenThread_1: "select ev_origin, ev_seqno, ev_timestamp,		  ev_minxid, ev_maxxid, ev_xip,		  ev_type,		  ev_data1, ev_data2,		  ev_data3, ev_data4,		  ev_data5, ev_data6,		  ev_data7, ev_data8 from "_pgbenchtest".sl_event e where (e.ev_origin = '1' and e.ev_seqno > '1') order by e.ev_origin, e.ev_seqno" - could not receive data from server: Operation now in progress
+</screen>
+
+<answer><para>
+On AIX and Solaris (and possibly elsewhere), both Slony-I <emphasis/and PostgreSQL/ must be compiled with the <option/--enable-thread-safety/ option.  The above results when PostgreSQL isn't so compiled.
+
+<para>What breaks here is that the libc (threadsafe) and libpq (non-threadsafe) use different memory locations for errno, thereby leading to the request failing.
+
+<para>Problems like this crop up with disadmirable regularity on AIX
+and Solaris; it may take something of an <quote/object code audit/ to
+make sure that <emphasis/ALL/ of the necessary components have been
+compiled and linked with <option/--enable-thread-safety/.
+
+<para>For instance, I ran into the problem one that
+<envar/LD_LIBRARY_PATH/ had been set, on Solaris, to point to
+libraries from an old PostgreSQL compile.  That meant that even though
+the database <emphasis/had/ been compiled with
+<option/--enable-thread-safety/, and <application/slon/ had been
+compiled against that, <application/slon/ was being dynamically linked
+to the <quote/bad old thread-unsafe version,/ so slon didn't work.  It
+wasn't clear that this was the case until I ran <command/ldd/ against
+<application/slon/.
+
+<qandaentry>
+<question> <para>I tried creating a CLUSTER NAME with a "-" in it.
+That didn't work.
+
+<answer><Para> Slony-I uses the same rules for unquoted identifiers as the PostgreSQL
+main parser, so no, you probably shouldn't put a "-" in your
+identifier name.
+
+<para> You may be able to defeat this by putting "quotes" around
+identifier names, but it's liable to bite you some, so this is
+something that is probably not worth working around.
+
+<qandaentry>
+<question><para> slon does not restart after crash
+
+<para> After an immediate stop of postgresql (simulation of system crash)
+in pg_catalog.pg_listener a tuple with
+relname='_${cluster_name}_Restart' exists. slon doesn't start cause it
+thinks another process is serving the cluster on this node.  What can
+I do? The tuples can't be dropped from this relation.
+
+<para> The logs claim that "Another slon daemon is serving this node already"
+
+<answer>
+<para>It's handy to keep a slonik script like the following one around to
+run in such cases:
+
+<programlisting>
+twcsds004[/opt/twcsds004/OXRS/slony-scripts]$ cat restart_org.slonik 
+cluster name = oxrsorg ;
+node 1 admin conninfo = 'host=32.85.68.220 dbname=oxrsorg user=postgres port=5532';
+node 2 admin conninfo = 'host=32.85.68.216 dbname=oxrsorg user=postgres port=5532';
+node 3 admin conninfo = 'host=32.85.68.244 dbname=oxrsorg user=postgres port=5532';
+node 4 admin conninfo = 'host=10.28.103.132 dbname=oxrsorg user=postgres port=5532';
+restart node 1;
+restart node 2;
+restart node 3;
+restart node 4;
+</programlisting>
+
+<para> <command/restart node n/ cleans up dead notifications so that you can restart the node.
+
+<para>As of version 1.0.5, the startup process of slon looks for this
+condition, and automatically cleans it up.
+
+<qandaentry>
+<question><Para>
+ps finds passwords on command line
+
+<para> If I run a <command/ps/ command, I, and everyone else, can see passwords
+on the command line.
+
+<answer>
+<para>Take the passwords out of the Slony configuration, and put them into
+<filename><envar/$(HOME)//.pgpass.</filename>
+
+<qandaentry>
+<question><Para>Slonik fails - cannot load PostgreSQL library - <command>PGRES_FATAL_ERROR load '$libdir/xxid';</command>
+
+<para> When I run the sample setup script I get an error message similar
+to:
+
+<command>
+stdin:64: PGRES_FATAL_ERROR load '$libdir/xxid';  - ERROR:  LOAD:
+could not open file '$libdir/xxid': No such file or directory
+</command>
+
+<answer><para> Evidently, you haven't got the <filename/xxid.so/
+library in the <envar/$libdir/ directory that the PostgreSQL instance
+is using.  Note that the Slony-I components need to be installed in
+the PostgreSQL software installation for <emphasis/each and every one/
+of the nodes, not just on the <quote/master node./
+
+<para>This may also point to there being some other mismatch between
+the PostgreSQL binary instance and the Slony-I instance.  If you
+compiled Slony-I yourself, on a machine that may have multiple
+PostgreSQL builds <quote/lying around,/ it's possible that the slon or
+slonik binaries are asking to load something that isn't actually in
+the library directory for the PostgreSQL database cluster that it's
+hitting.
+
+<para>Long and short: This points to a need to <quote/audit/ what
+installations of PostgreSQL and Slony you have in place on the
+machine(s).  Unfortunately, just about any mismatch will cause things
+not to link up quite right.  See also <link linkend="SlonyFAQ02">
+SlonyFAQ02 </link> concerning threading issues on Solaris ...
+
+<qandaentry>
+<question><Para>Table indexes with FQ namespace names
+
+<programlisting>
+set add table (set id = 1, origin = 1, id = 27, 
+               full qualified name = 'nspace.some_table', 
+               key = 'key_on_whatever', 
+               comment = 'Table some_table in namespace nspace with a candidate primary key');
+</programlisting>
+
+<answer><para> If you have <command/ key = 'nspace.key_on_whatever'/
+the request will <emphasis/FAIL/.
+
+<qandaentry>
+<question><Para>
+I'm trying to get a slave subscribed, and get the following
+messages in the logs:
+
+<screen>
+DEBUG1 copy_set 1
+DEBUG1 remoteWorkerThread_1: connected to provider DB
+WARN	remoteWorkerThread_1: transactions earlier than XID 127314958 are still in progress
+WARN	remoteWorkerThread_1: data copy for set 1 failed - sleep 60 seconds
+</screen>
+
+<para>Oops.  What I forgot to mention, as well, was that I was trying
+to add <emphasis/TWO/ subscribers, concurrently.
+
+<answer><para> That doesn't work out: Slony-I won't work on the
+<command/COPY/ commands concurrently.  See
+<filename>src/slon/remote_worker.c</filename>, function
+<function/copy_set()/
+
+<para>This has the (perhaps unfortunate) implication that you cannot
+populate two slaves concurrently.  You have to subscribe one to the
+set, and only once it has completed setting up the subscription
+(copying table contents and such) can the second subscriber start
+setting up the subscription.
+
+<para>It could also be possible for there to be an old outstanding
+transaction blocking Slony-I from processing the sync.  You might want
+to take a look at pg_locks to see what's up:
+
+<screen>
+sampledb=# select * from pg_locks where transaction is not null order by transaction;
+ relation | database | transaction |  pid    |     mode      | granted 
+----------+----------+-------------+---------+---------------+---------
+          |          |   127314921 | 2605100 | ExclusiveLock | t
+          |          |   127326504 | 5660904 | ExclusiveLock | t
+(2 rows)
+</screen>
+
+<para>See?  127314921 is indeed older than 127314958, and it's still running.
+
+<screen>
+$ ps -aef | egrep '[2]605100'
+postgres 2605100  205018	0 18:53:43  pts/3  3:13 postgres: postgres sampledb localhost COPY 
+</screen>
+
+<para>This happens to be a <command/COPY/ transaction involved in setting up the
+subscription for one of the nodes.  All is well; the system is busy
+setting up the first subscriber; it won't start on the second one
+until the first one has completed subscribing.
+
+<para>By the way, if there is more than one database on the PostgreSQL
+cluster, and activity is taking place on the OTHER database, that will
+lead to there being <quote/transactions earlier than XID whatever/ being
+found to be still in progress.  The fact that it's a separate database
+on the cluster is irrelevant; Slony-I will wait until those old
+transactions terminate.
+<qandaentry>
+<question><Para>
+ERROR: duplicate key violates unique constraint "sl_table-pkey"
+
+<para>I tried setting up a second replication set, and got the following error:
+
+<screen>
+stdin:9: Could not create subscription set 2 for oxrslive!
+stdin:11: PGRES_FATAL_ERROR select "_oxrslive".setAddTable(2, 1, 'public.replic_test', 'replic_test__Slony-I_oxrslive_rowID_key', 'Table public.replic_test without primary key');  - ERROR:  duplicate key violates unique constraint "sl_table-pkey"
+CONTEXT:  PL/pgSQL function "setaddtable_int" line 71 at SQL statement
+</screen>
+
+<answer><para>
+The table IDs used in SET ADD TABLE are required to be unique <emphasis/ACROSS
+ALL SETS/.  Thus, you can't restart numbering at 1 for a second set; if
+you are numbering them consecutively, a subsequent set has to start
+with IDs after where the previous set(s) left off.
+<qandaentry>
+<question><Para>I need to drop a table from a replication set
+<answer><para>
+This can be accomplished several ways, not all equally desirable ;-).
+
+<itemizedlist>
+<listitem><para> You could drop the whole replication set, and recreate it with just the tables that you need.  Alas, that means recopying a whole lot of data, and kills the usability of the cluster on the rest of the set while that's happening.
+
+<listitem><para> If you are running 1.0.5 or later, there is the command SET DROP TABLE, which will "do the trick."
+
+<listitem><para> If you are still using 1.0.1 or 1.0.2, the _essential_ functionality of SET DROP TABLE involves the functionality in droptable_int().  You can fiddle this by hand by finding the table ID for the table you want to get rid of, which you can find in sl_table, and then run the following three queries, on each host:
+
+<programlisting>
+  select _slonyschema.alterTableRestore(40);
+  select _slonyschema.tableDropKey(40);
+  delete from _slonyschema.sl_table where tab_id = 40;
+</programlisting>
+
+<para>The schema will obviously depend on how you defined the Slony-I
+cluster.  The table ID, in this case, 40, will need to change to the
+ID of the table you want to have go away.
+
+You'll have to run these three queries on all of the nodes, preferably
+firstly on the "master" node, so that the dropping of this propagates
+properly.  Implementing this via a SLONIK statement with a new Slony
+event would do that.  Submitting the three queries using EXECUTE
+SCRIPT could do that.  Also possible would be to connect to each
+database and submit the queries by hand.
+</itemizedlist>
+<qandaentry>
+<question><Para>I need to drop a sequence from a replication set
+
+<answer><para><para>If you are running 1.0.5 or later, there is a
+<command/SET DROP SEQUENCE/ command in Slonik to allow you to do this,
+parallelling <command/SET DROP TABLE./
+
+<para>If you are running 1.0.2 or earlier, the process is a bit more manual.
+
+<para>Supposing I want to get rid of the two sequences listed below,
+<envar/whois_cachemgmt_seq/ and <envar/epp_whoi_cach_seq_/, we start
+by needing the <envar/seq_id/ values.
+
+<screen>
+oxrsorg=# select * from _oxrsorg.sl_sequence  where seq_id in (93,59);
+ seq_id | seq_reloid | seq_set |       seq_comment				 
+--------+------------+---------+-------------------------------------
+     93 |  107451516 |       1 | Sequence public.whois_cachemgmt_seq
+     59 |  107451860 |       1 | Sequence public.epp_whoi_cach_seq_
+(2 rows)
+</screen>
+
+<para>The data that needs to be deleted to stop Slony from continuing to
+replicate these are thus:
+
+<programlisting>
+delete from _oxrsorg.sl_seqlog where seql_seqid in (93, 59);
+delete from _oxrsorg.sl_sequence where seq_id in (93,59);
+</programlisting>
+
+<para>Those two queries could be submitted to all of the nodes via
+<function/ddlscript()/ / <command/EXECUTE SCRIPT/, thus eliminating
+the sequence everywhere <quote/at once./ Or they may be applied by
+hand to each of the nodes.
+
+<para>Similarly to <command/SET DROP TABLE/, this should be in place for Slony-I version
+1.0.5 as <command/SET DROP SEQUENCE./
+<qandaentry>
+<question><Para>Slony-I: cannot add table to currently subscribed set 1
+
+<para> I tried to add a table to a set, and got the following message:
+
+<screen>
+	Slony-I: cannot add table to currently subscribed set 1
+</screen>
+
+<answer><para> You cannot add tables to sets that already have
+subscribers.
+
+<para>The workaround to this is to create <emphasis/ANOTHER/ set, add
+the new tables to that new set, subscribe the same nodes subscribing
+to "set 1" to the new set, and then merge the sets together.
+
+<qandaentry>
+<question><Para>Some nodes start consistently falling behind
+
+<para>I have been running Slony-I on a node for a while, and am seeing
+system performance suffering.
+
+<para>I'm seeing long running queries of the form:
+<screen>
+	fetch 100 from LOG;
+</screen>
+
+<answer><para> This is characteristic of pg_listener (which is the table containing
+<command/NOTIFY/ data) having plenty of dead tuples in it.  That makes <command/NOTIFY/
+events take a long time, and causes the affected node to gradually
+fall further and further behind.
+
+<para>You quite likely need to do a <command/VACUUM FULL/ on <envar/pg_listener/, to vigorously clean it out, and need to vacuum <envar/pg_listener/ really frequently.  Once every five minutes would likely be AOK.
+
+<para> Slon daemons already vacuum a bunch of tables, and
+<filename/cleanup_thread.c/ contains a list of tables that are
+frequently vacuumed automatically.  In Slony-I 1.0.2,
+<envar/pg_listener/ is not included.  In 1.0.5 and later, it is
+regularly vacuumed, so this should cease to be a direct issue.
+
+<para>There is, however, still a scenario where this will still
+"bite."  Vacuums cannot delete tuples that were made "obsolete" at any
+time after the start time of the eldest transaction that is still
+open.  Long running transactions will cause trouble, and should be
+avoided, even on "slave" nodes.
+
+<qandaentry>
+<question><Para>I started doing a backup using pg_dump, and suddenly Slony stops
+
+<answer><para>Ouch.  What happens here is a conflict between:
+<itemizedlist>
+
+<listitem><para> <application/pg_dump/, which has taken out an <command/AccessShareLock/ on all of the tables in the database, including the Slony-I ones, and
+
+<listitem><para> A Slony-I sync event, which wants to grab a <command/AccessExclusiveLock/ on	 the table <envar/sl_event/.
+</itemizedlist>
+
+<para>The initial query that will be blocked is thus:
+
+<screen>
+select "_slonyschema".createEvent('_slonyschema, 'SYNC', NULL);	  
+</screen>
+
+<para>(You can see this in <envar/pg_stat_activity/, if you have query
+display turned on in <filename/postgresql.conf/)
+
+<para>The actual query combination that is causing the lock is from
+the function <function/Slony_I_ClusterStatus()/, found in
+<filename/slony1_funcs.c/, and is localized in the code that does:
+
+<programlisting>
+  LOCK TABLE %s.sl_event;
+  INSERT INTO %s.sl_event (...stuff...)
+  SELECT currval('%s.sl_event_seq');
+</programlisting>
+
+<para>The <command/LOCK/ statement will sit there and wait until <command/pg_dump/ (or whatever else has pretty much any kind of access lock on <envar/sl_event/) completes.  
+
+<para>Every subsequent query submitted that touches <envar/sl_event/ will block behind the <function/createEvent/ call.
+
+<para>There are a number of possible answers to this:
+<itemizedlist>
+
+<listitem><para> Have pg_dump specify the schema dumped using
+--schema=whatever, and don't try dumping the cluster's schema.
+
+<listitem><para> It would be nice to add an "--exclude-schema" option
+to pg_dump to exclude the Slony cluster schema.  Maybe in 8.0 or
+8.1...
+
+<listitem><para>Note that 1.0.5 uses a more precise lock that is less
+exclusive that alleviates this problem.
+</itemizedlist>
+<qandaentry>
+
+<question><Para>The slons spent the weekend out of commission [for
+some reason], and it's taking a long time to get a sync through.
+
+<answer><para>
+You might want to take a look at the sl_log_1/sl_log_2 tables, and do
+a summary to see if there are any really enormous Slony-I transactions
+in there.  Up until at least 1.0.2, there needs to be a slon connected
+to the master in order for <command/SYNC/ events to be generated.
+
+<para>If none are being generated, then all of the updates until the next
+one is generated will collect into one rather enormous Slony-I
+transaction.
+
+<para>Conclusion: Even if there is not going to be a subscriber around, you
+<emphasis/really/ want to have a slon running to service the <quote/master/ node.
+
+<para>Some future version (probably 1.1) may provide a way for
+<command/SYNC/ counts to be updated on the master by the stored
+function that is invoked by the table triggers.
+
+<qandaentry>
+<question><Para>I pointed a subscribing node to a different parent and it stopped replicating
+
+<answer><para>
+We noticed this happening when we wanted to re-initialize a node,
+where we had configuration thus:
+
+<itemizedlist>
+<listitem><para> Node 1 - master
+<listitem><para> Node 2 - child of node 1 - the node we're reinitializing
+<listitem><para> Node 3 - child of node 3 - node that should keep replicating
+</itemizedlist>
+
+<para>The subscription for node 3 was changed to have node 1 as
+provider, and we did <command/DROP SET//<command/SUBSCRIBE SET/ for
+node 2 to get it repopulating.
+
+<para>Unfortunately, replication suddenly stopped to node 3.
+
+<para>The problem was that there was not a suitable set of <quote/listener paths/
+in sl_listen to allow the events from node 1 to propagate to node 3.
+The events were going through node 2, and blocking behind the
+<command/SUBSCRIBE SET/ event that node 2 was working on.
+
+<para>The following slonik script dropped out the listen paths where node 3
+had to go through node 2, and added in direct listens between nodes 1
+and 3.
+
+<programlisting>
+cluster name = oxrslive;
+ node 1 admin conninfo='host=32.85.68.220 dbname=oxrslive user=postgres port=5432';
+ node 2 admin conninfo='host=32.85.68.216 dbname=oxrslive user=postgres port=5432';
+ node 3 admin conninfo='host=32.85.68.244 dbname=oxrslive user=postgres port=5432';
+ node 4 admin conninfo='host=10.28.103.132 dbname=oxrslive user=postgres port=5432';
+try {
+  store listen (origin = 1, receiver = 3, provider = 1);
+  store listen (origin = 3, receiver = 1, provider = 3);
+  drop listen (origin = 1, receiver = 3, provider = 2);
+  drop listen (origin = 3, receiver = 1, provider = 2);
+}
+</programlisting>
+
+<para>Immediately after this script was run, <command/SYNC/ events started propagating
+again to node 3.
+
+This points out two principles:
+<itemizedlist>
+
+<listitem><para> If you have multiple nodes, and cascaded subscribers,
+you need to be quite careful in populating the <command/STORE LISTEN/
+entries, and in modifying them if the structure of the replication
+"tree" changes.
+
+<listitem><para> Version 1.1 probably ought to provide better tools to
+help manage this.
+
+</itemizedlist>
+
+<para>The issues of "listener paths" are discussed further at <link
+linkend="ListenPaths"> Slony Listen Paths </link>
+
+<qandaentry>
+<question><Para>After dropping a node, sl_log_1 isn't getting purged out anymore.
+
+<answer><para> This is a common scenario in versions before 1.0.5, as
+the "clean up" that takes place when purging the node does not include
+purging out old entries from the Slony-I table, sl_confirm, for the
+recently departed node.
+
+<para> The node is no longer around to update confirmations of what
+syncs have been applied on it, and therefore the cleanup thread that
+purges log entries thinks that it can't safely delete entries newer
+than the final sl_confirm entry, which rather curtails the ability to
+purge out old logs.
+
+<para>Diagnosis: Run the following query to see if there are any
+"phantom/obsolete/blocking" sl_confirm entries:
+
+<screen>
+oxrsbar=# select * from _oxrsbar.sl_confirm where con_origin not in (select no_id from _oxrsbar.sl_node) or con_received not in (select no_id from _oxrsbar.sl_node);
+ con_origin | con_received | con_seqno |        con_timestamp                  
+------------+--------------+-----------+----------------------------
+          4 |          501 |     83999 | 2004-11-09 19:57:08.195969
+          1 |            2 |   3345790 | 2004-11-14 10:33:43.850265
+          2 |          501 |    102718 | 2004-11-14 10:33:47.702086
+        501 |            2 |      6577 | 2004-11-14 10:34:45.717003
+          4 |            5 |     83999 | 2004-11-14 21:11:11.111686
+          4 |            3 |     83999 | 2004-11-24 16:32:39.020194
+(6 rows)
+</screen>
+
+<para>In version 1.0.5, the "drop node" function purges out entries in
+sl_confirm for the departing node.  In earlier versions, this needs to
+be done manually.  Supposing the node number is 3, then the query
+would be:
+
+<command>
+delete from _namespace.sl_confirm where con_origin = 3 or con_received = 3;
+</command>
+
+<para>Alternatively, to go after <quote/all phantoms,/ you could use
+<command>
+oxrsbar=# delete from _oxrsbar.sl_confirm where con_origin not in (select no_id from _oxrsbar.sl_node) or con_received not in (select no_id from _oxrsbar.sl_node);
+DELETE 6
+</command>
+
+<para>General "due diligance" dictates starting with a
+<command/BEGIN/, looking at the contents of sl_confirm before,
+ensuring that only the expected records are purged, and then, only
+after that, confirming the change with a <command/COMMIT/.  If you
+delete confirm entries for the wrong node, that could ruin your whole
+day.
+
+<para>You'll need to run this on each node that remains...
+
+<para>Note that in 1.0.5, this is no longer an issue at all, as it purges unneeded entries from sl_confirm in two places:
+<itemizedlist>
+<listitem><para> At the time a node is dropped
+<listitem><para> At the start of each "cleanupEvent" run, which is the event in which old data is purged from sl_log_1 and sl_seqlog
+</itemizedlist>
+
+<qandaentry>
+<question><Para>Replication Fails - Unique Constraint Violation
+
+<para>Replication has been running for a while, successfully, when a
+node encounters a "glitch," and replication logs are filled with
+repetitions of the following:
+
+<screen>
+DEBUG2 remoteWorkerThread_1: syncing set 2 with 5 table(s) from provider 1
+DEBUG2 remoteWorkerThread_1: syncing set 1 with 41 table(s) from provider 1
+DEBUG2 remoteWorkerThread_1: syncing set 5 with 1 table(s) from provider 1
+DEBUG2 remoteWorkerThread_1: syncing set 3 with 1 table(s) from provider 1
+DEBUG2 remoteHelperThread_1_1: 0.135 seconds delay for first row
+DEBUG2 remoteHelperThread_1_1: 0.343 seconds until close cursor
+ERROR  remoteWorkerThread_1: "insert into "_oxrsapp".sl_log_1          (log_origin, log_xid, log_tableid,                log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '34', '35090538', 'D', '_rserv_ts=''9275244''');
+delete from only public.epp_domain_host where _rserv_ts='9275244';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '34', '35090539', 'D', '_rserv_ts=''9275245''');
+delete from only public.epp_domain_host where _rserv_ts='9275245';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '26', '35090540', 'D', '_rserv_ts=''24240590''');
+delete from only public.epp_domain_contact where _rserv_ts='24240590';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '26', '35090541', 'D', '_rserv_ts=''24240591''');
+delete from only public.epp_domain_contact where _rserv_ts='24240591';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '26', '35090542', 'D', '_rserv_ts=''24240589''');
+delete from only public.epp_domain_contact where _rserv_ts='24240589';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '11', '35090543', 'D', '_rserv_ts=''36968002''');
+delete from only public.epp_domain_status where _rserv_ts='36968002';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '11', '35090544', 'D', '_rserv_ts=''36968003''');
+delete from only public.epp_domain_status where _rserv_ts='36968003';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '24', '35090549', 'I', '(contact_id,status,reason,_rserv_ts) values (''6972897'',''64'','''',''31044208'')');
+insert into public.contact_status (contact_id,status,reason,_rserv_ts) values ('6972897','64','','31044208');insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '24', '35090550', 'D', '_rserv_ts=''18139332''');
+delete from only public.contact_status where _rserv_ts='18139332';insert into "_oxrsapp".sl_log_1	  (log_origin, log_xid, log_tableid,		log_actionseq, log_cmdtype,		log_cmddata) values	  ('1', '919151224', '24', '35090551', 'D', '_rserv_ts=''18139333''');
+delete from only public.contact_status where _rserv_ts='18139333';" ERROR:  duplicate key violates unique constraint "contact_status_pkey"
+ - qualification was: 
+ERROR  remoteWorkerThread_1: SYNC aborted
+</screen>
+
+<para>The transaction rolls back, and Slony-I tries again, and again,
+and again.  The problem is with one of the <emphasis/last/ SQL statements, the
+one with <command/log_cmdtype = 'I'/.  That isn't quite obvious; what takes
+place is that Slony-I groups 10 update queries together to diminish
+the number of network round trips.
+
+<answer><para>
+
+<para> A <emphasis/certain/ cause for this has not yet been arrived
+at.  The factors that <emphasis/appear/ to go together to contribute
+to this scenario are as follows:
+
+<itemizedlist>
+
+<listitem><para> The "glitch" seems to coincide with some sort of
+outage; it has been observed both in cases where databases were
+suffering from periodic "SIG 11" problems, where backends were falling
+over, as well as when temporary network failure seemed likely.
+
+<listitem><para> The scenario seems to involve a delete transaction
+having been missed by Slony-I.
+
+</itemizedlist>
+
+<para>By the time we notice that there is a problem, the missed delete
+transaction has been cleaned out of sl_log_1, so there is no recovery
+possible.
+
+<para>What is necessary, at this point, is to drop the replication set
+(or even the node), and restart replication from scratch on that node.
+
+<para>In Slony-I 1.0.5, the handling of purges of sl_log_1 are rather
+more conservative, refusing to purge entries that haven't been
+successfully synced for at least 10 minutes on all nodes.  It is not
+certain that that will prevent the "glitch" from taking place, but it
+seems likely that it will leave enough sl_log_1 data to be able to do
+something about recovering from the condition or at least diagnosing
+it more exactly.  And perhaps the problem is that sl_log_1 was being
+purged too aggressively, and this will resolve the issue completely.
+
+<qandaentry>
+
+<question><Para> If you have a slonik script something like this, it
+will hang on you and never complete, because you can't have
+<command/wait for event/ inside a <command/try/ block. A <command/try/
+block is executed as one transaction, and the event that you are
+waiting for can never arrive inside the scope of the transaction.
+
+<programlisting>
+try {
+      echo 'Moving set 1 to node 3';
+      lock set (id=1, origin=1);
+      echo 'Set locked';
+      wait for event (origin = 1, confirmed = 3);
+      echo 'Moving set';
+      move set (id=1, old origin=1, new origin=3);
+      echo 'Set moved - waiting for event to be confirmed by node 3';
+      wait for event (origin = 1, confirmed = 3);
+      echo 'Confirmed';
+} on error {
+      echo 'Could not move set for cluster foo';
+      unlock set (id=1, origin=1);
+      exit -1;
+}
+</programlisting>
+
+<answer><para> You must not invoke <command/wait for event/ inside a
+<quote/try/ block.
+
+</qandaentry>
+</qandaset>
+<!-- Keep this comment at the end of the file
+Local variables:
+mode:sgml
+sgml-omittag:nil
+sgml-shorttag:t
+sgml-minimize-attributes:nil
+sgml-always-quote-attributes:t
+sgml-indent-step:1
+sgml-indent-data:t
+sgml-parent-document:slony.sgml
+sgml-default-dtd-file:"./reference.ced"
+sgml-exposed-tags:nil
+sgml-local-catalogs:("/usr/lib/sgml/catalog")
+sgml-local-ecat-files:nil
+End:
+-->
\ No newline at end of file


More information about the Slony1-commit mailing list