Fri Dec 10 23:53:17 PST 2004
- Previous message: [Slony1-commit] By darcyb: Fix cast to pointer from integer of different size on 64bit
- Next message: [Slony1-commit] By darcyb: Remove warning of ProcessConfigFile not defined.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Log Message: ----------- Add in FAQ to the admin guide Added Files: ----------- slony1-engine/doc/adminguide: faq.sgml (r1.1) -------------- next part -------------- --- /dev/null +++ doc/adminguide/faq.sgml @@ -0,0 +1,640 @@ +<qandaset> + +<qandaentry> +<question><para>I looked for the <envar/_clustername/ namespace, and +it wasn't there.</question> + +<answer><para> If the DSNs are wrong, then slon instances can't connect to the nodes. + +<para>This will generally lead to nodes remaining entirely untouched. + +<para>Recheck the connection configuration. By the way, since +<application/slon/ links to libpq, you could have password information +stored in <filename> <envar>$HOME</envar>/.pgpass</filename>, +partially filling in right/wrong authentication information there. +</answer> +</qandaentry> + +<qandaentry id="SlonyFAQ02"> +<question><para> +Some events moving around, but no replication + +<para> Slony logs might look like the following: + +<screen> +DEBUG1 remoteListenThread_1: connected to 'host=host004 dbname=pgbenchrep user=postgres port=5432' +ERROR remoteListenThread_1: "select ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip, ev_type, ev_data1, ev_data2, ev_data3, ev_data4, ev_data5, ev_data6, ev_data7, ev_data8 from "_pgbenchtest".sl_event e where (e.ev_origin = '1' and e.ev_seqno > '1') order by e.ev_origin, e.ev_seqno" - could not receive data from server: Operation now in progress +</screen> + +<answer><para> +On AIX and Solaris (and possibly elsewhere), both Slony-I <emphasis/and PostgreSQL/ must be compiled with the <option/--enable-thread-safety/ option. The above results when PostgreSQL isn't so compiled. + +<para>What breaks here is that the libc (threadsafe) and libpq (non-threadsafe) use different memory locations for errno, thereby leading to the request failing. + +<para>Problems like this crop up with disadmirable regularity on AIX +and Solaris; it may take something of an <quote/object code audit/ to +make sure that <emphasis/ALL/ of the necessary components have been +compiled and linked with <option/--enable-thread-safety/. + +<para>For instance, I ran into the problem one that +<envar/LD_LIBRARY_PATH/ had been set, on Solaris, to point to +libraries from an old PostgreSQL compile. That meant that even though +the database <emphasis/had/ been compiled with +<option/--enable-thread-safety/, and <application/slon/ had been +compiled against that, <application/slon/ was being dynamically linked +to the <quote/bad old thread-unsafe version,/ so slon didn't work. It +wasn't clear that this was the case until I ran <command/ldd/ against +<application/slon/. + +<qandaentry> +<question> <para>I tried creating a CLUSTER NAME with a "-" in it. +That didn't work. + +<answer><Para> Slony-I uses the same rules for unquoted identifiers as the PostgreSQL +main parser, so no, you probably shouldn't put a "-" in your +identifier name. + +<para> You may be able to defeat this by putting "quotes" around +identifier names, but it's liable to bite you some, so this is +something that is probably not worth working around. + +<qandaentry> +<question><para> slon does not restart after crash + +<para> After an immediate stop of postgresql (simulation of system crash) +in pg_catalog.pg_listener a tuple with +relname='_${cluster_name}_Restart' exists. slon doesn't start cause it +thinks another process is serving the cluster on this node. What can +I do? The tuples can't be dropped from this relation. + +<para> The logs claim that "Another slon daemon is serving this node already" + +<answer> +<para>It's handy to keep a slonik script like the following one around to +run in such cases: + +<programlisting> +twcsds004[/opt/twcsds004/OXRS/slony-scripts]$ cat restart_org.slonik +cluster name = oxrsorg ; +node 1 admin conninfo = 'host=32.85.68.220 dbname=oxrsorg user=postgres port=5532'; +node 2 admin conninfo = 'host=32.85.68.216 dbname=oxrsorg user=postgres port=5532'; +node 3 admin conninfo = 'host=32.85.68.244 dbname=oxrsorg user=postgres port=5532'; +node 4 admin conninfo = 'host=10.28.103.132 dbname=oxrsorg user=postgres port=5532'; +restart node 1; +restart node 2; +restart node 3; +restart node 4; +</programlisting> + +<para> <command/restart node n/ cleans up dead notifications so that you can restart the node. + +<para>As of version 1.0.5, the startup process of slon looks for this +condition, and automatically cleans it up. + +<qandaentry> +<question><Para> +ps finds passwords on command line + +<para> If I run a <command/ps/ command, I, and everyone else, can see passwords +on the command line. + +<answer> +<para>Take the passwords out of the Slony configuration, and put them into +<filename><envar/$(HOME)//.pgpass.</filename> + +<qandaentry> +<question><Para>Slonik fails - cannot load PostgreSQL library - <command>PGRES_FATAL_ERROR load '$libdir/xxid';</command> + +<para> When I run the sample setup script I get an error message similar +to: + +<command> +stdin:64: PGRES_FATAL_ERROR load '$libdir/xxid'; - ERROR: LOAD: +could not open file '$libdir/xxid': No such file or directory +</command> + +<answer><para> Evidently, you haven't got the <filename/xxid.so/ +library in the <envar/$libdir/ directory that the PostgreSQL instance +is using. Note that the Slony-I components need to be installed in +the PostgreSQL software installation for <emphasis/each and every one/ +of the nodes, not just on the <quote/master node./ + +<para>This may also point to there being some other mismatch between +the PostgreSQL binary instance and the Slony-I instance. If you +compiled Slony-I yourself, on a machine that may have multiple +PostgreSQL builds <quote/lying around,/ it's possible that the slon or +slonik binaries are asking to load something that isn't actually in +the library directory for the PostgreSQL database cluster that it's +hitting. + +<para>Long and short: This points to a need to <quote/audit/ what +installations of PostgreSQL and Slony you have in place on the +machine(s). Unfortunately, just about any mismatch will cause things +not to link up quite right. See also <link linkend="SlonyFAQ02"> +SlonyFAQ02 </link> concerning threading issues on Solaris ... + +<qandaentry> +<question><Para>Table indexes with FQ namespace names + +<programlisting> +set add table (set id = 1, origin = 1, id = 27, + full qualified name = 'nspace.some_table', + key = 'key_on_whatever', + comment = 'Table some_table in namespace nspace with a candidate primary key'); +</programlisting> + +<answer><para> If you have <command/ key = 'nspace.key_on_whatever'/ +the request will <emphasis/FAIL/. + +<qandaentry> +<question><Para> +I'm trying to get a slave subscribed, and get the following +messages in the logs: + +<screen> +DEBUG1 copy_set 1 +DEBUG1 remoteWorkerThread_1: connected to provider DB +WARN remoteWorkerThread_1: transactions earlier than XID 127314958 are still in progress +WARN remoteWorkerThread_1: data copy for set 1 failed - sleep 60 seconds +</screen> + +<para>Oops. What I forgot to mention, as well, was that I was trying +to add <emphasis/TWO/ subscribers, concurrently. + +<answer><para> That doesn't work out: Slony-I won't work on the +<command/COPY/ commands concurrently. See +<filename>src/slon/remote_worker.c</filename>, function +<function/copy_set()/ + +<para>This has the (perhaps unfortunate) implication that you cannot +populate two slaves concurrently. You have to subscribe one to the +set, and only once it has completed setting up the subscription +(copying table contents and such) can the second subscriber start +setting up the subscription. + +<para>It could also be possible for there to be an old outstanding +transaction blocking Slony-I from processing the sync. You might want +to take a look at pg_locks to see what's up: + +<screen> +sampledb=# select * from pg_locks where transaction is not null order by transaction; + relation | database | transaction | pid | mode | granted +----------+----------+-------------+---------+---------------+--------- + | | 127314921 | 2605100 | ExclusiveLock | t + | | 127326504 | 5660904 | ExclusiveLock | t +(2 rows) +</screen> + +<para>See? 127314921 is indeed older than 127314958, and it's still running. + +<screen> +$ ps -aef | egrep '[2]605100' +postgres 2605100 205018 0 18:53:43 pts/3 3:13 postgres: postgres sampledb localhost COPY +</screen> + +<para>This happens to be a <command/COPY/ transaction involved in setting up the +subscription for one of the nodes. All is well; the system is busy +setting up the first subscriber; it won't start on the second one +until the first one has completed subscribing. + +<para>By the way, if there is more than one database on the PostgreSQL +cluster, and activity is taking place on the OTHER database, that will +lead to there being <quote/transactions earlier than XID whatever/ being +found to be still in progress. The fact that it's a separate database +on the cluster is irrelevant; Slony-I will wait until those old +transactions terminate. +<qandaentry> +<question><Para> +ERROR: duplicate key violates unique constraint "sl_table-pkey" + +<para>I tried setting up a second replication set, and got the following error: + +<screen> +stdin:9: Could not create subscription set 2 for oxrslive! +stdin:11: PGRES_FATAL_ERROR select "_oxrslive".setAddTable(2, 1, 'public.replic_test', 'replic_test__Slony-I_oxrslive_rowID_key', 'Table public.replic_test without primary key'); - ERROR: duplicate key violates unique constraint "sl_table-pkey" +CONTEXT: PL/pgSQL function "setaddtable_int" line 71 at SQL statement +</screen> + +<answer><para> +The table IDs used in SET ADD TABLE are required to be unique <emphasis/ACROSS +ALL SETS/. Thus, you can't restart numbering at 1 for a second set; if +you are numbering them consecutively, a subsequent set has to start +with IDs after where the previous set(s) left off. +<qandaentry> +<question><Para>I need to drop a table from a replication set +<answer><para> +This can be accomplished several ways, not all equally desirable ;-). + +<itemizedlist> +<listitem><para> You could drop the whole replication set, and recreate it with just the tables that you need. Alas, that means recopying a whole lot of data, and kills the usability of the cluster on the rest of the set while that's happening. + +<listitem><para> If you are running 1.0.5 or later, there is the command SET DROP TABLE, which will "do the trick." + +<listitem><para> If you are still using 1.0.1 or 1.0.2, the _essential_ functionality of SET DROP TABLE involves the functionality in droptable_int(). You can fiddle this by hand by finding the table ID for the table you want to get rid of, which you can find in sl_table, and then run the following three queries, on each host: + +<programlisting> + select _slonyschema.alterTableRestore(40); + select _slonyschema.tableDropKey(40); + delete from _slonyschema.sl_table where tab_id = 40; +</programlisting> + +<para>The schema will obviously depend on how you defined the Slony-I +cluster. The table ID, in this case, 40, will need to change to the +ID of the table you want to have go away. + +You'll have to run these three queries on all of the nodes, preferably +firstly on the "master" node, so that the dropping of this propagates +properly. Implementing this via a SLONIK statement with a new Slony +event would do that. Submitting the three queries using EXECUTE +SCRIPT could do that. Also possible would be to connect to each +database and submit the queries by hand. +</itemizedlist> +<qandaentry> +<question><Para>I need to drop a sequence from a replication set + +<answer><para><para>If you are running 1.0.5 or later, there is a +<command/SET DROP SEQUENCE/ command in Slonik to allow you to do this, +parallelling <command/SET DROP TABLE./ + +<para>If you are running 1.0.2 or earlier, the process is a bit more manual. + +<para>Supposing I want to get rid of the two sequences listed below, +<envar/whois_cachemgmt_seq/ and <envar/epp_whoi_cach_seq_/, we start +by needing the <envar/seq_id/ values. + +<screen> +oxrsorg=# select * from _oxrsorg.sl_sequence where seq_id in (93,59); + seq_id | seq_reloid | seq_set | seq_comment +--------+------------+---------+------------------------------------- + 93 | 107451516 | 1 | Sequence public.whois_cachemgmt_seq + 59 | 107451860 | 1 | Sequence public.epp_whoi_cach_seq_ +(2 rows) +</screen> + +<para>The data that needs to be deleted to stop Slony from continuing to +replicate these are thus: + +<programlisting> +delete from _oxrsorg.sl_seqlog where seql_seqid in (93, 59); +delete from _oxrsorg.sl_sequence where seq_id in (93,59); +</programlisting> + +<para>Those two queries could be submitted to all of the nodes via +<function/ddlscript()/ / <command/EXECUTE SCRIPT/, thus eliminating +the sequence everywhere <quote/at once./ Or they may be applied by +hand to each of the nodes. + +<para>Similarly to <command/SET DROP TABLE/, this should be in place for Slony-I version +1.0.5 as <command/SET DROP SEQUENCE./ +<qandaentry> +<question><Para>Slony-I: cannot add table to currently subscribed set 1 + +<para> I tried to add a table to a set, and got the following message: + +<screen> + Slony-I: cannot add table to currently subscribed set 1 +</screen> + +<answer><para> You cannot add tables to sets that already have +subscribers. + +<para>The workaround to this is to create <emphasis/ANOTHER/ set, add +the new tables to that new set, subscribe the same nodes subscribing +to "set 1" to the new set, and then merge the sets together. + +<qandaentry> +<question><Para>Some nodes start consistently falling behind + +<para>I have been running Slony-I on a node for a while, and am seeing +system performance suffering. + +<para>I'm seeing long running queries of the form: +<screen> + fetch 100 from LOG; +</screen> + +<answer><para> This is characteristic of pg_listener (which is the table containing +<command/NOTIFY/ data) having plenty of dead tuples in it. That makes <command/NOTIFY/ +events take a long time, and causes the affected node to gradually +fall further and further behind. + +<para>You quite likely need to do a <command/VACUUM FULL/ on <envar/pg_listener/, to vigorously clean it out, and need to vacuum <envar/pg_listener/ really frequently. Once every five minutes would likely be AOK. + +<para> Slon daemons already vacuum a bunch of tables, and +<filename/cleanup_thread.c/ contains a list of tables that are +frequently vacuumed automatically. In Slony-I 1.0.2, +<envar/pg_listener/ is not included. In 1.0.5 and later, it is +regularly vacuumed, so this should cease to be a direct issue. + +<para>There is, however, still a scenario where this will still +"bite." Vacuums cannot delete tuples that were made "obsolete" at any +time after the start time of the eldest transaction that is still +open. Long running transactions will cause trouble, and should be +avoided, even on "slave" nodes. + +<qandaentry> +<question><Para>I started doing a backup using pg_dump, and suddenly Slony stops + +<answer><para>Ouch. What happens here is a conflict between: +<itemizedlist> + +<listitem><para> <application/pg_dump/, which has taken out an <command/AccessShareLock/ on all of the tables in the database, including the Slony-I ones, and + +<listitem><para> A Slony-I sync event, which wants to grab a <command/AccessExclusiveLock/ on the table <envar/sl_event/. +</itemizedlist> + +<para>The initial query that will be blocked is thus: + +<screen> +select "_slonyschema".createEvent('_slonyschema, 'SYNC', NULL); +</screen> + +<para>(You can see this in <envar/pg_stat_activity/, if you have query +display turned on in <filename/postgresql.conf/) + +<para>The actual query combination that is causing the lock is from +the function <function/Slony_I_ClusterStatus()/, found in +<filename/slony1_funcs.c/, and is localized in the code that does: + +<programlisting> + LOCK TABLE %s.sl_event; + INSERT INTO %s.sl_event (...stuff...) + SELECT currval('%s.sl_event_seq'); +</programlisting> + +<para>The <command/LOCK/ statement will sit there and wait until <command/pg_dump/ (or whatever else has pretty much any kind of access lock on <envar/sl_event/) completes. + +<para>Every subsequent query submitted that touches <envar/sl_event/ will block behind the <function/createEvent/ call. + +<para>There are a number of possible answers to this: +<itemizedlist> + +<listitem><para> Have pg_dump specify the schema dumped using +--schema=whatever, and don't try dumping the cluster's schema. + +<listitem><para> It would be nice to add an "--exclude-schema" option +to pg_dump to exclude the Slony cluster schema. Maybe in 8.0 or +8.1... + +<listitem><para>Note that 1.0.5 uses a more precise lock that is less +exclusive that alleviates this problem. +</itemizedlist> +<qandaentry> + +<question><Para>The slons spent the weekend out of commission [for +some reason], and it's taking a long time to get a sync through. + +<answer><para> +You might want to take a look at the sl_log_1/sl_log_2 tables, and do +a summary to see if there are any really enormous Slony-I transactions +in there. Up until at least 1.0.2, there needs to be a slon connected +to the master in order for <command/SYNC/ events to be generated. + +<para>If none are being generated, then all of the updates until the next +one is generated will collect into one rather enormous Slony-I +transaction. + +<para>Conclusion: Even if there is not going to be a subscriber around, you +<emphasis/really/ want to have a slon running to service the <quote/master/ node. + +<para>Some future version (probably 1.1) may provide a way for +<command/SYNC/ counts to be updated on the master by the stored +function that is invoked by the table triggers. + +<qandaentry> +<question><Para>I pointed a subscribing node to a different parent and it stopped replicating + +<answer><para> +We noticed this happening when we wanted to re-initialize a node, +where we had configuration thus: + +<itemizedlist> +<listitem><para> Node 1 - master +<listitem><para> Node 2 - child of node 1 - the node we're reinitializing +<listitem><para> Node 3 - child of node 3 - node that should keep replicating +</itemizedlist> + +<para>The subscription for node 3 was changed to have node 1 as +provider, and we did <command/DROP SET//<command/SUBSCRIBE SET/ for +node 2 to get it repopulating. + +<para>Unfortunately, replication suddenly stopped to node 3. + +<para>The problem was that there was not a suitable set of <quote/listener paths/ +in sl_listen to allow the events from node 1 to propagate to node 3. +The events were going through node 2, and blocking behind the +<command/SUBSCRIBE SET/ event that node 2 was working on. + +<para>The following slonik script dropped out the listen paths where node 3 +had to go through node 2, and added in direct listens between nodes 1 +and 3. + +<programlisting> +cluster name = oxrslive; + node 1 admin conninfo='host=32.85.68.220 dbname=oxrslive user=postgres port=5432'; + node 2 admin conninfo='host=32.85.68.216 dbname=oxrslive user=postgres port=5432'; + node 3 admin conninfo='host=32.85.68.244 dbname=oxrslive user=postgres port=5432'; + node 4 admin conninfo='host=10.28.103.132 dbname=oxrslive user=postgres port=5432'; +try { + store listen (origin = 1, receiver = 3, provider = 1); + store listen (origin = 3, receiver = 1, provider = 3); + drop listen (origin = 1, receiver = 3, provider = 2); + drop listen (origin = 3, receiver = 1, provider = 2); +} +</programlisting> + +<para>Immediately after this script was run, <command/SYNC/ events started propagating +again to node 3. + +This points out two principles: +<itemizedlist> + +<listitem><para> If you have multiple nodes, and cascaded subscribers, +you need to be quite careful in populating the <command/STORE LISTEN/ +entries, and in modifying them if the structure of the replication +"tree" changes. + +<listitem><para> Version 1.1 probably ought to provide better tools to +help manage this. + +</itemizedlist> + +<para>The issues of "listener paths" are discussed further at <link +linkend="ListenPaths"> Slony Listen Paths </link> + +<qandaentry> +<question><Para>After dropping a node, sl_log_1 isn't getting purged out anymore. + +<answer><para> This is a common scenario in versions before 1.0.5, as +the "clean up" that takes place when purging the node does not include +purging out old entries from the Slony-I table, sl_confirm, for the +recently departed node. + +<para> The node is no longer around to update confirmations of what +syncs have been applied on it, and therefore the cleanup thread that +purges log entries thinks that it can't safely delete entries newer +than the final sl_confirm entry, which rather curtails the ability to +purge out old logs. + +<para>Diagnosis: Run the following query to see if there are any +"phantom/obsolete/blocking" sl_confirm entries: + +<screen> +oxrsbar=# select * from _oxrsbar.sl_confirm where con_origin not in (select no_id from _oxrsbar.sl_node) or con_received not in (select no_id from _oxrsbar.sl_node); + con_origin | con_received | con_seqno | con_timestamp +------------+--------------+-----------+---------------------------- + 4 | 501 | 83999 | 2004-11-09 19:57:08.195969 + 1 | 2 | 3345790 | 2004-11-14 10:33:43.850265 + 2 | 501 | 102718 | 2004-11-14 10:33:47.702086 + 501 | 2 | 6577 | 2004-11-14 10:34:45.717003 + 4 | 5 | 83999 | 2004-11-14 21:11:11.111686 + 4 | 3 | 83999 | 2004-11-24 16:32:39.020194 +(6 rows) +</screen> + +<para>In version 1.0.5, the "drop node" function purges out entries in +sl_confirm for the departing node. In earlier versions, this needs to +be done manually. Supposing the node number is 3, then the query +would be: + +<command> +delete from _namespace.sl_confirm where con_origin = 3 or con_received = 3; +</command> + +<para>Alternatively, to go after <quote/all phantoms,/ you could use +<command> +oxrsbar=# delete from _oxrsbar.sl_confirm where con_origin not in (select no_id from _oxrsbar.sl_node) or con_received not in (select no_id from _oxrsbar.sl_node); +DELETE 6 +</command> + +<para>General "due diligance" dictates starting with a +<command/BEGIN/, looking at the contents of sl_confirm before, +ensuring that only the expected records are purged, and then, only +after that, confirming the change with a <command/COMMIT/. If you +delete confirm entries for the wrong node, that could ruin your whole +day. + +<para>You'll need to run this on each node that remains... + +<para>Note that in 1.0.5, this is no longer an issue at all, as it purges unneeded entries from sl_confirm in two places: +<itemizedlist> +<listitem><para> At the time a node is dropped +<listitem><para> At the start of each "cleanupEvent" run, which is the event in which old data is purged from sl_log_1 and sl_seqlog +</itemizedlist> + +<qandaentry> +<question><Para>Replication Fails - Unique Constraint Violation + +<para>Replication has been running for a while, successfully, when a +node encounters a "glitch," and replication logs are filled with +repetitions of the following: + +<screen> +DEBUG2 remoteWorkerThread_1: syncing set 2 with 5 table(s) from provider 1 +DEBUG2 remoteWorkerThread_1: syncing set 1 with 41 table(s) from provider 1 +DEBUG2 remoteWorkerThread_1: syncing set 5 with 1 table(s) from provider 1 +DEBUG2 remoteWorkerThread_1: syncing set 3 with 1 table(s) from provider 1 +DEBUG2 remoteHelperThread_1_1: 0.135 seconds delay for first row +DEBUG2 remoteHelperThread_1_1: 0.343 seconds until close cursor +ERROR remoteWorkerThread_1: "insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '34', '35090538', 'D', '_rserv_ts=''9275244'''); +delete from only public.epp_domain_host where _rserv_ts='9275244';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '34', '35090539', 'D', '_rserv_ts=''9275245'''); +delete from only public.epp_domain_host where _rserv_ts='9275245';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '26', '35090540', 'D', '_rserv_ts=''24240590'''); +delete from only public.epp_domain_contact where _rserv_ts='24240590';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '26', '35090541', 'D', '_rserv_ts=''24240591'''); +delete from only public.epp_domain_contact where _rserv_ts='24240591';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '26', '35090542', 'D', '_rserv_ts=''24240589'''); +delete from only public.epp_domain_contact where _rserv_ts='24240589';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '11', '35090543', 'D', '_rserv_ts=''36968002'''); +delete from only public.epp_domain_status where _rserv_ts='36968002';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '11', '35090544', 'D', '_rserv_ts=''36968003'''); +delete from only public.epp_domain_status where _rserv_ts='36968003';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '24', '35090549', 'I', '(contact_id,status,reason,_rserv_ts) values (''6972897'',''64'','''',''31044208'')'); +insert into public.contact_status (contact_id,status,reason,_rserv_ts) values ('6972897','64','','31044208');insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '24', '35090550', 'D', '_rserv_ts=''18139332'''); +delete from only public.contact_status where _rserv_ts='18139332';insert into "_oxrsapp".sl_log_1 (log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) values ('1', '919151224', '24', '35090551', 'D', '_rserv_ts=''18139333'''); +delete from only public.contact_status where _rserv_ts='18139333';" ERROR: duplicate key violates unique constraint "contact_status_pkey" + - qualification was: +ERROR remoteWorkerThread_1: SYNC aborted +</screen> + +<para>The transaction rolls back, and Slony-I tries again, and again, +and again. The problem is with one of the <emphasis/last/ SQL statements, the +one with <command/log_cmdtype = 'I'/. That isn't quite obvious; what takes +place is that Slony-I groups 10 update queries together to diminish +the number of network round trips. + +<answer><para> + +<para> A <emphasis/certain/ cause for this has not yet been arrived +at. The factors that <emphasis/appear/ to go together to contribute +to this scenario are as follows: + +<itemizedlist> + +<listitem><para> The "glitch" seems to coincide with some sort of +outage; it has been observed both in cases where databases were +suffering from periodic "SIG 11" problems, where backends were falling +over, as well as when temporary network failure seemed likely. + +<listitem><para> The scenario seems to involve a delete transaction +having been missed by Slony-I. + +</itemizedlist> + +<para>By the time we notice that there is a problem, the missed delete +transaction has been cleaned out of sl_log_1, so there is no recovery +possible. + +<para>What is necessary, at this point, is to drop the replication set +(or even the node), and restart replication from scratch on that node. + +<para>In Slony-I 1.0.5, the handling of purges of sl_log_1 are rather +more conservative, refusing to purge entries that haven't been +successfully synced for at least 10 minutes on all nodes. It is not +certain that that will prevent the "glitch" from taking place, but it +seems likely that it will leave enough sl_log_1 data to be able to do +something about recovering from the condition or at least diagnosing +it more exactly. And perhaps the problem is that sl_log_1 was being +purged too aggressively, and this will resolve the issue completely. + +<qandaentry> + +<question><Para> If you have a slonik script something like this, it +will hang on you and never complete, because you can't have +<command/wait for event/ inside a <command/try/ block. A <command/try/ +block is executed as one transaction, and the event that you are +waiting for can never arrive inside the scope of the transaction. + +<programlisting> +try { + echo 'Moving set 1 to node 3'; + lock set (id=1, origin=1); + echo 'Set locked'; + wait for event (origin = 1, confirmed = 3); + echo 'Moving set'; + move set (id=1, old origin=1, new origin=3); + echo 'Set moved - waiting for event to be confirmed by node 3'; + wait for event (origin = 1, confirmed = 3); + echo 'Confirmed'; +} on error { + echo 'Could not move set for cluster foo'; + unlock set (id=1, origin=1); + exit -1; +} +</programlisting> + +<answer><para> You must not invoke <command/wait for event/ inside a +<quote/try/ block. + +</qandaentry> +</qandaset> +<!-- Keep this comment at the end of the file +Local variables: +mode:sgml +sgml-omittag:nil +sgml-shorttag:t +sgml-minimize-attributes:nil +sgml-always-quote-attributes:t +sgml-indent-step:1 +sgml-indent-data:t +sgml-parent-document:slony.sgml +sgml-default-dtd-file:"./reference.ced" +sgml-exposed-tags:nil +sgml-local-catalogs:("/usr/lib/sgml/catalog") +sgml-local-ecat-files:nil +End: +--> \ No newline at end of file
- Previous message: [Slony1-commit] By darcyb: Fix cast to pointer from integer of different size on 64bit
- Next message: [Slony1-commit] By darcyb: Remove warning of ProcessConfigFile not defined.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-commit mailing list