Thu Sep 20 00:17:33 PDT 2007
- Previous message: [Slony1-general] bug in deadlock handling?
- Next message: [Slony1-general] bug in deadlock handling?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jeff Frost a écrit : > On Wed, 19 Sep 2007, Christopher Browne wrote: > >>> Bails out is the wrong description. Clobbered by the pgsql deadlock >>> detection system, leaving the cluster in an unstable state would be >>> more accurate, if that's what happened. I don't know that there's a >>> way to catch that clobberage and run a "finally" type thing. >> The thing is, that's not what happens. >> >> Unless there's a COMMIT somewhere in the DDL script, in which case >> all bets are off, everywhere, the deadlock should lead to one of two >> things happening: >> >> 1. The other process that was holding onto the tables might fail and >> roll back, and the DDL script would complete, or >> >> 2. The DDL script will fail and roll back, leading to the state of >> the tables falling back to what it was before DDL script processing >> began. >> >> In either case, the results should leave the node in a consistent >> state, either: >> a) With the DDL request having gone in, or >> b) With the DDL request *not* having gone in. >> >> Unless there's an extra COMMIT in the code (and I just looked at the >> code, and Did Not See One), the failure resulting from a deadlock >> should be benign, restoring the tables to the "previously altered >> state." >>> >>> So, is there a reasonable way to fix this without >>> droppping/resubscribing the >>> node? >>> >>> >>> Well, to start with, you might want to figure out why your >>> application is taking such aggressive locks. And make sure in the >>> future that it doesn't happen again (not much point fixing it if >>> it's just gonna re-occur). If you are using a separate superuser >>> account to connect to your database and run your slons (generally >>> the "slony" user) then this is really easy to do: tweak your pg_hba >>> to only allow connections from slony and then kick all active >>> non-slony connections. Revert your pg_hba at the end of the >>> maintenance. >>> >>> If you're willing to experiment with using slony internal functions, >>> you could put the table in question into altered state. Something >>> like this on the offending node might work. >>> >>> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE >>> tab_reloid = (SELECT oid FROM pg_class WHERE relname='cart'))); >>> >>> Or of course it might mess things up even more. :) >> The one change to suggest that comes to *my* mind is that we perhaps >> ought to change Slony-I to aggressively lock the replicated tables >> ASAP at the start of the DDL script event. >> >> That will either immediately succeed, and eliminate any possibility >> of future deadlock, or immediately fail, before we even try to alter >> anything. >> >> This same change was made to the SUBSCRIBE SET process (well, >> "COPY_SET", strictly speaking, but that's not at all user >> visible...), as we saw cases where gradually escalating locks on >> tables led to deadlocks that could waste hours worth of subscription >> work. >> >> But it seems likely to me that there's more to the problem than we're >> hearing, because deadlock shouldn't cause any corruption of anything >> - to the contrary, it may be expected to prevent it. > > this is the latest SQL that caused the problem (note there is not a > COMMIT in the sql): > > -------- > CREATE TABLE orders.amazon_items > ( > id serial NOT NULL, > order_id integer NOT NULL, > item_id integer NOT NULL, > amazon_item_id character varying(14) NOT NULL, > CONSTRAINT amazon_items_pkey PRIMARY KEY (id), > CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id) > REFERENCES orders.orders (id) MATCH SIMPLE > ON UPDATE NO ACTION ON DELETE NO ACTION, > CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id) > REFERENCES orders.items (id) MATCH SIMPLE > ON UPDATE NO ACTION ON DELETE NO ACTION > ) > WITH OIDS; > ALTER TABLE orders.amazon_items OWNER TO thenerds; > -------- > > It was called by the following slonik script: > > -------- > #!/usr/bin/slonik > include </nerds/preamble.slonik>; > > EXECUTE SCRIPT ( > SET ID = 1, > FILENAME = '/nerds/thenerds.sql', > EVENT NODE = 1 > ); > -------- > > and caused the following deadlock to occur: > > 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR > remoteWorkerThread_1: "select "_nerdcluster".ddlScript > _complete_int(1, -1); " PGRES_FATAL_ERROR > Sep 18 15:27:54 sql1 slon[12252]: [39-2] ERROR: deadlock detected > Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL: Process 12263 waits for > AccessExclusiveLock on relation 121589880 of databas > e 121589046; blocked by process 12096. > Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for > AccessShareLock on relation 121589817 of database 121589046; > blocked by process 12263. > > Which then left the some of the tables on that slave in a bad state > breaking replication: > > 2007-09-18 15:56:06 EDT ERROR remoteWorkerThread_1: "select > "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR: > Slony-I: alterTableRestore(): Table "public"."carts" is not in altered > state > CONTEXT: SQL statement "SELECT "_nerdcluster".alterTableRestore( $1 )" > PL/pgSQL function "ddlscript_prepare_int" line 46 at perform > > Note that it's just an AccessShareLock that's killing us. Looks like > that's caused by a select query which does searches. Our application > does not produce any extraneous locking, it simply does SELECTS on > that server. > > Interestingly, before we started using the slave for queries, we would > have the deadlocks happen on the master when doing DDL changes, but > this never caused the tables on the master to get into a bad state. > You could just re-run your EXECUTE SCRIPT and it would usually work > fine the second time. > > What other info can I provide? > First, I don't think you have to use the execute script command to create a new table. I'd also like to know if you use pgpool, on slave, master, both ? Does the re-executing of the script offer any solution ? Cheers, SAS
- Previous message: [Slony1-general] bug in deadlock handling?
- Next message: [Slony1-general] bug in deadlock handling?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list