[Slony1-bugs] Slony 1.2.10: Deadlock on slave during execute script

Wed Sep 26 08:25:28 PDT 2007

For extra background on this bug, see the thread here:
http://lists.slony.info/pipermail/slony1-general/2007-September/006687.html

This is on Slony1-1.2.10, PostgreSQL-8.1.9:

I've run into a situation on a client cluster composed of a master and 2 slave 
nodes where a deadlock on one of the slaves happens quite regularly during any 
EXECUTE SCRIT commands.  It seems if slony loses the deadlock, some of the 
tables are left in a not altered for replication state and this breaks 
replication.

This is the latest SQL that caused the problem (note there is not a COMMIT in
the sql):

--------
CREATE TABLE orders.amazon_items
(
    id serial NOT NULL,
    order_id integer NOT NULL,
    item_id integer NOT NULL,
    amazon_item_id character varying(14) NOT NULL,
    CONSTRAINT amazon_items_pkey PRIMARY KEY (id),
    CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id)
        REFERENCES orders.orders (id) MATCH SIMPLE
        ON UPDATE NO ACTION ON DELETE NO ACTION,
    CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id)
        REFERENCES orders.items (id) MATCH SIMPLE
        ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH OIDS;
ALTER TABLE orders.amazon_items OWNER TO thenerds;
--------

It was called by the following slonik script:

--------
#!/usr/bin/slonik
include </nerds/preamble.slonik>;

          EXECUTE SCRIPT (
                  SET ID = 1,
                  FILENAME = '/nerds/thenerds.sql',
                  EVENT NODE = 1
          );
--------

and caused the following deadlock to occur:

15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
remoteWorkerThread_1: "select "_nerdcluster".ddlScript
_complete_int(1, -1); " PGRES_FATAL_ERROR
Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits for
AccessExclusiveLock on relation 121589880 of databas
e 121589046; blocked by process 12096.
Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
AccessShareLock on relation 121589817 of database 121589046;
blocked by process 12263.

Which then left some of the tables on that slave in a bad state breaking
replication:

2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
"_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR:
Slony-I: alterTableRestore(): Table "public"."carts" is not in altered state
CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1 )"
PL/pgSQL function "ddlscript_prepare_int" line 46 at perform

Note that it's just an AccessShareLock that's killing us.  Looks like that's
caused by a select query which does searches.  Our application does not
produce any extraneous locking, it simply does SELECTS on that server.

Interestingly, before we started using the slave for queries, the deadlocks 
would happen on the master when doing DDL changes, but this never caused the 
tables on the master to get into a bad state.  You could just re-run your 
EXECUTE SCRIPT and it would usually work fine the second time.

-- 
Jeff Frost, Owner 	<jeff at frostconsultingllc.com>
Frost Consulting, LLC 	http://www.frostconsultingllc.com/
Phone: 650-780-7908	FAX: 650-649-1954