[Slony1-general] bug in deadlock handling?

Thu Sep 20 00:17:33 PDT 2007

Jeff Frost a écrit :
> On Wed, 19 Sep 2007, Christopher Browne wrote:
>
>>> Bails out is the wrong description. Clobbered by the pgsql deadlock
>>> detection system, leaving the cluster in an unstable state would be
>>> more accurate, if that's what happened. I don't know that there's a
>>> way to catch that clobberage and run a "finally" type thing.
>> The thing is, that's not what happens.
>>
>> Unless there's a COMMIT somewhere in the DDL script, in which case
>> all bets are off, everywhere, the deadlock should lead to one of two
>> things happening:
>>
>> 1. The other process that was holding onto the tables might fail and
>> roll back, and the DDL script would complete, or
>>
>> 2.  The DDL script will fail and roll back, leading to the state of
>> the tables falling back to what it was before DDL script processing
>> began.
>>
>> In either case, the results should leave the node in a consistent
>> state, either:
>> a) With the DDL request having gone in, or
>> b) With the DDL request *not* having gone in.
>>
>> Unless there's an extra COMMIT in the code (and I just looked at the
>> code, and Did Not See One), the failure resulting from a deadlock
>> should be benign, restoring the tables to the "previously altered
>> state."
>>>
>>>     So, is there a reasonable way to fix this without
>>>     droppping/resubscribing the
>>>     node?
>>>
>>>
>>> Well, to start with, you might want to figure out why your
>>> application is taking such aggressive locks. And make sure in the
>>> future that it doesn't happen again (not much point fixing it if
>>> it's just gonna re-occur). If you are using a separate superuser
>>> account to connect to your database and run your slons (generally
>>> the "slony" user) then this is really easy to do: tweak your pg_hba
>>> to only allow connections from slony and then kick all active
>>> non-slony connections. Revert your pg_hba at the end of the
>>> maintenance.
>>>
>>> If you're willing to experiment with using slony internal functions,
>>> you could put the table in question into altered state. Something
>>> like this on the offending node might work.
>>>
>>> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE
>>> tab_reloid = (SELECT oid FROM pg_class WHERE relname='cart')));
>>>
>>> Or of course it might mess things up even more. :)
>> The one change to suggest that comes to *my* mind is that we perhaps
>> ought to change Slony-I to aggressively lock the replicated tables
>> ASAP at the start of the DDL script event.
>>
>> That will either immediately succeed, and eliminate any possibility
>> of future deadlock, or immediately fail, before we even try to alter
>> anything.
>>
>> This same change was made to the SUBSCRIBE SET process (well,
>> "COPY_SET", strictly speaking, but that's not at all user
>> visible...), as we saw cases where gradually escalating locks on
>> tables led to deadlocks that could waste hours worth of subscription
>> work.
>>
>> But it seems likely to me that there's more to the problem than we're
>> hearing, because deadlock shouldn't cause any corruption of anything
>> - to the contrary, it may be expected to prevent it.
>
> this is the latest SQL that caused the problem (note there is not a
> COMMIT in the sql):
>
> --------
> CREATE TABLE orders.amazon_items
> (
>   id serial NOT NULL,
>   order_id integer NOT NULL,
>   item_id integer NOT NULL,
>   amazon_item_id character varying(14) NOT NULL,
>   CONSTRAINT amazon_items_pkey PRIMARY KEY (id),
>   CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id)
>       REFERENCES orders.orders (id) MATCH SIMPLE
>       ON UPDATE NO ACTION ON DELETE NO ACTION,
>   CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id)
>       REFERENCES orders.items (id) MATCH SIMPLE
>       ON UPDATE NO ACTION ON DELETE NO ACTION
> )
> WITH OIDS;
> ALTER TABLE orders.amazon_items OWNER TO thenerds;
> --------
>
> It was called by the following slonik script:
>
> --------
> #!/usr/bin/slonik
> include </nerds/preamble.slonik>;
>
>         EXECUTE SCRIPT (
>                 SET ID = 1,
>                 FILENAME = '/nerds/thenerds.sql',
>                 EVENT NODE = 1
>         );
> --------
>
> and caused the following deadlock to occur:
>
> 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
> remoteWorkerThread_1: "select "_nerdcluster".ddlScript
> _complete_int(1, -1); " PGRES_FATAL_ERROR
> Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
> Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits for
> AccessExclusiveLock on relation 121589880 of databas
> e 121589046; blocked by process 12096.
> Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
> AccessShareLock on relation 121589817 of database 121589046;
> blocked by process 12263.
>
> Which then left the some of the tables on that slave in a bad state
> breaking replication:
>
> 2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
> "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR:
> Slony-I: alterTableRestore(): Table "public"."carts" is not in altered
> state
> CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1 )"
> PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
>
> Note that it's just an AccessShareLock that's killing us.  Looks like
> that's caused by a select query which does searches.  Our application
> does not produce any extraneous locking, it simply does SELECTS on
> that server.
>
> Interestingly, before we started using the slave for queries, we would
> have the deadlocks happen on the master when doing DDL changes, but
> this never caused the tables on the master to get into a bad state. 
> You could just re-run your EXECUTE SCRIPT and it would usually work
> fine the second time.
>
> What other info can I provide?
>
First, I don't think you have to use the execute script command to
create a new table.

I'd also like to know if you use pgpool, on slave, master, both ?

Does the re-executing of the script offer any solution ?

Cheers,
SAS