[Slony1-bugs] Slony 1.2.10: Deadlock on slave during execute script

Thu Sep 27 15:51:04 PDT 2007

Jeff Frost wrote:
> For extra background on this bug, see the thread here:
> http://lists.slony.info/pipermail/slony1-general/2007-September/006687.html 
>
>
> This is on Slony1-1.2.10, PostgreSQL-8.1.9:
>
> I've run into a situation on a client cluster composed of a master and 
> 2 slave nodes where a deadlock on one of the slaves happens quite 
> regularly during any EXECUTE SCRIT commands.  It seems if slony loses 
> the deadlock, some of the tables are left in a not altered for 
> replication state and this breaks replication.
I am now mighty confused as to how this could happen.

I headed down a pretty void rabbit trail, trying to duplicate the 
condition by trying to induce a deadlock.

I had not much luck trying to do that; the tendancy was for deadlocks to 
occur against the queries with the lower locking levels.

So I decided to instrument a slon and try to induce it to get into the 
described "no longer altered for replication" state. 

I added a nice "exit(-1);" at exactly the point at which you found the 
induction of the deadlock.  The slon fell over at the end of processing 
the DDL script, as expected.

And when it fell over, the transaction was (implicitly) rolled back, so 
that the condition of the subscriber was recovered to consistency.

2007-09-27 20:21:16 UTC DEBUG2 remoteWorkerThread_1: Received event 1,30 
DDL_SCRIPT
2007-09-27 20:21:16 UTC INFO   Checking local node id
2007-09-27 20:21:16 UTC INFO   Found local node id
2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is 80300
2007-09-27 20:21:16 UTC DEBUG4 remoteWorkerThread_3: update provider 
configuration
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL request with 10 
statements
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 0: 
[create sequence ncolseq;]
2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress3 
host=localhost user=cbbrowne port=5834" is 80300
2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is 80300
2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is 80300
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 1: [
alter table table4 add column newcol timestamptz;]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 2: [
alter table table4 alter column newcol set default now();]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 3: [
alter table table4 rename column id1 to col1;]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 4: [
alter table table4 rename column id2 to col2;]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 5: [

ALTER TABLE billing_discount ADD column "use_term" character(1);]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 6: [
ALTER TABLE billing_discount ALTER column use_term set default 'n';]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 7: [
ALTER TABLE billing_discount add constraint use_term_cons check 
(use_term in ('y','n'));]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 8: [

alter table table1 rename column id to col1;]
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 9: []
2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_EMPTY_QUERY
2007-09-27 20:21:16 UTC DEBUG2 slon: child terminated status: 65280; 
pid: 24126, current worker pid: 24126
2007-09-27 20:21:16 UTC DEBUG1 slon: restart of worker in 10 seconds
2007-09-27 20:21:20 UTC DEBUG1 slon: shutdown requested
2007-09-27 20:21:20 UTC DEBUG2 slon: notify worker process to shutdown
2007-09-27 20:21:20 UTC DEBUG1 slon: done
2007-09-27 20:21:20 UTC DEBUG2 slon: exit(0)

So, for instance, statement #6 was going to alter table billing_discount 
to add a CHECK constraint:

As we can see, that constraint isn't there.

slonyregress2@[local]:5834=# \d billing_discount
                                           Table "public.billing_discount"
       Column        |           Type           
|                             Modifiers                             
---------------------+--------------------------+--------------------------------------------------------------------
 discount_code       | character(2)             | not null
 billing_object_type | character varying(10)    | not null
 billing_action_type | character varying(10)    | not null
 discount_amount     | numeric(7,2)             | not null
 start_date          | timestamp with time zone | not null
 end_date            | timestamp with time zone | not null
 billing_discount_id | integer                  | not null default 
nextval(('billing_discount_seq'::text)::regclass)
 registrar_id        | integer                  |
 tld_id              | integer                  |
 zone_id             | integer                  |
Indexes:
    "billing_discount_pkey" PRIMARY KEY, btree (billing_discount_id)
Triggers:
    _slony_regress1_denyaccess_5 BEFORE INSERT OR DELETE OR UPDATE ON 
billing_discount FOR EACH ROW EXECUTE PROCEDURE 
_slony_regress1.denyaccess('_slony_regress1')

slonyregress2@[local]:5834=#

This heads me back to the "wait, that's impossible!" expectation that I 
started with.

When the ERROR message was submitted to the slon log, that means that 
the transaction rolls back, and everything recovers to where it was.

That's what I see happen here; I can't quite fathom why that's not 
happening there.