[Slony1-bugs] Slony 1.2.10: Deadlock on slave during execute script

Thu Sep 27 15:56:41 PDT 2007

On Thu, 27 Sep 2007, Christopher Browne wrote:

> Jeff Frost wrote:
>> For extra background on this bug, see the thread here:
>> http://lists.slony.info/pipermail/slony1-general/2007-September/006687.html 
>> 
>> This is on Slony1-1.2.10, PostgreSQL-8.1.9:
>> 
>> I've run into a situation on a client cluster composed of a master and 2 
>> slave nodes where a deadlock on one of the slaves happens quite regularly 
>> during any EXECUTE SCRIT commands.  It seems if slony loses the deadlock, 
>> some of the tables are left in a not altered for replication state and this 
>> breaks replication.
> I am now mighty confused as to how this could happen.
>
> I headed down a pretty void rabbit trail, trying to duplicate the condition 
> by trying to induce a deadlock.
>
> I had not much luck trying to do that; the tendancy was for deadlocks to 
> occur against the queries with the lower locking levels.
>
> So I decided to instrument a slon and try to induce it to get into the 
> described "no longer altered for replication" state. 
> I added a nice "exit(-1);" at exactly the point at which you found the 
> induction of the deadlock.  The slon fell over at the end of processing the 
> DDL script, as expected.
>
> And when it fell over, the transaction was (implicitly) rolled back, so that 
> the condition of the subscriber was recovered to consistency.
>
> 2007-09-27 20:21:16 UTC DEBUG2 remoteWorkerThread_1: Received event 1,30 
> DDL_SCRIPT
> 2007-09-27 20:21:16 UTC INFO   Checking local node id
> 2007-09-27 20:21:16 UTC INFO   Found local node id
> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is 80300
> 2007-09-27 20:21:16 UTC DEBUG4 remoteWorkerThread_3: update provider 
> configuration
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL request with 10 
> statements
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 0: [create 
> sequence ncolseq;]
> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress3 
> host=localhost user=cbbrowne port=5834" is 80300
> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is 80300
> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is 80300
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 1: [
> alter table table4 add column newcol timestamptz;]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 2: [
> alter table table4 alter column newcol set default now();]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 3: [
> alter table table4 rename column id1 to col1;]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 4: [
> alter table table4 rename column id2 to col2;]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 5: [
>
> ALTER TABLE billing_discount ADD column "use_term" character(1);]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 6: [
> ALTER TABLE billing_discount ALTER column use_term set default 'n';]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 7: [
> ALTER TABLE billing_discount add constraint use_term_cons check (use_term in 
> ('y','n'));]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 8: [
>
> alter table table1 rename column id to col1;]
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 9: []
> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_EMPTY_QUERY
> 2007-09-27 20:21:16 UTC DEBUG2 slon: child terminated status: 65280; pid: 
> 24126, current worker pid: 24126
> 2007-09-27 20:21:16 UTC DEBUG1 slon: restart of worker in 10 seconds
> 2007-09-27 20:21:20 UTC DEBUG1 slon: shutdown requested
> 2007-09-27 20:21:20 UTC DEBUG2 slon: notify worker process to shutdown
> 2007-09-27 20:21:20 UTC DEBUG1 slon: done
> 2007-09-27 20:21:20 UTC DEBUG2 slon: exit(0)
>
> So, for instance, statement #6 was going to alter table billing_discount to 
> add a CHECK constraint:
>
> As we can see, that constraint isn't there.
>
> slonyregress2@[local]:5834=# \d billing_discount
>                                          Table "public.billing_discount"
>      Column        |           Type           | 
> Modifiers 
> ---------------------+--------------------------+--------------------------------------------------------------------
> discount_code       | character(2)             | not null
> billing_object_type | character varying(10)    | not null
> billing_action_type | character varying(10)    | not null
> discount_amount     | numeric(7,2)             | not null
> start_date          | timestamp with time zone | not null
> end_date            | timestamp with time zone | not null
> billing_discount_id | integer                  | not null default 
> nextval(('billing_discount_seq'::text)::regclass)
> registrar_id        | integer                  |
> tld_id              | integer                  |
> zone_id             | integer                  |
> Indexes:
>   "billing_discount_pkey" PRIMARY KEY, btree (billing_discount_id)
> Triggers:
>   _slony_regress1_denyaccess_5 BEFORE INSERT OR DELETE OR UPDATE ON 
> billing_discount FOR EACH ROW EXECUTE PROCEDURE 
> _slony_regress1.denyaccess('_slony_regress1')
>
> slonyregress2@[local]:5834=#
>
> This heads me back to the "wait, that's impossible!" expectation that I 
> started with.
>
> When the ERROR message was submitted to the slon log, that means that the 
> transaction rolls back, and everything recovers to where it was.
>
> That's what I see happen here; I can't quite fathom why that's not happening 
> there.

I guess it's fun to be unique. :-)

And your introduction of the failure modes was on the slave and not the 
master, correct?  Were you testing with 1.2.11 and not 1.2.10?  Could I 
possibly be misinterpreting the log messages?

BTW, thanks for looking into this.

-- 
Jeff Frost, Owner 	<jeff at frostconsultingllc.com>
Frost Consulting, LLC 	http://www.frostconsultingllc.com/
Phone: 650-780-7908	FAX: 650-649-1954