[Slony1-bugs] Slony 1.2.10: Deadlock on slave during execute script

Fri Sep 28 07:20:25 PDT 2007

Jeff Frost a écrit :
> On Thu, 27 Sep 2007, Christopher Browne wrote:
>
>> Jeff Frost wrote:
>>> For extra background on this bug, see the thread here:
>>> http://lists.slony.info/pipermail/slony1-general/2007-September/006687.html
>>>
>>> This is on Slony1-1.2.10, PostgreSQL-8.1.9:
>>>
>>> I've run into a situation on a client cluster composed of a master
>>> and 2 slave nodes where a deadlock on one of the slaves happens
>>> quite regularly during any EXECUTE SCRIT commands.  It seems if
>>> slony loses the deadlock, some of the tables are left in a not
>>> altered for replication state and this breaks replication.
>> I am now mighty confused as to how this could happen.
>>
>> I headed down a pretty void rabbit trail, trying to duplicate the
>> condition by trying to induce a deadlock.
>>
>> I had not much luck trying to do that; the tendancy was for deadlocks
>> to occur against the queries with the lower locking levels.
>>
>> So I decided to instrument a slon and try to induce it to get into
>> the described "no longer altered for replication" state. I added a
>> nice "exit(-1);" at exactly the point at which you found the
>> induction of the deadlock.  The slon fell over at the end of
>> processing the DDL script, as expected.
>>
>> And when it fell over, the transaction was (implicitly) rolled back,
>> so that the condition of the subscriber was recovered to consistency.
>>
>> 2007-09-27 20:21:16 UTC DEBUG2 remoteWorkerThread_1: Received event
>> 1,30 DDL_SCRIPT
>> 2007-09-27 20:21:16 UTC INFO   Checking local node id
>> 2007-09-27 20:21:16 UTC INFO   Found local node id
>> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is
>> 80300
>> 2007-09-27 20:21:16 UTC DEBUG4 remoteWorkerThread_3: update provider
>> configuration
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL request with
>> 10 statements
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 0:
>> [create sequence ncolseq;]
>> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress3
>> host=localhost user=cbbrowne port=5834" is 80300
>> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is
>> 80300
>> 2007-09-27 20:21:16 UTC DEBUG4 version for "dbname=slonyregress2" is
>> 80300
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 1: [
>> alter table table4 add column newcol timestamptz;]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 2: [
>> alter table table4 alter column newcol set default now();]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 3: [
>> alter table table4 rename column id1 to col1;]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 4: [
>> alter table table4 rename column id2 to col2;]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 5: [
>>
>> ALTER TABLE billing_discount ADD column "use_term" character(1);]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 6: [
>> ALTER TABLE billing_discount ALTER column use_term set default 'n';]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 7: [
>> ALTER TABLE billing_discount add constraint use_term_cons check
>> (use_term in ('y','n'));]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 8: [
>>
>> alter table table1 rename column id to col1;]
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_COMMAND_OK
>> 2007-09-27 20:21:16 UTC CONFIG remoteWorkerThread_1: DDL Statement 9: []
>> 2007-09-27 20:21:16 UTC CONFIG DDL success - PGRES_EMPTY_QUERY
>> 2007-09-27 20:21:16 UTC DEBUG2 slon: child terminated status: 65280;
>> pid: 24126, current worker pid: 24126
>> 2007-09-27 20:21:16 UTC DEBUG1 slon: restart of worker in 10 seconds
>> 2007-09-27 20:21:20 UTC DEBUG1 slon: shutdown requested
>> 2007-09-27 20:21:20 UTC DEBUG2 slon: notify worker process to shutdown
>> 2007-09-27 20:21:20 UTC DEBUG1 slon: done
>> 2007-09-27 20:21:20 UTC DEBUG2 slon: exit(0)
>>
>> So, for instance, statement #6 was going to alter table
>> billing_discount to add a CHECK constraint:
>>
>> As we can see, that constraint isn't there.
>>
>> slonyregress2@[local]:5834=# \d billing_discount
>>                                          Table "public.billing_discount"
>>      Column        |           Type           | Modifiers
>> ---------------------+--------------------------+--------------------------------------------------------------------
>>
>> discount_code       | character(2)             | not null
>> billing_object_type | character varying(10)    | not null
>> billing_action_type | character varying(10)    | not null
>> discount_amount     | numeric(7,2)             | not null
>> start_date          | timestamp with time zone | not null
>> end_date            | timestamp with time zone | not null
>> billing_discount_id | integer                  | not null default
>> nextval(('billing_discount_seq'::text)::regclass)
>> registrar_id        | integer                  |
>> tld_id              | integer                  |
>> zone_id             | integer                  |
>> Indexes:
>>   "billing_discount_pkey" PRIMARY KEY, btree (billing_discount_id)
>> Triggers:
>>   _slony_regress1_denyaccess_5 BEFORE INSERT OR DELETE OR UPDATE ON
>> billing_discount FOR EACH ROW EXECUTE PROCEDURE
>> _slony_regress1.denyaccess('_slony_regress1')
>>
>> slonyregress2@[local]:5834=#
>>
>> This heads me back to the "wait, that's impossible!" expectation that
>> I started with.
>>
>> When the ERROR message was submitted to the slon log, that means that
>> the transaction rolls back, and everything recovers to where it was.
>>
>> That's what I see happen here; I can't quite fathom why that's not
>> happening there.
>
> I guess it's fun to be unique. :-)
>
> And your introduction of the failure modes was on the slave and not
> the master, correct?  Were you testing with 1.2.11 and not 1.2.10? 
> Could I possibly be misinterpreting the log messages?
>
> BTW, thanks for looking into this.
>
I personaly get these conditions with slony 1.2.10, PG 8.2.3 or 8.2.4
and a pgpool for web clients to connect to the slave.

Slony does not connect through pgpool, and pgpool is not installed on
the master.

Regards,

SAS