[Slony1-general] Initial replication of sequences is failing

Tue Jun 2 13:49:26 PDT 2009

Thanks for the idea, Jeff.  I inspected the log more closely and found 
what I think is the _real_ problem (unexpected chunk number)...

2009-05-28 23:49:31 CDT DEBUG2 remoteWorkerThread_1: Begin COPY of table 
"finch"."blastres_qry"
2009-05-28 23:49:31 CDT DEBUG2 remoteWorkerThread_1:  nodeon73 is 0
NOTICE:  truncate of "finch"."blastres_qry" failed - doing delete
2009-05-28 23:49:31 CDT DEBUG2 remoteListenThread_1: LISTEN
2009-05-28 23:49:33 CDT DEBUG2 remoteListenThread_1: queue event 1,4141 SYNC
2009-05-28 23:49:33 CDT DEBUG2 remoteListenThread_1: UNLISTEN
2009-05-28 23:49:34 CDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 2412
2009-05-28 23:49:38 CDT DEBUG2 localListenThread: Received event 2,2412 SYNC
2009-05-28 23:49:44 CDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 2413
2009-05-28 23:49:44 CDT DEBUG2 localListenThread: Received event 2,2413 SYNC
2009-05-28 23:49:45 CDT DEBUG2 remoteListenThread_1: queue event 1,4142 SYNC
2009-05-28 23:49:53 CDT DEBUG2 remoteListenThread_1: LISTEN
2009-05-28 23:49:54 CDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 2414
2009-05-28 23:49:56 CDT DEBUG2 localListenThread: Received event 2,2414 SYNC
2009-05-28 23:50:00 CDT ERROR  remoteWorkerThread_1: copy to stdout on 
provider - PGRES_FATAL_ERROR ERROR:  unexpected chunk number 8618 
(expected 426) for toast value 34675334
2009-05-28 23:50:00 CDT WARN   remoteWorkerThread_1: data copy for set 1 
failed - sleep 30 seconds
2009-05-28 23:50:00 CDT DEBUG2 remoteListenThread_1: queue event 1,4143 SYNC
2009-05-28 23:50:00 CDT DEBUG2 remoteListenThread_1: UNLISTEN
WARNING:  there is no transaction in progress

Any ideas on what could cause such an error?

Sean

Jeff Frost wrote:
> Jeff Frost wrote:
>> Andrew Sullivan wrote:
>>> On Tue, Jun 02, 2009 at 10:30:45AM -0500, Sean Staats wrote:
>>>   
>>>> I created a new replication cluster.  It turns out that starting the  
>>>> table IDs at id=1 and the sequence IDs at id=1001 didn't make any  
>>>> difference as slony gave me the same error (sequence ID 1001 has already  
>>>> been assigned.)  Increasing the log verbosity to 4 doesn't produce any  
>>>> more useful debugging information.  Time for another approach.
>>>>
>>>> Would it make sense to create 2 different sets - one to replicate the  
>>>> tables and one to replicate the sequences?  Is there a downside to this  
>>>> kind of workaround?
>>>>     
>>>
>>> It'd be better to figure out what the duplication is caused by.  Have
>>> a look in the _slony tables and check to see what's in there.  Where's
>>> the collision?
>>>
>>>   
>> I've seen this issue recently when the initial sync fails.  If you 
>> scroll further back in your logs do you have a failure for the 
>> initial copy_set?  When this happens to me, it seems that slony 
>> leaves the slave DB in a half replicated state, but reattempts to do 
>> the initial sync and finds that the sequences are already in 
>> _cluster.sl_sequence table, then errors out.  This requires dropping 
>> the node and starting over.  This is with version 1.2.16. I recall 
>> previous versions being able to recover from a failed initial sync 
>> without intervention, but my memory could be mistaken.
> In fact, here's how it looks in my logs:
>
> Jun  2 13:09:36 localhost slon[1867]: [274-1] 2009-06-02 13:09:36 PDT 
> ERROR  remoteWorkerThread_1: "select 
> "_engage_cluster".tableHasSerialKey('"archive"."invitation"');"
> Jun  2 13:09:36 localhost slon[1867]: [274-2]  could not receive data 
> from server: Connection timed out
> Jun  2 13:09:36 localhost slon[1867]: [275-1] 2009-06-02 13:09:36 PDT 
> WARN   remoteWorkerThread_1: data copy for set 1 failed - sleep 30 seconds
> Jun  2 13:09:36 localhost postgres[1880]: [26-1] NOTICE:  there is no 
> transaction in progress
> Jun  2 13:10:06 localhost slon[1867]: [276-1] 2009-06-02 13:10:06 PDT 
> DEBUG1 copy_set 1
> Jun  2 13:10:06 localhost slon[1867]: [277-1] 2009-06-02 13:10:06 PDT 
> DEBUG1 remoteWorkerThread_1: connected to provider DB
> Jun  2 13:10:09 localhost slon[1867]: [278-1] 2009-06-02 13:10:09 PDT 
> ERROR  remoteWorkerThread_1: "select 
> "_engage_cluster".setAddSequence_int(1, 4,
> Jun  2 13:10:09 localhost slon[1867]: [278-2]  
> '"public"."tracking_sequence"', 'public.tracking_sequence sequence')" 
> PGRES_FATAL_ERROR ERROR:  Slony-I: setAddSequence_int():
> Jun  2 13:10:09 localhost slon[1867]: [278-3]  sequence ID 4 has 
> already been assigned
> Jun  2 13:10:09 localhost slon[1867]: [279-1] 2009-06-02 13:10:09 PDT 
> WARN   remoteWorkerThread_1: data copy for set 1 failed - sleep 60 seconds
>
> The DB in question is 144GB and it's being replicated over a 
> relatively slow link.  It seems to do about 1GB/hr, but never gets 
> past 10GB.  It always dies at that same point. 
> -- 
> Jeff Frost, Owner 	<jeff at frostconsultingllc.com>
> Frost Consulting, LLC 	http://www.frostconsultingllc.com/
> Phone: 916-647-6411	FAX: 916-405-4032
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general
>