[Slony1-general] [Slony 2.0.1] Events and Confirmations not propagating

Wed Apr 29 11:58:29 PDT 2009

Any help would be great ...

Cyril Scetbon wrote:
> I'm still investigating ....
>
> The slon restart does not resolve the error. After the restart I have 
> something like :
>
> select ev_origin, min(ev_seqno), max(ev_seqno),
>        date_trunc('minutes', min(now() - ev_timestamp)),
>        date_trunc('minutes', max(now() - ev_timestamp)),
>        min(now() - ev_timestamp) > '00:30:00' as agehi
>    from _pns_slony_voila_archi_0.sl_event group by ev_origin;
> ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
> -----------+---------+---------+------------+------------+-------
>       103 |      65 |      66 | 00:01:00   | 00:47:00   | f
>       104 |      60 |      61 | 00:01:00   | 00:47:00   | f
>       102 |      33 |      34 | 00:01:00   | 00:47:00   | f
>       101 | 4541581 | 4542350 | 00:00:00   | 00:13:00   | f
> (4 rows)
>
> So, another event is generated (for example, 66 from node 103)
>
> log extract :
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 33 
> processing
> 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_102: no sets need 
> syncing for this event
> 2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event 
> 102,33 SYNC
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 102,33 
> ignored - duplicate
> 2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event 
> 104,60 SYNC
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 104,60 
> ignored - duplicate
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_104: SYNC 60 
> processing
> 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_104: no sets need 
> syncing for this event
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_103: SYNC 65 
> processing
> 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_103: no sets need 
> syncing for this event
>
> after a few minutes I have :
>
> select ev_origin, min(ev_seqno), max(ev_seqno),
>        date_trunc('minutes', min(now() - ev_timestamp)),
>        date_trunc('minutes', max(now() - ev_timestamp)),
>        min(now() - ev_timestamp) > '00:30:00' as agehi
>    from _pns_slony_voila_archi_0.sl_event group by ev_origin;
> ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
> -----------+---------+---------+------------+------------+-------
>       103 |      66 |      66 | 00:12:00   | 00:12:00   | f
>       104 |      61 |      61 | 00:12:00   | 00:12:00   | f
>       102 |      34 |      34 | 00:12:00   | 00:12:00   | f
>       101 | 4542310 | 4543008 | 00:00:00   | 00:11:00   | f
>
> log extract :
> 2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_103: SYNC 66 
> processing
> 2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_103: no sets need 
> syncing for this event
> 2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_104: SYNC 61 
> processing
> 2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_104: no sets need 
> syncing for this event
> 2009-04-28 16:37:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 34 
> processing
> 2009-04-28 16:37:53 CEST DEBUG1 remoteWorkerThread_102: no sets need 
> syncing for this event
>
> Precedent events (65,60 and 33) have been cleared, and a new one per 
> node has been generated. But, the error continues and after 30 minutes 
> I have :
>
> select ev_origin, min(ev_seqno), max(ev_seqno),
>        date_trunc('minutes', min(now() - ev_timestamp)),
>        date_trunc('minutes', max(now() - ev_timestamp)),
>        min(now() - ev_timestamp) > '00:30:00' as agehi
>    from _pns_slony_voila_archi_0.sl_event group by ev_origin;
> ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
> -----------+---------+---------+------------+------------+-------
>       103 |      66 |      66 | 00:41:00   | 00:41:00   | t
>       104 |      61 |      61 | 00:41:00   | 00:41:00   | t
>       102 |      34 |      34 | 00:41:00   | 00:41:00   | t
>       101 | 4543634 | 4544715 | 00:00:00   | 00:18:00   | f
>
> In Summary :
>
> Events are not correctly propagated from nodes 102, 103 and 104. If I 
> restart every slons processes new events are generated on these nodes 
> but still not correctly propagated. However the last event on each 
> node seems to be propagated thanks to the restart stage.
>
> Any idea ?
>
> Cyril Scetbon wrote:
>> Hi,
>>
>> I found in slony log files that the worker has restarted the process 
>> on each host in my configuration (1 provider, 3 receivers) at the 
>> same time :
>>
>> 2009-04-28 08:47:37 CEST INFO   remoteWorkerThread_101: SYNC 4516360 
>> sync_event timing:  pqexec (s/count)- provider 0.001/1 - subscriber 
>> 0.035/1 - IUD 0.000/2
>> 2009-04-28 08:47:38 CEST ERROR  remoteListenThread_102: "select 
>> ev_origin, ev_seqno, ev_timestamp,        ev_snapshot,        
>> "pg_catalog".txid_snapshot_xmin(ev_snapshot),        
>> "pg_catalog".txid_snapshot_xmax(ev_snapshot),        ev_type,        
>> ev_data1, ev_data2,        ev_data3, ev_data4,        ev_data5,
>> ev_data6,        ev_data7, ev_data8 from 
>> "_pns_slony_voila_archi_0".sl_event e where (e.ev_origin = '102' and 
>> e.ev_seqno > '29') or (e.ev_origin = '103' and e.ev_seqno > '62') 
>> order by e.ev_origin, e.ev_seqno limit 40" - FATAL:  terminating 
>> connection due to administrator command
>> 2009-04-28 08:47:38 CEST ERROR  remoteListenThread_102: "select 
>> con_origin, con_received,     max(con_seqno) as con_seqno,     
>> max(con_timestamp) as con_timestamp from 
>> "_pns_slony_voila_archi_0".sl_confirm where con_received <> 104 group 
>> by con_origin, con_received" 2009-04-28 08:47:38 CEST CONFIG slon: 
>> child terminated status: 11; pid: 28152, current worker pid: 28152
>> 2009-04-28 08:47:38 CEST CONFIG slon: restart of worker in 10 seconds
>>
>> The origin of the first error is that postgresql has been restarted 
>> on one receiver without stopping slon before. Is it a known source of 
>> errors ?
>> The slon process has been terminated with status=0 on the receiver 
>> where postgresql has been restarted and has segfault (chid status=11) 
>> on the others (each watchdog has restart a new slon process). This 
>> fact seems to cause the error with events and confirmations.
>>
>> Regards.
>>
>> Cyril Scetbon wrote:
>>> A simple restart of all slon processes seems to have resolved the 
>>> issue. weird ....
>>>
>>> Cyril Scetbon wrote:
>>>> I use test_slony_state and see some informations like :
>>>>
>>>> Check of event info
>>>> ---------------------------------------------------
>>>> Problem : Events not propagating to node 2
>>>> Problem : Events not propagating to node 4
>>>> Problem : Events not propagating to node 3
>>>>
>>>> Check of sl_confirm aging
>>>> ---------------------------------------------------
>>>> Confirmations not propagating from 2 to 1
>>>> Confirmations not propagating from 2 to 3
>>>> Confirmations not propagating from 2 to 4
>>>> Confirmations not propagating from 3 to 1
>>>> Confirmations not propagating from 3 to 2
>>>> Confirmations not propagating from 3 to 4
>>>> Confirmations not propagating from 4 to 1
>>>> Confirmations not propagating from 4 to 2
>>>> Confirmations not propagating from 4 to 3
>>>>
>>>> You can see the results on one of my databases :
>>>>
>>>> - for Confirmations
>>>>
>>>> select con_origin, con_received, min(con_seqno) as minseq, 
>>>> max(con_seqno) as maxseq, date_trunc('minutes', 
>>>> min(now()-con_timestamp)) as age1, date_trunc('minutes', 
>>>> max(now()-con_timestamp)) as age2, min(now() - con_timestamp) > 
>>>> '00:30:00' as tooold from _pns_slony_voila_archi_0.sl_confirm group 
>>>> by con_origin, con_received order by con_origin, con_received;
>>>> con_origin | con_received | minseq  | maxseq  |   age1   |   age2   
>>>> | tooold 
>>>> ------------+--------------+---------+---------+----------+----------+-------- 
>>>>
>>>>        101 |          102 | 4464029 | 4464792 | 00:00:00 | 00:16:00 
>>>> | f
>>>>        101 |          103 | 4464027 | 4464792 | 00:00:00 | 00:16:00 
>>>> | f
>>>>        101 |          104 | 4464024 | 4464792 | 00:00:00 | 00:16:00 
>>>> | f
>>>>        102 |          101 |      29 |      29 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        102 |          103 |      29 |      29 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        102 |          104 |      29 |      29 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        103 |          101 |      62 |      62 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        103 |          102 |      62 |      62 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        103 |          104 |      62 |      62 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        104 |          101 |      57 |      57 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        104 |          102 |      57 |      57 | 03:39:00 | 03:39:00 
>>>> | t
>>>>        104 |          103 |      57 |      57 | 03:39:00 | 03:39:00 
>>>> | t
>>>>
>>>>
>>>> - for Events
>>>>
>>>> select ev_origin, min(ev_seqno), max(ev_seqno),
>>>>         date_trunc('minutes', min(now() - ev_timestamp)),
>>>>         date_trunc('minutes', max(now() - ev_timestamp)),
>>>>         min(now() - ev_timestamp) > '00:30:00' as agehi
>>>>     from _pns_slony_voila_archi_0.sl_event group by ev_origin;
>>>> ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
>>>> -----------+---------+---------+------------+------------+-------
>>>>       103 |      62 |      62 | 03:49:00   | 03:49:00   | t
>>>>       104 |      57 |      57 | 03:49:00   | 03:49:00   | t
>>>>       102 |      29 |      29 | 03:49:00   | 03:49:00   | t
>>>>       101 | 4464493 | 4465346 | 00:00:00   | 00:14:00   | f
>>>>
>>>>
>>>> What can be the source of these errors and how to track them ? FYI, 
>>>> I have logs but in debug level 1.
>>>>
>>>> I did not have issues in 1.2.15.
>>>>
>>>> Regards.
>>>
>>
>

-- 
Cyril SCETBON