[Slony1-general] Replication stopping

Fri Sep 14 03:23:52 PDT 2007

Hello, I am having problems with the stability of Slony-I (version
1.2.6).  I have a simple set up with 1 master and 1 slave database.
Both are running on 2 GHz, SuSE 9.2 Linux servers connect directly via
an ethernet cable.  I'm also running High-Availability Linux which I'm
using to manage the virtual database IP addresses and handle
network/machine failure events.

The test I'm doing is writing a UNIX timestamp the prime database and
checking if both databases are updated with the timestamp.  This runs
fine for a number of hours then a number of problems occur (sometimes
independantly):
1) the slave database is no longer updated with the timestamp (the
master is updated)
2) database primeship changes (master becomes slave) but according to HA
Linux no failure has occured.

>From the Slony log file I can see some errors which occur every 30
seconds or so:

2007-09-12 13:39:15 GMT ERROR  remoteWorkerThread_1: "begin transaction;
set transaction isolation level serializable; lock table
"_t1".sl_config_lock; select "_t1".failoverSet_int(1, 2, 1, 10787);
notify "_t1_Event"; notify "_t1_Confirm"; insert into "_t1".sl_event
(ev_origin, ev_seqno, ev_timestamp,      ev_minxid, ev_maxxid, ev_xip,
ev_type , ev_data1, ev_data2, ev_data3    ) values ('1', '10787',
'2007-09-12 07:34:50.791482', '9692768', '9692769', '', 'FAILOVER_SET',
'1', '2', '1'); insert into "_t1".sl_confirm      (con_origin,
con_received, con_seqno, con_timestamp)    values (1, 2, '10787',
now()); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key
violates unique constraint "pg_trigger_tgrelid_tgname_index"

>From log file slon-smsdb-node2.err (where smsdb is the name of my
database)

WATCHDOG: No Slon is running for node node2!
WATCHDOG: You ought to check the postmaster and slon for evidence of a
crash!
WATCHDOG: I'm going to restart slon for node2...
WATCHDOG: Restarted slon for the t1 cluster, PID 3240

>From PostgreSQL log file

2007-09-13 04:16:53 LOG:  SSL SYSCALL error: EOF detected
2007-09-13 04:16:53 LOG:  could not receive data from client: Connection
reset by peer
2007-09-13 04:16:53 LOG:  unexpected EOF on client connection

So the questions I have:
1) Where (i.e. log files) can I find out more information about what's
happening?
2) If Slony-I fails and looks like watchdog cannot recover from it, how
can I restart it?
3) And of course, any ideas why is Slony failing?

Thank you for your help,
Slawek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20070914/93fce2d8/attachment.htm