Wed Apr 19 15:14:07 PDT 2006
- Previous message: [Slony1-general] Postmaster restart breaks slony
- Next message: [Slony1-general] Postmaster restart breaks slony
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ian Burrell wrote: > We ran into a problem with the postmaster restarting database > connections causing replication to break. A segfault in one of the > backends running a custom C aggregate caused the postmaster to restart > all of the other backend processes. This broke all of the existing > connections to the database. The node1 slon died. For some reason the > slon_watchdog.pl script also died but that is a different problem. > The slave slon daemons both reconnected the remoteListenThread > connection. But the remoteWorkerThread connection for transferring > data was unused and left broken. > > We didn't notice the node1 slon was down until a few hours later. I > started the node1 slon daemon which inserted SYNC events. The slave > slon daemon then started processing the SYNC events and trying to > transfer data. Since the data connection had failed and the slave > slon daemons started failing. I noticed the error and restart all the > slon daemons which fixed the problem. > > Shouldn't the slon daemons reconnect if the remoteWorkerThread > connection goes down? Even dying and being restarted would be better > than continuously failing in a loop. We are using 1.1.0 with most of > the 1.1.1 patches. Has this problem been fixed in 1.1.5? > Unfortunately, the real fix to this is in CVS HEAD/1.2, which fairly significantly restructures the thread handling, needful for Windows support... I wish there were a better answer; unfortunately, the "sorta-internal watchdog" that was added in 1.1.0 leaves something to be desired. The fix is the "big fix," which is what's in CVS HEAD. All I can recommend for now is to make sure that there's a watchdog running. For 1.2, I've got what I think is a "better mousetrap" for watchdogging. And nothing about it is really 1.2-specific; I'm running it now... You basically write slon .conf files into a well-defined directory structure, and those .conf files have the respective slons write .pid files nearby. A "watchdog" basically goes in every so often and checks to see if there's a slon running for each of the .conf/.pid combination, restarting whatever falls over. It's a *real* simple shell script. http://gborg.postgresql.org/cgi-bin/cvsweb.cgi/slony1-engine/tools/launch_clusters.sh?cvsroot=slony1 The script is only a page long, and can manage a whole bunch of Slony-I clusters all at once...
- Previous message: [Slony1-general] Postmaster restart breaks slony
- Next message: [Slony1-general] Postmaster restart breaks slony
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list