Tue Oct 12 19:04:05 PDT 2004
- Previous message: [Slony1-general] Slony stops replicating during nightly periodic + small patch
- Next message: [Slony1-general] .cleanuplistener() does not exist
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 10/12/2004 12:57 PM, Jacques Caron wrote: > Hi, > > First of all, many thanks for the great work on slony! > > I use slony 1.0.2 to replicate two Postgresql 7.4.3 databases running on > FreeBSD 5.2.1-p9, and see that slony stops replicating every night (with a > couple minor exceptions) during the periodic process that does the backups, > vacuuming, etc. I use the standard 502.pgsql script that comes with the > postgresql port on FreeBSD (not quite sure whether it's part of the port or > the original source tree of Postgresql), which basically does a pg_dump and > a vacuum analyze. > > Every night, I get this on stdout from slon: > ERROR remoteListenThread_1: timeout for event selection > And this on stderr: > sched_mainloop: select(): Bad file descriptor This problem will be fixed in 1.0.3. The function storing the events is trying to grab a too strong lock on the sl_event table, causing it to wait for the pg_dump to finish, and everyone else using that table is waiting for that one ... and so the whole thing goes kablooy. Jan > > Setting debug level to 4 does not give much more information, just says > after the timeout that the remoteListenThread is done. > > Trying to figure out the whole scheduling mechanism, I found this little > issue: in scheduler.c, a temporary copy of the fdsets for select is made > first, and then some checks are done to remove some FDs which may not be > needed any more from the global fdsets. I believe this must be an > oversight, and is the reason for the select error, which in turn sets > sched_status to an error value, and causes sched_msleep to return with an > error value and the remote listener thread to stop. > > I moved the copy further down (just before the select) and last night slony > did not stop replicating even though it logged several of the "timeout for > event selection" errors. Probably should wait a couple more periodic runs > to claim victory, but I believe the patch should at the very least not > cause any problems and solve a few, so here it is (including a couple of > typo fixes): > > %diff -u scheduler.c.orig scheduler.c > --- scheduler.c.orig Mon Oct 11 17:00:30 2004 > +++ scheduler.c Tue Oct 12 18:54:09 2004 > @@ -452,21 +452,8 @@ > struct timeval timeout; > > /* > - * Make copies of the file descriptor sets for select(2) > - */ > - FD_ZERO(&rfds); > - FD_ZERO(&wfds); > - for (i = 0; i < sched_numfd; i++) > - { > - if (FD_ISSET(i, &sched_fdset_read)) > - FD_SET(i, &rfds); > - if (FD_ISSET(i, &sched_fdset_write)) > - FD_SET(i, &wfds); > - } > - > - /* > * Check if any of the connections in the wait queue > - * have reached there timeout. While doing so, we also > + * have reached their timeout. While doing so, we also > * remember the closest timeout in the future. > */ > tv = NULL; > @@ -560,6 +547,19 @@ > } > > /* > + * Make copies of the file descriptor sets for select(2) > + */ > + FD_ZERO(&rfds); > + FD_ZERO(&wfds); > + for (i = 0; i < sched_numfd; i++) > + { > + if (FD_ISSET(i, &sched_fdset_read)) > + FD_SET(i, &rfds); > + if (FD_ISSET(i, &sched_fdset_write)) > + FD_SET(i, &wfds); > + } > + > + /* > * Do the select(2) while unlocking the master lock. > */ > pthread_mutex_unlock(&sched_master_lock); > @@ -776,7 +776,7 @@ > > > /* ---------- > - * sched_add_fdset > + * sched_remove_fdset > * > * Remove a file descriptor from one of the global scheduler sets and > * adjust sched_numfd accordingly. > > Hope that helps, > > Jacques. > > > _______________________________________________ > Slony1-general mailing list > Slony1-general at gborg.postgresql.org > http://gborg.postgresql.org/mailman/listinfo/slony1-general -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck at Yahoo.com #
- Previous message: [Slony1-general] Slony stops replicating during nightly periodic + small patch
- Next message: [Slony1-general] .cleanuplistener() does not exist
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list