[Slony1-general] Slony stops replicating during nightly periodic + small patch

Tue Oct 12 19:04:05 PDT 2004

On 10/12/2004 12:57 PM, Jacques Caron wrote:

> Hi,
> 
> First of all, many thanks for the great work on slony!
> 
> I use slony 1.0.2 to replicate two Postgresql 7.4.3 databases running on 
> FreeBSD 5.2.1-p9, and see that slony stops replicating every night (with a 
> couple minor exceptions) during the periodic process that does the backups, 
> vacuuming, etc. I use the standard 502.pgsql script that comes with the 
> postgresql port on FreeBSD (not quite sure whether it's part of the port or 
> the original source tree of Postgresql), which basically does a pg_dump and 
> a vacuum analyze.
> 
> Every night, I get this on stdout from slon:
> ERROR  remoteListenThread_1: timeout for event selection
> And this on stderr:
> sched_mainloop: select(): Bad file descriptor

This problem will be fixed in 1.0.3. The function storing the events is 
trying to grab a too strong lock on the sl_event table, causing it to 
wait for the pg_dump to finish, and everyone else using that table is 
waiting for that one ... and so the whole thing goes kablooy.

Jan

> 
> Setting debug level to 4 does not give much more information, just says 
> after the timeout that the remoteListenThread is done.
> 
> Trying to figure out the whole scheduling mechanism, I found this little 
> issue: in scheduler.c, a temporary copy of the fdsets for select is made 
> first, and then some checks are done to remove some FDs which may not be 
> needed any more from the global fdsets. I believe this must be an 
> oversight, and is the reason for the select error, which in turn sets 
> sched_status to an error value, and causes sched_msleep to return with an 
> error value and the remote listener thread to stop.
> 
> I moved the copy further down (just before the select) and last night slony 
> did not stop replicating even though it logged several of the "timeout for 
> event selection" errors. Probably should wait a couple more periodic runs 
> to claim victory, but I believe the patch should at the very least not 
> cause any problems and solve a few, so here it is (including a couple of 
> typo fixes):
> 
> %diff -u scheduler.c.orig scheduler.c
> --- scheduler.c.orig    Mon Oct 11 17:00:30 2004
> +++ scheduler.c Tue Oct 12 18:54:09 2004
> @@ -452,21 +452,8 @@
>                  struct timeval  timeout;
> 
>                  /*
> -                * Make copies of the file descriptor sets for select(2)
> -                */
> -               FD_ZERO(&rfds);
> -               FD_ZERO(&wfds);
> -               for (i = 0; i < sched_numfd; i++)
> -               {
> -                       if (FD_ISSET(i, &sched_fdset_read))
> -                               FD_SET(i, &rfds);
> -                       if (FD_ISSET(i, &sched_fdset_write))
> -                               FD_SET(i, &wfds);
> -               }
> -
> -               /*
>                   * Check if any of the connections in the wait queue
> -                * have reached there timeout. While doing so, we also
> +                * have reached their timeout. While doing so, we also
>                   * remember the closest timeout in the future.
>                   */
>                  tv = NULL;
> @@ -560,6 +547,19 @@
>                  }
> 
>                  /*
> +                * Make copies of the file descriptor sets for select(2)
> +                */
> +               FD_ZERO(&rfds);
> +               FD_ZERO(&wfds);
> +               for (i = 0; i < sched_numfd; i++)
> +               {
> +                       if (FD_ISSET(i, &sched_fdset_read))
> +                               FD_SET(i, &rfds);
> +                       if (FD_ISSET(i, &sched_fdset_write))
> +                               FD_SET(i, &wfds);
> +               }
> +
> +               /*
>                   * Do the select(2) while unlocking the master lock.
>                   */
>                  pthread_mutex_unlock(&sched_master_lock);
> @@ -776,7 +776,7 @@
> 
> 
>   /* ----------
> - * sched_add_fdset
> + * sched_remove_fdset
>    *
>    *     Remove a file descriptor from one of the global scheduler sets and
>    *     adjust sched_numfd accordingly.
> 
> Hope that helps,
> 
> Jacques.
> 
> 
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at gborg.postgresql.org
> http://gborg.postgresql.org/mailman/listinfo/slony1-general

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck at Yahoo.com #