[Slony1-general] Leaking file descriptor in 1.2

Fri Dec 1 08:45:23 PST 2006

Jan Wieck wrote:
> On 11/30/2006 3:07 PM, Christopher Browne wrote:
>> Niels Breet wrote:
>>> I had this problem:
>>> 2006-11-29 20:51:41 CET FATAL  slon: sched_wakeuppipe create failed
>>> -(24)
>>> Too many open files
>>>
>>> When slon is started and the local database is down, slon now tries
>>> to reconnect. Before 1.2 we would bail out, but now we restart the
>>> thread. slon_terminate_worker() doesn't close the currently opened
>>> sched_wakeuppipe, so after a few loops we have too many open files.
>>>
>>> I just added this to the end of slon_terminate_worker():
>>>         close(sched_wakeuppipe[0]);
>>>         close(sched_wakeuppipe[1]);
>>>
>>> I'm sure there is a better solution, but that solves the problem.
>>>
>>>   
>> That doesn't look like a half bad answer, all by itself.  I'm taking a
>> look at how those objects (some are pipes) get opened to see if there's
>> some more general approach.  But that looks like a good common place
>> where they do need closing.
>
> It looks like a perfectly fine solution to me. Fact is that file
> descriptors of any kind (pipe, socket or whatever) are objects on the
> process level. There is no mechanism to tell the OS to clean anything
> up automagically on thread termination. So one can either close them
> explicitly or attempt to reuse the existing pipe. I am in favor of
> closing and a fresh start.
There's still an issue here, even when we fix this; the trouble is that
there are clearly some cases, now, where the slon isn't sleeping 10
seconds (Jan, you certainly recall that change) where it ought to be.

Based on Niels' logs:

2006-11-29 20:51:41 CET CONFIG main: slon version 1.2.1 starting up
2006-11-29 20:51:41 CET DEBUG2 slon: watchdog process started
2006-11-29 20:51:41 CET DEBUG2 slon: watchdog ready - pid = 9712
2006-11-29 20:51:41 CET FATAL  main: Cannot connect to local database -
could not connect to server: Connection refused
        Is the server running on host "localhost" and accepting
        TCP/IP connections on port 3000?

This seems to me to be a good case for the restart/retry to sleep a few seconds so that this doesn't turn into some "thundering herd" of continual connection requests.

Jan's recent change took out 10s sleeps in when "child" threads have unimportant failures; I'm going to add in 10s sleeps for the evident converse case, when the main thread encounters non-fatal exceptions in the main loop, so that you don't get this situation of it trying to reconnect as many times as it can per second while the node the slon wishes to serve is inaccessible.