[Slony1-general] Catching up a large backlog: a few observations

Wed Apr 30 14:31:08 PDT 2008

Jacques Caron <jc at oxado.com> writes:
> Hi,
>
> At 17:54 24/04/2008, Christopher Browne wrote:
>>I'm not sure we want to stop queueing events altogether for any
>>extended period of time.  *That* seems like a risky thing to do.
>
> I'm not quite sure why? I'd rather have them stay untouched in the DB
> rather than have slon grow (potentially a lot) for no good reason.

After a chat with Jan, I have to back away from suggesting it's
particularly risky.  It shouldn't be.

>>- If the problem is that there is a backlog of SYNCs that *will* need
>>   to be processed, then metering them in, via "LIMIT N" + some delays,
>>   should prevent the slon from blowing up.  If it *does* blow up, it'll
>>   restart, *hopefully* after getting some work done.
>
> It does indeed, but in the meantime you have requested lots of events
> that ended up not being used (and which you will fetch again on the
> next run), you have used memory that could be more useful as OS cache,
> and in many cases you actually end up stopping slon quite abruptly
> while it's fetching data, with postgres continuing to work on that
> fetch while you start a new one. And really the "oh anyway it will
> crash and restart" approach gives be goose bumps for something related
> to DB replication!

That *seems* fair.

My counterargument is that it looks as though implementing this
throttling may be done in one of two ways:

  a) In the exact way you suggest, which seems as though it would
     require quite a lot of code change, as collecting the necessary
     data would require touching quite a lot of code.

  b) If, instead, we use an inexact heuristic, then we have the
     benefit of being able to localize the changes to a small portion of
     program logic.

It looks as though collecting all the data needed for this would be
quite intrusive, and I'm loathe to add the complication for something
that people aren't complaining about, when a simpler heuristic might
well be good enough.     

>>The alternative solution is to do "strict metering" where we don't
>>allow the queue to grow past some pre-defined size.  But I'm not sure
>>what that size should be.
>
> n x sync_group_maxsize? With n somewhere between 2 and 10, I'd say.

Ah, yes, that seems pretty plausible.  That cuts down on the need for
configuration, as the existing config (sync_group_maxsize) can indeed
reasonably imply this already.

[Rummaging through code...]

Hmm.  It'll be a bit of a pain to implement this in any strict
fashion, as the queues are maintained on a per-node basis, and I don't
see too terribly much value in being really strict about exact
handling (e.g. - to the point of walking thru *all* the nodes'
configuration to analyze this).

Simpler seems better, particularly in that this will (regardless) have
some complicating effects on the code base.  If I can keep the logic
in one spot, that will make this change *way* more maintainable.

Suggestion:

- With 2 variables, I can store the number of event rows pulled for
  this and the last iteration, last_events, and current_events

- We set the query to do "limit (2*sync_group_maxsize)"

- If (last_events + current_events) = 4*sync_group_maxsize, then we
  sleep for a configurable period, *and* drop back into LISTENing mode.

That allows all of this logic to take place at the end of
remoteListen_receive_events().

It means that the slon never completely ceases to add events to the
queue, but:

 a) The events aren't enormously big, so we should be running out of
    memory due to other things way before running out due to event
    bloat;

 b) If we decelerate it sufficiently, that should be helpful enough.

I'll see about a patch for this.

>>Ah, you could be right there.  Yes, it may be that the "time for first
>>fetch" is nearly constant, and so should be taken out of the estimate.
>
> It's at least somewhat constant for periods of time, when the index
> isn't selective enough and the initial fetch needs a lot of work. Over
> longer periods it does vary quite significantly.
>
>>Mind you, we may be "gilding buggy whips" here; trying to improve an
>>estimate that is fundamentally flawed.  There is the fundamental flaw
>>that there is no real reason to expect two SYNCs to be equally
>>expensive, if there is variation in system load.
>
> Over consecutive runs I would expect them to be quite consistent,
> there would just be an issue at the point where the load changes
> (start or end of a batch job, etc.). Obviously for a system with lots
> of short spikes and low values of desired_sync_time it would not make
> much sense, but then I'm not sure the desired_sync_time would make
> much sense either?

The trouble with changing this to be terribly much more intelligent,
much like the listener case, is that the more sophisticated we get
about this, the more intrusive the code needs to be.

And I'm not just meaning in a "reluctance to alter existing code"
sense.  If it requires a LOT of additional instrumentation, and code
strewn around everywhere, that makes Slony-I harder to maintain, which
is no good thing.
-- 
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://linuxfinances.info/info/nonrdbms.html
"...In my phone conversation with Microsoft's lawyer I copped to the
fact that just maybe his client might see me as having been in the
past just a bit critical of their products and business
practices. This was too bad, he said with a sigh, because they were
having a very hard time finding a reporter who both knew the industry
well enough to be called an expert and who hadn't written a negative
article about Microsoft." -- Robert X. Cringely