[Slony1-general] Slonik Awareness of Cluster State (was Automatic WAIT FOR EVENT)

Fri Nov 19 01:28:21 PST 2010

On Thu, Nov 18, 2010 at 11:32 PM, Christopher Browne
<cbbrowne at ca.afilias.info> wrote:
> Vick Khera <vivek at khera.org> writes:
>> On Wed, Nov 17, 2010 at 5:06 PM, Christopher Browne
>> <cbbrowne at ca.afilias.info> wrote:
>>> For instance if it takes 20 minutes for SUBSCRIBE SET to complete, it's
>>> pretty likely that you want to wait for that to be complete before
>>> proceeding with other configuration that depends on it.
>>
>> None of my scripts ever have a WAIT in them.  I break up my scripts to
>> be pretty atomic, and I rarely do anything after the subscribe except
>> sit and wait hours for the copy to happen.
>>
>> The one thing I'd like to have is a warning when issuing shape
>> changing commands such as move set if the backlog of events is very
>> high (say, more than 5 or 10).  On occasion, I've forgotten to check
>> and issued the move when there was a 10 minute backlog.  I ended up
>> waiting that out, but it was tense. ;)

I certainly use a lot of WAIT and SYNC statements to ensure I don't
throw Slony into some sort of deadlock - one statement at a time
please. These are generated scripts to run complex DDL and add/remove
stuff from replication or canned processes to build and drop replicas.
When doing manual tasks I do things one statement at a time and watch
the logs. I'd certainly love some sort of a mode where the next
statement is not started until the previous statement has been
completed on all nodes to avoid all the WAIT and SYNC nonsense.

The thing that bites us often are statements that block on long
running transactions. At the moment it is left to me to deduce why
something is blocked. I'd love to know what slony is trying to do and,
if it is blocked, what it is waiting for.

> That suggests to me there being some value to having some sort of "fail
> if not sufficiently up to date" command.

Pretty much all my canned scripts start by issuing a sync and waiting
for it. I think it is dangerous to do otherwise. Failing with a
meaningful error rather than leaving sysadmins staring at a blinking
prompt would be preferable.

> That's pretty specific to "replication being behind" - a different
> flavour of useful might be to have have a series of "fail if condition
> not met" evaluators to check things like:
>
>  - Is a node there?
>  - Is a replication set there?
>  - Is a particular subscription configured?
>  - Run a bit of SQL, and fail if it errors out.
>  - Has a node successfully processed events within [time interval]?
>  - Is a node presently processing a particular event?
>
> That last item is suggestive of another branch in the whole "stream of
> consciousness"...

> It sure would be nice if each slon would note somewhere, in a way that
> can be queried by others, what it's working on.  That would be useful
> for the common scenario we have where beginners (and sometimes even
> experienced folk!) say "I don't quite know if replication is working.
> How can I find that out???"

I'd so love my admins to be able to see that the canned script to add
a new replica is blocked because the 11 hour backup is currently
running.

> BTW, the larger goal here isn't merely to draw out how we might handle
> an "implicit WAIT FOR EVENT" - we (where "we" is initially consisting of
> Jan, Steve, and myself as 'core') are trying to figure out what sorts of
> enhancements would be worth introducing.
>
> And your comments have introduced at least two additional ideas to my
> list, which I appreciate considerably!

The other thing I bumped into the other day - there is no way to get
slonik to emit the current date/time. I've got scripts running against
staging environments that I'd like to get timings for so we can plan
production updates properly. At the moment, I can only get the timing
of the entire script.

-- 
Stuart Bishop <stuart at stuartbishop.net>
http://www.stuartbishop.net/