[Slony1-general] sync performance

Tue Sep 13 06:59:46 PDT 2016

	Interesting test results. By adding date commands inside and outside the script, it’s clear there’s 11-12 secs of startup contact before any commands get going. After that, I see syncs can take anywhere from 6-15 secs to execute. Once in a while, I’l also get a postgres timeout error, and I know the DB hasn’t gone down.
	Early on I adopted a habit of providing all conninfo for every node at the start of each script. It seems now I should be aiming for either minimal conn info or fewer scripts, or both.

root at prodrpl-Amst:~# date && akaslonik /tmp/commcheck-2.slk 
Tue Sep 13 13:50:14 UTC 2016
/tmp/commcheck-2.slk:44: 2016-09-13 13:50:26
/tmp/commcheck-2.slk:47: 2016-09-13 13:50:34
/tmp/commcheck-2.slk:50: waiting for event (2,5001432258) to be confirmed on node 5
/tmp/commcheck-2.slk:51: 2016-09-13 13:50:47
/tmp/commcheck-2.slk:55: 2016-09-13 13:50:53
root at prodrpl-Amst:~# 
root at prodrpl-Amst:~# 
root at prodrpl-Amst:~# date && akaslonik /tmp/commcheck-2.slk 
Tue Sep 13 13:51:01 UTC 2016
/tmp/commcheck-2.slk:44: 2016-09-13 13:51:12
/tmp/commcheck-2.slk:46: waiting for event (2,5001432264) to be confirmed on node 5
/tmp/commcheck-2.slk:47: 2016-09-13 13:51:22
/tmp/commcheck-2.slk:51: 2016-09-13 13:51:31
/tmp/commcheck-2.slk:55: 2016-09-13 13:51:40
root at prodrpl-Amst:~# 
root at prodrpl-Amst:~# date && akaslonik /tmp/commcheck-2.slk 
Tue Sep 13 13:51:47 UTC 2016
/tmp/commcheck-2.slk:44: 2016-09-13 13:51:58
/tmp/commcheck-2.slk:46: waiting for event (2,5001432272) to be confirmed on node 5
/tmp/commcheck-2.slk:47: 2016-09-13 13:52:12
/tmp/commcheck-2.slk:50: waiting for event (2,5001432274) to be confirmed on node 5
/tmp/commcheck-2.slk:51: 2016-09-13 13:52:23
/tmp/commcheck-2.slk:54: waiting for event (2,5001432276) to be confirmed on node 5
/tmp/commcheck-2.slk:55: 2016-09-13 13:52:38
root at prodrpl-Amst:~# 

	Tom    ☺

On 9/12/16, 4:38 PM, "Steve Singer" <steve at ssinger.info> wrote:

    On 09/12/2016 11:39 AM, Tignor, Tom wrote:
    >                  Seems I have an additional data point: the sync test
    > always takes longer (> 20 secs) if I include conninfo for all cluster
    > nodes instead of just the local node. I had previously thought conninfo
    > data was only used when needed. Is this not the case?

    What if you do

      sync(id=2);

      wait for event (origin=2, confirmed=5, wait on=2, timeout=30);

      sync(id=2);

      wait for event (origin=2, confirmed=5, wait on=2, timeout=30);
      sync(id=2);

      wait for event (origin=2, confirmed=5, wait on=2, timeout=30);

    3 times (or more) in a row, does it still take about the same amount of 
    time as 1 sync ?

    When slonik starts up it contacts all the nodes it has admin conninfo 
    for to get the current state/last event from each node.  Maybe your time 
    is spent establishing all those connections over SSL

    >
    >                  Tom J
    >
    > *From: *Tom Tignor <ttignor at akamai.com>
    > *Date: *Monday, September 12, 2016 at 10:52 AM
    > *To: *"slony1-general at lists.slony.info" <slony1-general at lists.slony.info>
    > *Subject: *sync performance
    >
    >                  Hello slony1 community,
    >
    >                  We’ve recently been testing communication reliability
    > between our cluster nodes. Our config is a simple setup with one
    > provider producing a modest volume of changes (measured in KB/s)
    > consumed by 5 direct subscribers, though these are geographically
    > distributed. The test is just a sync event followed by a wait on the
    > sync originator. Example:
    >
    > cluster name = ams_cluster;
    >
    > node 5 admin
    >
    >        conninfo='dbname=ams
    >
    >        host=23.79.242.182
    >
    >        user=ams_slony
    >
    >        sslmode=verify-ca
    >
    >        sslcert=/usr/local/akamai/.ams_certs/complete-ams_slony.crt
    >
    >        sslkey=/usr/local/akamai/.ams_certs/ams_slony.private_key
    >
    >        sslrootcert=/usr/local/akamai/etc/ssl_ca/canonical_ca_roots.pem';
    >
    > node 2 admin conninfo = 'dbname=ams user=ams_slony';
    >
    > sync(id=2);
    >
    > wait for event (origin=2, confirmed=5, wait on=2, timeout=30);
    >
    >                  Tests show the script takes 10-20 secs to run on
    > different nodes.
    >
    >                  Can anyone explain what’s happening internally during
    > this time, and why it takes so long? On a healthy, lightly loaded
    > system, we might have hoped for a sync response in just a couple
    > seconds. Our slon daemons are running with mostly default startup options.
    >
    >                  Thanks in advance,
    >
    >                  Tom J
    >
    >
    >
    > _______________________________________________
    > Slony1-general mailing list
    > Slony1-general at lists.slony.info
    > http://lists.slony.info/mailman/listinfo/slony1-general
    >