Christopher Browne cbbrowne at ca.afilias.info
Tue Nov 9 13:27:49 PST 2010
sharadov <sreddy at spark.net> writes:
> We have slony replication set up, and the replication on the slave has
> fallen behind by 10 days. On investigating I noticed that the sl_log_1 table
> has 25K records, but the sl_log_2 table has over 100 million rows, and they
> keep going up. How do I go about troubleshooting this?
>
> I am a newbie to slony, and would appreciate all the help that I can get

You should consider running the "test_slony_state" script which pokes at
various parts of the configuration with a view to seeing what might be
wrong.

  <http://slony.info/documentation/2.0/monitoring.html>

Some questions...

  - Why didn't you notice for 10 days?

    Presumably monitoring hasn't been done right.  I'd suggest running
    test_slony_state on an hourly basis; it complains only if something
    seems broken...

  - Are the slon processes running?

    Usually /usr/bin/ps can help find them...

  - Is the slon for the subscriber actually replicating data?

    You should search in the slon logs for the subscriber for lines
    looking like:

       DEBUG2: remoteWorkerThread_%d: SYNC %d done in %.3f seconds
       DEBUG2: remoteWorkerThread_%d_d: inserts=%d updates=%d deletes=%d

    That should give you an idea as to whether replication work is
    actually taking place.

    If it's running into errors before doing real work, then there's
    some problem that need to get rectified.

    There's a somewhat "worst case scenario" where if there are way too
    many events to process, and a timeout gets exceeded:

       ERROR: remoteListenThread_%d: timeout for event selection

     This means that the listener thread (src/slon/remote_listener.c)
     timed out when trying to determine what events were outstanding for
     it.

     This could occur because network connections broke, in which case
     restarting the slon might help.

     Alternatively, this might occur because the slon for this node has
     been broken for a long time, and there are an enormous number of
     entries in sl_event on this or other nodes for the node to work
     through, and it is taking more than slon_conf_remote_listen_timeout
     seconds to run the query. In older versions of Slony-I, that
     configuration parameter did not exist; the timeout was fixed at 300
     seconds. In newer versions, you might increase that timeout in the
     slon config file to a larger value so that it can continue to
     completion. And then investigate why nobody was monitoring things
     such that replication broke for such a long time...

   If this proves to be the problem, then you can change the listen
   timeout to something rather larger than 300 seconds.  And hopefully
   the slon can get past the too-many-events problem.
-- 
"cbbrowne","@","ca.afilias.info"
Christopher Browne
"Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
phasers on the Heffalump, Piglet, meet me in transporter room three"


More information about the Slony1-general mailing list