sharadov sreddy at spark.net
Tue Nov 9 17:09:26 PST 2010
I'll start working through what you suggested.


Christopher Browne wrote:
> 
> sharadov <sreddy at spark.net> writes:
>> We have slony replication set up, and the replication on the slave has
>> fallen behind by 10 days. On investigating I noticed that the sl_log_1
>> table
>> has 25K records, but the sl_log_2 table has over 100 million rows, and
>> they
>> keep going up. How do I go about troubleshooting this?
>>
>> I am a newbie to slony, and would appreciate all the help that I can get
> 
> You should consider running the "test_slony_state" script which pokes at
> various parts of the configuration with a view to seeing what might be
> wrong.
> 
>   <http://slony.info/documentation/2.0/monitoring.html>
> 
> Some questions...
> 
>   - Why didn't you notice for 10 days?
> 
>     Presumably monitoring hasn't been done right.  I'd suggest running
>     test_slony_state on an hourly basis; it complains only if something
>     seems broken...
> 
>   - Are the slon processes running?
> 
>     Usually /usr/bin/ps can help find them...
> 
>   - Is the slon for the subscriber actually replicating data?
> 
>     You should search in the slon logs for the subscriber for lines
>     looking like:
> 
>        DEBUG2: remoteWorkerThread_%d: SYNC %d done in %.3f seconds
>        DEBUG2: remoteWorkerThread_%d_d: inserts=%d updates=%d deletes=%d
> 
>     That should give you an idea as to whether replication work is
>     actually taking place.
> 
>     If it's running into errors before doing real work, then there's
>     some problem that need to get rectified.
> 
>     There's a somewhat "worst case scenario" where if there are way too
>     many events to process, and a timeout gets exceeded:
> 
>        ERROR: remoteListenThread_%d: timeout for event selection
> 
>      This means that the listener thread (src/slon/remote_listener.c)
>      timed out when trying to determine what events were outstanding for
>      it.
> 
>      This could occur because network connections broke, in which case
>      restarting the slon might help.
> 
>      Alternatively, this might occur because the slon for this node has
>      been broken for a long time, and there are an enormous number of
>      entries in sl_event on this or other nodes for the node to work
>      through, and it is taking more than slon_conf_remote_listen_timeout
>      seconds to run the query. In older versions of Slony-I, that
>      configuration parameter did not exist; the timeout was fixed at 300
>      seconds. In newer versions, you might increase that timeout in the
>      slon config file to a larger value so that it can continue to
>      completion. And then investigate why nobody was monitoring things
>      such that replication broke for such a long time...
> 
>    If this proves to be the problem, then you can change the listen
>    timeout to something rather larger than 300 seconds.  And hopefully
>    the slon can get past the too-many-events problem.
> -- 
> "cbbrowne","@","ca.afilias.info"
> Christopher Browne
> "Bother,"  said Pooh,  "Eeyore, ready  two photon  torpedoes  and lock
> phasers on the Heffalump, Piglet, meet me in transporter room three"
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general
> 
> 

-- 
View this message in context: http://old.nabble.com/Sl_log-table-is-huge%2C-over-100-million-rows-tp30173901p30176891.html
Sent from the Slony-I -- General mailing list archive at Nabble.com.



More information about the Slony1-general mailing list