Zac Bentley zbbentley at gmail.com
Wed Nov 30 05:55:37 PST 2011
We recently had a database load failure, and it seems to have been caused
by Slony. I'm wondering why/how this occurred, and how it could be
prevented in the future.


We run a webapp with a relatively high number of database hits/minute. We
have a pair of database servers replicated via Slony: one master, one
slave. The slave runs the slons for both nodes. The two are in physically
distant locations across the US from each other, but our hosting provider
maintains a very high-bandwidth, low-latency link between the two; it
transfers data faster than our 100mbit onsite LAN does. We use Postgres
8.4, Slony 2.0, and apache/php for the webapp. All of our Slony options are
the defaults.


Awhile ago, we set up replication between the two servers, and completed
the initial subscription process. Everything went well, and replication was
tested working. Then, due to a firewall problem, node 2 (the slave)
couldn’t talk to node 1 (the master) for 10 days.  During that time, there
was a LOT of database activity (DML only) on the master, but Slony wasn’t
replicating any of it.


When we finally fixed the problem after 10 days of non-syncing, I could see
hundreds of sync requests being received, queued, and processed in the
slave’s log. I figured it would take a day or more to catch up, but that
wasn’t a problem.

Around an hour after re-establishing the link our webapp crashed. Checking
apache’s monitoring showed that all available database connections were
filled (apache limits them; Postgres allows unlimited) and waiting for the
database to respond: a garden-variety database load failure. We purged the
connections and restarted Postgres on the master.


Then it happened again. And again, 20 minutes later. I checked pgadmin for
the master, and saw a fair amount of replication activity, but it appeared
to be the generation of a lot of SYNC events, nothing more. However, the
number of locks on important tables was so large and growing so rapidly
that it was causing load failures. These failures kept occurring until we
disconnected the slon host (and the slave database—same computer) from the
network again.


Why would SYNC catch-up cause lock bloat on our master node? According to
the SYNC documentation, no locking is supposed to take place. Is this
caused by the fact that our slon daemons run remotely from the master? Is
this normal behavior for Slony when a slave has a many-days-out-of-date
database that needs to be caught up?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20111130/f86cec51/attachment.htm 


More information about the Slony1-general mailing list