Wed Nov 30 05:55:37 PST 2011
- Previous message: [Slony1-general] Re : Re : Re : dump, restore & --exclude-schema
- Next message: [Slony1-general] Excessive locking during Slony catch-up
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
We recently had a database load failure, and it seems to have been caused by Slony. I'm wondering why/how this occurred, and how it could be prevented in the future. We run a webapp with a relatively high number of database hits/minute. We have a pair of database servers replicated via Slony: one master, one slave. The slave runs the slons for both nodes. The two are in physically distant locations across the US from each other, but our hosting provider maintains a very high-bandwidth, low-latency link between the two; it transfers data faster than our 100mbit onsite LAN does. We use Postgres 8.4, Slony 2.0, and apache/php for the webapp. All of our Slony options are the defaults. Awhile ago, we set up replication between the two servers, and completed the initial subscription process. Everything went well, and replication was tested working. Then, due to a firewall problem, node 2 (the slave) couldn’t talk to node 1 (the master) for 10 days. During that time, there was a LOT of database activity (DML only) on the master, but Slony wasn’t replicating any of it. When we finally fixed the problem after 10 days of non-syncing, I could see hundreds of sync requests being received, queued, and processed in the slave’s log. I figured it would take a day or more to catch up, but that wasn’t a problem. Around an hour after re-establishing the link our webapp crashed. Checking apache’s monitoring showed that all available database connections were filled (apache limits them; Postgres allows unlimited) and waiting for the database to respond: a garden-variety database load failure. We purged the connections and restarted Postgres on the master. Then it happened again. And again, 20 minutes later. I checked pgadmin for the master, and saw a fair amount of replication activity, but it appeared to be the generation of a lot of SYNC events, nothing more. However, the number of locks on important tables was so large and growing so rapidly that it was causing load failures. These failures kept occurring until we disconnected the slon host (and the slave database—same computer) from the network again. Why would SYNC catch-up cause lock bloat on our master node? According to the SYNC documentation, no locking is supposed to take place. Is this caused by the fact that our slon daemons run remotely from the master? Is this normal behavior for Slony when a slave has a many-days-out-of-date database that needs to be caught up? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.slony.info/pipermail/slony1-general/attachments/20111130/f86cec51/attachment.htm
- Previous message: [Slony1-general] Re : Re : Re : dump, restore & --exclude-schema
- Next message: [Slony1-general] Excessive locking during Slony catch-up
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list