[Slony1-general] Excessive locking during Slony catch-up

Wed Nov 30 13:17:33 PST 2011

On 11-11-30 08:55 AM, Zac Bentley wrote:

You might want to review 
http://bugs.slony.info/bugzilla/show_bug.cgi?id=222 and 
http://bugs.slony.info/bugzilla/show_bug.cgi?id=167

167 means that the load on the master will be very high after 10 days of 
no replication.  I am not sure if 222 would cause the issue you are 
describing or not.

Exactly what locks was postgresql waiting on obtaining and which 
transactions held those locks?

> We recently had a database load failure, and it seems to have been
> caused by Slony. I'm wondering why/how this occurred, and how it could
> be prevented in the future.
>
>
> We run a webapp with a relatively high number of database hits/minute.
> We have a pair of database servers replicated via Slony: one master, one
> slave. The slave runs the slons for both nodes. The two are in
> physically distant locations across the US from each other, but our
> hosting provider maintains a very high-bandwidth, low-latency link
> between the two; it transfers data faster than our 100mbit onsite LAN
> does. We use Postgres 8.4, Slony 2.0, and apache/php for the webapp. All
> of our Slony options are the defaults.
>
>
> Awhile ago, we set up replication between the two servers, and completed
> the initial subscription process. Everything went well, and replication
> was tested working. Then, due to a firewall problem, node 2 (the slave)
> couldn’t talk to node 1 (the master) for 10 days.  During that time,
> there was a LOT of database activity (DML only) on the master, but Slony
> wasn’t replicating any of it.
>
>
> When we finally fixed the problem after 10 days of non-syncing, I could
> see hundreds of sync requests being received, queued, and processed in
> the slave’s log. I figured it would take a day or more to catch up, but
> that wasn’t a problem.
>
> Around an hour after re-establishing the link our webapp crashed.
> Checking apache’s monitoring showed that all available database
> connections were filled (apache limits them; Postgres allows unlimited)
> and waiting for the database to respond: a garden-variety database load
> failure. We purged the connections and restarted Postgres on the master.
>
>
> Then it happened again. And again, 20 minutes later. I checked pgadmin
> for the master, and saw a fair amount of replication activity, but it
> appeared to be the generation of a lot of SYNC events, nothing more.
> However, the number of locks on important tables was so large and
> growing so rapidly that it was causing load failures. These failures
> kept occurring until we disconnected the slon host (and the slave
> database—same computer) from the network again.
>
>
> Why would SYNC catch-up cause lock bloat on our master node? According
> to the SYNC documentation, no locking is supposed to take place. Is this
> caused by the fact that our slon daemons run remotely from the master?
> Is this normal behavior for Slony when a slave has a
> many-days-out-of-date database that needs to be caught up?
>
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general