[Slony1-general] gotchas from experience ?

Wed May 31 08:02:24 PDT 2006

Rod Taylor wrote:
>>> If you have more than about 500GB of data on a heavily used OLTP
>>> database, I strongly advise posting a detailed plan of attack or hiring
>>> a consultant. There are lots of gotchas at that level of the game just
>>> due to the impact of the time involved.
>> Some of the frontline consultants are also in this mess...
> 
> I've replicated a moderately sized DB (around the 400GB mark last time
> not including indexes) through a few DB upgrades using Slony with the
> same physical machine as the source and destination.
> 
> It's not difficult but it does require quite a bit of thought on how to
> keep the Slony transactions as short as possible AND keep the number of
> sets required as low as possible.
> 
> My main issue was actually scheduling regular system vacuums on DB hot
> spots to run between Slony transactions to ensure that tables that bloat
> quickly didn't grow more than 100 times their normal size. COPY and
> building the PRIMARY KEY for the large tables was the difficulty.
> 
> This required all kinds of lead in work that included changing primary
> keys to a SERIAL instead of several varchar's and that type of thing.
> Having a parallel index builder within PostgreSQL would pretty much
> eliminate that type of problem.
> 
>>> 4) Stampeding Slony. If for any reason Slony is unable to complete the
>>> replication work available within the timeframe allotted (5 minutes I
>>> think), Slony will abandon the first connection and establish a new one
>>> to retry.
>> Exactly what "replication work" do you mean?  One table? All tables being copied?
>> In my situation I have 6500*5 + 100 tables to copy.  No way is that going to be
>> completed in 5 minutes no matter that the tables are small.  (And no
>> I did not design the schema :)
> 
> The standard sync process that happens in the background on a day to day
> basis. I don't know if it is because pg_listener passes a threshold and
> locks get in the way or if sl_event or sl_log_1 grow too large, but at
> some point it can take Slony significant amounts of time to replicate a
> 'sync' event.
> 
> The queries that scoop data from sl_log_1 start to take a long time and
> can pass the timeout. After that it seems to run away with itself.
> 

I've seen both cases happen.  With long running transactions, you pretty
much need to eliminated the transactions and get a vacuum of pg_listener
through.  Vacuum full of pg_listener is sometimes needed.

If sl_log_1 gets huge, clustering the table might help you but that's
going to block access for you.  If memory servers me, there are usually
a lot of dead index rows when sl_log_1 grows big that don't clear out
easily.

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.