[Slony1-general] Do finishTableAfterCopy and ANALYZE need to be serialized with data copy?

Tue Nov 18 14:48:28 PST 2008

> Richard Yen <dba at richyen.com> writes:
>> This might be moot with the coming release of Slony 2.0.0, but I was
>> wondering if there are any thoughts about the following question:
>>
>> Do the finishTableAfterCopy() and ANALYZE of each table need to  
>> happen
>> in serial with the data copy from stdin?  i.e., can we create a new
>> thread that will do these two things while slon proceeds to copy the
>> data of the next table?
>>
>> I raise this question because for large data sets, I think the
>> copy_set process time could be improved by 30-40% if we can split
>> these two stages.  I have some large tables that take 30 min or so to
>> copy, then another 15-20 min to finishTableAfterCopy() and ANALYZE.
>>
>>

> On Nov 18, 2008, at 2:22 PM, Christopher Browne wrote:
>>  Step 2.  Order the requests so as to maximize parallelism.
>>
>>       Thus, we subscribe to tables in reverse order of their
>>       estimated size (pg_class.relpages should be a reasonable
>>       approximation).
>>
>>       This means that we tend to push the bigger tables onto the
>>       "reindex queue" as early as possible in the subscription
>>       process.

We have this same request.  Essentially in our case we may or may not  
have anything reading the slave right away.  So simply providing an  
option of skipping the analyze would be nice.  This way we can have a  
separate process analyze as we feel appropriate (or auto).

Of course it would be very nice to be able to specify a level of  
parallelism for the COPY processes to run in as well. In many cases  
there is no benefit to running everything in serial.  By randomizing  
the tables being COPY'd, you stand a reasonable chance of all  
processes having about the same amount of work to do and the fastest  
completion time.  Of course you could employ a more elaborate  
algorithm based on size and get better balancing.

thx
Kenny Gorman
www.hi5.com