Christopher Browne cbbrowne at ca.afilias.info
Sun Sep 16 22:11:27 PDT 2007
Jan Wieck <JanWieck at Yahoo.com> writes:
> On 9/16/2007 10:06 AM, Cyril SCETBON wrote:
>> Jan Wieck wrote:
>>> On 9/16/2007 4:27 AM, Cyril SCETBON wrote:
>>>>
>>>> Filip RembiaÅkowski wrote:
>>>>> Guys, this discussion went quite offtopic, but maybe you missed one
>>>>> fact that Cyril may no know about.
>>>>>
>>>>> In postgres, setting N columns to NULL is just N bits of physical
>>>>> writes.
>>>>>
>>>>> So the overhead of
>>>>>
>>>>> INSERT INTO t1 ( id, data1, data2, data3, ..., data100 )
>>>>> VALUES( 12345, 'the only non-null data', NULL,NULL, ..., NULL )
>>>>>
>>>>> is not so terrible.
>>>>>
>>>> Thanks filip, but I didn't talk about the performance when
>>>> applying this request, but the fact that storing a longer request
>>>> than a simple insert into t1(col1,col2) values(valcol1,valcol2)
>>>> causes slony tables to grow faster, and needs more network
>>>> bandwidth, that's all :-)
>>>
>>> And we can't do that because imagine you have a schema
>>>
>>>     create table t1 (
>>>         a int primary key,
>>>         b text default 'foo'
>>>     );
>>>
>>> and then do
>>>
>>>     insert into t1 (a, b) values (1, NULL);
>>>
>>> the resulting row, that make it into the masters table, will be
>>>
>>>     (1, NULL)
>>>
>>> Yeah, another one of those little things where MySQL behaves
>>> different and where Postgres is right according to ANSI. If we omit
>>> that NULL column now from the INSERT on the subscriber, the default
>>> will be used there, resulting in
>>>
>>>     (1, 'foo')
>>>
>>> and we have the subscriber out of sync.
>>>
>>>
>>> Jan
>>>
>> The suggestion was to catch the request entered so here (a,b) (1,
>> NULL) would be stored. but if user enter insert into t1(b)
>> values('b') , then (b) ('b') would be stored, and as every subscribe
>> has the same schema the default value would be used on them.
>
> And what exactly do you capture as "request entered" if the original
> query was "insert into foo (a, b, c) select x, y, z from temp_bar;" ?
>
> There certainly is some optimization possible. No doubt. But it isn't
> achievable as cheap as you think. One would first have to agree that
> all nodes in the whole cluster have to have tables containing the same
> columns. Then, the column names can be mapped to short integers in a
> new slony configuration table. Finally, the sl_log tables query field
> has to be split up into a short integer array specifying the table
> column order in the second field, containing all the values.

I would question what benefit this would really provide, and at what
cost?

This looks like an optimization for the case where tables are badly
normalized, at the cost of:
a) Adding a lot of complexity, generally, and
b) Probably making it *more* expensive to replicate tables that are
   well normalized.

And I don't see the "win" in this, either.  If there are 40 columns in
a table, there need to be 40 elements in the array for each tuple,
which will be plenty expensive.  I don't see it being cheaper in any
material way than the present representation.

You mentioned to me an idea that seemed pretty sound as to optimizing
things; it would be a slick, slick, idea for SYNCs to use COPY to
quickly get data out of sl_log_* from the provider and into sl_log_*
on the subscriber, and then process application to tables inside a
stored procedure on the subscriber.  *That* strikes me as potentially
a pretty big win.

But this seems like grasping for ways to try to make it sound like
we're optimizing things which would be quite likely to worsen
behaviour for well-designed OLTP apps.
-- 
output = ("cbbrowne" "@" "linuxfinances.info")
http://linuxfinances.info/info/x.html
"I once went  to a shrink.  He  told me to speak freely.   I did.  The
damn fool tried to charge me $90 an hour."
-- jimjr at qis.net (Jim Moore Jr)


More information about the Slony1-general mailing list