[Slony1-general] proper procedure for re-starting slony after replication slave reboots

Wed Feb 20 16:13:37 PST 2008

Christopher, I appreciate your efforts as well as those of everyone else 
on the list.  I'm glad to see you folks haven't give up on me yet. :)

Christopher Browne wrote:
> Geoffrey <lists at serioustechnology.com> writes:
>> Andrew Sullivan wrote:
>>> I am by no means willing to dismiss the suggestion that there are bugs in
>>> Slony; but this still looks to me very much like there's something we don't
>>> know about what happened, that explains the errors you're seeing.
>> I would so love to figure out this issue.  I appreciate your efforts.
>>
>> I simply don't understand how one table inparticular could get so far
>> out of sync.  We're talking 300 records.
>>
>> I can't imagine that slony is that fragile.  There's got to be
>> something going on that we don't see.
> 
> I agree.  From what I have heard, it doesn't sound like you have
> experienced anything that should be scratching any of the edge points
> of Slony-I.
> 
> 300 records don't just disappear.
> 
> When I put this all together, I'm increasingly suspicious that you may
> have experienced hardware problems or some such thing that might cause
> data loss that Slony-I would have no way to address.

Understand, I'm not saying that I'm losing data, just that there are 
inconsistencies between the replication server and the primary.  I don't 
believe we are losing data on the primary at all.  What I see is the 
number of records in tables don't match, thus the replication process is 
not working as expected.  The weird thing is, not every table is 
affected, just a handful.  We're talking 88 tables and 84 sequences, but 
only 4 tables have problems.  Here's a comparison of record counts:

< count for adest 54055
---
 > count for adest 54056
65c65
< count for mcarr 22560
---
 > count for mcarr 22572
67c67
< count for mcust 63757
---
 > count for mcust 63774
94c94
< count for tract 75380
---
 > count for tract 75420

This hardware has been rock solid since it was installed.  If we were 
losing data on the primary, we would definitely hear about it.  One 
thing I didn't mention is the actual configuration.  Two boxes connected 
to a single data silo.  It's a hot/hot configuration. Separate 
postmaster for each database.  Half the postmasters run on one server, 
the other half on the other.  If/when one fails, the other picks up the 
postmaster processes.  Each database has it's own IP, so I reference the 
host by multiple host names.  Connect to database mwr via host mwr.  In 
the event of failure, mwr IP is moved to the other machine.

<snip>

> You've grown suspicious about *every* component, which, on the one
> hand, is unsurprising, but on the other, not much useful.  I haven't
> heard you mention anything that would cause me to expect Slony-I to
> have eaten data, or to have even "started to look hungrily at the
> data."

The only reason I keep looking at slony is because the system is rock 
solid.  We don't lose data and these boxes are up 24/7.  Folks are 
hitting them constantly.  Slony is the only new part of the equation.

> The notices you have mentioned are all benign things.  The one
> question that comes to mind: Any interesting ERROR messages in the
> PostgreSQL logs?  I'm getting more and more suspicious that something
> about the entire DB cluster has gotten unstable, and if that's the
> case, Slony-I wouldn't do any better than the DB it is running on...

There are no postgresql errors to speak of on the primary.

I do see the following in the postgresql log on the slave:

2008-02-19 19:30:59 [3216] NOTICE:  type "_mwr_cluster.xxid" is not yet 
defined
DETAIL:  Creating a shell type definition.
2008-02-19 19:30:59 [3216] NOTICE:  argument type _mwr_cluster.xxid is 
only a shell
2008-02-19 19:30:59 [3216] NOTICE:  type "_mwr_cluster.xxid_snapshot" is 
not yet defined
DETAIL:  Creating a shell type definition.
2008-02-19 19:30:59 [3216] NOTICE:  argument type 
_mwr_cluster.xxid_snapshot is only a shell

Since these are NOTICEs, I assume this is normal.

During the initial replication, I do see a number of:

2008-02-19 19:32:28 [2463] LOG:  checkpoints are occurring too 
frequently (6 seconds apart)

But our problem doesn't seem to start until after the initial replication.

-- 
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
  - Benjamin Franklin