Brian Fehrle brianf at consistentstate.com
Thu May 27 12:14:23 PDT 2010
Jaime Casanova wrote:
> On Wed, May 26, 2010 at 4:13 PM, "Stéphane A. Schildknecht"
> <stephane.schildknecht at postgresql.fr> wrote:
>   
>> Could you check that this table is in sl_tables, and in which set it is ?
>>
>> Maybe this set isn't subscribed.
>>
>>     
>
>   
Those all look fine, the table exists in the replication set on both 
machines, and the tab_relname matches the correct table in 
pg_catalog.pg_class.

Just a little status update to the problem. The master and slave 
databases still do not match, as they are missing a small chunk of data. 
Yet replication is still taking place, any new data inserted into the 
master ends up on the slave. After doing more looking at it, all the 
data that is missing off the slave were added to the master in a certain 
window of time. We're looking into what happened during that period of 
time via logs and whatnot.

Our daemons are started with the -a command, and I have a copy of every 
archive log from the slony slave since the point of adding that table to 
replication until now. I got a list of every single ID of the rows that 
are missing from the slony slave, and wrote up a little script to search 
for each of those rows ID's in each of the slony archive logs. None of 
them were present. So I think we can conclude that the data was not 
deleted from the slave underneath slony by a user, but rather it was 
never replicated to the slave in the first place.

One thing is that we had a daemon that would attempt to start the slon 
daemons once every minute if they are not running already. Due to a bug, 
it ended up starting a new set of daemons once every minute. This was 
happening before, during, and after the chunk of data that is missing 
was generated. Each minute it generated an error message saying 
"duplicate key value violates unique constraint "sl_nodelock-pkey"", 
which points to the daemon realizing there is already a daemon running, 
and then exit. No other errors pertaining to the replicated table in 
question were present in the postgres logs at this time.

At this point we will probably be removing the table from replication, 
then adding it again and let it sync up.

A question: I'm still a little unfamiliar with a couple aspects of 
slony, but from my understanding (correct me if I'm wrong), when adding 
a table to replication, slonik modifies the table so that whenever a 
insert, delete, update happens, it creates a trigger that alerts slony 
of the existence of data that needs to be sent to the slave nodes. I 
guess my question is, is there a way to insert data into the table and 
cause that trigger effect to not be executed? And if it is possible, 
could that cause the situation of "missing data" that slony itself 
doesn't even know about (since it's reporting everything is in sync). If 
this is possible, then I may have an situation where a user is inserting 
data in an odd way that makes the inserted data not able to be replicated

Thanks,
    Brian Fehrle
> or maybe the table has the wrong tab_reloid in the slave. you can
> probe that with this simple query (the same for sequences):
> select * from _cluster_name.sl_table where tab_reloid <> (tab_nspname
> || '.' || tab_relname)::regclass;
>
>   



More information about the Slony1-general mailing list