[Slony1-general] slony not replicating after re-initializing the slave cluster

Thu Sep 23 14:09:05 PDT 2010

Steve Singer wrote:
> On 10-09-21 07:23 PM, Brian Fehrle wrote:
>> I got some time and decided to test this again on some VM boxes rather
>> than our live environment, but had little luck.
>>
>> Simply so I can have this logged in the mailing list with what was done
>> (and hopefully a solution in the near future), here's my process I
>> preformed.
>>
>> I created two clusters that mirror our live boxes as closely as 
>> possible.
>> - - PostgreSQL version 8.4.2
>> - - Slony version 1.2.20
>> - - both installed via source
>>
>> I created the master cluster as:
>> # initdb -D /usr/local/pgsql/encoding_master/ --locale=C 
>> --encoding=LATIN1
>> I created the slave cluster as:
>> # initdb -D /usr/local/pgsql/encoding_slave/ --locale=C 
>> --encoding=SQL_ASCII
>>
>> I set up a master ->  slave slony cluster and replicated a single table
>> in a single replication set, and verified that replication was taking 
>> place.
>>
>> I wrote a small daemon that inserts a row into the table being
>> replicated on the master once a minute.
>>
>> I brought down the slon daemons, and preformed a pg_dump on the slave:
>> # pg_dump -p 5433 -Fc postgres>  /tmp/postgres_dump.sql
>>
>> I brought down the slave cluster, then created a new one with the LATIN1
>> encoding:
>> # initdb -D /usr/local/pgsql/encoding_slave_latin/ --locale=C
>> --encoding=SQL_ASCII
>>
>> I brought the cluster online and started up the slon daemons. The slave
>> slon daemon reported remoteworker and remote listener threads, and
>> reported increasing SYNC numbers, however did not actually replicate
>> data from the master to the slave, and _slony.sl_log_1 on the master
>> grew in numbers with every insert that took place . NOTE: This is the
>> same behavior I experienced before on our live servers.
>>
>> I then executed the following:
>> #!/bin/bash
>> . etc/slony.env
>> echo "Repair config"
>>
>> slonik<<_EOF_
>> cluster name = $CLUSTERNAME ;
>> node 1 admin conninfo = 'dbname=$MASTERDBNAME host=$MASTERHOST
>> port=$MASTERPORT user=$REPUSER';
>> node 2 admin conninfo = 'dbname=$SLAVEDBNAME host=$SLAVEHOST
>> port=$SLAVEPORT user=$REPUSER';
>> REPAIR CONFIG (SET ID = 1, EVENT NODE = 1, EXECUTE ONLY ON = 2);
>> _EOF_
>>
>
> Try
>
> REPAIR CONFIG (SET ID=1, EVENT NODE=2, EXECUTE ONLY ON=2);
>
> I tried a somewhat simliar sequence to what you described (though with 
> a different postgresql and slony version) and the REPAIR CONFIG did 
> not seem to do anything on node 2.  Ie the oid values in sl_table did 
> NOT match what was in pg_class.  When I ran it with event node=2 then 
> it did seem to update sl_table on node 2.
>
Ok, so I tried this and while it did update the row in sl_table to point 
to the correct oid, the slave's slon daemon kills itself and no 
replication takes place. Starting the daemon again results in another 
untimely death. Result from the log when executing REPAIR CONFIG (SET 
ID=1, EVENT NODE=2, EXECUTE ONLY ON=2):

2010-09-23 12:36:55 MDT FATAL  localListenThread: event 31: Unknown 
event type: RESET_CONFIG

Out of just wanting to try EVERYTHING, I set up everything on the new 
encoding server, however instead of running the repair config, and 
before starting up the slon daemons, I manually updated the row in 
sl_table and set the tab_reloid column to the oid of the table in the 
post-pg_recover cluster.

I started up the slon daemons, replication works without any 
warnings/errors, and the daemons stay alive. Any new data inserted into 
that table gets replicated.

I also turned on the query logging in the slave's database and set it to 
log all queries, and monitored them, i see the copy statements that copy 
the data from the master to the slave, and it all matches the data 
received and the data in the log shipping logs that the slave daemon 
generates via the -x command.

So I'm wondering, is updating sl_table myself like that safe? I know 
it's highly discouraged to modify anything in the slony system tables 
myself, but since the repair config command doesn't seem to be working 
for me, not sure if I have another option.

- Brian

>
>
>
>> it executed without error, however replication did not start working,
>> and the slave daemon started acting weird with the child process being
>> terminated constantly, then restarted every 10 seconds just to be
>> terminated again.