[Slony1-general] recreating a cluster when the master dies

Sat Apr 17 08:22:55 PDT 2010

Hi Steve,

2010/4/16 Steve Singer <ssinger at ca.afilias.info>

> albert wrote:
>
>> Greetings all,
>>
>> I have a master-slave setup and am trying to automate a recovery situation
>> where the master fails and it is recreated from scratch based on a dump from
>> the slave's database.
>>
>
> You don't tell us which version of slony your using (this can be useful to
> know)
>
>
I downloaded the slony source code version 2.0.2 and compiled it in-site.
Compilation and installation ran perfect. I am running the tests against
Postgres 8.4.1, though the final setup will be running against Postgres
8.4.2.

>
>> Here's the flow of events I am using to test the transition:
>>
>> 1. the cluster is registered, the master and slave are in sync, all good.
>> 2. the master dies. the master database is recreated from scratch using a
>> dump from the slave's database
>>
>
> When you take the dump of the slave database it still has slony installed
> on it.  Once you've restored this on the master your master has the slaves
> slony configuration on it.  It is probably a good idea to not start any
> slons up until after your uninstall node is finished (or to not restore the
> _my_cluster schema) though I don't think this is your problem.
>

That is precisely right. I make sure the slony processes are stopped before
jumping into step number 2. Also, after several failed attempts, I decided
to dump back the slave database including the _my_cluster schema, and remove
all slony definitions by runinng the code snippet (see below). I do that
approach because if I dumped the database back excluding the _my_cluster
schema, the cluster redefinition failed telling me the cluster was already
defined (I am guessing this was caused because of the slony triggers defined
on my public schema tables).

 3. the master-slave replication cluster is deleted using the following code
>> snippet:
>>
>> TODO: ********** remoteWorkerThread: node 1 - EVENT 1,27 STORE_NODE -
>> unknown event type
>>
>
> This is very strange,  the error is saying that the big if/else block in
> remote_worker.c isn't matching the events even the event name as printed in
> the above message looks okay.
>
>
Well, that sounds interesting...

Here's part of the log messages from the slony process running against the
slave database during step 1, that is... when replication is setup for the
first time and data is moved across correctly. Note that the above TODO:
messages are also printed and then replication messages are logged and data
is moved correctly (My test inserts some random data into the master
database, sleeps for a while, then dumps both the slave and master databases
and they get diffed)

2010-04-17 09:20:39 AST INFO   remoteListenThread_1: thread starts
2010-04-17 09:20:39 AST INFO   remoteWorkerThread_1: thread starts
2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_slave
user=postgres" is 80401
2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_slave
user=postgres" is 80401
2010-04-17 09:20:39 AST CONFIG remoteWorkerThread_1: update provider
configuration
2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_master
host=localhost user=postgres" is 80401
TODO: ********** remoteWorkerThread: node 1 - EVENT 1,27 STORE_NODE -
unknown event type
2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2
li_provider=1
TODO: ********** remoteWorkerThread: node 1 - EVENT 1,28 ENABLE_NODE -
unknown event type
2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2
li_provider=1
2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2
li_provider=1
2010-04-17 09:20:39 AST CONFIG storeSubscribe: sub_set=1 sub_provider=1
sub_forward='t'
2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2
li_provider=1
2010-04-17 09:20:39 AST INFO   copy_set 1
2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_master
host=localhost user=postgres" is 80401
2010-04-17 09:20:39 AST CONFIG remoteWorkerThread_1: connected to provider
DB
2010-04-17 09:20:39 AST CONFIG remoteWorkerThread_1: prepare to copy table
"public"."domain"

Additional messages are logged and data is replicated correctly.

> If you have the ability I'd be curious attach a debugger to the slon
> process when it gets to this state and see what event->ev_type looks like at
> line 715 (in 1.2.21 source or the equivlent line on whatever version your
> on).
>

I am more than happy to do that. I am lacking advanced skills for that
though... I can see and attach to both slony processes running against the
slave, but I can't find the way to switch to the proper thread's context to
inspect the event symbol (there's 8 threads running on 1 process, and 1
thread on the other). I can inspect some symbols and I can see the source
code from gdb, so it would appear symbolic information is present into the
binaries. Could you please give me a hint on how to extract that
event->ev_type info?

>

The strcmp against "STORE_NODE" should be matching and it should be going
> into that if block instead of falling to the last else where it prints the
> above error message.
>
>
>
>
>
>  2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2
>> li_provider=1
>> TODO: ********** remoteWorkerThread: node 1 - EVENT 1,28 ENABLE_NODE -
>> unknown event type
>> 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2
>> li_provider=1
>> 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2
>> li_provider=1
>> 2010-04-16 11:39:42 AST CONFIG remoteWorkerThread_1: update provider
>> configuration
>>
>> These log events are the same when the cluster is working flawlessly
>> (although more events are logged after these, of course).
>> It looks as thought the replication silently stops working with no
>> apparent reason.
>>
>
> I would not expect to see those 'TODO: **************** ..... unknown event
> type ' lines when the cluster is working flawlessly, are you saying that you
> always get them?
>
>
That is correct. I always see them (see comments above).

>

>  Could anyone please help me understand what might be going wrong?
>>
>> Thanks
>> Albert
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Slony1-general mailing list
>>
>> Slony1-general at lists.slony.info
>> http://lists.slony.info/mailman/listinfo/slony1-general
>>
>
>
> --
> Steve Singer
> Afilias Canada
> Data Services Developer
> 416-673-1142
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20100417/2bdb4646/attachment.htm