[Slony1-general] recreating a cluster when the master dies

Fri Apr 16 11:01:53 PDT 2010

albert wrote:
> Greetings all,
> 
> I have a master-slave setup and am trying to automate a recovery 
> situation where the master fails and it is recreated from scratch based 
> on a dump from the slave's database.

You don't tell us which version of slony your using (this can be useful 
to know)

> 
> Here's the flow of events I am using to test the transition:
> 
> 1. the cluster is registered, the master and slave are in sync, all good.
> 2. the master dies. the master database is recreated from scratch using 
> a dump from the slave's database

When you take the dump of the slave database it still has slony 
installed on it.  Once you've restored this on the master your master 
has the slaves slony configuration on it.  It is probably a good idea to 
not start any slons up until after your uninstall node is finished (or 
to not restore the _my_cluster schema) though I don't think this is your 
problem.

> 3. the master-slave replication cluster is deleted using the following 
> code snippet:
> 
> TODO: ********** remoteWorkerThread: node 1 - EVENT 1,27 STORE_NODE - 
> unknown event type

This is very strange,  the error is saying that the big if/else block in 
   remote_worker.c isn't matching the events even the event name as 
printed in the above message looks okay.

If you have the ability I'd be curious attach a debugger to the slon 
process when it gets to this state and see what event->ev_type looks 
like at line 715 (in 1.2.21 source or the equivlent line on whatever 
version your on).

The strcmp against "STORE_NODE" should be matching and it should be 
going into that if block instead of falling to the last else where it 
prints the above error message.

> 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2 
> li_provider=1
> TODO: ********** remoteWorkerThread: node 1 - EVENT 1,28 ENABLE_NODE - 
> unknown event type
> 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2 
> li_provider=1
> 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2 
> li_provider=1
> 2010-04-16 11:39:42 AST CONFIG remoteWorkerThread_1: update provider 
> configuration
> 
> These log events are the same when the cluster is working flawlessly 
> (although more events are logged after these, of course).
> It looks as thought the replication silently stops working with no 
> apparent reason.

I would not expect to see those 'TODO: **************** ..... unknown 
event type ' lines when the cluster is working flawlessly, are you 
saying that you always get them?

> Could anyone please help me understand what might be going wrong?
> 
> Thanks
> Albert
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general

-- 
Steve Singer
Afilias Canada
Data Services Developer
416-673-1142