Sat Apr 17 08:22:55 PDT 2010
- Previous message: [Slony1-general] recreating a cluster when the master dies
- Next message: [Slony1-general] recreating a cluster when the master dies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Steve, 2010/4/16 Steve Singer <ssinger at ca.afilias.info> > albert wrote: > >> Greetings all, >> >> I have a master-slave setup and am trying to automate a recovery situation >> where the master fails and it is recreated from scratch based on a dump from >> the slave's database. >> > > You don't tell us which version of slony your using (this can be useful to > know) > > I downloaded the slony source code version 2.0.2 and compiled it in-site. Compilation and installation ran perfect. I am running the tests against Postgres 8.4.1, though the final setup will be running against Postgres 8.4.2. > >> Here's the flow of events I am using to test the transition: >> >> 1. the cluster is registered, the master and slave are in sync, all good. >> 2. the master dies. the master database is recreated from scratch using a >> dump from the slave's database >> > > When you take the dump of the slave database it still has slony installed > on it. Once you've restored this on the master your master has the slaves > slony configuration on it. It is probably a good idea to not start any > slons up until after your uninstall node is finished (or to not restore the > _my_cluster schema) though I don't think this is your problem. > That is precisely right. I make sure the slony processes are stopped before jumping into step number 2. Also, after several failed attempts, I decided to dump back the slave database including the _my_cluster schema, and remove all slony definitions by runinng the code snippet (see below). I do that approach because if I dumped the database back excluding the _my_cluster schema, the cluster redefinition failed telling me the cluster was already defined (I am guessing this was caused because of the slony triggers defined on my public schema tables). 3. the master-slave replication cluster is deleted using the following code >> snippet: >> >> TODO: ********** remoteWorkerThread: node 1 - EVENT 1,27 STORE_NODE - >> unknown event type >> > > This is very strange, the error is saying that the big if/else block in > remote_worker.c isn't matching the events even the event name as printed in > the above message looks okay. > > Well, that sounds interesting... Here's part of the log messages from the slony process running against the slave database during step 1, that is... when replication is setup for the first time and data is moved across correctly. Note that the above TODO: messages are also printed and then replication messages are logged and data is moved correctly (My test inserts some random data into the master database, sleeps for a while, then dumps both the slave and master databases and they get diffed) 2010-04-17 09:20:39 AST INFO remoteListenThread_1: thread starts 2010-04-17 09:20:39 AST INFO remoteWorkerThread_1: thread starts 2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_slave user=postgres" is 80401 2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_slave user=postgres" is 80401 2010-04-17 09:20:39 AST CONFIG remoteWorkerThread_1: update provider configuration 2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_master host=localhost user=postgres" is 80401 TODO: ********** remoteWorkerThread: node 1 - EVENT 1,27 STORE_NODE - unknown event type 2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1 TODO: ********** remoteWorkerThread: node 1 - EVENT 1,28 ENABLE_NODE - unknown event type 2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1 2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1 2010-04-17 09:20:39 AST CONFIG storeSubscribe: sub_set=1 sub_provider=1 sub_forward='t' 2010-04-17 09:20:39 AST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1 2010-04-17 09:20:39 AST INFO copy_set 1 2010-04-17 09:20:39 AST CONFIG version for "dbname=replica_test_master host=localhost user=postgres" is 80401 2010-04-17 09:20:39 AST CONFIG remoteWorkerThread_1: connected to provider DB 2010-04-17 09:20:39 AST CONFIG remoteWorkerThread_1: prepare to copy table "public"."domain" Additional messages are logged and data is replicated correctly. > If you have the ability I'd be curious attach a debugger to the slon > process when it gets to this state and see what event->ev_type looks like at > line 715 (in 1.2.21 source or the equivlent line on whatever version your > on). > I am more than happy to do that. I am lacking advanced skills for that though... I can see and attach to both slony processes running against the slave, but I can't find the way to switch to the proper thread's context to inspect the event symbol (there's 8 threads running on 1 process, and 1 thread on the other). I can inspect some symbols and I can see the source code from gdb, so it would appear symbolic information is present into the binaries. Could you please give me a hint on how to extract that event->ev_type info? > The strcmp against "STORE_NODE" should be matching and it should be going > into that if block instead of falling to the last else where it prints the > above error message. > > > > > > 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2 >> li_provider=1 >> TODO: ********** remoteWorkerThread: node 1 - EVENT 1,28 ENABLE_NODE - >> unknown event type >> 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2 >> li_provider=1 >> 2010-04-16 11:39:42 AST CONFIG storeListen: li_origin=1 li_receiver=2 >> li_provider=1 >> 2010-04-16 11:39:42 AST CONFIG remoteWorkerThread_1: update provider >> configuration >> >> These log events are the same when the cluster is working flawlessly >> (although more events are logged after these, of course). >> It looks as thought the replication silently stops working with no >> apparent reason. >> > > I would not expect to see those 'TODO: **************** ..... unknown event > type ' lines when the cluster is working flawlessly, are you saying that you > always get them? > > That is correct. I always see them (see comments above). > > Could anyone please help me understand what might be going wrong? >> >> Thanks >> Albert >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Slony1-general mailing list >> >> Slony1-general at lists.slony.info >> http://lists.slony.info/mailman/listinfo/slony1-general >> > > > -- > Steve Singer > Afilias Canada > Data Services Developer > 416-673-1142 > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.slony.info/pipermail/slony1-general/attachments/20100417/2bdb4646/attachment.htm
- Previous message: [Slony1-general] recreating a cluster when the master dies
- Next message: [Slony1-general] recreating a cluster when the master dies
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list