Gavin Hamill gdh
Fri Mar 31 18:50:29 PST 2006
On Fri, 31 Mar 2006 17:26:07 +0000
Christopher Browne <cbbrowne at ca.afilias.info> wrote:

> I see a first problem here...
> 
> It's useless to use the "-a" option against the "master" node.  The
> master node is not subscribing to any sets, so nothing will be recorded
> in the "log shipping" logs there.

OK fair enough. This evening I decided to uninstall both nodes, drop the _replication schema on both databases, and start again, this time much closer to the HOWTO...

dropdb + createdb on slave, then import the schema only. The following was imported into slonik on Node1 without failure:

cluster name = replication;
 node 1 admin conninfo='host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX';
 node 2 admin conninfo='host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX';

init cluster (id=1, comment = 'Cayenne');

# CREATE SET
  try {
    create set (id = 1, origin = 1, comment = 'Set 1 for replication');
  } on error {
    echo 'Could not create subscription set 1 for replication!';
    exit -1;
  }

# SET ADD TABLE
  echo 'Subscription set 1 created';
  echo 'Adding tables to the subscription set';
  set add table (set id = 1, origin = 1, id = 1,
                 full qualified name = 'public.AccessRoles',
                 comment = 'Table public.AccessRoles with primary key');

... and so on for 85 tables and 89  sequences...

# STORE NODE
  store node (id = 2, event node = 1, comment = 'Node 2 - Jalapeno');
  echo 'Set up replication nodes';

# STORE PATH
  echo 'Next: configure paths for each node/origin';
  store path (server = 2, client = 1, conninfo = 'host=194.24.250.143 dbname=laterooms user=XXX port=5432 password=XXX');
  store path (server = 1, client = 2, conninfo = 'host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX');

# STORE LISTEN
  store listen (origin = 1, receiver = 2, provider = 1);
  store listen (origin = 2, receiver = 1, provider = 2);
  echo 'Replication nodes prepared';
  echo 'Please start a slon replication daemon for each node';

So, then ..
Node1$ slon replication "host=194.24.250.137 dbname=laterooms user=XXX password=XXX"
Node2$ slon replication "host=194.24.250.143 dbname=laterooms user=XXX password=XXX"

then watching the slon log on Node2....

2006-04-01 01:24:09 BST CONFIG main: slon version 1.1.5 starting up
2006-04-01 01:24:09 BST CONFIG main: local node id = 2
2006-04-01 01:24:09 BST CONFIG main: launching sched_start_mainloop
2006-04-01 01:24:09 BST CONFIG main: loading current cluster configuration
2006-04-01 01:24:09 BST CONFIG storeNode: no_id=1 no_comment='Cayenne'
2006-04-01 01:24:09 BST CONFIG storePath: pa_server=1 pa_client=2 pa_conninfo="host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX" pa_connretry=10
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:24:09 BST CONFIG storeSet: set_id=1 set_origin=1 set_comment='Set 1 for replication'
2006-04-01 01:24:09 BST WARN   remoteWorker_wakeup: node 1 - no worker thread
2006-04-01 01:24:09 BST CONFIG main: configuration complete - starting threads
2006-04-01 01:24:09 BST DEBUG1 localListenThread: thread starts
2006-04-01 01:24:09 BST CONFIG enableNode: no_id=1
2006-04-01 01:24:09 BST DEBUG1 remoteWorkerThread_1: thread starts
2006-04-01 01:24:09 BST DEBUG1 remoteListenThread_1: thread starts
2006-04-01 01:24:09 BST DEBUG1 cleanupThread: thread starts
2006-04-01 01:24:09 BST DEBUG1 main: running scheduler mainloop
2006-04-01 01:24:09 BST DEBUG1 syncThread: thread starts
2006-04-01 01:24:09 BST DEBUG1 remoteListenThread_1: connected to 'host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXX'
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:24:09 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1

Now I submit the magic 'subscribe set ( id = 1, provider = 1, receiver = 2, forward = no);'

2006-04-01 01:25:45 BST CONFIG storeSubscribe: sub_set=1 sub_provider=1 sub_forward='f'
2006-04-01 01:25:45 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 01:25:45 BST DEBUG1 copy_set 1
2006-04-01 01:25:45 BST DEBUG1 remoteWorkerThread_1: connected to provider DB
2006-04-01 01:34:31 BST DEBUG1 cleanupThread:    0.018 seconds for cleanupEvent()
2006-04-01 01:34:31 BST DEBUG1 cleanupThread:    0.006 seconds for delete logs
2006-04-01 01:44:58 BST DEBUG1 cleanupThread:    0.006 seconds for cleanupEvent()
2006-04-01 01:44:58 BST DEBUG1 cleanupThread:    0.004 seconds for delete logs
2006-04-01 01:55:33 BST DEBUG1 cleanupThread:    0.003 seconds for cleanupEvent()
2006-04-01 01:55:33 BST DEBUG1 cleanupThread:    0.003 seconds for delete logs
2006-04-01 02:06:26 BST DEBUG1 cleanupThread:    0.003 seconds for cleanupEvent()
2006-04-01 02:06:26 BST DEBUG1 cleanupThread:    0.005 seconds for delete logs
2006-04-01 02:16:49 BST DEBUG1 cleanupThread:    0.005 seconds for cleanupEvent()
2006-04-01 02:16:49 BST DEBUG1 cleanupThread:    0.004 seconds for delete logs
2006-04-01 02:27:28 BST DEBUG1 cleanupThread:    0.003 seconds for cleanupEvent()
2006-04-01 02:27:28 BST DEBUG1 cleanupThread:    0.004 seconds for delete logs
2006-04-01 02:39:00 BST DEBUG1 cleanupThread:    0.006 seconds for cleanupEvent()
2006-04-01 02:39:00 BST DEBUG1 cleanupThread:    0.007 seconds for delete logs
2006-04-01 02:49:33 BST DEBUG1 cleanupThread:    0.003 seconds for cleanupEvent()
2006-04-01 02:49:33 BST DEBUG1 cleanupThread:    0.005 seconds for delete logs
2006-04-01 02:59:54 BST DEBUG1 cleanupThread:    0.002 seconds for cleanupEvent()
2006-04-01 02:59:54 BST DEBUG1 cleanupThread:    0.005 seconds for delete logs
2006-04-01 03:10:14 BST DEBUG1 cleanupThread:    0.003 seconds for cleanupEvent()
2006-04-01 03:10:14 BST DEBUG1 cleanupThread:    0.007 seconds for delete logs
2006-04-01 03:13:46 BST DEBUG1 remoteWorkerThread_1: disconnected from provider DB
2006-04-01 03:13:46 BST DEBUG1 copy_set 1 done in 6480.502 seconds
2006-04-01 03:13:46 BST CONFIG enableSubscription: sub_set=1
2006-04-01 03:13:58 BST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2006-04-01 03:13:58 BST DEBUG1 remoteWorkerThread_1: helper thread for provider 1 created
2006-04-01 03:13:58 BST DEBUG1 remoteWorkerThread_1: connected to data provider 1 on 'host=194.24.250.137 dbname=laterooms user=XXX port=5432 password=XXXXXXXXX'

At this point I would be expecting Node2 to start catching up with whatever Node1's been up to during the 2 hours of the copy_set.. but it does not. If I run this:

$ tcpdump -i bond0 -n host 194.24.250.143 and port 5432 -s 1500 -w out

Then do a trivial UPDATE of one row on a table on Node1, and let tcpdump run for 30 seconds, the only queries I see sent across are "select ev_origin, ev_seqno, ev_timestamp..." - I'm not seeing any real data.

The thing that worries me most is that 'slon' on Node2 is using 100% CPU time and apparently doing nothing at all. There is no CPU use for any postgres process, nor any disk activity (confirmed with iostat).

Related to this /must/ be the fact that there's a postgres process on /both/ master+slave marked as 'idle in transaction'.  I've asked our devels if they're using any transactions in the codebase and they've told me "We don't, but NHibernate likely does" if that makes any difference.

How can I find out a bit more about what transaction the system is stuck in, since it may be causing some kind of busy-wait loop in slon hence the 100% CPU use?

Node1 is an Quad Xeon machine, and Node2 is an AIX 5.3 box (yep, PG compiled --with-thread-safety - slony complained loudly when I didn't have this flag set, so I recompiled PG.) and PG 8.1.3 + Slony 1.1.5 are in use on both machines.

I've followed all the instructions and I understand what each step does - I just can't work out what the problem is :(

Finally, just after copy_set finished, this was noticed in the postgres log:

2006-04-01 03:13:57 BST LOG:  duration: 11414.154 ms  statement: select "_replication".enableSubscription(1, 1, 2); notify "_replication_Event"; notify "_replication_Confirm"; insert into "_replication".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1, ev_data2, ev_data3, ev_data4    ) values ('1', '249', '2006-04-01 01:25:45.586148', '196845367', '196845368', '', 'ENABLE_SUBSCRIPTION', '1', '1', '2', 'f'); insert into "_replication".sl_confirm        (con_origin, con_received, con_seqno, con_timestamp)    values (1, 2, '249', now()); commit transaction;

Yours in a mix of hope + desperation,
Gavin.



More information about the Slony1-general mailing list