TANIDA Yutaka tanida
Mon Mar 20 00:38:14 PST 2006

It seems a bug in remote_worker.c , rtcfg_lock() was called , but some
case exit main loop without rtcfg_unlock() . 

Diff for src/slon/remote_worker.c  attached.

On Wed, 08 Mar 2006 20:34:25 +0900
TANIDA Yutaka <tanida at sraoss.co.jp> wrote:

> Hi.
> 
> I found a bug , unexpected slon shutdown when dropping node.
> 
> -PostgreSQL 8.1.2
> -Slony-I 1.1.5
> --perltools was not used.
> -RHEL4 update2
> 
> TO REPRODUCE THIS BUG:
> 
> 1. make sure PostgreSQL 8.1,pgbench and Slony-I was installed.
> 
> 2. Execute add.sh follows. It will create 1 master and 2-slave cluster
> named "nodetest". node 1-2 and 1-3 path exists.
> 
> [tanida at srapc2209 sl]$ cat add.sh
> #!/bin/sh
> CLUSTERNAME=nodetest
> 
> RUSER=postgres
> killall slon
> dropdb node1
> dropdb node2
> dropdb node3
> createdb node1
> createdb node2
> createdb node3
> createlang plpgsql node1
> pgbench -i -s 1 node1
> pg_dump -s node1 | psql node2
> pg_dump -s node1 | psql node3
> slonik <<_EOF_
> cluster name = $CLUSTERNAME;
> node 1 admin conninfo = 'dbname=node1';
> node 2 admin conninfo = 'dbname=node2';
> node 3 admin conninfo = 'dbname=node3';
> init cluster ( id=1, comment = 'Master');
> create set (id=1, origin=1, comment='All tables');
> table add key (node id = 1, fully qualified name = 'public.history');
> store node (id=2, comment = 'Slave');
> store node (id=3, comment = 'Slave');
> store path (server = 1, client = 2, conninfo='dbname=node1 ');
> store path (server = 2, client = 1, conninfo='dbname=node2 ');
> store path (server = 1, client = 3, conninfo='dbname=node1 ');
> store path (server = 3, client = 1, conninfo='dbname=node3 ');
> 
> store listen (origin=1, provider = 1, receiver =2);
> store listen (origin=2, provider = 2, receiver =1);
> store listen (origin=1, provider = 1, receiver =3);
> store listen (origin=2, provider = 1, receiver =3);
> store listen (origin=3, provider = 3, receiver =1);
> store listen (origin=3, provider = 1, receiver =2);
> set add table (set id=1, origin=1, id=1, fully qualified name = 'public.accounts', comment='accounts table');
> set add table (set id=1, origin=1, id=2, fully qualified name = 'public.branches', comment='branches table');
> set add table (set id=1, origin=1, id=3, fully qualified name = 'public.tellers', comment='tellers table');
> set add table (set id=1, origin=1, id=4, fully qualified name = 'public.history', comment='history table', key = serial);
> #wait for event(origin=all,confirmed=all);
> _EOF_
> slon nodetest "dbname=node1" >node1.log 2>&1 &
> slon nodetest "dbname=node2" >node2.log 2>&1 &
> slon nodetest "dbname=node3" >node3.log 2>&1 &
> slonik <<_EOF_
> cluster name = $CLUSTERNAME;
> node 1 admin conninfo = 'dbname=node1';
> node 2 admin conninfo = 'dbname=node2';
> node 3 admin conninfo = 'dbname=node3';
> subscribe set ( id = 1, provider = 1, receiver = 2, forward = yes);
> subscribe set ( id = 1, provider = 1, receiver = 3, forward = yes);
> _EOF_
> 
> 3. Execute del.sh follows. It will drop node 3.
> 
> [tanida at srapc2209 sl]$ cat del.sh
> #!/bin/sh
> CLUSTERNAME=nodetest
> 
> slonik <<_EOF_
> cluster name = $CLUSTERNAME;
> node 1 admin conninfo = 'dbname=node1';
> node 2 admin conninfo = 'dbname=node2';
> node 3 admin conninfo = 'dbname=node3';
> drop node (id=3);
> _EOF_
> 
> 4. slon for node3 will shutdown immediately , but after 20 seconds ,
> slon for node1 will shutdown , which must be restart.
> 
> log of node1 shows:
> 
> 2006-03-08 20:24:11 JST INFO   localListenThread: got restart notification - signal scheduler
> 2006-03-08 20:24:11 JST DEBUG1 slon: restart requested
> 2006-03-08 20:24:11 JST DEBUG1 cleanupThread: thread done
> 2006-03-08 20:24:11 JST DEBUG1 syncThread: thread done
> 2006-03-08 20:24:11 JST DEBUG1 main: scheduler mainloop returned
> 2006-03-08 20:24:11 JST DEBUG1 localListenThread: thread done
> 2006-03-08 20:24:31 JST WARN   main: shutdown timeout exiting
> 2006-03-08 20:24:31 JST DEBUG1 slon: shutdown now requested
> 
> It seems something unconditional happened in remoteListenThread or
> remoteWorkerThread and deadlocked , so wait 20 seconds and shutdowned by
> timeout.
> 
> This example is for "drop node", but It will occurs other statement
> requests restarts , such as "uninstall node" , "move set" or "failover".
> 
> 
> -- 
> TANIDA Yutaka <tanida at sraoss.co.jp>
> 
> 
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at gborg.postgresql.org
> http://gborg.postgresql.org/mailman/listinfo/slony1-general
> 
> 

-- 
TANIDA Yutaka <tanida at sraoss.co.jp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crush_on_dropNode.diff
Type: application/octet-stream
Size: 686 bytes
Desc: not available
Url : http://gborg.postgresql.org/pipermail/slony1-general/attachments/20060320/f9053aa1/crush_on_dropNode.obj



More information about the Slony1-general mailing list