Steve Singer ssinger at ca.afilias.info
Fri May 14 13:53:21 PDT 2010
Brian Fehrle wrote:
> Hi all,
> 

> 
> So the problem is after I drop a node, everything looks great except for 
> the _slony.sl_status table, in any or all the remaining nodes, still 
> refers to the node that was just dropped.

sl_status is a view not a table.
The view definition involves sl_event and sl_confirm.

The dropNode_int function should be deleting everything involving the 
dropped node from both tables.

So my question would be is the sequence number of the event for your 
dropped table less than the sequence number of your DROP NODE command or 
is it greater than?  I can't tell this from what you've sent.


(I am wondering if these is some sort of race condition where the 
dropNode_int() function deletes the rows from sl_event but a new event 
gets added after?, though looking at the code I don't see any obvious 
issues)


> 
> I did quite a few test runs of the drop node to try to reproduce and 
> determine the cause. After the drop node, if I look in sl_node, sl_path, 
> sl_event, or any other sl_ location, I see no reference to the third 
> node. However, about half the time I would still get references to the 
> third node in sl_status. This can either be on the master node, or the 
> (remaining) slave node, or both. There was one test scenario that I 
> monitored the sl_status table and noticed that node 3 disappeared, then 
> reappeared a second later, then remained.
> 
> Example queries done on node 2 (slave) after dropping node 3 (other slave):
> 
> postgres=# select * from _slony.sl_node;
>  no_id | no_active | no_comment | no_spool
> -------+-----------+------------+----------
>      1 | t         | Server 1   | f
>      2 | t         | Server 2   | f
> (2 rows)
> 
> postgres=# select * from _slony.sl_path ;
>  pa_server | pa_client |                        
> pa_conninfo                         | pa_connretry
> -----------+-----------+------------------------------------------------------------+--------------
>          1 |         2 | dbname=postgres host=172.16.44.111 port=5432 
> user=postgres |           10
>          2 |         1 | dbname=postgres host=172.16.44.129 port=5432 
> user=postgres |           10
> (2 rows)
> 
> postgres=# select * from _slony.sl_status;
>  st_origin | st_received | st_last_event |      st_last_event_ts      | 
> st_last_received |    st_last_received_ts     | 
> st_last_received_event_ts  | st_lag_num_events |   st_lag_time
> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------
>          2 |           1 |          1649 | 2010-05-10 15:53:16.245529 
> |             1649 | 2010-05-10 15:53:16.246212 | 2010-05-10 
> 15:53:16.245529 |                 0 | 00:00:05.57205
>          2 |           3 |          1656 | 2010-05-10 15:54:26.280131 
> |             1636 | 2010-05-10 15:51:05.341512 | 2010-05-10 
> 15:51:05.343754 |                20 | 00:03:22.66664
> 
> 
> Also, another problem that may be linked is the fact that the slon 
> daemon for node 3 does not terminate itself after it. Watching the log 
> output by that daemon, it shows that it recieves the drop node command 
> for itself, and it drops the _slony schema as intended. However after 
> that it reports "2010-05-10 15:57:56 MDT FATAL  main: Node is not 
> initialized properly - sleep 10s" and keeps checking every ten seconds. 
> I'm not sure if somehow this daemon is causing some post-drop-node 
> entries into the sl_event section that causes the sl_status entry to be 
> recreated.
> 
> In case it helps, here is a copy of the drop node script I'm running.
> 
> #!/bin/bash
> slonik <<_EOF_
> cluster name = slony;
> node 1 admin conninfo = ' dbname=postgres host=172.16.44.111 port=5432 
> user=postgres';
> node 2 admin conninfo = ' dbname=postgres host=172.16.44.129 port=5432 
> user=postgres';
> node 3 admin conninfo = ' dbname=postgres host=172.16.44.142 port=5432 
> user=postgres';
> DROP NODE ( ID = 3, EVENT NODE = 1 );
> _EOF_
> 
> I am running on a CentOS 5, postgres 8.4.2, and slony 1.2.20 environment 
> on all three nodes.
> 
> Thanks in advance,
>     Brian Fehrle
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general


-- 
Steve Singer
Afilias Canada
Data Services Developer
416-673-1142


More information about the Slony1-general mailing list