Jerry Sievers jerry at jerrysievers.com
Fri Jul 6 09:52:07 PDT 2007
Jan;  A great big thanks for your assistance yesterday. 

I believe the tips on what to manually remove from the events tables
did the trick.  

The Slon  quit failing on the master and  eventually, both of the
slaves came up to date again.  This is a big relief due to these DBs
being about 120GB in size and 650 max TPS with approx 300TPS sustained
workload. 

We've found it impossible now during near past attempts to init a new
slave due to the huge event backlog that's in place once the sets are
subscribed.  As such, we can't afford to lose slaves, one of which we
did lose yesterday due to a corrupt application table.

I suppose reconfiguring the system to have several sets and possibly
even sets having only one large table and trying to bring up a new
slave this way may be an option.  This DB is poorly designed and has
no FKs whatsoever (a bad thing turned good in this case.)

Anyway, that's a whole other issue.

Have a great weekend. 


Jan Wieck <JanWieck at Yahoo.com> writes:

> On 7/5/2007 5:22 PM, Jerry Sievers wrote:
> 
> > Selecting all non-sync events from each of the 3 nodes ordered by
> > ev_seqno.
> 
> I think I see what's going on here ... maybe.
> 
> This is probably a pilot error in connection with a copy/paste mistake
> sitting in slon for ages.
> 
> The copy/paste mistake is:
>      the error message in disableNode() says "enableNode(): ...".
>      I claim ownership of that one.
> 
> The pilot error is:
>      the dropnode() was issued multiple times against different nodes
>      without giving them time to propagate (in this case nodes 1 and 4).
>      They are events (1,2225224) and (4,1863698).
> 
> Nice screwup. However since all 3 nodes don't have node 2 in the
> sl_node table any more (at least from what I see they should not), it
> is safe to
> 
>      DELETE FROM sl_event WHERE ev_origin = 4 and ev_seqno = 1863698;
>      DELETE FROM sl_event WHERE ev_origin = 1 and ev_seqno = 2225224;
> 
> 
> Jan
> 
> > Thanks!
> > Pager usage is off.
> > Expanded display is on.
> > -[ RECORD 1 ]+------------------------------------------------------------------------
> > ev_origin    | 1
> > ev_seqno     | 2225126
> > ev_timestamp | 05-JUL-07 14:57:16.056801
> > ev_minxid    | 884391402
> > ev_maxxid    | 884391412
> > ev_xip       | '884391409','884391411'
> > ev_type      | ACCEPT_SET
> > ev_data1     | 1
> > ev_data2     | 2
> > ev_data3     | 1
> > ev_data4     | ev_data5     | ev_data6     | ev_data7     | ev_data8
> > | -[ RECORD 2
> > ]+------------------------------------------------------------------------
> > ev_origin    | 1
> > ev_seqno     | 2225133
> > ev_timestamp | 05-JUL-07 14:58:26.439281
> > ev_minxid    | 884391608
> > ev_maxxid    | 884391609
> > ev_xip       | ev_type      | ACCEPT_SET
> > ev_data1     | 2
> > ev_data2     | 2
> > ev_data3     | 1
> > ev_data4     | ev_data5     | ev_data6     | ev_data7     | ev_data8
> > | -[ RECORD 3
> > ]+------------------------------------------------------------------------
> > ev_origin    | 1
> > ev_seqno     | 2225224
> > ev_timestamp | 05-JUL-07 15:49:54.253471
> > ev_minxid    | 884528335
> > ev_maxxid    | 884697167
> > ev_xip       | '884528335','884697160','884697162','884697161','884587782','884697166'
> > ev_type      | DROP_NODE
> > ev_data1     | 2
> > ev_data2     | ev_data3     | ev_data4     | ev_data5     | ev_data6
> > | ev_data7     | ev_data8     | Pager usage is off.
> > Expanded display is on.
> > -[ RECORD 1 ]+--------------------------
> > ev_origin    | 4
> > ev_seqno     | 1863698
> > ev_timestamp | 05-JUL-07 15:52:40.518681
> > ev_minxid    | 385609088
> > ev_maxxid    | 385609089
> > ev_xip       | ev_type      | DROP_NODE
> > ev_data1     | 2
> > ev_data2     | ev_data3     | ev_data4     | ev_data5     | ev_data6
> > | ev_data7     | ev_data8     | Pager usage is off.
> > Expanded display is on.
> > -[ RECORD 1 ]+--------------------------
> > ev_origin    | 4
> > ev_seqno     | 1863698
> > ev_timestamp | 05-JUL-07 15:52:40.518681
> > ev_minxid    | 385609088
> > ev_maxxid    | 385609089
> > ev_xip       | ev_type      | DROP_NODE
> > ev_data1     | 2
> > ev_data2     | ev_data3     | ev_data4     | ev_data5     | ev_data6
> > | ev_data7     | ev_data8     | Jan Wieck <JanWieck at Yahoo.com>
> > writes:
> >
> >> On 7/5/2007 3:03 PM, Jerry Sievers wrote:
> >> > Crisis today.  Complete power failure leaves a corrupt table on
> >> old
> >> > master. I did moveset() and dropnode() to reconfigure the cluster.
> >> > The old
> >> > master was node 2.    New master is node 1.   There are now just 2
> >> > slaves 3 and 4.
> >> Another question: Did you wait for the moveset() to propagate before
> >> you dropped node 2?
> >> Jan
> >> > For some reason however, when I try to fire up the slon on the
> >> > master,
> >> > it complains of node #2 does not exist right after reporting having
> >> > init'd node 4. I have no clue what's going wrong here and hope not
> >> > to have to undo
> >> > and reconfig the cluster from scratch.  These DBs are too large now
> >> > for easy subscription during live processing. Any help much
> >> > appreciated. -----------------------------------------
> >> > 2007-07-05 18:19:18 GMT CONFIG main: edb-replication version 1.1.5 starting up
> >> > 2007-07-05 18:19:19 GMT CONFIG main: local node id = 1
> >> > 2007-07-05 18:19:19 GMT CONFIG main: launching sched_start_mainloop
> >> > 2007-07-05 18:19:19 GMT CONFIG main: loading current cluster configuration
> >> > 2007-07-05 18:19:19 GMT CONFIG storeNode: no_id=3 no_comment='slave node 3'
> >> > 2007-07-05 18:19:19 GMT CONFIG storeNode: no_id=4 no_comment='slave node 4'
> >> > 2007-07-05 18:19:19 GMT CONFIG storePath: pa_server=3 pa_client=1 pa_conninfo="dbname=rt3_01 host=192.168.30.172 user=slonik password=foo.j1MiTikGop0rytQuedPid8 port=5432" pa_connretry=5
> >> > 2007-07-05 18:19:19 GMT CONFIG storePath: pa_server=4 pa_client=1 pa_conninfo="dbname=rt3_01 host=192.168.30.173 user=slonik password=foo.j1MiTikGop0rytQuedPid8 port=5432" pa_connretry=5
> >> > 2007-07-05 18:19:19 GMT CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3
> >> > 2007-07-05 18:19:19 GMT CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4
> >> > 2007-07-05 18:19:19 GMT CONFIG storeSet: set_id=1 set_origin=1 set_comment='RT3/VCASE replication set'
> >> > 2007-07-05 18:19:19 GMT CONFIG storeSet: set_id=2 set_origin=1 set_comment='new set for adding tables'
> >> > 2007-07-05 18:19:19 GMT CONFIG main: configuration complete - starting threads
> >> > NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=12520
> >> > 2007-07-05 18:19:19 GMT CONFIG enableNode: no_id=3
> >> > 2007-07-05 18:19:19 GMT CONFIG enableNode: no_id=4
> >> > 2007-07-05 18:19:19 GMT FATAL  enableNode: unknown node ID 2
> >> > 2007-07-05 18:19:19 GMT INFO   remoteListenThread_4: disconnecting from 'dbname=rt3_01 host=192.168.30.173 user=slonik password=foo.j1MiTikGop0rytQuedPid8 port=5432'
> >> > 2007-07-05 18:19:20 GMT INFO   remoteListenThread_3: disconnecting from 'dbname=rt3_01 host=192.168.30.172 user=slonik password=foo.j1MiTikGop0rytQuedPid8 port=5432'
> >> >
> >> -- 
> >> #======================================================================#
> >> # It's easier to get forgiveness for being wrong than for being right. #
> >> # Let's break this rule - forgive me.                                  #
> >> #================================================== JanWieck at Yahoo.com #
> >>
> >
> 
> 
> -- 
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me.                                  #
> #================================================== JanWieck at Yahoo.com #
> 

-- 
-------------------------------------------------------------------------------
Jerry Sievers   732 365-2844 (work)     Production Database Administrator
                305 321-1144 (mobil	WWW E-Commerce Consultant


More information about the Slony1-general mailing list