Mike Wilson mfwilson at gmail.com
Tue Dec 27 14:18:03 PST 2011
Thanks for the reply.  Per your questions here is additional information.  For the purposes of this discussion db1 (node: 101) is our Slave and db2 (node:102) is our Slony Master.

> On Tue, 27 Dec 2011, Mike Wilson wrote:
> 
> 1. The good news is restarting postgresql is unlikely to fix your problem.
I figured as much but I would still like to try it.  I've hunted around to try to find out how to track down the PG session that is associated with an xid but thus far no such luck.
> 
> 2. Now that you have killed the old vacuums you need to determine what if anything slony is doing.
During the period where the replication stopped I found several vacuums that had been running for upwards of 11 days.   I killed the processes (using pg_cancel_backup) and shortly after that started to get pings from Nagios which monitors the sl_status table that replication was falling behind.
> 
> a) Are new SYNC events being generated on the master?
I'm not sure how I determine this.
> 
> b) According to the master is the master confirming SYNC events.  Even if the slave  is two weeks behind is the highest confirmed event # increasing over time (ie over an hour).
See next question. The slave's st_last_event value is increasing, whereas the master's is static.
> 
> c) Is the slave receiving new SYNC events FROM THE MASTER?
I'm not sure.  How would I tell?  See below for snippets of our current logs.  It does appear that the two nodes are doing "something".
Here is the query output of the slave over a period of a few minutes:
=# select * from sl_status;
 st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts  | st_lag_num_events |   st_lag_time   
-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------
       101 |         102 |    5000826162 | 2011-12-27 13:55:56.563939 |       5000824786 | 2011-12-27 10:06:32.519807 | 2011-12-27 10:06:31.719339 |              1376 | 03:49:28.222625

=# select * from sl_status;
 st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts  | st_lag_num_events |  st_lag_time   
-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+----------------
       101 |         102 |    5000826238 | 2011-12-27 14:08:36.821359 |       5000824786 | 2011-12-27 10:06:32.519807 | 2011-12-27 10:06:31.719339 |              1452 | 04:02:05.30789


Here is the sl_status table from the master over the same few minutes:
=# select * from sl_status;
 st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts  | st_lag_num_events |      st_lag_time       
-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
       102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |       5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |            133831 | 3 days 06:40:41.911154

=# select * from sl_status;
 st_origin | st_received | st_last_event |      st_last_event_ts      | st_last_received |    st_last_received_ts     | st_last_received_event_ts  | st_lag_num_events |      st_lag_time       
-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
       102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |       5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |            133831 | 3 days 06:42:49.222796

As you can see the slaves st_last_event value appears to be increasing, whereas the master's appears to be static.

> 
> d) What does your slon log for the slave slon say it is doing?  Is it processing SYNC events that take a very long time? Is it stuck? How long does it take to process a SYNC event? How many inserts,updates,deletes are in this?
Please see below for snippets of the master/slave slony.log
> 
> e) Are there any other ERROR messages in your slon log (use search, don't eyeball it).  ERROR messages contain the word 'ERROR'
There are no recent error messages in either the master or slaves Slony logs.  There was an error in the master's slony.log regarding an xid.  When I restarted the Slony processes though on our master it apparently overwrote the log and I don't have a copy of the log from before the restart.

Here is a small bit of the current slony.log from the master (node: 102).  Other then the NOTICE regarding the log switch the log is clean (I grep'd it for errors also):
2011-12-27 09:23:47 PSTINFO   cleanupThread:    4.996 seconds for cleanupEvent()
NOTICE:  Slony-I: log switch to sl_log_1 still in progress - sl_log_2 not truncated
CONTEXT:  PL/pgSQL function "cleanupevent" line 99 at assignment
2011-12-27 09:34:37 PSTINFO   cleanupThread:    4.903 seconds for cleanupEvent()
NOTICE:  Slony-I: log switch to sl_log_1 still in progress - sl_log_2 not truncated
CONTEXT:  PL/pgSQL function "cleanupevent" line 99 at assignment
2011-12-27 09:45:13 PSTINFO   cleanupThread:    2.795 seconds for cleanupEvent()
2011-12-27 09:45:13 PSTINFO   cleanupThread:    0.005 seconds for vacuuming
NOTICE:  Slony-I: log switch to sl_log_1 still in progress - sl_log_2 not truncated
CONTEXT:  PL/pgSQL function "cleanupevent" line 99 at assignment

Here is what I'm seeing on the slaves (node: 101) slony.log, it also looks clean with no recent errors (I grep'd).
2011-12-27 14:05:11 PSTINFO   remoteWorkerThread_102: syncing set 3 with 24 table(s) from provider 102
2011-12-27 14:05:11 PSTINFO   remoteWorkerThread_102: syncing set 1 with 185 table(s) from provider 102
2011-12-27 14:05:11 PSTINFO   remoteWorkerThread_102: syncing set 2 with 13 table(s) from provider 102
NOTICE:  Slony-I: log switch to sl_log_1 complete - truncate sl_log_2
CONTEXT:  PL/pgSQL function "cleanupevent" line 99 at assignment
2011-12-27 14:05:45 PSTINFO   cleanupThread:    0.037 seconds for cleanupEvent()
2011-12-27 14:05:45 PSTINFO   cleanupThread:    0.008 seconds for vacuuming
2011-12-27 14:06:32 PSTINFO   remoteWorkerThread_102: SYNC 5003985339 done in 80.973 seconds
2011-12-27 14:06:32 PSTINFO   remoteWorkerThread_102: syncing set 3 with 24 table(s) from provider 102
2011-12-27 14:06:32 PSTINFO   remoteWorkerThread_102: syncing set 1 with 185 table(s) from provider 102
2011-12-27 14:06:32 PSTINFO   remoteWorkerThread_102: syncing set 2 with 13 table(s) from provider 102

Note that I have three replication sets defined.

> 
> If replication  is working just very far behind it might be faster for you to create a new slave instead of waiting for the existing slave to catch up. This depends on how large your database is, how far behind it is and how fast your network is.
If at all possible I need to heal the current cluster rather than producing a new one.  The slave has numerous other schemas, tables, indexes, etc so it isn't a duplicate of the master.  Because of this, if at all possible, I would like to try to get replication working again, even if it took weeks for the two systems to come back in sync.  

Thanks so much for your help thus far.  Please advise with any attempted fixes you can think of or if you can possibly describe what is currently going on.  Cheers.

> 
> Steve
> 
>> Under incredible load last week during the Christmas season our primary PG (8.4.7: Slony 2.0.7) stopped replicating.  Now that we are past the Christmas season I need to figure out how to clear the back log of replication rows in sl_log_1/2.  This is all running on our commercial website and if possible I would prefer not to have to restart the PG instance as it would require a scheduled maintenance window on a week where everyone is out of the office.  In an attempt to fix the issue without rebooting the PG instance and I've already restarted the Slony services on the primary PG node as a first attempt at a fix.  This did not get replication working again and I'm still getting the same error from Slony in the logs: log switch to sl_log_1 still in progress - sl_log_2 not truncated
>> 
>>> From my research I can see that this error message is called when the function logswitch_finish() is called.  I did have some hung vacuums during this period of high load on the server but I have killed them with pg_cancel_backend.  From other lock analysis I can see that nothing is currently running or locked in the db (nothing more than a few milliseconds old at least).  I'm certain whatever transaction was in progress that prevented the switch from occurring is long since past.
>> 
>> Any ideas on the best way to get replication working again?  I'm adverse to rebooting the PG instance but I am willing to do it if someone more knowlegable out there thinks it would fix this issue.  We currently are operating without a backup of all of our XMas sales data and I *really* want to get this data replicated.  Any help would be appreciated.
>> 
>> Mike Wilson
>> Predicate Logic
>> Cell: (310) 600-8777
>> SkypeID: lycovian
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Slony1-general mailing list
>> Slony1-general at lists.slony.info
>> http://lists.slony.info/mailman/listinfo/slony1-general
>> 
> 



More information about the Slony1-general mailing list