Chris Browne cbbrowne at lists.slony.info
Fri Oct 23 14:18:23 PDT 2009
Update of /home/cvsd/slony1/slony1-engine/src/slon
In directory main.slony.info:/tmp/cvs-serv16900

Modified Files:
	remote_worker.c 
Log Message:
Fix 8.4-ism as found by Jeff Trout...

In a nutshell in the event loop we start a transaction, then if we are not an accept set event we lock the config lock table.  We then zero out query1. (this is in remote_worker.c).

The ENABLE_SUBSCRIPTION event runs in a while(true) loop.
First it executes query1 (which thanks to the above, is empty), then tries to copy_set.  If copy_set fails for whatever reason we ROLLBACK our local conn (query2) and then loop.

The problem with this is when we come back around in the next look we're outside of a transaction and one won't be started because query1 is reset.  This causes LOCK TABLE to barf on PG8.4.  You are forever stuck until you restart slon.  This also explains another problem I've seen a couple times.

We subscribe to a set with say 3 tables.
The initial subscription fails due to an earlier txn wait.
We copy the first table of hte set successfully.
Then the second table fails to copy due to some DDL issue (perhaps for some reason a PK or column is missing).  We issue a rollback but since we are not in a txn, nothing happens. The event does not suceed so we try again
What happens next is since our previous work wasn't rolled back slony sees we've already got teh deny trigger & friends on the first table and barfs.   Cue infinite loop fixed only by shutting down slon and playing with the sl_ tables.

This patch keeps a count of how many retries we've had on this copy_set.  If we are on retry > 0 then we re-issue a start transaction, set islolation, and lock the config table. My testing has showed that this works. 



Index: remote_worker.c
===================================================================
RCS file: /home/cvsd/slony1/slony1-engine/src/slon/remote_worker.c,v
retrieving revision 1.181
retrieving revision 1.182
diff -C2 -d -r1.181 -r1.182
*** remote_worker.c	17 Aug 2009 17:25:50 -0000	1.181
--- remote_worker.c	23 Oct 2009 21:18:21 -0000	1.182
***************
*** 1198,1201 ****
--- 1198,1202 ----
  				int			sub_receiver = (int) strtol(event->ev_data3, NULL, 10);
  				char	   *sub_forward = event->ev_data4;
+ 				int         copy_set_retries = 0;
  
  				/*
***************
*** 1243,1254 ****
  							}
  						}
  
! 						/*
! 						 * Execute the config changes so far, but don't commit
! 						 * the transaction yet. We have to copy the data now
! 						 * ...
  						 */
! 						if (query_execute(node, local_dbconn, &query1) < 0)
! 							slon_retry();
  
  						/*
--- 1244,1266 ----
  							}
  						}
+ 					
  
! 						/* 
! 						 * if we have failed more than once we need to restart
! 						 * our transaction or we can end up with odd results
! 						 * in our subscription tables, and in 8.4+ LOCK
! 						 * TABLE requires you to be in a txn.
  						 */
! 						if(copy_set_retries != 0)
! 						  {
! 							slon_mkquery(&query1, "start transaction;"
! 										 "set transaction isolation level serializable;");
! 							slon_appendquery(&query1,
! 											 "lock table %s.sl_config_lock; ",
! 											 rtcfg_namespace);
! 
! 							if (query_execute(node, local_dbconn, &query1) < 0)
! 							  slon_retry();
! 						  }
  
  						/*
***************
*** 1264,1270 ****
--- 1276,1285 ----
  										sub_set, sub_provider, sub_receiver);
  							sched_rc = SCHED_STATUS_OK;
+ 							copy_set_retries = 0;
  							break;
  						}
  
+ 						copy_set_retries++;
+ 
  						/*
  						 * Data copy for new enabled set has failed. Rollback
***************
*** 1272,1278 ****
  						 */
  						slon_log(SLON_WARN, "remoteWorkerThread_%d: "
! 								 "data copy for set %d failed - "
  								 "sleep %d seconds\n",
! 								 node->no_id, sub_set, sleeptime);
  						if (query_execute(node, local_dbconn, &query2) < 0)
  							slon_retry();
--- 1287,1295 ----
  						 */
  						slon_log(SLON_WARN, "remoteWorkerThread_%d: "
! 								 "data copy for set %d failed %d times - "
  								 "sleep %d seconds\n",
! 								 node->no_id, sub_set, copy_set_retries,
! 								 sleeptime);
! 
  						if (query_execute(node, local_dbconn, &query2) < 0)
  							slon_retry();



More information about the Slony1-commit mailing list