Wed Feb 1 00:02:17 PST 2006
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
?hel kenal p?eval, T, 2006-01-31 kell 20:14, kirjutas Michael Crozier: > Hi, > > I encountered some duplicate key errors in my slony cluster today. Clearly, > an event/log was replicated more than once. > > I believe that this may be due to "the Solaris threading issue", but I can't > find enough clear information about this problem to determine whether I > failed to avoid it in the build of Postgresql and Slony. > > Detais: > Solaris 9 sparc, 7.3.13, compiled with --thread-safety > Solaris 10 opteron, 8.0.6, compiled with --thread-safety > All the slon's were running from the 8.0.6 instance/build. > > I was able to manually remove the offending rows and get the slon's processing > events again, but I'm worried about a few things: > > 1. How is my data? Do I need to re-sync? Possible. Check your data :) > 2. How can I prove that this problem is related to threading issue? I don't think it is related to threading issue. If you have had more than 2G (_xxx_cluster_.sl_log_1.log_xid > 2G) transactions executed during the replication, without reindexing sl_log_1, then indexes on xxid starts misbehaving, resulting both in duplicate key errors *and* some events not being replicated (i.e. data loss). It should be (but is not) documented in BIG FRIENDLY LETTERS on title page of slony docs . > 3. What IS the threading issue? I can't find a good description of the > problem and the solution. > 4. If the problem still exists on the 8.0.X build, how do I correct it? I've heard there are some plans to start alternating between sl_log_1 and sl_log_2 and truncating the unused tables in upcoming v2.0 of slony. Until then the only alternative I know of is doing reindex on any indexes using xxid_ops at least after every 1G transactions. And NEVER use a setup where data from multiple masters goes through the same node, as this greatly increases a potential of a situation where there are xxids apart by more than 2G (due to different trx rates) in which case btree indexes break. this 2G difference must not be present at the same time, it just has to be so during the lifetime of the index. This behaviour is especially nasty, because it is not detected in testing (unless you are able to run tests for more than 2G tansactions, which takes 23 days at 1000 trx/sec) and even when it activates after 2G trx it starts eating your data quite slowly and undetectably at first - you won't notice the data loss, only see an occasional duplicate key error. If you want to know a little more about the issue look for my recent posts on this list. ---------------- Hannu
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list