[Slony1-general] BUG? Logshipping causes failure when slony waiting on older transactions to finish

Fri Jan 16 09:06:34 PST 2009

I consider things carefully before throwing out the "BUG" word. I
believe I have uncovered a bug that is related to logshipping. This is
not a show-stopper bug and does not cause missing data or data
corruption or anything like that. However it is a repeatable pattern
and makes my life with Slony more difficult than necessary.

We are a growing organization with a rapidly changing schema. So we
have a relatively frequent need to add new tables into the schema and
also into replication. I developed a slonik shell script to make
adding new tables easy. It creates a temporary set (ID = 999), adds
the table to it, subscribes the new set, then merges set 999 into my
real set (ID = 1). When the database is not busy, this works
perfectly.

Before I started using log shipping, we would often see the pattern
that looks like lines 1 through 12 in the log below. We would then do
our investigating to figure out what the older transaction was and
eliminate it or simply let Slony wait for that older transaction to
complete naturally. In either case, once the older transaction was out
of the way, Slony would complete the merge successfully.

After log shipping, we see the pattern you see below, and it will
happen every time the merge is waiting behind older transactions. What
you see below is the merge starts (line 1), then fails because there
are older transactions in the way. Slony sleeps for 15 seconds and
tries again. He fails again because the older transactions are still
in the way. Slony sleeps for 30 seconds then tries again. This time,
the older transaction(s) have cleared up, so Slony tries to continue
with the merge request. However, he fails because he is unable to
write to the log shipping archive file (line 16).

The BUG, I believe, has something to do with Slony closing the file
handle or otherwise losing access to the file when he hits the wait
condition. There is nothing, in reality, wrong with that tmp file--no
other reason Slony is unable to write to it.

At this point, the way I "fix" this is to simply stop and restart the
slon process on the subscriber node. (The node who creates the log
shipping files.) I also have to then manually dropset my temp set (ID
= 999) and start my merge over when the database is not so busy.

By the way, my configuration uses log_level=1 to reduce noise in the
log files. The lines below are all the lines in that time period. I
modified the schema, table, and dir path name to protect the innocent.

=========================================================================================================
01: 2009-01-15 16:29:07 EST CONFIG storeSet: set_id=999 set_origin=1
set_comment='temporary merge set to add a table into an existing set'
02: 2009-01-15 16:29:07 EST CONFIG storeSubscribe: sub_set=999
sub_provider=1 sub_forward='t'
03: 2009-01-15 16:29:07 EST CONFIG storeListen: li_origin=1
li_receiver=2 li_provider=1
04: 2009-01-15 16:29:07 EST DEBUG1 copy_set 999
05: 2009-01-15 16:29:07 EST DEBUG1 remoteWorkerThread_1: connected to
provider DB
06: 2009-01-15 16:29:07 EST WARN   remoteWorkerThread_1: transactions
earlier than XID 1077488892 are still in progress
07: 2009-01-15 16:29:07 EST WARN   remoteWorkerThread_1: data copy for
set 999 failed - sleep 15 seconds
08: 2009-01-15 16:29:22 EST DEBUG1 copy_set 999
09: 2009-01-15 16:29:22 EST DEBUG1 remoteWorkerThread_1: connected to
provider DB
10: 2009-01-15 16:29:22 EST WARN   remoteWorkerThread_1: transactions
earlier than XID 1077488892 are still in progress
11: 2009-01-15 16:29:22 EST WARN   remoteWorkerThread_1: data copy for
set 999 failed - sleep 30 seconds
12: WARNING:  there is no transaction in progress
13: 2009-01-15 16:29:52 EST DEBUG1 copy_set 999
14: 2009-01-15 16:29:52 EST DEBUG1 remoteWorkerThread_1: connected to
provider DB
15: NOTICE:  truncate of "my_schema"."my_new_table" succeeded
16: 2009-01-15 16:29:52 EST ERROR  remoteWorkerThread_1: Cannot write
to archive file /opt/slony/slony1_log_2_00000000000001428342.sql.tmp -
not open
17: 2009-01-15 16:29:52 EST WARN   remoteWorkerThread_1: data copy for
set 999 failed - sleep 60 seconds
18: WARNING:  there is no transaction in progress
=========================================================================================================