[Slony1-general] node is on archive counter 44, this archive log expects 45

Tue Apr 29 17:23:00 PDT 2008

Hi Jennifer,

Thanks for the info!

Indeed, when I query the sl_archive_tracking table, I see that the most
recently applied SQL file is #44:

shahaf_logshipped_slave=# select * from
_shahaf_cluster.sl_archive_tracking;
 at_counter |        at_created         |         at_applied
------------+---------------------------+----------------------------
         44 | 2008-04-29 13:41:26.20039 | 2008-04-29 13:56:24.255029
(1 row)

So, it makes sense that the logshipped slave would want #45 next.

To produce the SQL files I simply used the -a flag on the slon process
for my regular slave, as described here:
http://lists.slony.info/documentation/logshipping.html

I have a directory (/data/tmp/shahaf/logshipping/SYNC) that now includes
several hundred files like this:

[root at idb SYNC]# ls | more
slony1_log_10_00000000000000000001.sql
slony1_log_10_00000000000000000002.sql
slony1_log_10_00000000000000000003.sql
slony1_log_10_00000000000000000004.sql
slony1_log_10_00000000000000000005.sql

And, this directory does in fact contain files for the segment where the
problem occurred...

<snip>
slony1_log_10_00000000000000000043.sql
slony1_log_10_00000000000000000044.sql
slony1_log_10_00000000000000000045.sql
slony1_log_10_00000000000000000046.sql
slony1_log_10_00000000000000000047.sql
<snip>

Also, I verified that the file name and the contents are in sync.  In
other words, the file that ends in 045.sql includes the following:

select "_shahaf_cluster".archiveTracking_offline('45', '2008-04-29
13:41:36.201339');

To apply these logs, I'm using the /usr/bin/slony_logshipper tool, also
described here:
http://lists.slony.info/documentation/logshipping.html

So, either something is wrong with my setup, or there's something wrong
with the slony_logshipper tool.

To dig a little further, I tried an experiment.  I started with a fresh
shahaf_logshipped_slave DB and I applied the DB dump to it (the one I
got from the slony1_dump.sh script).  I then checked the
sl_archive_tracking table and found that at_counter = 34.  Then I
manually applied the scripts from the SYNC directory by feeding them
into psql one by one, starting with 035.sql.  This worked.  I was able
to go as high as 050.sql and I didn't see the issue described below.

Then I repeated my experiment by once again starting with a fresh DB on
which I applied the dump.  This time I went back to using the
slony_logshipper tool to do the work of choosing and applying the
scripts, one by one.  And once it failed with the same issue, only this
time it happened at #42/43 instead of #44/45.  Here's what I see in the
log file produced by the slony_logshipper tool:

INFO  2008-04-29 17:18:07 > Processing archive file
./SYNC/slony1_log_10_00000000000000000040.sql
INFO  2008-04-29 17:18:07 > NOTICE:  Slony-I: Process archive with
counter 40 created 2008-04-29 13:40:46.188066
INFO  2008-04-29 17:18:07 > Processing archive file
./SYNC/slony1_log_10_00000000000000000041.sql
INFO  2008-04-29 17:18:07 > NOTICE:  Slony-I: Process archive with
counter 41 created 2008-04-29 13:40:56.191449
INFO  2008-04-29 17:18:07 > Processing archive file
./SYNC/slony1_log_10_00000000000000000042.sql
WARN  2008-04-29 17:18:07 > skip archive with counter (r)V - already
applied
INFO  2008-04-29 17:18:07 > Processing archive file
./SYNC/slony1_log_10_00000000000000000043.sql
ERROR 2008-04-29 17:18:07 > PGRES_FATAL_ERROR: ERROR:  Slony-I: node is
on archive counter 41, this archive log expects 42
Query was: select "_shahaf_cluster".archiveTracking_offline('43',
'2008-04-29 13:41:16.196712');
WARN  2008-04-29 17:18:07 > waiting for resume

So, something went sideways when the tool tried to process file #42.  It
decided to skip it (I'm not sure what "counter (r)V" is all about).  As
a result, #43 failed.

How do you go about running the SQL scripts on your logshipped slave?

Thanks,

--S

-----Original Message-----
From: Jennifer Spencer [mailto:jennifer at sun.Stanford.EDU] 
Sent: Tuesday, April 29, 2008 4:39 PM
To: Shahaf Abileah
Cc: slony1-general at lists.slony.info
Subject: Re: [Slony1-general] node is on archive counter 44, this
archive log expects 45

Hi Shahaf,

I have a similar setup, using log shipping.  I have noticed that this
particular error message is not 
as helpful as you might want.  In fact, the log-shipper may actually
want log #46 or #246.  Just not 
the one you gave it (#44).  I think the error message just gives you
N+1, where N is the number of the 
log you tried to feed it.

Your log-shipped slave has a table called
_shahaf_cluster.sl_archive_tracking.  It should have been 
created by the slony1_dump.sh script.  That table will contain one row,
and the "at counter" (first 
field there) is the log number it ingested last.  It wants the
at_counter + 1.  So, in the slavedb, 
select * from _sharaf_cluster.sl_archive_tracking;

Your slony logs should be numbered, monotonically increasing, and you
should have that next log number 
in the directory where those logs live.  If you have that correct log,
have you tried applying it?

Which script are you using to apply the logs, anyway?

As to how it got "off" from the correct order, I am not sure.  You'd
have to check any logging 
programs you have, whose names are hopefully in the log-application
script.

-Jennifer Spencer

Shahaf Abileah wrote:
> Hello,
> 
>  
> 
> I'm trying to use the slony_logshipper tool to update a log-shipped
> slave.  I kicked off the process and monitored the log file.  After a
> short time, I noticed this in the log:
> 
>  
> 
> ERROR 2008-04-29 13:56:24 > PGRES_FATAL_ERROR: ERROR:  Slony-I: node
is
> on archive counter 44, this archive log expects 45
> 
> Query was: select "_shahaf_cluster".archiveTracking_offline('46',
> '2008-04-29 13:41:46.204843');
> 
> WARN  2008-04-29 13:56:24 > waiting for resume
> 
>  
> 
> I can see the archiveTracking_offline function in the dump file I got
> using slony1_dump.sh, the one I used to initialize my logshipped
slave.
> But I'm not familiar with the details of this replication scheme, so
I'm
> not sure why the archive counter would have gotten out of sync
(whatever
> that means).
> 
>  
> 
> Can anyone offer suggestions on how to fix this issue or how to
analyze
> it further?
> 
>  
> 
> Thanks,
> 
>  
> 
> --S
> 
>  
> 
>  
> 
> Shahaf Abileah | Lead Software Developer - Data Team
> 
> shahaf at redfin.com <mailto:shahaf at redfin.com>  | tel: 206.859.2869 |
> cell: 206.331.2057 | www.redfin.com <http://www.redfin.com> 
> 
> 
> 
> 
>
------------------------------------------------------------------------
> 
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general