Scott Marlowe smarlowe
Wed Jun 14 13:50:50 PDT 2006
Not sure why I'm getting this error.

We had an outage in our production systems on saturday when our shared
storage went kablooey and the systems went off line for about 7 hours.

Most everything came back up ok, BUT, I got a failure on replication of
one set about an hour later, and I can't get it to come back.

urg!

So, if I look for the slon daemons, there are none.

ps af|grep "slon repo"

If I run my script to start replication, I get three slon processes:

ps af|grep "slon repo"
 5954 pts/0    S      0:00 /data01/pg/bin/slon reporting
dbname=production_reporting user=postgres host=pg01
 5956 pts/0    S      0:00 /data01/pg/bin/slon reporting
dbname=production_reporting user=postgres host=pg01
 5957 pts/0    S      0:00  \_ /data01/pg/bin/slon reporting
dbname=production_reporting user=postgres host=pg01

I have to other replication sets on this machine.  They are running
normally, and each has 8 total processes / threads.  The statistic
database, for instance, looks like this:

 4882 pts/0    S      0:00 /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01
 4884 pts/0    S      0:00 /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01
 4885 pts/0    S      0:00  \_ /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01
 4886 pts/0    S      0:00  \_ /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01
 4888 pts/0    S      0:00  \_ /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01
 4890 pts/0    S      0:00  \_ /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01
 4891 pts/0    S      0:00  \_ /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01
 4892 pts/0    S      0:00  \_ /data01/pg/bin/slon statistic
dbname=production_statistic user=postgres host=pg01


So, I'm running my slon.master script by hand, and I see this:

/data01/pg/bin/slon reporting "dbname=production_reporting user=postgres
host=pg01"
CONFIG main: slon version 1.0.5 starting up
CONFIG main: local node id = 1
CONFIG main: loading current cluster configuration
CONFIG storeNode: no_id=2 no_comment='Slave node'
CONFIG storePath: pa_server=2 pa_client=1
pa_conninfo="dbname=production_reporting host=pg02 user=postgres"
pa_connretry=10
CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
CONFIG storeSet: set_id=1 set_origin=1 set_comment='All reporting
tables'
CONFIG main: configuration complete - starting threads
FATAL  localListenThread: Another slon daemon is serving this node
already

Note that I've filled in the env vars to make it easier to read.  In
reality this script looks more like /data01/pg/bin/slon $CLUSTERNAME
"dbname=$MASTERDBNAME user=$REPLICATIONUSER host=$MASTERHOST" 

I've done an lsof, and can find nothing sitting on any ports for the
postgres user or with slon in its name.

Any hints would be greatly appreciated, as I'd rather not have to reboot
a production server to get replication up and running.

ipcm looks normal too.



More information about the Slony1-general mailing list