Jason Chen yunfeng82 at gmail.com
Sun Oct 31 21:53:23 PDT 2010
Hi Steve,

After detail check log/core dump and sl_event table, I have figured out the
root cause which is the time change issue.

During the master node configuration, time has changed after configure slon
service. I am configuring VM machine in a ESX host which time is earlier 7
hours than NTP server which VM machine is using after configuration. This
will cause slon sched thread continuously check and wait for 7 hours. The
simple workaround here is to restart slon service to refresh the time in
slon sched thread.

This might bring a new requirement on slony. Do we have any kind of
mechanism to handle time change other than restart slon service? Consider
one scenario, after slon service configured successfully and there have many
SYNC events generated, then user configures a new external NTP server which
might has several days before the current time. This will cause all previous
un-confirmed SYNC events cannot be synced until time has caught up.

Could you share your insight on this potential issue?

Thanks,
Jason


On Sat, Oct 30, 2010 at 7:19 PM, Jason Chen <yunfeng82 at gmail.com> wrote:

> After the error system run several hours, it becomes normal again. So I
> need to redeploy the testbed and get the backtrace. Basically, I have
> compared the error master node with normal master node. The only difference
> are there have only 5 threads in error master which missing remoteListener
> and remoteWorker thread. If you need this details, I will get it and let you
> know next Monday after access my system.
>
> Do you think there has any issue in the configuration process?
>
> Here is the backtrace of the error master node which has become normal
> currently.
>
> *(gdb) thread apply all bt
>
> Thread 7 (Thread 0x4159a940 (LWP 6365)):
> #0  0x00007fc74a5f6da2 in select () from /lib64/libc.so.6
> #1  0x000000000041396e in sched_mainloop (dummy=<value optimized out>) at
> scheduler.c:532
> #2  0x00007fc74a8852f7 in start_thread () from /lib64/libpthread.so.0
> #3  0x00007fc74a5fd85d in clone () from /lib64/libc.so.6
> #4  0x0000000000000000 in ?? ()
>
> Thread 6 (Thread 0x4094c940 (LWP 6371)):
> #0  0x00007fc74a8894a6 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000041334e in sched_wait_conn (conn=0x630d70, condition=0) at
> scheduler.c:230
> #2  0x00000000004056ee in localListenThread_main (dummy=<value optimized
> out>) at local_listen.c:701
> #3  0x00007fc74a8852f7 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fc74a5fd85d in clone () from /lib64/libc.so.6
> #5  0x0000000000000000 in ?? ()
>
> Thread 5 (Thread 0x41d9b940 (LWP 6376)):
> #0  0x00007fc74a8894a6 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000041334e in sched_wait_conn (conn=0x6317f0, condition=0) at
> scheduler.c:230
> #2  0x0000000000412a7e in cleanupThread_main (dummy=<value optimized out>)
> at cleanup_thread.c:113
> #3  0x00007fc74a8852f7 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fc74a5fd85d in clone () from /lib64/libc.so.6
> #5  0x0000000000000000 in ?? ()
>
> Thread 4 (Thread 0x4274c940 (LWP 6380)):
> #0  0x00007fc74a8894a6 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000041334e in sched_wait_conn (conn=0x642650, condition=0) at
> scheduler.c:230
> #2  0x00000000004125b6 in syncThread_main (dummy=<value optimized out>) at
> sync_thread.c:101
> #3  0x00007fc74a8852f7 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fc74a5fd85d in clone () from /lib64/libc.so.6
> #5  0x0000000000000000 in ?? ()
>
> Thread 3 (Thread 0x42f4d940 (LWP 8283)):
> #0  0x00007fc74a8894a6 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000040c5c3 in remoteWorkerThread_main (cdata=<value optimized
> out>) at remote_worker.c:479
> #2  0x00007fc74a8852f7 in start_thread () from /lib64/libpthread.so.0
> #3  0x00007fc74a5fd85d in clone () from /lib64/libc.so.6
> #4  0x0000000000000000 in ?? ()
>
> Thread 2 (Thread 0x4374e940 (LWP 8285)):
> #0  0x00007fc74a8894a6 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x000000000041334e in sched_wait_conn (conn=0x6433b0, condition=0) at
> scheduler.c:230
> #2  0x0000000000406b0a in remoteListenThread_main (cdata=<value optimized
> out>) at remote_listen.c:339
> #3  0x00007fc74a8852f7 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007fc74a5fd85d in clone () from /lib64/libc.so.6
> #5  0x0000000000000000 in ?? ()
>
> Thread 1 (Thread 0x7fc74aeba6e0 (LWP 6363)):
> #0  0x00007fc74a8865b5 in pthread_join () from /lib64/libpthread.so.0
> #1  0x0000000000413582 in sched_wait_mainloop () at scheduler.c:172
> #2  0x0000000000402f31 in SlonWatchdog () at slon.c:740
> #3  0x0000000000403c58 in main (argc=6, argv=0x7fff9a3202b8) at slon.c:355
> #0  0x00007fc74a8865b5 in pthread_join () from /lib64/libpthread.so.0*
>
>
>
> On Sat, Oct 30, 2010 at 4:28 AM, Steve Singer <ssinger at ca.afilias.info>wrote:
>
>> On 10-10-29 11:12 AM, Jason Chen wrote:
>>
>>> That is correct. In the error node, the master node cannot get
>>> STORE_PATH event and cannot start remoteListen and remoteWorker threads.
>>>
>>>
>> You mentioned previosuly something about gdb.
>>
>> Can you connect to the slon process while it is in this state to see what
>> it is doing.
>>
>> ie  'info threads' to display a list of threads
>>
>> thread 1
>> thread 2
>> etc..
>> to switch between threads.
>>
>> and bt to show the stack trace of each thread.
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-hackers/attachments/20101101/b253bd66/attachment.htm 


More information about the Slony1-hackers mailing list