Core dumping at drop_ex_thread

Core dumping at drop_ex_thread

Post by Mr_Plo » Thu, 07 Oct 2004 07:53:51


have a multi-threaded C application in Solaris 8 that core dumps under
heavy load.
This load is a large number of x.25 calls, where a new thread is created
to handle each call - then destroyed when the call is terminated.
I can't seem to find any info on the last call in the stack
"drop_ex_thread".
I was directed to look at a Sun Bug (4886961 thr_create() can run on a
stack before the previous owner has
thr_exit()'ed) but haven't been able to obtain the details.
I have used a number of utilities to collect data, but still can't figure
out what might be causing the core dump.
I am assuming the address "ff344764" is out of bounds, but can't figure
out why it would be accessed.

In the following data, you'll see "process_ev_stream_msgs" which is my
function that tries to kill the thead after the stream is closed.

Thanks for any assistance!

core file = core -- program ``X25_MT'' on platform SUNW,Ultra-Enterprise
SIGSEGV: Segmentation Fault
$C
?() at ff344764
[savfp=0x0,savpc=0x0]
data address not found
?
0: 7f454c46 = call 0xfffffffffd153118
-----------------------
WORKSHOP:
t@12277 (l@102) terminated by signal SEGV (no mapping at the fault
address)
-----------------------
PSTACK:
core 'core' of 24636: /usr/gsm/current/bin/X25_MT + 1
----------------- lwp# 102 / thread# 12277 --------------------
ff344764 drop_ex_thread (0, ff179d20, 1, ff344764, 0, 4)
ff152224 _thrp_exit (ed70dd70, ff16e000, 0, ff3451b4, 0, 0) + 18
ff156550 _t_cancel (ff15b824, ff16e000, ed70d070, ff1521fc, ed70de14,
ed70de04) + c4
ff1521fc _thr_exit_common (ed70dee8, ff16ed30, ff16e000, ed70dd70, 0,
0) + 1a4
0002ef68 process_ev_stream_msgs (ee508, fe3f3d10, 1, ff17ad8c, 0, 2) +
3b0
ff15b744 _thread_start (ee508, 0, 0, 0, 0, 0) + 40
-----------------------
PFLAGS:
core 'core' of 24636: /usr/gsm/current/bin/X25_MT + 1
data model = _ILP32 flags = PR_RLC
flttrace = 0xfffffbff
sigtrace = 0xfffffeff 0xffffffff
entryset = 0x00000401 0x04000000 0x00000000 0x00000028
0x80000000 0x00000000 0x00000000 0x00000000
exitset = 0xfffffffe 0xffffffff 0xffffffff 0xffffffd7
0x7fffffff 0xffffffff 0xffffffff 0xffffffff
/102: flags = PR_PCINVAL
sigmask = 0xffffbefc,0x00001fff cursig = SIGSEGV
-----------------------
PCRED:
core of 24636: e/r/suid=110 e/r/sgid=110
------------------------
DBX:
process_ev_stream_msgs(0xee508, 0xfe3f3d10, 0x1, 0xff17ad8c, 0x0,
0x2)
_thr_exit_common(0xed70dee8, 0xff16ed30, 0xff16e000, 0xed70dd70, 0x0,
0x0)
_t_cancel(0xff15b824, 0xff16e000, 0xed70d070, 0xff1521fc, 0xed70de14,
0xed70de04)
_thrp_exit(0xed70dd70, 0xff16e000, 0x0, 0xff3451b4, 0x0, 0x0)
_destroy_tsd(0x0, 0xff179d20, 0x1, 0xff344764, 0x0, 0x4)
drop_ex_thread(0x43535400, 0x43535400, 0xff16e000, 0xff172720,
0xff179d20, 0xff179d40)

0x0002ef68: process_ev_stream_msgs+0x03b0: call thr_exit [PLT]
0xff1521fc: _thr_exit_common+0x01a4: call _tcancel_all
0xff156550: _t_cancel+0x00c4: call _thrp_exit
0xff152224: _thrp_exit+0x0018: call _destroy_tsd
0xff154420: _destroy_tsd+0x0094: jmpl %i3, %o7
0xff344764: drop_ex_thread : ld [%o0], %o3

(dbx) where -l
=>[1] libCrun.so.1:drop_ex_thread(0x43535400, 0x43535400, 0xff16e000,
0xff172720, 0xff179d20, 0xff179d40), at 0xff344764
[2] libthread.so.1:_destroy_tsd(0x0, 0xff179d20, 0x1, 0xff344764, 0x0,
0x4), at 0xff154420
[3] l
 
 
 

Core dumping at drop_ex_thread

Post by Jonathan A » Thu, 07 Oct 2004 11:03:20

n article
< XXXX@XXXXX.COM >,
"Mr_Plow" < XXXX@XXXXX.COM > wrote:



You can use pmap(1m) to look at the address space of the process. But
that's not the address which the program is tripping over -- see below.

...

This looks like an accurate stack trace.

^^^^^^^^^^
...
...

This is part of libCrun, the C++ runtime library. It's the destructor
(see pthread_key_create(3thr)) for a piece of thread-specific data which
holds the state for C++'s exception handling. It decompiles to roughly:

void
drop_ex_thread(void *p)
{
my_info *ptr = p;
if (ptr->needs_free) {
free(ptr);
}
}

The argument (%o0) to drop_ex_thread() is 0x43535400 from the above
register dump. That looks pretty bogus -- in fact, it's the string
"CST\0". There's some form of massive memory corruption going on here,
which has corrupted libthread's state with what looks like a time zone
name.

I'd investigate getting a malloc debug library.

Cheers,
- jonathan

P.S. I took a look at:
4886961 thr_create() can run on a stack before the previous owner has
thr_exit()'ed

but that only applies to a thread which has already reached the
lwp_exit() system call -- this thread is still processing TSD (which is
one of the first stages of thread exiting).

 
 
 

Core dumping at drop_ex_thread

Post by Mr_Plo » Fri, 08 Oct 2004 06:07:03

Thanks.

Even thought it doesn't really matter...with pmap I get this:
FF340000 48K read/exec /usr/lib/libCrun.so.1
FF35A000 8K read/write/exec /usr/lib/libCrun.so.1
FF35C000 16K read/write/exec /usr/lib/libCrun.so.1

I figured this was some malloc thing, so ran Purify on it. But, this
slowed down the application to an extent where this failure no longer
happened.

Who passes this %o0 to drop_ex_thread?
Does it just use %o0 because it's the first one?
How might that strange contents get in there?
If it matters...my application logs a message upon these disconnection,
with an entry that includes the time, set with "localtime_r( (time_t
*)&now, &tm_buf )". I vaguely remember I had to change something with my
calls to localtime to fix some cores in the past.

I searched for some malloc debug libraries, and came across several. Any
suggestions for which one to use?
 
 
 

Core dumping at drop_ex_thread

Post by Jonathan A » Fri, 08 Oct 2004 07:52:41

In article
< XXXX@XXXXX.COM >,


The thread library does -- it's the contents of the thread-specific data
buffer for that key for that thread. It's stored in an array allocated
using malloc(3C), with other keys nearby.


That's the calling convention on SPARC -- first argument to the function
goes in %o0, next argument is %o1, on up to sixth argument in %o5. %o6
and %o7 are special -- %o6 is also known as %sp (the stack pointer), and
%o7 typically contains the (return address - 8).


My bet would be a too-short buffer being overrun. i.e. someone is doing
something like:

char *s = malloc(n);
ascftime(s, "... %Z", &tm_buf);

where 'n' is too small to hold everything generated by ascftime() (or
whatever function is being used for the call).


I haven't looked at them in a while, so I'm not sure which to recommend.

- jonathan
 
 
 

Core dumping at drop_ex_thread

Post by Mr_Plo » Sun, 10 Oct 2004 06:29:08

Took out the logged messages, just to see what it does...and a malloc and
the 0x43535400 comes up on the stack in a new core dump...no big
surprize:
t@51683 (l@48) terminated by signal SEGV (no mapping at the fault
address)

(dbx) where -l
=>[1] libc.so.1:_smalloc(0x10, 0xff0c27b0, 0x4, 0x10, 0x0, 0x0), at
0xff041c78
[2] libc.so.1:realloc(0xff0c0600, 0x10, 0xff0bc004, 0xff172720, 0x10,
0x0), at 0xff041fcc
[3] libthread.so.1:_ti_pthread_setspecific(0xff179d10, 0x4,
0xff16e000, 0x10, 0x3941b8, 0x0), at 0xff1542fc
[4] X25_MT.bk.1007:init_mps_thread(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at
0x4afd8
[5] X25_MT.bk.1007:process_fd_msgs(0x55, 0xfed83d10, 0x0, 0x5, 0x1,
0xfe400000), at 0x38fec
(dbx) regs
current thread: t@51683
current frame: [1]
g0-g3 0x00000000 0xff14c788 0x00000000 0x00000000
g4-g7 0x00000000 0x00000000 0x00000000 0xf330dd70
o0-o3 0x00000018 0x43535400 0xff0c27c4 0xff0bc004
o4-o7 0xff0c2840 0x00000000 0xf330cee0 0xff041bf0
l0-l3 0xff0c0600 0x00000000 0x00000000 0x00000000
l4-l7 0x00000000 0x00000000 0x00000000 0x00000000
i0-i3 0x00000010 0xff0c27b0 0x00000004 0x00000010
i4-i7 0x00000000 0x00000000 0xf330cf40 0xff041fcc
y 0x804c40f1
ccr 0xfe001006
pc 0xff041c78:_smalloc+0x8c ld [%o1 + 0x8], %o0
npc 0xff041c7c:_smalloc+0x90 st %o0, [%i2 + %i1]
(dbx)
0xff041c78: _smalloc+0x008c: ld [%o1 + 0x8], %o0
0x00038fec: process_fd_msgs+0x000c: call init_mps_thread
0x0004afd8: init_mps_thread+0x0098: call pthread_setspecific
[PLT]
0xff1542fc: _ti_pthread_setspecific+0x00b8: call
_PROCEDURE_LINKAGE_TABLE_+0x420 [PLT]
0xff041fcc: realloc+0x005c: call _malloc_unlocked
0xff041c78: _smalloc+0x008c: ld [%o1 + 0x8], %o0
(dbx) dis
0xff041c7c: _smalloc+0x0090: st %o0, [%i2 + %i1]
0xff041c80: _smalloc+0x0094: ld [%o1], %o0
0xff041c84: _smalloc+0x0098: or %o0, 0x1, %o0
0xff041c88: _smalloc+0x009c: st %o0, [%o1]
0xff041c8c: _smalloc+0x00a0: add %o1, 0x8, %o0
0xff041c90: _smalloc+0x00a4: ret
0xff041c94: _smalloc+0x00a8: restore %g0, %o0, %o0
0xff041c98: malloc : save %sp, -0x60, %sp
0xff041c9c: malloc+0x0004: call malloc+0xc
0xff041ca0: malloc+0x0008: sethi %hi(0x7a000), %o1
 
 
 

Core dumping at drop_ex_thread

Post by Jonathan A » Sun, 10 Oct 2004 07:15:15

In article
< XXXX@XXXXX.COM >,



Yup -- heap corruption.

- jonathan
 
 
 

Core dumping at drop_ex_thread

Post by Neem » Tue, 09 Nov 2004 15:32:02

Hi Mr_Plow and Jonathan..
Added this query here because my question is very related to this one. I
am also getting the SAME problem :
----------------------------------------------------------
detected a multithreaded program
t@1 (l@1) terminated by signal BUS (invalid address alignment)
0xfeec1e58: _malloc_unlocked+0x0164: ld [%i1], %o1
-----------------------------------------------------------
And am assuming heap corruption. Please help me trace the point where my
heap gets corrupted. I mean, I have a purify build of the exe, but that
does not core dump. How can I use that to solve my problem.

Thanks in advance...Please let me know asap.
 
 
 

Core dumping at drop_ex_thread

Post by Markus.Elf » Tue, 09 Nov 2004 19:29:47

> detected a multithreaded program

Can the following articles give you any ideas?
- http://www.yqcomputer.com/
- http://www.yqcomputer.com/ ***
- http://www.yqcomputer.com/