gdb (linux) "print" command clears memory corruption - so how do I find my bug?

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by kreuite » Fri, 05 Dec 2003 20:01:18


I am looking for some advice on how to debug a program when the
de *** "print" command actually clears the corruption. This is not
the usual non-initialised memory problem, because the program aborts
with a SIGBUS inside the de *** as well. But when I use the print
command inside the de *** , the program completes normally.

I am using gdb on a linux system. The offending C code is:

memcpy(new_entry, &newloc, IRECPTRLEN);

I display these values just before the memcpy:

printf("Calling memcpy(%p, %p, %d)\n", new_entry, &newloc,
IRECPTRLEN);

... which works. When run straight from gdb (snipped a bit):

$ gdb xwif
(gdb) b src/c_library.c:598
Breakpoint 1 at 0x804bca3: file src/c_library.c, line 598.
(gdb) run
Starting program: /home/dev/bin/xwif -p
Calling memcpy(0x4001f000, 0xbffff04c, 4)

Breakpoint 1, c$keyed_write (p=0x80520a0, record=0x80658a0 "\002") at
src/c_library.c:598
598 memcpy(new_entry, &newloc, IRECPTRLEN);
(gdb) s

Program received signal SIGBUS, Bus error.
0x4207c46c in memcpy () from /lib/i686/libc.so.6

But when I use "print" before "step":


$ gdb xwif
(gdb) b src/c_library.c:598
Breakpoint 1 at 0x804bca3: file src/c_library.c, line 598.
(gdb) r

Starting program: /home/dev/bin/xwif -p
Calling memcpy(0x4001f000, 0xbffff04c, 4)

Breakpoint 1, c$keyed_write (p=0x80520a0, record=0x80658a0 "\002") at
src/c_library.c:598
598 memcpy(new_entry, &newloc, IRECPTRLEN);
(gdb) p new_entry
$1 = 0x4001f000 ""
(gdb) s
599 new_entry += IRECPTRLEN;
(gdb)

... and it completes successfully.

I *know* that I am corrupting memory somewhere (I am calling mmap). I
wrote a small program to test the way I am using mmap(), and it works.
But when I try to include it in a much larger application, it aborts.
I am not asking you to debug my program, nor for help on mmap()
(although, if you really want to spend hours stepping through my code,
I won't object :-) But I am requesting help with techniques to debug
programs exhibiting symptoms like the above.

(I orignally posted this to comp.lang.c, but suspect that I might have
chosen the wrong newsgroup. Perhaps someone can also advise me how I
determine which group to post a query to; is there a FAQ on choosing
newsgroups?)
 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by mru » Fri, 05 Dec 2003 20:22:54


XXXX@XXXXX.COM (Gavin Kreuiter) writes:


You have a Schringer bug, i.e. one that changes when you observe it.
Since you are running Linux, I'd suggest you try out valgrind,
http://www.yqcomputer.com/ 's a de *** that will usually catch
most kinds of memory related bugs.


Browse the list of groups and find those that look appropriate to you.
Then fetch the FAQ for those groups, and see if it looks right. It
might even answer your question.

--
Ms Rullgd
XXXX@XXXXX.COM

 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by Paul Pluzh » Sat, 06 Dec 2003 01:39:53


XXXX@XXXXX.COM (Gavin Kreuiter) writes:


Hmm ... This behaviour is quite rare.
What is the output from "cat /proc/<pid>/maps" just before memcpy()?


This doesn't look like memory corruption.
This looks like mmap() with strange/incorrect flags, and possibly
an interaction with or a bug in Linux ptrace (which linux is it BTW?).


Or perhaps the application mmap()s something "on top of" your
previous mapping? [But that should not be affected by "gdb print",
I think].


Well, if you give me access to the debug binary, I can see what I
can do for you (I am curious to find the root cause).

XXXX@XXXXX.COM (Ms Rullgd) writes:


Always a good advice.

Though even if valgrind tells what the bug is, I'd still be curious
why it disappears with "gdb print".

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by Bjorn Rees » Sat, 06 Dec 2003 04:22:05


It could be related to the cpu cache (e.g. the new_entry variable
may be cached after the print command) or the cpu pipeline.

The original poster does not mention how the code is compiled, but
it may be worth trying to compile with and without optimization.

--
mail1dotstofanetdotdk
 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by Paul Pluzh » Sat, 06 Dec 2003 04:47:55

"Bjorn Reese" < XXXX@XXXXX.COM > writes:



Given that all the values were also printed from within the program
on the previous line, this is quite an unlikely explanation.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by Bjorn Rees » Sun, 07 Dec 2003 02:52:49


Without further information we can only continue to speculate,
but maybe only memcpy exhibits the cache/pipeline/whatever
problem and printf does not. After all, memcpy tends to be a
highly optimized inlined function/macro, whereas printf is too
complicated to be inlined. So the use of printf may "fix" the
crash in the same manner as the gdb print command does.

I still would be interested in knowing if the code behaves
differently depending on whether optimization has been turned
on or off.

--
mail1dotstofanetdotdk
 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by kreuite » Fri, 19 Dec 2003 21:17:37

Thanks to all who replied. I was actually looking for advice on how
to debug a problem of this nature. Valgrind seems like a good bet for
future, although it didn't help in this case. As Paul suggested, it
wasn't memory corruption as such; in essence, it was dereferencing an
out-of-bounds pointer (the mmap'd file's disk size is zero).

I managed to reproduce gdb's curious behavior in the small program
below, and include a gdb session for the sake of completeness
(although the reason why this occurs remains a mystery):

-------------------------- <snip> ----------------------------
#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>

main() {
int i, fd;
char *ptr;

fd = open("data", O_RDWR | O_CREAT + O_TRUNC, 0777);
ptr = mmap(NULL, 32768, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

for (i = 4096; i <= 32768; i += 4096) {
printf("setting file size to %d\n", i);
printf("ptr[%d] = %d\n", i-1, ptr[i-1]);
}
}
-------------------------- <snip> ----------------------------
$ gdb demo
(gdb) r
setting file size to 4096
Program received signal SIGBUS, Bus error.
0x08048406 in main () at demo.c:14
14 printf("ptr[%d] = %d\n", i-1, ptr[i-1]);

(gdb) b 14
Breakpoint 1 at 0x80483fc: file demo.c, line 14.
(gdb) r
setting file size to 4096
Breakpoint 1, main () at demo.c:14
14 printf("ptr[%d] = %d\n", i-1, ptr[i-1]);

(gdb) p ptr[i-1]
$1 = 0 '\0'

(gdb) c
Continuing.
ptr[4095] = 0
setting file size to 8192
Breakpoint 1, main () at demo.c:14
14 printf("ptr[%d] = %d\n", i-1, ptr[i-1]);

(gdb) p ptr[i-1]
$2 = 0 '\0'

(gdb) c
Continuing.
ptr[8191] = 0
setting file size to 12288
Breakpoint 1, main () at demo.c:14
14 printf("ptr[%d] = %d\n", i-1, ptr[i-1]);

(gdb) c
Continuing.

Program received signal SIGBUS, Bus error.
0x08048406 in main () at demo.c:14
14 printf("ptr[%d] = %d\n", i-1, ptr[i-1]);

-------------------------- <snip> ----------------------------
As can be seen from the above, accessing the out-of-bounds pointer
signalled SIGBUS; first using gdb to dereference it (via "print")
resets it somehow, so that the SIGBUS is not produced.

- The demo is a modified version from Stevens, without calling
ftrunc() to increase the file size on disk.
- I am running Red Hat 8.0.
- valgrind terminates with bus error, without additional info
- optimisation had no visible effect
 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by Paul Pluzh » Sat, 20 Dec 2003 02:05:17


XXXX@XXXXX.COM (Gavin Kreuiter) writes:


This is a kernel bug: gdb performs "ptrace(PEEK_TEXT, ptr+4095, ...)",
and the kernel "automagically" extends the vma to "cover" the
[ptr, ptr+4095) range.


Also reproduces with kernels 2.4.6 and 2.4.20 (RH-9.0).

If anyone could reproduce this on 2.6.0 kernel, this should be
reported, as it makes debugging this particular problem unnecessarily
difficult.

For comparison, here is Solaris behaviour:

Breakpoint 1, main () at mmap4.c:14
14 printf("ptr[%d] = %d\n", i-1, ptr[i-1]);
(gdb) p ptr[i-1]
Cannot access memory at address 0xff390fff.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
 
 
 

gdb (linux) "print" command clears memory corruption - so how do I find my bug?

Post by mru » Sat, 20 Dec 2003 02:29:53

Paul Pluzhnikov < XXXX@XXXXX.COM > writes:


Linux 2.6.0-test11 does it.

--
Ms Rullgd
XXXX@XXXXX.COM