Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Neil Brow » Thu, 31 Mar 2005 13:10:07



Thanks. I had another look, and think I may be able to see the
problem. If I'm right, it is a problem with this patch.



I think the "!buffer_update(bh)" test is not safe at this point as,
after the wait_on_buffer which could cause a schedule, the bh may
no longer exist, or be for the same block. There doesn't seem to be
any locking or refcounting that would keep it valid.

Note the comment "the journal_head may have been removed now".
If the journal_head is gone, the associated buffer_head is likely gone
as well.

I'm not certain that this is right, but it seems possible and would
explain the symptoms. Maybe Stephen or Andrew could comments?



As an aside, this looks extremely dubious to me.

There is a loop earlier in this routine (do_generic_file_write) that
passes a piece-at-a-time of the write request to prepare_write /
commit_write.
Successes are counted in 'written'. A failure causes the loop to
abort with a status in 'status'.

If some of the write succeeded and some failed, then I believe the
correct behaviour is to return the number of bytes that succeeded.
However this change to the return status (remember the above patch is
a reversal) causes any failure to over-ride any success. This, I
think, is wrong.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Marcelo To » Fri, 01 Apr 2005 02:10:12


Seems to be possible, yes.


Andrew, Stephen?


Yeap, that part also looks wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/

 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Thu, 07 Apr 2005 07:50:07

Hi,




Sorry, was offline for a week last week; I'll try to look at this more
closely tomorrow. Checking the buffer_uptodate() without either a
refcount or a lock certainly looks unsafe at first glance.

There are lots of ways to pin the bh in that particular bit of the
code. The important thing will be to do so without causing leaks if
we're truly finished with the buffer after this flush.



Certainly it's normal for a short read/write to imply either error or
EOF, without the error necessarily needing to be returned explicitly.
I'm not convinced that the Singleunix language actually requires that,
but it seems the most obvious and consistent behaviour.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Thu, 07 Apr 2005 23:30:14

Hi,





Yes. But it is conventional to interpret a short write as being a
failure. Returning less bytes than were requested in the write
indicates that the rest failed. It just doesn't give the exact nature
of the failure (EIO vs ENOSPC etc.) For regular files, a short write is
never permitted unless there are errors of some description.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Fri, 08 Apr 2005 05:20:10

Hi,




Thanks --- it certainly looks like this should fix the dereferencing of
invalid bh's, and this code mirrors what 2.6 does around that area.

However, 2.6 is suspected of still having leaks in ext3. To be certain
that we're not just backporting one of those to 2.4, we need to
understand who exactly is going to clean up these bh's if they are in
fact unused once we complete this code and do the final put_bh().

I'll give this patch a spin with Andrew's fsx-based leak stress and see
if anything unpleasant appears.

Cheers,
Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Sat, 09 Apr 2005 01:00:23

Hi,




I'm currently running with the buffer-trace debug patch, on 2.4, with a
custom patch to put every buffer jbd ever sees onto a per-superblock
list, and remove it only when the bh is destroyed in
put_unused_buffer_head(). At unmount time, we can walk that list to
find stale buffers attached to data pages (invalidate_buffers() already
does such a walk for metadata.)

I just ran it with a quick test and it found 300,000 buffers still
present at unmount. Whoops, I guess I need to move the check to _after_
the final invalidate_inodes() call. :-)

But this method ought to allow us to test for this sort of leak a lot
more reliably, and to report via the usual buffer history tracing the
most recent activity on any bh that does leak.

Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested
for Marcelo. I've got uptodate buffer_trace for 2.6 anyway, so it
shouldn't be too hard.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Andrew Mor » Wed, 13 Apr 2005 05:50:11


Correct.


Presumably these pages have no ->mapping, so try_to_release_page() will
call try_to_free_buffers().


Yes, if the buffers are dirty then 2.4's try_to_free_buffers() won't free
them.


Why do we have dirty buffers left over at umount time?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Wed, 13 Apr 2005 07:40:10

Hi,




Good --- I think I've got the debugging solid against 2.4 for that case,
patching against 2.6 now.


Exactly.


In the leak case we're worried about, the buffers are dirty but the page
is anon so there's nothing to clean them up. The buffers _will_ be
destroyed by unmount, sure; invalidate_bdev() should see to that. But
I'm doing the bh leak check before we get to the final
invalidate_bdev(), so they should still be available for testing at that
point.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Thu, 14 Apr 2005 23:40:15

Hi,




...

In further testing, this chunk is causing some problems. It is entirely
legal for a buffer to be !uptodate here, although the path is somewhat
convoluted. The trouble is, once the IO has completed,
journal_dirty_data() can steal the buffer from the committing
transaction. And once that happens, journal_unmap_buffer() can then
release the buffer and clear the uptodate bit.

I've run this under buffer tracing and can show you this exact path on a
trace. It only takes a few seconds to reproduce under fsx.

For this to have happened, though, journal_dirty_data needs to have
waited on the data, so it has an opportunity to see the !uptodate state
at that point.

I think it's safe enough to sidestep this by double-checking to see
whether the bh is still on the committing transaction after we've waited
for the buffer and retaken journaling locks, but before testing
buffer_uptodate().

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Fri, 29 Apr 2005 07:50:11

Hi,





It turns out that simply testing for

(!buffer_uptodate(bh) && buffer_mapped(bh))

is enough to fix this: the situation where the uptodate bit is cleared
manually here also clears the buffer_mapped bit, and that is done under
bh lock so we don't even have to worry about the order in which the
bitops occur. There's another path later on in commit.c where the same
test encounters the same problem; that gets cured by the same fix. With
the EIO errors turned into BUG()s for debugging, I've been able to run
under severe load, with 50msec commit intervals, for a day or so on SMP
without running into any false positives.

In checking out the patch one last time, though, I found one anomaly.
The patch that was submitted to 2.6 for this problem also changed
write_inode_now() to return an error value. The patch that got
submitted to 2.4 had no such change. Was there a reason for this
difference between the two versions?

Cheers,
Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

Post by Stephen C. » Fri, 29 Apr 2005 20:20:17

Hi,




At present I'm using your changed version, but that is wrong --- in most
cases, we want to return short writes on errors. It's only in the
synchronous case that we want to do otherwise, and even then only on EIO
on the inode. For example, if we run out of quota halfway through an
O_SYNC write beyond EOF, then the bulk of the IO *has* completed
successfully, and we want to indicate that in the error. So I'll be
re-jigging that bit of the code.

Cheers,
Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/