[linux-usb-devel] serious 2.6 bug in USB subsystem?

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Brow » Sat, 01 Nov 2003 00:30:21



Does that 0xf0000000 (on ia64) match any obvious address mapping
of the null pointer -- like a dma mapping? I'm not sure that if
the HID driver were to pass a null buffer pointer, it would be
caught anywhere.

- Dave



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Sat, 01 Nov 2003 05:20:13

>>>>> On Thu, 30 Oct 2003 07:11:42 -0800, David Brownell < XXXX@XXXXX.COM > said:



>> On x86, there is no OOps, it just freezes. On ia64, I get a nice MCA
>> and from that we can infer that a USB host controller read from
>> address 0xf0000000 caused the problem but since this is asynchronous
>> to the kernel's code path, the instruction pointer etc. in the MCA
>> state dump isn't terribly helpful.

David> Does that 0xf0000000 (on ia64) match any obvious address mapping
David> of the null pointer -- like a dma mapping?

Not really. AFAIK, 0xf0000000 is part of the PCI MMIO address space,
but on the machines that I have access to, this particular address
isn't assigned to any device:

$ lspci -v|fgrep 'Memory at'
Memory at 0000000080000000 (32-bit, prefetchable) [size=128M]
Memory at 0000000088000000 (32-bit, non-prefetchable) [size=512K]
Memory at 00000000d0023000 (32-bit, non-prefetchable) [size=4K]
Memory at 00000000d0022000 (32-bit, non-prefetchable) [size=4K]
Memory at 00000000d0021000 (32-bit, non-prefetchable) [size=256]
Memory at 00000000d0020000 (32-bit, non-prefetchable) [size=4K]
Memory at 00000000d0000000 (32-bit, non-prefetchable) [size=128K]
Memory at 00000000e0200000 (32-bit, non-prefetchable) [size=4K]
Memory at 00000000e0100000 (32-bit, non-prefetchable) [size=1M]

David> I'm not sure that if the HID driver were to pass a null
David> buffer pointer, it would be caught anywhere.

OK, I'll try to find some time to trace the I/O MMU calls to see if
something isn't kosher at that level. Is there a good way of getting
a relatively high-level of tracing in the USB subsystem that would
some me what's going on between the HID and the core USB level?

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/

 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Brow » Sun, 02 Nov 2003 01:40:10


I think there are some devices that *** the HID
code; I recall someone reporting a mouse that did the
same kind of thing. Do other kinds of keyboards do
the same thing, or is it just that one?

Vojtech may have other suggestions.

- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Sun, 02 Nov 2003 03:40:27

>>>>> On Fri, 31 Oct 2003 08:23:54 -0800, David Brownell < XXXX@XXXXX.COM > said:


>> After spending a bit more time on this, it looks to me like the
>> keyboard is crashing the system very early on.

David.B> I think there are some devices that *** the HID
David.B> code;

And nobody is alarmed by this? Surely crashing the kernel by plugging
in a USB device must be considered a MUST-FIX item. Perhaps I missed
something, but I never saw this mentioned before.

David.B> I recall someone reporting a mouse that did the same kind of
David.B> thing. Do other kinds of keyboards do the same thing, or is
David.B> it just that one?

Ugh, I only have about half a dozen or so different types of USB
devices (and even fewer of them are HID devices), so my experience
isn't exactly a statistically valid sample. Having said that, out of
that 6 or so devices, that particular keyboard is the only one causing
crashes. However, note that it works (mostly) fine under 2.4 and even
if they keyboard were total crap, it certainly shouldn't crash the
kernel.

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by Valdis.Kle » Sun, 02 Nov 2003 04:00:23

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
On Fri, 31 Oct 2003 10:34:22 PST, David Mosberger said:


Bill Gates. Comdex. Printer. :)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh version 2.5 07/13/2001

iD8DBQE/oq7kcC3lWbTT17ARAt30AKCH9gDYJJfsdkVppnq1vpCEDDmmDwCgvP68
7K0OKtSp+C4KhCtn+Tj5u/w=
=wHIj
-----END PGP SIGNATURE-----
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Brow » Sun, 02 Nov 2003 04:40:13


You sound alarmed! If that's alarmed enough to find out what
the real problem is, maybe you'll end up fixing it ... :)

I could be wrong about the problem being in the HID code, but
that does look like a likely home for the bug. We know there
are other issues with HID/input/hiddev/... that need attention.



Agreed, oopsing == bad. HID needs more attention. I suspect whoever
dives into that will want to know what you mean by "(mostly) fine";
that might give a clue about what 2.6 changes worsened the failures.

- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Sun, 02 Nov 2003 05:00:08

>>>>> On Fri, 31 Oct 2003 11:28:20 -0800, David Brownell < XXXX@XXXXX.COM > said:

David.B> You sound alarmed! If that's alarmed enough to find out
David.B> what the real problem is, maybe you'll end up fixing it
David.B> ... :)

Except I know almost nothing about the USB stack.

David.B> Agreed, oopsing == bad. HID needs more attention. I
David.B> suspect whoever dives into that will want to know what you
David.B> mean by "(mostly) fine"; that might give a clue about what
David.B> 2.6 changes worsened the failures.

Are you saying nobody is maintaining HID?

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David S. M » Sun, 02 Nov 2003 05:20:09

On Fri, 31 Oct 2003 11:50:01 -0800



David, get real, this is never an excuse for people of our
caliber. :-)

You, myself, and many others are more than intelligent enough and more
than capable enough to debug subsystems we are not familiar with or
even have never looked at before.

As platforms maintainers, such a skill is nearly a necessity.

When I hit a problem in some subsystem and I can't provide enough
information to the subsystem maintainer for them to fix the bug, I
have to do the debugging work if I want the bug fixed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Brow » Tue, 04 Nov 2003 13:00:17


Most of that story is just submitting and completing URBs.

I'd either try changing the spots in drivers/usb/core/hcd.c
marked as appropriate for generic MONITOR_URB hooks (printk
if it's your HID device, maybe), or manually turn on whatever
HCD-specific hooks exist (maybe use a VERBOSE message level).

Such a thing wasn't possible in 2.4 since there were too
many different bizarre (and sometimes buggy) ways for URBs
to return to the usb device drivers and get implicitly
resubmitted.

- Dave




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Wed, 05 Nov 2003 06:30:30

>>>>> On Sun, 02 Nov 2003 19:46:38 -0800, David Brownell < XXXX@XXXXX.COM > said:

David> I'm not sure that if the HID driver were to pass a null
David> buffer pointer, it would be caught anywhere.
>> OK, I'll try to find some time to trace the I/O MMU calls to see
>> if something isn't kosher at that level. Is there a good way of
>> getting a relatively high-level of tracing in the USB subsystem
>> that would some me what's going on between the HID and the core
>> USB level?

Dave.B> Most of that story is just submitting and completing URBs.

Yeah. And it appears that it's the very first call to
hid_submit_ctrl() that's triggering the problem (not always, but about
9 out of 10 times). I dumped some of the key fields for the URB being
submitted and they all looked saned to me.

Dave.B> I'd either try changing the spots in drivers/usb/core/hcd.c
Dave.B> marked as appropriate for generic MONITOR_URB hooks (printk
Dave.B> if it's your HID device, maybe), or manually turn on
Dave.B> whatever HCD-specific hooks exist (maybe use a VERBOSE
Dave.B> message level).

OK, thanks for the suggestion. I'll keep looking, but will be on
travel this week, so I may not be able to spend much time on this
problem.

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Sun, 07 Mar 2004 11:20:05

K, finally a bit of progress. If you remember back in October 2003 I
reported:

> One-line summary: plug-in your USB keyboard, see your machine die.

> So, I have this non-name USB keyboard (with built-in 2-port USB
> hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
> In retrospect, it's clear to me that the same keyboard also
> occasionally crashes 2.4 kernels, but there the problem appears
> more seldom. Perhaps once in 10 reboots and once the machine is
> booted and the keyboard is running, it keeps on working. The
> keyboard in question is a BTC 5141H.

After this, I spent a (small) amount of time looking over the HID code
etc to see what could be causing it. I could find nothing wrong so I
gave up, connected another USB keyboard, and basically ignored the
problem. In retrospect, that was Good Thinking, because I was
apparently looking at the wrong code: the problem _does_ appear to be
coming from the USB HCD, not from the HIDeous code.

Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines
I tested would crash as soon as they had _any_ USB keyboard plugged
in. That is, the problem no longer was limited to the BTC keyboard,
which is special because it has a built-in hub. This was encouraging.

Turns out it's this patch that was causing the crashes:

http://linux.bkbits.net:8080/linux-2.5/ XXXX@XXXXX.COM

That was strange, because even to my USB-untrained eye the patch
looked obviously correct. However, I think the root cause of the
problem really has to do with a race-condition between the controller
and the driver. In particular, if I apply the patch below, my USB
keyboards (including the BTC keyboard) work just fine!

===== drivers/usb/host/ohci-q.c 1.48 vs edited =====
--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c Fri Mar 5 17:25:55 2004
@@ -438,7 +451,7 @@
* behave. frame_no wraps every 2^16 msec, and changes right before
* SF is triggered.
*/
- ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+ ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;

/* rm_list is just singly linked, for simplicity */
ed->ed_next = ohci->ed_rm_list;

However, I think the root-cause of the problem may be this optimization
in ohci_irq():

/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */

Indeed, if I apply this patch instead:

===== drivers/usb/host/ohci-hcd.c 1.56 vs edited =====
--- 1.56/drivers/usb/host/ohci-hcd.c Tue Mar 2 05:52:40 2004
+++ edited/drivers/usb/host/ohci-hcd.c Fri Mar 5 17:45:09 2004
@@ -584,7 +584,7 @@
int ints;

/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
- if ((ohci->hcca->done_head != 0)
+ if (0 && (ohci->hcca->done_head != 0)
&& ! (le32_to_cpup (&ohci->hcca->done_head) & 0x01)) {
ints = OHCI_INTR_WDH;


there are no crashes either.

So my theory is that I was seeing this sequence of events:

- HCD signals WDH interrupt & sends DMA to update the frame number in
the host-controller communication area (HCCA)

- host gets interrupt, but skips readl() and hence reads a stale
frame number N instead of the up-to-date value (N+1)

- HCD cancels a transfer descriptor (TD), moves it to the "remove list"
and calculates the frame number at which it can be remove from
the host-controller's list as N+1

- SOF interrup
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Sun, 07 Mar 2004 11:20:11

Typo-alert:


David> - HCD ends up dereferencing a bad pointer and ends up
David> reading from address 0xf0000000, which on our ia64 machines
David> is a read-only area, which then results in a machine-check
David> abort
^^^^^^^^^
make that "write-only"

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Brow » Sun, 07 Mar 2004 14:10:07

avid Mosberger wrote:

Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!!

There's something wierd going on recently for sure, and it's
not caused just by that patch. I'm not sure reverting that
would make things better overall right now; hard to say.

I'm thinking the "disable periodic schedule" patch is also
worth looking at. I can imagine some silicon having bugs
if that's turned off (just like some _systems_ have issues
when it's left on).



I've seen that _change_ behavior in some regression tests
(usbtest test11/test12), as if the extra msec let things
quiesce (so only one of two broken states showed, and
not the oopsable one) but not _fix_ it.



See this post from Martin Diehl ... my response isn't out yet:

http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107850825815775&w=2

The reason I keep ending up thinking that readl-elimination
must be OK (me agreeing with Martin) is that the memory there
came from dma_alloc_coherent() ... so if anything's wrong,
it'd be at most lack of rmb(), not a stale-cache kind of thing.



It reads the frame number directly from the controller, so it's
not possible that it can be so stale that an rmb() wouldn't fix
sequencing issues.

What might be possible though is that the donelist gets modified
by the time the unlinks get processed, with some extra TDs changing
state (from HC perspective) ... haven't explored that possibility.



That'd be an ED on the remove list, not a TD. Also in dma-coherent
memory. The cancelation would apply to one or more of the TDs
queued to that ED.



This ED-level disagreement between the HC and HCD might explain
some issues. I think the current trouble cases are usually where
there's only one TD queued to the ED, so that TD and the dummy
keep ping-ponging back and forth. The HC certainly seems to
overwrite the "dummy TD" --


I'm surprised that DMA from a read-only area would be a problem! :)
If OHCI is getting a PCI error, I'd expect a "UE" IRQ.



Parts of it. There's definite recent nastiness. Of the type
that other eyes sometimes see better.



I still suspect some problem in the HID code. But right now
I'm quite certain of a recent-ish OHCI issue.

- Dave




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Sun, 07 Mar 2004 15:00:08

>>>>> On Fri, 05 Mar 2004 20:55:01 -0800, David Brownell < XXXX@XXXXX.COM > said:
>> Turns out it's this patch that was causing the crashes:

>> http://www.yqcomputer.com/ :8080/linux-2.5/ XXXX@XXXXX.COM

David.B> Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!!

Of course. What I'm saying is that in 2.6.0-test{8,9} it was rare to
trigger the problem (only with BTC keyboard) and the change above made
it trivial to trigger the keyboard. Basically, your fix in
cset 1.1619.1.17 made it more common for stuff to be unlinked in
the "deferred" (proper) manner and that made it much more likely to
trigger the bug.

David.B> The reason I keep ending up thinking that readl-elimination
David.B> must be OK (me agreeing with Martin) is that the memory
David.B> there came from dma_alloc_coherent() ... so if anything's
David.B> wrong, it'd be at most lack of rmb(), not a stale-cache
David.B> kind of thing.

It's not an issue of DMA coherency, it's an issue of DMA vs. interrupt
ordering. I believe the WHD interrupt is arriving at the CPU before
the DMA update to the HCCA is done. In my second patch, the readl()
at the beginning of the interrupt ensures that the DMA update to
the HCCA is completed before the readl() returns data.

David.B> It reads the frame number directly from the controller, so
David.B> it's not possible that it can be so stale that an rmb()
David.B> wouldn't fix sequencing issues.

Eh, it's read like this:

#define OHCI_FRAME_NO(hccap) ((u16)le32_to_cpup(&(hccap)->frame_no))

finish_unlinks (ohci, OHCI_FRAME_NO(ohci->hcca), ptregs);

The HCCA is in host memory.

>> - HCD ends up dereferencing a bad pointer and ends up reading
>> from address 0xf0000000, which on our ia64 machines is a
>> read-only area, which then results in a machine-check abort

David.B> I'm surprised that DMA from a read-only area would be a
David.B> problem! :) If OHCI is getting a PCI error, I'd expect a
David.B> "UE" IRQ.

You must have not received my follow-up message. There was a typo in
my message: it was supposed to say "write-only" area.

David.B> I still suspect some problem in the HID code. But right
David.B> now I'm quite certain of a recent-ish OHCI issue.

I'm 99% certain that the problem I saw back in October (BTC keyboard)
was identical to the one triggered by cset 1.1619.1.17.

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/
 
 
 

[linux-usb-devel] serious 2.6 bug in USB subsystem?

Post by David Mosb » Sun, 07 Mar 2004 16:30:11

>>>>> On Fri, 5 Mar 2004 21:49:20 -0800, David Mosberger < XXXX@XXXXX.COM > said:

David> It's not an issue of DMA coherency, it's an issue of DMA
David> vs. interrupt ordering. I believe the WHD interrupt is
David> arriving at the CPU before the DMA update to the HCCA is
David> done.

Actually, it looks like I misunderstood the OHCI spec on first reading.
It seems like the causal relationship goes like this:

(1) Start of Frame -> (2) update HccaFrameNumber -> (3) trigger SF interrupt

Now, suppose you get a WDH interrupt between (1) and (2). You'd read
the old frame-number yet by the time the interrupt from (3) arrives
the HC might already be accessing the ED that you're about to remove.

If this is correct, then the first patch is probably a better
approach:

===== drivers/usb/host/ohci-q.c 1.48 vs edited =====
--- 1.48/drivers/usb/host/ohci-q.c Tue Mar 2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c Fri Mar 5 17:25:55 2004
@@ -438,7 +451,7 @@
* behave. frame_no wraps every 2^16 msec, and changes right before
* SF is triggered.
*/
- ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+ ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;

/* rm_list is just singly linked, for simplicity */
ed->ed_next = ohci->ed_rm_list;

This actually makes tons of sense if you think of it like jiffies: you
need to make sure you delay at least one full frame-interval. If you
set the tick to "+ 1" and the current tick is almost over, that
requirement is violated. Setting it to "+ 2" should be safe. The
only problem I can think of is if the delay between point (1) and (2)
were to exceed one frame-interval (1 msec). While unlikely, the right
PCI topology and heavy bus traffic perhaps could cause such delays.
However, even then it's probably OK because the HC would presumably
stall when trying to update the HccaFrameNumber the second time and
the previous update hasn't completed yet.

Here is one little piece of evidence that's consistent with this
explanation: last week I tried to rip some audio tracks off a CD.
With PIO, this caused interrupts to get delayed 2-3msec and that
caused all kinds of weird effects on the USB bus. Mostly, I'd
suddenly lose the keyboard or the mouse, though reconnecting them
would "fix" the problem for a short time.

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to XXXX@XXXXX.COM
More majordomo info at http://www.yqcomputer.com/
Please read the FAQ at http://www.yqcomputer.com/