ECC errors: is my RAM bad?

ECC errors: is my RAM bad?

Post by Kalle Olav » Sun, 14 Dec 2003 17:45:42


I powered up my Alphastation 500/266 after it had been off for
several months. Now, during fsck, I get lots of errors like this
(many per second):

CIA machine check: vector=0x630 pc=0xfffffc00004913e8 code=0x86
machine check type: correctable ECC error (retryable)
pc = [<fffffc00004913e8>] ra = [<fffffc000034c1b8>] ps = 0000 Not tainted
v0 = 0000000000000038 t0 = 0000000000000000 t1 = bfffd2e500000000
t2 = 0000000000000000 t3 = 0000000000000030 t4 = 0000000000000020
t5 = 00000001200c6c78 t6 = fffffc00168d9fc8 t7 = fffffc001fbf4000
a0 = fffffc001fbf7eb8 a1 = 000000000007c0a4 a2 = 0000000000000000
a3 = 0000000000002000 a4 = 00000001200c4cb0 a5 = 000000011ffffa80
t8 = 0000000000008000 t9 = 0000020000170e58 t10= 0000000000000008
t11= 0000000000000001 pv = fffffc00004912e0 at = fffffc000034c834
gp = fffffc0000547458 sp = fffffc001fbf7e08

Does this mean the RAM has gone bad? What is the failing address?
The memory configuration is:

Memory Size = 512Mb

Bank Size/Sets Base Addr Speed
------ ---------- --------- -----
00 256Mb/2 000000000 Fast
01 256Mb/2 010000000 Fast


This machine has SRM V7.2-2 (Apr 4 2000 17:17:43). I tried the
"memory" command, which should run "memtest" with the appropriate
arguments, but it just fails with "Invalid group selection".
Are there any NVRAM variables that could cause ECC errors?

I am using Debian GNU/Linux, kernel-image-2.4.21-5-generic.
 
 
 

ECC errors: is my RAM bad?

Post by Kalle Olav » Sun, 14 Dec 2003 18:59:18

Kalle Olavi Niemitalo < XXXX@XXXXX.COM > writes:


I forgot to mention I also moved it to a different room, where it
now lies flat on the desk. In the previous room, it had been
standing on its right side (where there are no ventilation holes)
in a tower orientation of sorts. I suppose this change might
have loosened a contact, or something.

Anyway, I now rebooted Linux with mem=256M, and there are no ECC
errors so far. Which makes me suspect the flaw is in the second
bank.

 
 
 

ECC errors: is my RAM bad?

Post by mru » Sun, 14 Dec 2003 20:18:42

Kalle Olavi Niemitalo < XXXX@XXXXX.COM > writes:


Most likely. It did when I got it, at least.


Impossible to tell from that information.

--
Ms Rullgd
XXXX@XXXXX.COM
 
 
 

ECC errors: is my RAM bad?

Post by mru » Sun, 14 Dec 2003 20:20:28

Kalle Olavi Niemitalo < XXXX@XXXXX.COM > writes:


Take the memory out, then put it back, just to be sure there isn't a
loose connection.


That seems reasonable to me.

--
Ms Rullgd
XXXX@XXXXX.COM
 
 
 

ECC errors: is my RAM bad?

Post by Kalle Olav » Mon, 15 Dec 2003 03:42:17


XXXX@XXXXX.COM (Ms Rullgd) writes:


No effect. :-(
 
 
 

ECC errors: is my RAM bad?

Post by Kalle Olav » Mon, 15 Dec 2003 04:14:45


XXXX@XXXXX.COM (Ms Rullgd) writes:


There were no other messages in /var/log/kern.log nor on the
console, between the messages that I already posted.

How can I tell Linux to display the address?
Or doesn't Linux even know it?

I would be interested in knowing whether the fault is at a
specific address or spans the entire bank. I fear it's the
latter, because I get so many of those ECC messages.

I tried to trigger the error by writing and reading at the
beginning of the second bank in SRM:


I didn't get any unusual messages from this. Does SRM report
recoverable ECC errors at all?

If I specify mem=256M to Linux, can I regardless access the
second bank via /dev/mem?
 
 
 

ECC errors: is my RAM bad?

Post by mru » Mon, 15 Dec 2003 04:33:35

Kalle Olavi Niemitalo < XXXX@XXXXX.COM > writes:


Then you'll just have to swap it out.

--
Ms Rullgd
XXXX@XXXXX.COM
 
 
 

ECC errors: is my RAM bad?

Post by mru » Mon, 15 Dec 2003 05:03:55

Kalle Olavi Niemitalo < XXXX@XXXXX.COM > writes:


I don't know. Try poking around in arch/alpha/kernel/irc_alpha.c and
core_YOURCHIPSET.c.


It could be that one data pin is broken somehow. That would make the
whole module useless. I once had a 32 MB module that had an error at
just one place. The first 8 MB were fine, IIRC.


You could try playing around with memexer, I think that's what it's
called. It's supposed to loop over a region of memory reading and
writing until it's stopped. I've never managed to get any useful
information from it, though.


I wouldn't think so.

--
Ms Rullgd
XXXX@XXXXX.COM