Help with Sol 7 Fatal PCI UE Error - intermittant

Help with Sol 7 Fatal PCI UE Error - intermittant

Post by David McCa » Tue, 03 Feb 2004 18:31:03


Is this a memory module failure or a cpu4 failure??

thx ahead of time.

dmc


Feb 1 22:54:40 e4500a unix: WARNING: uncorrectable error from pci2 (upa mid
3) during dvma read transaction
Feb 1 22:54:40 e4500a unix: Transaction was a block operation.
Feb 1 22:54:40 e4500a unix: AFSR=40000000.63800000
AFAR=00000000.ce9ad898,
Feb 1 22:54:40 e4500a double word offset=3, Memory Module Board 0 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801 id 3.
Feb 1 22:54:40 e4500a unix: panic[cpu4]/thread=2a10025fd60:
Feb 1 22:54:40 e4500a unix: Fatal PCI UE Error
Feb 1 22:54:40 e4500a unix:
Feb 1 22:54:40 e4500a unix: syncing file systems...
Feb 1 22:54:40 e4500a unix: WARNING: md: d11: write error on
/dev/dsk/c0t11d0s0
Feb 1 22:54:43 e4500a last message repeated 1 time
Feb 1 22:54:43 e4500a unix: 23
Feb 1 22:54:45 e4500a unix: 5
Feb 1 22:54:46 e4500a unix: 3
Feb 1 22:54:56 e4500a last message repeated 8 times
Feb 1 22:54:57 e4500a unix: cannot sync -- giving up
Feb 1 22:54:58 e4500a unix: dumping to /dev/md/dsk/d2, offset 105316352
Feb 1 22:56:04 e4500a unix: ^M100% done: 58260 pages dumped, compression
ratio 3.10,


============================================================================
==

System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise
E4500/E5500
System clock frequency: 84 MHz
Memory size: 4096Mb

========================= CPUs =========================

Run Ecache CPU CPU
Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 336 4.0 US-II 2.0
0 1 1 336 4.0 US-II 2.0
2 4 0 336 4.0 US-II 2.0
2 5 1 336 4.0 US-II 2.0
4 8 0 336 4.0 US-II 2.0


========================= Memory =========================

Intrlv. Intrlv.
Brd Bank MB Status Condition Speed Factor With
--- ----- ---- ------- ---------- ----- ------- -------
0 0 1024 Active OK 60ns 4-way A
0 1 1024 Active OK 60ns 4-way A
2 0 1024 Active OK 60ns 4-way A
2 1 1024 Active OK 60ns 4-way A

========================= IO Cards =========================

Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- -------------------
---
1 PCI 33 0 SUNW,hme-pci108e,1001 SUNW,cheerio
1 PCI 33 1 SUNW,hme-pci108e,1001 SUNW,cheerio
1 PCI 66 2 network-pci108e,2bad SUNW,pci-gem
1 PCI 33 3 SUNW,isptwo/sd (block) QLGC,ISP1040B
1 PCI 33 4 SUNW,isptwo-pci1077,1020/sd (blo+ QLGC,ISP1040B

Detached Boards
===============
Slot State Type Info
---- --------- ------ -----------------------------------------
3 disabled disk Disk 0: Target: 10 Disk 1: Target: 11
5 disabled disk Disk 0: Target: 12 Disk 1: Target: 13

No failures found in System
===========================

No System Faults found
======================


David McCall

UNIX Administrator

AdvancedTelcomGroup

XXXX@XXXXX.COM

XXXX@XXXXX.COM
 
 
 

Help with Sol 7 Fatal PCI UE Error - intermittant

Post by drubsledke » Wed, 04 Feb 2004 02:36:43


I had this exact same symptom. It turned out to be a memory error.
It was a production E3500, and I didn't have the luxury of swapping
out each memory module one at a time to find the bad one, so I just
replaced the whole bank, and that fixed it. I then turned the 8
memory modules over to Sun, and they were able to figure out which one
of the memory modules was bad.

 
 
 

Help with Sol 7 Fatal PCI UE Error - intermittant

Post by David McCa » Thu, 05 Feb 2004 22:11:46

Now it's kinda changed???

eb 3 17:30:49 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU8 Data access at TL=0, errID 0x00008b68.adea170d
Feb 3 17:30:49 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR
0x00000000.a4e4a018
Feb 3 17:30:49 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x1008aef8
Feb 3 17:30:49 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE>
UDBL.ESYND 0x03
Feb 3 17:30:49 e4500a UDBL Syndrome 0x3 Memory Module Board 0 J3100
J3200 J3300 J3400 J3500 J3600 J3700 J3800
Feb 3 17:30:49 e4500a unix: WARNING: [AFT1] errID 0x00008b68.adea170d
Syndrome 0x3 indicates that this may not be a memory module problem
Feb 3 17:30:49 e4500a unix: [AFT2] errID 0x00008b68.adea170d
PA=0x00000000.a4e4a018
Feb 3 17:30:49 e4500a E$tag 0x00000000.0fc0149c E$State: Modified
E$parity 0x07
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x00): 0x00000300.082203f8
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x08): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x18): 0x00000300.08309430 *Bad*
PSYND=0x00ff
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x20): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x28): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x30): 0x00000300.07c1fb20
Feb 3 17:30:49 e4500a unix: [AFT2] E$Data (0x38): 0x00000000.00000000
Feb 3 17:30:49 e4500a unix: panic[cpu8]/thread=300083d5960:
Feb 3 17:30:49 e4500a unix: [AFT1] errID 0x00008b68.adea170d UE Error(s)
Feb 3 17:30:49 e4500a See previous message(s) for details
Feb 3 17:30:49 e4500a unix:
Feb 3 17:30:49 e4500a unix: syncing file systems...
Feb 3 17:30:51 e4500a unix: 27
Feb 3 17:31:15 e4500a unix: 4
Feb 3 17:31:38 e4500a unix: 1
Feb 3 17:31:48 e4500a unix: panic[cpu8]/thread=2a1000abd60:
Feb 3 17:31:48 e4500a unix: panic sync timeout




mid
J3101
 
 
 

Help with Sol 7 Fatal PCI UE Error - intermittant

Post by David McCa » Thu, 12 Feb 2004 01:04:58

And yet another permutation:


Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU8 Data access at TL=0, errID 0x000009ec.bb54d3a5
Feb 10 03:28:33 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR
0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x1001ee94
Feb 10 03:28:33 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE>
UDBL.ESYND 0x03
Feb 10 03:28:33 e4500a UDBL Syndrome 0x3 Memory Module Board 2 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] errID 0x000009ec.bb54d3a5
Syndrome 0x3 indicates that this may not be a memory module p
roblem
Feb 10 03:28:33 e4500a unix: [AFT2] errID 0x000009ec.bb54d3a5
PA=0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a E$tag 0x00000000.1ec00c1e E$State: Exclusive
E$parity 0x0f
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x00): 0x00000000.00000048
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x08): 0xe17a14ae.47f53f01
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000060
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x18): 0x213824b0.65e04011 *Bad*
PSYND=0x00ff
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x20): 0x1cd232b0.65e24028
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x28): 0x00000010.325436d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x30): 0x65e24053.a29c41d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x38): 0x65e24009.00000003
Feb 10 03:28:33 e4500a unix: panic[cpu8]/thread=2a1004b3d60:
Feb 10 03:28:33 e4500a unix: [AFT1] errID 0x000009ec.bb54d3a5 UE Error(s)
 
 
 

Help with Sol 7 Fatal PCI UE Error - intermittant

Post by David McCa » Thu, 12 Feb 2004 06:00:05

And now even more interesting, with Module 0 and PCI2 swapped out of the
picture:'

Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] Uncorrectable Memory Error on
CPU8 Data access at TL=0, errID 0x000009ec.bb54d3a5
Feb 10 03:28:33 e4500a AFSR 0x00000000.80200000<PRIV,UE> AFAR
0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x1001ee94
Feb 10 03:28:33 e4500a UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE>
UDBL.ESYND 0x03
Feb 10 03:28:33 e4500a UDBL Syndrome 0x3 Memory Module Board 2 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Feb 10 03:28:33 e4500a unix: WARNING: [AFT1] errID 0x000009ec.bb54d3a5
Syndrome 0x3 indicates that this may not be a memory module problem
Feb 10 03:28:33 e4500a unix: [AFT2] errID 0x000009ec.bb54d3a5
PA=0x00000000.60f7d1d8
Feb 10 03:28:33 e4500a E$tag 0x00000000.1ec00c1e E$State: Exclusive
E$parity 0x0f
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x00): 0x00000000.00000048
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x08): 0xe17a14ae.47f53f01
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x10): 0x00000000.00000060
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x18): 0x213824b0.65e04011 *Bad*
PSYND=0x00ff
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x20): 0x1cd232b0.65e24028
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x28): 0x00000010.325436d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x30): 0x65e24053.a29c41d0
Feb 10 03:28:33 e4500a unix: [AFT2] E$Data (0x38): 0x65e24009.00000003
Feb 10 03:28:33 e4500a unix: panic[cpu8]/thread=2a1004b3d60:
Feb 10 03:28:33 e4500a unix: [AFT1] errID 0x000009ec.bb54d3a5 UE Error(s)
Feb 10 03:28:33 e4500a See previous message(s) for details
Feb 10 03:28:33 e4500a unix:
Feb 10 03:28:34 e4500a unix: syncing file systems...
Feb 10 03:28:35 e4500a unix: 13
Feb 10 03:28:58 e4500a unix: 1
Feb 10 03:29:13 e4500a unix: done
Feb 10 03:29:13 e4500a unix: panic[cpu8]/thread=2a1000abd60:
Feb 10 03:29:13 e4500a unix: panic sync timeout
Feb 10 03:29:13 e4500a unix:
Feb 10 03:29:14 e4500a unix: dumping to /dev/md/dsk/d2, offset 105316352
Feb 10 03:31:18 e4500a unix: ^M100% done: 43367 pages dumped, compression
ratio 4.07,
Feb 10 03:31:18 e4500a unix: dump succeeded
 
 
 

Help with Sol 7 Fatal PCI UE Error - intermittant

Post by David McCa » Thu, 12 Feb 2004 06:30:30

Forgot to mention, All the soft links in / were gone after the last failure.

dunno if that means anything significant either.

not much left to replace at this point.