| Title: | AlphaServer 4100 |
| Moderator: | MOVMON::DAVIS S |
| Created: | Tue Apr 16 1996 |
| Last Modified: | Fri Jun 06 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 648 |
| Total number of notes: | 3158 |
Hi.
I have a problem with a A4100:
SROM VERSION V1.1 - VERSION 2.0-3, 21-AUG-1996 14:31:24
OPENVMS PALCODE V1.18-8 DIGITAL UNIX PALCODE V1.21-12
With "TEST CPU" command in console mode I have the following
output:
PROCESS TIME CPU0: SOFT ERROR DETECTED, VECTOR 00620
MCHK_CODE: 00000000 02040000
System goes in loop and is necessary make a hardware restart.
I have try to make a downgrade firmware to version 1.2-4,
and i have fixed the problem.
But I needed version 2.0-3 because digital unix v3.2-g is
installed.
Can anybody help me please?
Thank in advanced. Regard Romeo Cesaeato
[Posted by WWW Notes gateway]
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 484.1 | MAY30::CUMMINS | Tue Feb 11 1997 19:24 | 24 | ||
The system is experiencing single-bit (CRD) memory errors. V1.2-4
console did not automatically report soft errors. This feature had
to be enabled by doing SET D_LOGSOFT ON prior to running the V1.2-4
TEST command. V2.0-3 and later consoles automatically enable soft
error reporting. This is why you only see the errors with V2.0-3.
The fact that it "goes into a loop" suggests that the system has
*lots* of these errors - this happens quite frequently. Note: older
revision PCI bridge cards can cause IOD-detected CRD errors with a
particular footprint. I believe older revision motherboards can cause
similar symptoms. You should have your system inventoried as to
revision levels to see whether your hardware is fully up to rev.
What does the SARM consoel SHOW FRU command return re: part numbers,
serial numbers, and revision levels? Would be helpful if you could
post SHOW FRU display output as a reply to this note.. It would also
be helpful if you could post three successive 620/630 error reports.
This will help us diagnose the problem.
It's quite possible the machine in question has bad memory. But we need
more data before we can know whether this is faulty memory or some
other out of rev module.
BC
| |||||
| 484.2 | MEM0H -or- MEM0L, which one? | KAOFS::M_NAKAGAWA | Tue Feb 11 1997 21:04 | 231 | |
re .-1
I have some DECevent samples here.
I believe we have memory problem with this system.
It has 1024MB memory(a pair of B3030-FA) and I wanted to know which
half(MEM0H or MEM0L) is causing it.
I tried the MACHINE CHECK program(4100.digital_unix) but it didn't help.
Some memory problems are discussed in following BLITZ:
[TD 2109] Alpha Server 4100 - Memory Errors
[TD 2226] Alpha Server 4100 - SRM Console V4.8-3
Thanks for your help,
CRDC/Mitz
------------------- MC620/630 DECevent Sample-----------------------
Timestamp of occurrence 03-FEB-1997 15:52:37
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 03-FEB-1997 15:52:37
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 03-FEB-1997 15:52:37
CPU Minor class 3. Bcache error (630 entry)
******************************** ENTRY 6 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 37.
Timestamp of occurrence 03-FEB-1997 15:52:37
Host name emcsats004
System type register x00000016 AlphaStation 4x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
Software Flags x0000000000000000
Active CPUs x00000003
Hardware Rev x00000000
System Serial Number C1563
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
Ext Interface Address Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
Fill Syndrome Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
Interrupt Summary Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
WHOAMI x00000000 Module Revision 0.
MID 0.
GID 0.
Sys Environmental Regs x00000000
Base Addr of Bridge x000000FBE0000000
Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001
HORSE Module Revision: x00000002
SADDLE Module Revision: x00000002
SADDLE Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
PCI Class Code x00000600
MC Error Info Register 0 x193E3C40
MC Bus Trans Addr<31:4>: 193E3C40
MC Error Info Register 1 x800F4800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
IOD1 Master at Time of Error
Device ID 2 x00000005
MC error info valid
CAP Error Register x89000000 Error Detected but Not Logged
Correctable ECC err det by MDPA
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
******************************** ENTRY 7 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 36.
Timestamp of occurrence 03-FEB-1997 15:52:37
Host name emcsats004
System type register x00000016 AlphaStation 4x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
Software Flags x0000000000000000
Active CPUs x00000003
Hardware Rev x00000000
System Serial Number C1563
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
Ext Interface Address Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
Fill Syndrome Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
Interrupt Summary Reg x0000000000000000
Not Valid for 620 System
Correctable Errors
WHOAMI x00000000 Module Revision 0.
MID 0.
GID 0.
Sys Environmental Regs x00000000
Base Addr of Bridge x000000F9E0000000
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
HORSE Module Revision: x00000002
SADDLE Module Revision: x00000002
SADDLE Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
PCI Class Code x00000600
MC Error Info Register 0 x193E3C40
MC Bus Trans Addr<31:4>: 193E3C40
MC Error Info Register 1 x800F4800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
IOD1 Master at Time of Error
Device ID 2 x00000005
MC error info valid
CAP Error Register x89000000 Error Detected but Not Logged
Correctable ECC err det by MDPA
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
******************************** ENTRY 8 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 35.
Timestamp of occurrence 03-FEB-1997 15:52:37
Host name emcsats004
System type register x00000016 AlphaStation 4x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 3. High Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 3. Bcache error (630 entry)
Software Flags x0000000000000000
Active CPUs x00000003
Hardware Rev x00000000
System Serial Number C1563
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Ext Interface Address Reg x0000000000000000
Fill Syndrome Reg x0000000000000000
Interrupt Summary Reg x0000000000000000
WHOAMI x00000000 Module Revision 0.
MID 0.
GID 0.
Sys Environmental Regs x00000000
Base Addr of Bridge x000000F9E0000000
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
HORSE Module Revision: x00000002
SADDLE Module Revision: x00000002
SADDLE Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
PCI Class Code x00000600
MC Error Info Register 0 x193E3C40
MC Bus Trans Addr<31:4>: 193E3C40
MC Error Info Register 1 x800F4800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
IOD1 Master at Time of Error
Device ID 2 x00000005
MC error info valid
CAP Error Register x89000000 Error Detected but Not Logged
Correctable ECC err det by MDPA
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
===========================================
| |||||
| 484.3 | MAY30::CUMMINS | Wed Feb 12 1997 10:30 | 55 | ||
This lengthy note is leading somewhere. Bear with me..
The AlphaServer 4100/4000's PCI bridge's ASIC has a bug in it that can
cause data corruption given a certain sequence of events. Basically,
the sequence involves a CSR read from the IOD's B chip while a DMA is
in progress. There are only a couple registers implemented in the B
chip. These include the SYNDROME and STAT CSRs which are used, in part,
to provide I/O-detected, single-bit error syndrome status. The only
software that would normally access these registers is PALcode. The
operating systems never touch them, as all error data collection is
handled by PAL (and NT HAL) code on the 4100/4000 platform.
We discovered the above data corruption problem prior to FRS. The
program opted to not re-spin the ASIC. Instead, we modified PALcode to
not collect B chip CSR error info on CRD or MCHK errors. Since VMS/UNIX
PALcode attempts to scrub all single-bit memory errors, it was felt
that more often than not, the EV5 would also detect a single-bit error
during the course of scrubbing the location, assuming the error was not
a transient.
The impact of all of this is:
1. The data corruption problem is worked around and made effectively
made moot by changes to PALcode. Customers should not ever see
this problem (though you wouldn't want to write an application
that periodically polled these STAT and SYNDROME CSRs!)
2. The side effect of (1) is that SYNDROME and STAT registers will
always read as zero on I/O-detected CRD errors logged in the
system error log. [See note below..] This will obviously hinder
isolation to a memory pair member.
3. More often than not, PAL scrubbing, which involves reading and
then writing back the data, will generate an EV5-detected CRD
(630 or 620) error. PAL will snapshot the EV5 FILL_SYNDROME IPR
which will then enable isolation to a pair member.
Note: the V3.0-10 SRM console and later versions added an automatic
enable of PAL collection of I/O SYNDROME and STAT data on I/O-detected
620 CRD errors. This is because TEST performs read-only operations to
disk/tape/floppy. No writes.. Therefore, V3.0-10 and greater consoles
can be used to diagnose to a memory pair member assuming the problem is
repeatable under console TEST. And very often it is..
In summary, if your error log does not have any EV5-detected 620 or 630
CRD error entries, then you will not be able to diagnose to a memory
pair member. Are there any CPU-detected CRD errors in the system error
log? The ones you posted were all I/O-detected errors... If you update
to V3.0-10, and re-run the TEST command, you may see EV5-detected CRD
errors. In this case, the error frames displayed will include syndrome
data for isolation to a pair member.
The V3.0-10 console is available on the V3.8 Firmware Update CD (as
well as via our firmware web site..)
If questions, let me know.
BC
| |||||
| 484.4 | Thanks | KAOFS::M_NAKAGAWA | Wed Feb 12 1997 12:23 | 10 | |
Thanks for the info.
>Are there any CPU-detected CRD errors in the system error log?
No, they all are I/O-detected.
A few MC620 errors followed by a MC630 as described in the TD #2109.
Thanks again,
CRDC/Mitz
| |||||
| 484.5 | MAY30::CUMMINS | Wed Feb 12 1997 12:33 | 5 | ||
Not sure what you meant in your previous reply. 630 CRD errors are
*always* EV5-detected; never I/O-detected. Are you saying you see 630
errors in the error log?
BC
| |||||
| 484.6 | 620,620,620 then 630 | KAOFS::M_NAKAGAWA | Wed Feb 12 1997 17:23 | 97 | |
re:last >Are you saying you see 630 errors in the error log?
Please refer to 484.2 DECevent entry #8, two MC620's followed by a MC630.
Sometimes we just get only MC620's but sometimes MC630 follows
immediately after MC620's something like below.
When we get MC630, the "MC Error Info Register 0" always contains
"x193E3C40" and I was wondering if someone could tell MEM0H or MEM0L if
we have memory problem here.
The "CAP Error Register"(err sum) says "Correctable ECC err det by MDPA".
---------------------------------------
Timestamp of occurrence 04-FEB-1997 08:46:38
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 04-FEB-1997 08:46:38
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 04-FEB-1997 08:46:38
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 03-FEB-1997 15:52:37
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 03-FEB-1997 15:52:37
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 03-FEB-1997 15:52:37
CPU Minor class 3. Bcache error (630 entry) <----- !!!
Timestamp of occurrence 02-FEB-1997 12:00:02
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 02-FEB-1997 12:00:02
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 02-FEB-1997 12:00:02
CPU Minor class 3. Bcache error (630 entry) <----- !!!
Timestamp of occurrence 31-JAN-1997 16:45:03
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 16:45:03
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 16:45:03
CPU Minor class 3. Bcache error (630 entry) <----- !!!
Timestamp of occurrence 31-JAN-1997 14:05:09
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 14:05:09
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 14:05:08
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 14:05:08
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 14:05:08
CPU Minor class 3. Bcache error (630 entry) <------ !!!
Timestamp of occurrence 31-JAN-1997 11:03:30
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 11:03:30
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 11:03:30
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 11:03:27
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 11:03:27
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 11:03:27
CPU Minor class 3. Bcache error (630 entry) <----- !!!
Timestamp of occurrence 31-JAN-1997 10:24:51
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 10:24:51
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 10:24:51
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 10:24:51
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 31-JAN-1997 10:24:51
CPU Minor class 3. Bcache error (630 entry) <----- !!!
Timestamp of occurrence 30-JAN-1997 17:31:38
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 30-JAN-1997 17:31:38
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 30-JAN-1997 17:31:38
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 30-JAN-1997 15:52:50
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 30-JAN-1997 15:52:50
CPU Minor class 4. 620 System Correctable Error
Timestamp of occurrence 30-JAN-1997 15:52:50
CPU Minor class 4. 620 System Correctable Error
---------------------------------------------------------------------------
Thanks again for your help.
CRDC/Mitz
| |||||
| 484.7 | MAY30::CUMMINS | Thu Feb 13 1997 10:50 | 11 | ||
I looked again at the error log snippets from reply .2 and there are
indeed 630 entries. Unfortunately, no EV5 error info is presented. I
am assuming this is because you are running older DECevent.
I strongly recommend that you update all of your customers to DECevent
V2.3 with the latest 4100/4000 KNL files. Several problems involving
CRD error interpretation / reporting have been resolved in DECevent.
Among other additions and fixes...
Without the syndrome information, it is not possible to diagnose to a
card pair member.
| |||||