| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 1215.1 | No functional difference between B01 and B02 54-24721-01 | PROXY::JEAN | MAUREEN JEAN | Mon Jun 02 1997 16:08 | 20 | 
|  | 
The difference between B01 and B02 is that there was
a part number change to the map rams that eliminated
a specific ram vendor from the QVL.   The Toshiba
sram is not to be used on the MB.   
As for the failures.  This test is a DMA loopback test
from the PCI to the System memory.  The errors in HPC0
error registor indicate that a CSR overrun occured
as well as a non-existent PCI address error.   
Are both B01's failing with the same exact error?
If so, is there any way I can get a hold of one of these
modules?  I can be reached at DTN 223-6348.
Thanks,
Maureen Jean
RSE Tlaser I/O support
 | 
| 1215.2 | will try to supply the faulty boards | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Tue Jun 03 1997 00:57 | 13 | 
|  |     Many thanks, Maureen.  Yes, the engineer told me that the two B01
    boards fail with exactly the same error.  These two boards are in
    transit from Melbourne to me (in Sydney) so I can test them in the
    Sydney CSC's 8400.  If I confirm the faults I can arrange to send the
    modules to you, although that will take a little while.
    
    In the meantime, we have another module on order and the Australia Post
    8400 is running with the original and intermittently faulty board
    installed.  That board has caused another couple of crashes since
    yesterday, but fortunately the 8400 hadn't been formally accepted at
    the time the problems started.  Even so, if it wasn't for the fact that
    the CSC's 8400 has a DWLPA instead of a DWLPB we'd have given Australia
    Post the board from that to try to improve our relations with them.
 | 
| 1215.3 |  | PROXY::JEAN | MAUREEN JEAN | Tue Jun 03 1997 10:43 | 5 | 
|  | 
What was the reason for the crash on the first
DWLPB motherboard?
Maureen
 | 
| 1215.4 | original motherboard's error | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Wed Jun 04 1997 01:34 | 29 | 
|  |     I don't have soft-copy of the original DWLPB errors, only a fax. 
    However, here's the most significant part of a typical error entry:
MRETRY1                   x00400000
ERR 1                     x00000041  ERROR SUMMARY
                                     DMA READ RETURN DATA PARITY/LENGTH ERR
FADR 1                    x02033440  DMA Read from Memory
IMask PCI Interrupt Mask  x01031001  Slot 0 - Interrupt A Enable
                                     Slot 3 - Interrupt A Enable
DIAG 1                    x00000008  Generate Correct parity
                                     HPC Gate Array Revision = 0
                                     RM Down Hose Translate Ad x00000000
IPEND 1                   x00000000
IPROG 1                   x0000000C  Interrupt Source  Slot 3 INTA
    
    These errors only occur under heavy I/O load.  This motherboard never
    fails self-test.
    
    A third replacement motherboard failed exactly the same way as the
    first two, so we're ordering a complete replacement DWLPB while we try
    to figure out what's going on.
    
    I've just received the first two replacement motherboards, which I'm
    going to test in the Sydney CSC's 8400.
    
    -Bret
    
    PS:
    Maureen, have you been getting the mail I sent you at PROXY::JEAN?
 | 
| 1215.5 | Seen it before (unfortunately). | IJSAPL::RIETKERK | Bart Rietkerk-Hoogeveen-Holland | Wed Jun 04 1997 03:49 | 50 | 
|  |     
    Goodday, downunder.....
    
    We recently had a horror story on a 12 CPU 440 Mhz TLASER with
    the same errorlog entry as you entered in .4. I don't have any
    revisions at hand, so FWIW.
    
    april 25. 	Middle 48V regulator has got its amber LED on. Both
    		te other regulators appear to be ok. System running
    		fine. After a complete power down the middle regulator
    		comes back normal. Replaced it anyway as a precaution.
    may 22	System crashes with a DMA READ RETURN DATA PARITY/
    		LENGTH ERROR on PCI-box #3. DWLPB Motherboard replaced.
    may 26	System crashes 3 times, among other funnies: PCIA MAP
    		RAM PARITY ERROR on PCI Box #3 (!) After replacing the
    		hose cable (and a power cycle of course) 2 out of 4
    		PCI boxes (0 and 3) fail their selftest. Solidly blown.
    		Had to replace 2 DWLPB motherboards, an also replaced
    		TIOP module (1 of the common factors). System running
    		fine again.
    june 3	Replaced middle 48V power regulator again just a a
    		precaution, because it has been swapped into the system
    		recently, and before the troubles started.
    
    The above gives the bare facts. Now for the gutfeelings: DWLPA/DWLPB
    is JUNK (!) Sorry to be so blunt, but 8400's are fine and problem free
    machines, except for the PCI boxes. 1) Construction s*cks. Apart from
    the generally known hints, kinks and blitzes: all those y-cables
    hanging of badly bended PCI (KZPSA) modules will give trouble sooner
    or later. 2) Electronically I don't trust them. Apart from the story 
    above I've seen other intermittent problems, and even 2 motherboards
    with components gone up in smoke (1 of them just during swith-on after
    installing a brand new machine) 3) Power is no good. Apart from
    the funny problems you get because of the mounting of the piggy-back
    power board in the PCI box I suspect you will end up having problems
    on any 8400 after swithing off and on often enough.
    
    I am sorry to say all this, but it is my honest opinion. I had a chance
    to configure a Compac Proliant a months ago or so (everybody is going
    Bill G. 's way these days), and I think our PCI box designers can learn
    a lot from at least the PCI construction of this machine. We (DIGITAL)
    should be ashamed about the TLASER PCI implementation!
    
    (gutfeelings back to normal)
    
    	I hope the Aussies can use my info to solve the case, and they stay
    	ahead of further problems. My guess would be marginal power in the
    	troubled PCI-box.
    
    	Cheers, Bart Rietkerk (looking after 7 Tlasers among other things).
 | 
| 1215.6 | the plot thickens... | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Wed Jun 04 1997 09:46 | 46 | 
|  |     G'day, Bart.  I'm actually reasonably happy with the quality and
    reliability of the electronics in _factory integrated_ Turbolaser
    systems (although the stories I hear about failure rates during FA&T
    are disturbing).  But add-in option quality isn't what I'd like and the
    reliability of Turbolaser modules in MCS spares inventory is
    unacceptable.  This is because the options and spares weren't (still
    aren't?) getting proper burn-in testing.
    
    However, 48V power supply problems on 8400s aren't really a Turbolaser
    issue.  The 48V supplies are carry-overs from the VAX/DEC7000s, and the
    ones I've seen (415V 50Hz 3-phase) have been buy-ins.  Anyway, I've
    found them to be quite reliable.
    
    I mostly agree with you about the Turbolaser mechanical/packaging
    aspect - it's pretty vile, particularly on the 8200, and a big
    reliability and maintainability issue.  The CSS rack-mount Turbolasers
    can be _real_ horrors when it comes to maintainability.
    
    But I'm digressing...
    
    We ordered and installed a whole new DWLPB shelf; it passed self-test
    and DUNIX booted OK.  It got through a complete LSM copy operation with
    no errors, which is much further than the original DWLPB ever got.
    NOTE: The new DWLPB has the old style -02 metalwork, a -01 variant power
          board and a rev B01 motherboard.  The failing DWLPB has the
          following parts:
            PCI metalwork: 70-31092-03 Rev B01
            PCI motherboard : 54-24721-01 Rev B02
            Power board: 54-23470-02 Rev B01
    
    The two rev B01 motherboard spares from Melbourne reached me today and
    both worked OK when I installed them in the CSC 8400's DWLPA shelf. 
    However, that shelf has different metalwork and power board (i.e. a -01
    instead of an -02) to the original DWLPB shelf in the Australia Post
    8400.
    
    The complete Aust. Post shelf will be sent to me for further testing
    and analysis.  It seems likely that there is something about this shelf
    causing a hose cable mating problem.  The problem manifests when the
    rev B01 motherboards from our spares stock are installed, although the
    rev level may not have anything to do with it.  FWIW, we've been
    following the procedure in Blitz TD-2153 when installing motherboards.
    
    I have also asked the site engineer to get Aust. Post's hose cable part
    numbers and rev levels for me tomorrow, just in case they have some
    relevance to the problem.
 | 
| 1215.7 | agree...how about the piggy-back p/s? | IJSAPL::RIETKERK | Bart Rietkerk-Hoogeveen-Holland | Wed Jun 04 1997 10:30 | 24 | 
|  |     
    Hi Bret,
    
    First, I know about the 48 V regulators. I don't think that is where
    the problem is either. I've got 2 XMI based 8400 over here, and about
    20 7000's (AXP and VAX)- hardly any problems with the 48V regulators.
    What I suspect more is the quality (mostly during switching) of the
    PCI-box piggy-back power boards. The 48V regulator swaps I've done
    on the troubled system over here where "just in case" swaps. I can't
    explain solid failure of 2 pci boxes at the same moment over here.
    There has to be a common factor (or was it just bad luck???)
    
    Has the engineer-on-site that looks after the system where the 3 spare 
    DWLPB motherboards have failed swapped the PCI-box piggy-back P/S?
    If that one is marginal it could explain 1) failure because of
    rev-level difference (marginal change in power consumption? 2) the
    fact that 2 of those modules run fine in your system. 3) the fact that
    the original motherboard jumps out only during heavy load.
    
    Just guessing....
    
    	Good luck!
    
    	Bart Rietkerk.
 | 
| 1215.8 | Interesting Failure!!  :-) | MASS10::geraldo.reo.dec.com::ConnollyG | [email protected] | Wed Jun 04 1997 14:11 | 2 | 
|  | >    motherboard in the DWLPB.  The replacement board failed with a hard-on
 | 
| 1215.9 | I think it's hose connection related | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Thu Jun 05 1997 00:44 | 25 | 
|  |     Bart,
    
    Yes, I think the engineer did swap the power board with the one from
    the 8400's other (working) DWLPB shelf.  He also swapped over the hose
    cable from the other shelf.
    
    Actually, almost from the beginning the errors looked to me like they
    were probably due to hose connection problems.  The errors on the
    original motherboard were DMA READ RETURN DATA PARITY/LENGTH errors,
    which can be caused by a poor hose connection.  The self-test errors
    were CSR overrun errors which also can be caused by a poor hose
    connection, although I've never seen a self-test failure for this
    reason before.
    
    I just didn't know whether it was a board-to-box revision related
    mechanical incompatibility problem or simply faulty DWLPB metalwork. 
    I'm now fairly sure that it's the latter, but I still don't know
    whether it's a one-off or an instance of a wider problem.
    
    I'll know more when I receive the complete faulty DWLPB.  If I find a
    hose connection problem and it looks like it might be manufacturing
    process related, I'll arrange to have the whole DWLPB shipped to
    Maureen Jean of RSE for further analysis.
    
    -Bret
 |