| Title: | FDDI - The Next Generation | 
| Moderator: | NETCAD::STEFANI | 
| Created: | Thu Apr 27 1989 | 
| Last Modified: | Thu Jun 05 1997 | 
| Last Successful Update: | Fri Jun 06 1997 | 
| Number of topics: | 2259 | 
| Total number of notes: | 8590 | 
    We have a problem with a crashing cluster under OpenVMS V6.2-1H2.
    
    Configuration:	OpenVMS V6.2-1H2
    			4 * 2100A with FDDI interface
    			direct connected to a Gigaswitch via one FGL-4
    			DSSI Connection between 2 of the 2100A
    			appr. 50 satellites
    For some reason, one of the Sables gets an error on the FDDI interface
    and starts to init the FGL-4 card. Than, the other 3 machines also get
    errors on the FDDI, like timeouts. Two of them crash with a CLUEXIT, like
    all the satellites! Those two, with the DSSI connection between, survive.
    The problem, of the timeouts generated from the FGL-4 card in the
    gigaswitch hopefully will be fixed with new firmware 3.01.
    The question is, why do we get the first error on the FWA0 interface.
    
    He told me that he had crashes appr. 6 month ago with V6.2. After 
    crash dump analysis he got a new SYS$FWDRIVER. This driver was linked
    on the 13-DEC-1995. No idea, if this was the same or a similiar problem.
    Now, after upgrading to V6.2-1H2 they have a new SYS$FWDRIVER
    with link date/time 27-JAN-1996. 
    Are those changes from the special driver implemented in the V6.2-1H2 one?
    May he use the special one from V6.2 with V6.2-1H2?
    Original V6.2-1H2:	
    		image name: "SYS$FWDRIVER"
    		image file identification: "X-3"
    		image file build identification: "X61Q-SSB-DD00"
    		link date/time:  27-JAN-1996
    
    Special V6.2:
    		image name: "SYS$FWDRIVER"
    		image file identification: "X-3"
    		image file build identification: "X61Q-SSB-0000"
    		link date/time:  13-DEC-1995     <-------------------
    
    Any help will be welcome.
    
    Michael
    
    Here are some extracts from the cluster:
    
    SDA> SHOW LAN /FDDI /ERROR
    LAN Data Structures
    -------------------
                -- FWA Error Information 12-JUN-1996 16:51:47 --
Fatal error count                  6    Last error CSR              00000400
Fatal error code        3-XmtTimeout    Last fatal error     11-JUN 14:54:49
Prev  error code        3-XmtTimeout    Prev fatal error     11-JUN 13:26:07
Transmit timeouts                  6    Last USB time                   None
Control timeouts                   0    Last UUB time        12-JUN 03:58:23
Restart failures                   0    Last CRC time                   None
Power failures                     0    Last CRC srcadr                 None
Bad PTE transmits                  0    Last length erro                None
Loopback failures                  0    Last exc collisi                None
System ID failures                 0    Last carrier fai                None
ReqCounters failures               0    Last late collis                None
    
    And here is a part of the ERRLOG.SYS
 ******************************* ENTRY     759. *******************************
 ERROR SEQUENCE 21121.                           LOGGED ON:  CPU_TYPE 00000002
 DATE/TIME 11-JUN-1996 11:06:59.54                            SYS_TYPE 00000018
 SYSTEM UPTIME: 5 DAYS 17:55:51
 SCS NODE: AXP601                                           OpenVMS AXP V6.2-1H2
 HW_MODEL: 00000423 Hardware Model = 1059.
 ERL$LOGMESSAGE AlphaServer 2100A 4/200
 NI-SCS SUB-SYSTEM, _AXP601$PEA0:
       PORT HAS CLOSED VIRTUAL CIRCUIT
       LOCAL STATION ADDRESS, FFFFFFFFFF00(X)
       LOCAL SYSTEM ID, 00000000F525(X)
       REMOTE STATION ADDRESS, 0000000000CB(X)
       REMOTE SYSTEM ID, 00000000F5D0(X)
       UCB$L_ERTCNT    00000032
                                       50. RETRIES REMAINING
       UCB$L_ERTMAX    00000032
                                       50. RETRIES ALLOWABLE
       UCB$L_ERRCNT    0000003F
                                       63. ERRORS THIS UNIT
       PPD$B_PORT            00
                                       REMOTE NODE # 0.
       PPD$B_STATUS          00
       PPD$B_OPC             00
                                       UNKNOWN OPCODE
       PPD$B_FLAGS           00
 V M S                SYSTEM ERROR REPORT         COMPILED 12-JUN-1996 16:55:07
                                                                      PAGE  25.
 ******************************* ENTRY     765. *******************************
 ERROR SEQUENCE 21127.                           LOGGED ON:  CPU_TYPE 00000002
 DATE/TIME 11-JUN-1996 11:09:19.72                            SYS_TYPE 00000018
 SYSTEM UPTIME: 5 DAYS 17:58:12
 SCS NODE: AXP601                                           OpenVMS AXP V6.2-1H2
 HW_MODEL: 00000423 Hardware Model = 1059.
 DEVICE ATTENTION AlphaServer 2100A 4/200
 NI-SCS SUB-SYSTEM, AXP601$PEA0:
       FATAL ERROR DETECTED BY DATALINK
       STATUS          0000045C
                       00001201
       DATALINK UNIT       0001
       DATALINK NAME   41574603
                       00000000
                       00000000
                       00000000
                                       DATALINK NAME = FWA1:
       REMOTE NODE     00000000
                       00000000
                       00000000
                       00000000
       REMOTE ADDR     00000000
                           0000
       LOCAL ADDR      000400AA
                           F525
                                       ETHERNET ADDR = 0E-01-01-00-00-00
       ERROR CNT           0001
                                       1. ERROR OCCURRENCES THIS ENTRY
       UCB$L_ERRCNT    00000040
                                       64. ERRORS THIS UNIT
V M S                SYSTEM ERROR REPORT         COMPILED 12-JUN-1996 16:55:07
                                                                      PAGE  26.
 ******************************* ENTRY     766. *******************************
 ERROR SEQUENCE 21128.                           LOGGED ON:  CPU_TYPE 00000002
 DATE/TIME 11-JUN-1996 11:09:22.86                            SYS_TYPE 00000018
 SYSTEM UPTIME: 5 DAYS 17:58:15
 SCS NODE: AXP601                                           OpenVMS AXP V6.2-1H2
 HW_MODEL: 00000423 Hardware Model = 1059.
 DEVICE ATTENTION AlphaServer 2100A 4/200
 NI-SCS SUB-SYSTEM, AXP601$PEA0:
       FATAL ERROR DETECTED BY DATALINK
       STATUS          8BD4F200
                       00001200
       DATALINK UNIT       0001
       DATALINK NAME   41574603
                       00000000
                       00000000
                       00000000
                                       DATALINK NAME = FWA1:
       REMOTE NODE     00000000
                       00000000
                       00000000
                       00000000
       REMOTE ADDR     00000000
                           0000
       LOCAL ADDR      000400AA
                           F525
                                       ETHERNET ADDR = 0E-01-01-00-00-00
       ERROR CNT           0001
                                       1. ERROR OCCURRENCES THIS ENTRY
       UCB$L_ERRCNT    00000041
                                       65. ERRORS THIS UNIT
| T.R | Title | User | Personal Name | Date | Lines | 
|---|---|---|---|---|---|
| 2065.1 | 19584::STOCKDALE | Thu Jun 13 1996 07:45 | 14 | ||
| I can't answer your question about why the transmit timeout occurred but normally its because the link became unavailable so rather than hold on to the outstanding transmit forever, FWDRIVER resets the DEFPA and returns the transmit with error status. Perhaps a SHOW LAN/FULL would provide more information. As to the driver version question. The V6.2 driver enabled parity checking when it shouldn't have. This caused occasional parity error crashes. The new driver disabled parity checking. This change is included in the V6.2-1H* versions. This sounds like a much different problem than what you are having which sounds like a network problem. Dick | |||||
| 2065.2 | GIGASwitch crashing? | CSC32::J_SOBECKI | John Sobecki, DTN 592-4101, CXO3-2/D2 | Thu Jun 13 1996 12:10 | 20 | 
|     Hello,
    
    Usually the transmit timeouts are caused by the loss of physical
    connection, aka is the GIGASwitch crashing?  I've never heard of a
    DEFPA causing an FGL-4 card to go down.
    
    Were the previous crashes the UCB R5 cleared crash?  This crash seems
    to not be checked in the recent LAN driver images.  
    
    The V6.2 driver should work fine under V6.2-1H2.  I'd check the errolog
    on the GIGASwitch to see what's causing the transmit timeouts.  If you
    have more than one SCP, and the SCP's are crashing, the errorlog is
    contained on the SCP itself.  So if the Elected SCP is the seconday
    SCP, you'll need to fail back to the primary SCP to check the errorlog.
    
    Maybe this is a new 2100A related problem.  I'd IPMT the driver issue
    if the crashes have returned.  
    
    Good Day,
    John
 | |||||
| 2065.3 | Get error log from FGL4 if necessary | NPSS::RLEBLANC | Thu Jun 13 1996 15:42 | 7 | |
|     
      If the SCP reports the FGL-4 in question is crashing, please
    also get the error log from the FGL4.
    
    
    						
    
 | |||||
| 2065.4 | FRSIT::MAYER | Mon Jun 17 1996 07:20 | 10 | ||
| Hi,
as next we will check GIGAswitch Errorlog to see if there are some Problem
regarding the GIGAswitch SCP or Linecard.
Also a sho lan/full is available on FRSIT::GSI_SDA_LAN.TXT
Regards
Juergen Mayer
                                                             
 | |||||
| 2065.5 | 19584::STOCKDALE | Tue Jun 18 1996 11:17 | 5 | ||
| >>Also a sho lan/full is available on FRSIT::GSI_SDA_LAN.TXT It doesn't appear to be there. - Dick | |||||
| 2065.6 | SDA output now available | FRSIT::MAYER | Thu Jun 20 1996 08:15 | 5 | |
|     Sorry,
    
    the sho lan/full is now available on FRSIT::GSI_SDA_LAN.TXT
    
    Regards Juergen
 | |||||
| 2065.7 | 19584::STOCKDALE | Thu Jun 20 1996 15:57 | 63 | ||
| If I extract the significant information from the counters it shows that the ring went away and came back a few times, resulting in failed transmits (either a timeout after the ring went away or transmits while the ring was not available). The last error CSR shows the port status register contents at the time of the transmit timeout, showing 'link available' and nothing else - this indicates that the FDDI appeared to be ok when the driver declared a transmit timeout and shut down the adapter. Note that the transmit timeout is 5-6 seconds, so the device owned the transmit for that long before the timeout occurred. Transmit underrun 0 Dup tokens detected 7 Ring inits received 5 LEM rejects 0 DAT test failures 0 Connections completed 10 No work transmits 59193334 Ring avail transitions 10 Buffer_Addr transmits 0 Ring unavail transitions 7 +00 Device interrupts 296991649 +2C Too many segments 0 +08 Transmits failed 2779 +34 RESETs issued 3 +0C Receive errors 0 +38 Fatal errs (soft tmo) 2 +10 Transmit timeouts 2 +3C EEPROM update tmo 0 Fatal error count 2 Last error CSR 00000400 Fatal error code 3-XmtTimeout Last fatal error 11-JUN 11:09:38 Prev error code 3-XmtTimeout Prev fatal error 7-JUN 16:50:01 Transmit timeouts 2 Last USB time None The driver version is the V6.2-1H2 version. There is a later version in V6.2-1H3 but it only has a bug fix for a DEFAA workaround so although the version is different, the code is identical since the DEFAA bug fix is in DEFAA conditional code). But the driver consists of a port driver plus the LAN common routines. The LAN common routines has a couple of fixes in V6.2-1H3, one when more than 11 multicast addresses are enabled (this system has 11 exactly), and one which affects shared user applications causing the first packet received by a shared user to be lost (if there was actually a shared user) and in this case there are no shared users (although there are two users started in shared mode there are only one for each protocol type). So neither of these fixes is significant in your case. So, my guess is that there was an failure of the ring which is likely something on the ring and not the DEFPA in the system. Perhaps a longer timeout would have allowed the FDDI ring to recover from whatever was going on, but given that the driver would have restarted the users automatically immediately after the error, the cluexits shouldn't have happened, but apparently the FDDI ring did not come back before the reconnect interval expired so the satellites cluexited. Increasing the reconnect interval may give the nodes enough time for the ring to recover. >> The question is, why do we get the first error on the FWA0 interface. Because the FDDI ring became unavailable for more than 5-6 seconds. >> Are those changes from the special driver implemented in the V6.2-1H2 one? Yes. >> May he use the special one from V6.2 with V6.2-1H2? Yes, as long as he doesn't want a couple of additional bug fixes. - Dick | |||||
| 2065.8 | FRSIT::MAYER | Fri Jun 21 1996 06:32 | 13 | ||
| Hi Dick, I also saw the Ring Inits and Connection Completed. So I asked the customer if he was plugging and unplugging the Systems from the Gigaswitch. He confirmed that he was moving from one Gigaswitch Port to another ones, but didn't remember how often. So in the moment we doesn't know how many Inits are "homemade" or real failures. Because we have the counters from know, we have to wait until the next failure occurs. We also focus on the Gigaswitch counters and errorlogs. regards Juergen | |||||