| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 6400.1 | Does not look right... | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Mon Feb 17 1997 15:51 | 14 | 
|  | 
The analysis shows no sense keys. It does not seem right. Maybe Chris Loane will
chime in, but for some reason you got an error with no data to support it in the
log.
As for busting open the SBB, you should not have to, and you may end up eating
the cost if the dispensation we requested on the warrenty sticker has not made
it in to manufacturing yet. Have your logistic folk find out why they can't get
the correct varient in stock.
We/are the terminaton power jumpers installed in the device as per TIMA BLITZ
TD2041?
roger.
 | 
| 6400.2 |  | KERNEL::LOANE | Comfortably numb!! | Tue Feb 18 1997 02:31 | 23 | 
|  | >Decoded Instance Code is:-
>    The disk device reported standard SCSI Sense Data. Check the service 
>    manual for the device for further instructions.
    The  Instance  code  SUGGESTS  that  the HSJ is about to log all the 
    extended Sense data, but....
>       LONGWORD 16.    0070802A
>                                       /*.p./
>       LONGWORD 17.    00000000
>                                       /..../
>       LONGWORD 18.    00000A00
>                                       /..../
>       LONGWORD 19.    00000000
>                                       /..../
>       LONGWORD 20.    00000000
>                                       /..../
    ......it's  all  zero.....this  is  very   strange   (i.e.   nothing 
    useful/no  further  help).  Were there ANY other errors logged at or 
    around the same time??
    Chris
 | 
| 6400.3 | GOOD, it wasn't just me ;^) | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Tue Feb 18 1997 07:43 | 7 | 
|  | 
You may want to hook up a printer to the console of the controller and see if
any event are being dumped to the console. It sounds like either something is
being incorrectly reported OR we are missing some magic key to unlock what IS
coming back.
roger.
 | 
| 6400.4 | Action plan | KAOFS::D_ORMAECHEA | Denis Ormaechea... Montreal MCS | Tue Feb 18 1997 13:55 | 24 | 
|  |     The customer's merged errorlog and text output of SWEAT V2.7 is now
    available on node MQOU27 decnet account (FAL$server). I found out that
    the errorlog can be analysed on VAX VMS6.2 even though the errorlog is
    out of a 5.5-2 system. File names are MSE_errlog.sys & MSE_sweat.txt. I
    will try to run DECEVENT from that file this afternoon.
    
    By the way, I checked the jumper on the drive that was replaced this
    weekend, and the jumper was missing. The only jumper present was 1-2 .I
    ran SCSIpro on that drive at the office and found no growing list out
    of the drive. I tried running read scan, write verify...etc but i
    cannot go over block number 48000. I'm working on this rigth now.I ran
    format successfully but still cannot go over block 48000.
    
    The plan for tonight, is to run dilx on the HSJ50 to exercise the disk
    to see if we cannot get more accurate info out of the test.Then, maybe
    format the drive is necessary.After, we should be changing the drive's
    slot in the BA356 and recreate the unit. FMU on both HSJ's did not show
    any problem so far. I will hookup a printer on the HSJ also.
    
    Regards,
    
    Denis Ormaechea
    DTN-632-7942
    
 | 
| 6400.5 | Troubleshooting results. | KAOFS::D_ORMAECHEA | Denis Ormaechea... Montreal MCS | Wed Feb 19 1997 19:49 | 365 | 
|  | --------------------------------------------------------------------------------
DENIS ORMAECHEA           <Troubleshooting results.>           19-FEB-1997 21:30
--------------------------------------------------------------------------------
18-Feb-97 Action
	I was onsite almost all day to gather all possible information in logs
about the problem since it started. Most important info were in node TS4
errorlog log, but the file had corrupted entries that needed to fix by RMS.
After fixing everything, i merged all errorlog info from the cluster since
problem started (20-jan-97).Brougth merged errorlog and SWEAT text output to
office by tape cartridge.
	Copied files to MQOU27 decnet's account and asked RDC to run DECEVENT
from it get get more info.All files are called MSE_errlog.sys,MSE_sweat.txt,
MSE_decevent.txt.
	My first action onsite at 17:30 Hr was to run DILX on both drives to
get better info from SCSI ASC/ASCQ status. The dua1000 disk showed errors
within 2 Mins with the following results:
This is the config :
*****************************************************************************
Controller:
        HSJ50-AX ZG63300559 Firmware V50J-2, Hardware  A01
        Configured for dual-redundancy with ZG63100486
            In dual-redundant configuration
        SCSI address 6
        Time: 18-FEB-1997 17:38:50
Host port:
        Node name: HSJ010, valid CI node 6, 16 max nodes
        System ID 420010061122
        Path A is ON
        Path B is ON
        MSCP allocation class    1
        TMSCP allocation class   1
        CI_ARBITRATION = ASYNCHRONOUS
        MAXIMUM_HOSTS = 31
        NOCI_4K_PACKET_CAPABILITY
Cache:
        128 megabyte write cache, version 3
        Cache is GOOD
        Battery is GOOD
        No unflushed data in cache
        CACHE_FLUSH_TIMER = DEFAULT (10 seconds)
        CACHE_POLICY = A
        NOCACHE_UPS
HSJ010 > sho d1000
MSCP unit                                    Uses
--------------------------------------------------------------
  D1000                                      DISK150
        Switches:
          RUN                    NOWRITE_PROTECT        READ_CACHE            
          WRITEBACK_CACHE       
          MAXIMUM_CACHED_TRANSFER_SIZE = 32
        State:
          AVAILABLE
          No exclusive access
          PREFERRED_PATH = THIS_CONTROLLER
        Size: 523366 blocks
HSJ010 > sho disk150
Name          Type          Port Targ  Lun                    Used by
------------------------------------------------------------------------------
DISK150       disk             1    5    0                    D1000
          DEC      EZ32     (C) DEC V064
        Switches:
          NOTRANSPORTABLE       
          TRANSFER_RATE_REQUESTED = 10MHZ (synchronous 10 MHZ negotiated)
        Size: 523366 blocks
        Configuration being backed up on this container
HSJ010 > sho d1100
MSCP unit                                    Uses
--------------------------------------------------------------
  D1100                                      DISK210
        Switches:
          RUN                    NOWRITE_PROTECT        READ_CACHE            
          WRITEBACK_CACHE       
          MAXIMUM_CACHED_TRANSFER_SIZE = 32
        State:
          ONLINE to the other controller
          No exclusive access
          PREFERRED_PATH = OTHER_CONTROLLER
        Size: 523366 blocks
HSJ010 > sho disk210
Name          Type          Port Targ  Lun                    Used by
------------------------------------------------------------------------------
DISK210       disk             2    1    0                    D1100
          DEC      EZ32     (C) DEC V064
        Switches:
          NOTRANSPORTABLE       
          TRANSFER_RATE_REQUESTED = 10MHZ (synchronous 10 MHZ negotiated)
        Size: 523366 blocks
        Configuration being backed up on this container
This is the results :
*******************************************************************************
HSJ010 > run dilx
Disk Inline Exerciser - version 2.0
Note: DILX will only test units with a single physical device.
The Auto-Configure option will automatically select, for testing, half or
all of the disk units configured. It will perform a very thorough test with
*WRITES* enabled. Only disk units with a single physical device will be
tested. The user will only be able to select the run time and
performance summary options and whether to test a half or full configuration.
The user will not be able to specify specific units to test.
The Auto-Configure option is only recommended for initial installations.
Do you wish to perform an Auto-Configure (y/n) [n] ?
Use all defaults and run in read only mode (y/n) [y] ?n
Enter execution time limit in minutes (1:65535) [10] ?30
Enter performance summary interval in minutes (1:65535) [10] ?
Include performance statistics in performance summary (y/n) [n] ?
Display hard/soft errors (y/n) [n] ?y
Display hex dump of Error Information Packet Requester Specific 
information (y/n) [n] ?
When the hard error limit is reached, the unit will be dropped from testing.
Enter hard error limit (1:65535) [65535] ?
When the soft error limit is reached, soft errors will no longer be
displayed but testing will continue for the unit.
Enter soft error limit (1:65535) [32] ?
Enter IO queue depth (1:12) [4] ?
  *** Available tests are:
    1. Basic Function
    2. User Defined
Use the Basic Function test 99.9% of the time. The User Defined
test is for special problems only.
Enter test number (1:2) [1] ?1
 **CAUTION**
If you answer yes to the next question, user data WILL BE destroyed.
Write enable disk unit(s) to be tested (y/n) [n] ?y
The write percentage will be set automatically. 
Enter read percentage for Random IO and Data Intensive phase (0:100) [67] ?
Enter data pattern number 0=ALL, 19=USER_DEFINED, (0:19) [0] ?
Perform initial write (y/n) [n] ?y
The erase percentage will be set automatically.
Enter access percentage for Seek Intensive phase (0:100) [90] ?
Perform data compare (y/n) [n] ?y
Enter compare percentage (1:100) [5] ?50
Disk unit numbers available for testing on this controller include:
    1000
    1100
Enter unit number to be tested ?1000
Unit 1000 will be write enabled.
Do you still wish to add this unit (y/n) [n] ?y
Enter start block number (0:523365) [0] ?
Enter end block number (0:523365) [523365] ?
Unit 1000 successfully allocated for testing
Select another unit (y/n) [n] ?
   DILX testing started at: 18-FEB-1997 17:57:01
    Test will run for 30 minutes
    Type ^T(if running DILX through VCS) or ^G(in all other cases)
      to get a current performance summary
    Type ^C to terminate the DILX test prematurely
    Type ^Y to terminate DILX prematurely
Error Information Packet in hex
      Cmd Ref Number       000010D5
      Unit Number          000003E8
      Log Sequence         0000002F
      Format               02
      Flags                40
      Event Code           0000000B
      Controller ID        63300559 012D0009
      Controller SW ver    50
      Controller HW ver    01
      Multi Unit Code      0005
      Unit ID[0]           00000000
      Unit ID[1]           02FF0000
      Unit Software Rev    01
      Unit Hardware Rev    34
      Recovery Level       01
      Retry Count          00
      Serial Number        05590004
      Header Code          00022B8F
      Instance                    0328450A
      Template Type               51
      Requestor Information Size  3C
      Sense Key                   01
      ASC                         17
      ASQ                         07
Error Information Packet in hex
      Cmd Ref Number       000010D5
      Unit Number          000003E8
      Log Sequence         00000030
      Format               02
      Flags                80
      Event Code           0000000B
      Controller ID        63300559 012D0009
      Controller SW ver    50
      Controller HW ver    01
      Multi Unit Code      0005
      Unit ID[0]           00000000
      Unit ID[1]           02FF0000
      Unit Software Rev    01
      Unit Hardware Rev    34
      Recovery Level       01
      Retry Count          00
      Serial Number        05590004
      Header Code          00022B8F
      Instance                    0328450A
      Template Type               51
      Requestor Information Size  3C
      Sense Key                   01
      ASC                         17
      ASQ                         07
Error Information Packet in hex
      Cmd Ref Number       00000000
      Unit Number          00000000
      Log Sequence         00000032
      Format               00
      Flags                02
      Event Code           0000016A
      Controller ID        63300559 012D0009
      Controller SW ver    50
      Controller HW ver    01
      Multi Unit Code      0000
      Instance                    03F40064
      Template Type               41
      Requestor Information Size  04
 Bad Value Added Completion Status for unit 1000, end message in hex
      Event Code                 0043
      Op Code                    21
      Cmd Ref Number             000017CE
      Byte Count                 00005A00
      Error Byte Count           00000000
      Sequence Number            0000
      Flags                      00
Error Information Packet in hex
      Cmd Ref Number       000017CE
      Unit Number          000003E8
      Log Sequence         00000031
      Format               02
      Flags                40
      Event Code           0000002B
      Controller ID        63300559 012D0009
      Controller SW ver    50
      Controller HW ver    01
      Multi Unit Code      0005
      Unit ID[0]           00000000
      Unit ID[1]           02FF0000
      Unit Software Rev    01
      Unit Hardware Rev    34
      Recovery Level       01
      Retry Count          00
      Serial Number        05590004
      Header Code          00026B8B
      Instance                    031A4002
      Template Type               51
      Requestor Information Size  3C
      Sense Key                   04
      ASC                         B0
      ASQ                         00
Error Information Packet in hex
      Cmd Ref Number       00000000
      Unit Number          00000000
      Log Sequence         00000034
      Format               00
      Flags                02
      Event Code           0000016A
      Controller ID        63300559 012D0009
      Controller SW ver    50
      Controller HW ver    01
      Multi Unit Code      0000
      Instance                    03F40064
      Template Type               41
      Requestor Information Size  04
Error Information Packet in hex
      Cmd Ref Number       000017CE
      Unit Number          000003E8
      Log Sequence         00000033
      Format               02
      Flags                00
      Event Code           0000012B
      Controller ID        63300559 012D0009
      Controller SW ver    50
      Controller HW ver    01
      Multi Unit Code      0005
      Unit ID[0]           00000000
      Unit ID[1]           02FF0000
      Unit Software Rev    01
      Unit Hardware Rev    34
      Recovery Level       01
      Retry Count          00
      Serial Number        05590004
      Header Code          00026B8B
      Instance                    03134002
      Template Type               51
      Requestor Information Size  3C
      Sense Key                   04
      ASC                         E0
      ASQ                         06
  The unit status and/or the unit device type changed unexpectedly.
  Unit 1000 dropped from testing
   DILX Summary at 18-FEB-1997 17:58:34
   Test minutes remaining: 29, expired: 1
Cnt err in HEX  IC:03F40064  PTL:01/05/FF  Key:06  ASC/Q:00/00  HC:0  SC:2
  Total Cntrl Errs   Hard Cnt 0   Soft Cnt 2
Unit 1000     Total IO Requests 6098
  Err in Hex: IC 0328450A  PTL:01/05/00  Key:01  ASC/Q:17/07  HC:0  SC:2
  Err in Hex: IC 031A4002  PTL:01/05/00  Key:04  ASC/Q:B0/00  HC:0  SC:1
  Err in Hex: IC 03134002  PTL:01/05/00  Key:04  ASC/Q:E0/06  HC:1  SC:0
  Total Errs   Hard Cnt 1   Soft Cnt 3
  The unit status and/or the unit device type changed unexpectedly.
  Unit 1000 dropped from testing
Reuse Parameters (stop, continue, restart, change_unit) [stop] ?
DILX - Normal Termination
************************************************************************
Also had these errors:
Unit 1100     Total IO Requests 1136
  Err in Hex: IC 0326450A  PTL:02/01/00  Key:03  ASC/Q:80/00  HC:1  SC:0
  Err in Hex: IC 031A4002  PTL:02/01/00  Key:04  ASC/Q:B0/00  HC:0  SC:1
  Err in Hex: IC 03134002  PTL:02/01/00  Key:04  ASC/Q:E0/06  HC:1  SC:0
  Total Errs   Hard Cnt 2   Soft Cnt 1
  The unit status and/or the unit device type changed unexpectedly.
  Unit 1100 dropped from testing
******************************************************************************
	Troubleshooting aliminated the following:
HSJ50 Controller	:By running test from both controllers on both drives
SCSI cables		:By interchanging drives (on differents busses) and
BA356 BUS		 again running DILX on both drives from both contr.
BA356 slot		 Same drive was failing.
SCSI terminators
	After T/S, i've put back previously replaced drive in SBB and ran
same tests. All test ran fine. Customer ran INIT/erase on both unit and
putted them back in their respective shadowset.
I have ordered an EZ32-VW (Whole SBB swap unit from SR17 with ETA for 3-march.
Also contacted customer today and their was still no errors.
 | 
| 6400.6 | ok, | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Thu Feb 20 1997 11:11 | 42 | 
|  | Let's see...
   DILX Summary at 18-FEB-1997 17:58:34
   Test minutes remaining: 29, expired: 1
Cnt err in HEX  IC:03F40064  PTL:01/05/FF  Key:06  ASC/Q:00/00  HC:0  SC:2
  Total Cntrl Errs   Hard Cnt 0   Soft Cnt 2
Unit 1000     Total IO Requests 6098
  Err in Hex: IC 0328450A  PTL:01/05/00  Key:01  ASC/Q:17/07  HC:0  SC:2
  Err in Hex: IC 031A4002  PTL:01/05/00  Key:04  ASC/Q:B0/00  HC:0  SC:1
  Err in Hex: IC 03134002  PTL:01/05/00  Key:04  ASC/Q:E0/06  HC:1  SC:0
  Total Errs   Hard Cnt 1   Soft Cnt 3
  The unit status and/or the unit device type changed unexpectedly.
  Unit 1000 dropped from testing
Reuse Parameters (stop, continue, restart, change_unit) [stop] ?
Unit 1100     Total IO Requests 1136
  Err in Hex: IC 0326450A  PTL:02/01/00  Key:03  ASC/Q:80/00  HC:1  SC:0
  Err in Hex: IC 031A4002  PTL:02/01/00  Key:04  ASC/Q:B0/00  HC:0  SC:1
  Err in Hex: IC 03134002  PTL:02/01/00  Key:04  ASC/Q:E0/06  HC:1  SC:0
  Total Errs   Hard Cnt 2   Soft Cnt 1
  The unit status and/or the unit device type changed unexpectedly.
  Unit 1100 dropped from testing
Unit 1000 had a couple recoverable errors then disappeared. (E0 and B0 are HSJ)
Unit 1100 had a 1 hard error then disappeared. (E0 and B0 are HSJ events)
Seems strange. Both units are just dropping out of site. I see from the logs
they are different ports so that kinda rules out a power/bus issue. 
I see you did some moving around and reseating of hardware, did you change
anything or did the units just start running?
Did you have a "regular" disk to also use to make sure you did not have a non-ez
problem?
If these units fail again, escalate a case to engineering and have those units
analyzed to make sure you are not fighting a symptom of "something else".
roger.
 | 
| 6400.7 | Confusion here... Sorry! | KAOFS::D_ORMAECHEA | Denis Ormaechea... Montreal MCS | Fri Feb 21 1997 09:46 | 26 | 
|  |     
    Roger,
    
    	Let me appologize for the confusion here. With the cut and paste
    i've done from my document, my intention was to show you that i had two
    kind  of error string out of DILX on the first instance code line:
    
    ASC/q:80/000 & 17/07.
    
    	The two units that you see in the report are actually the same
    physical drive, but in a different configuration during the
    troubleshooting step. To answer your question about the hardware
    change, the unit that i was troubleshooting had a solid problem even
    after moving the unit aroud, and putting it back the way it was. I've
    put back the original unit back in the SBB because i had it with me,
    and the ETA for the new EZ32-VW is March -03. The original unit failed
    on Feb-13, but may only have an intermitent problem (with the same
    symptoms), so i think that this unit is not reliable.
    
    	Conclusion of this, i think that i had 3 bad units in a row !!!!
    
    Regards,
    
    Denis
    
    
 | 
| 6400.8 | I hope not. | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Fri Feb 21 1997 09:50 | 7 | 
|  | 
Three bad is REALLY BAD luck or I'm about to get a LOT busier ;^)
I am going on vacation next week (yes, even I do take vacations ;^) but if you
want me to look at the bad unit, send me mail offine.
roger.
 |