| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 6736.1 | from the drive side... | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Fri May 30 1997 13:49 | 11 | 
|  | 
No idea why you have 1 less cylnder reported.
The drive does not use any type of alternate cylnder to use for revectoring. It
has spare sectors pre-mapped at the factory on each track and each head that it
uses. If the drive has major media/head problems and uses up all of these
sectors (that are NOT included in the capacity of the device), any subsequent
reassign results in an Sense Key of 04, ASC of 19 (Defect List Error). Not an
Illegal Request as your log indicated.
Roger.
 | 
| 6736.5 | SCSI Timeout is an Issue for heavy IO | MSAM03::RAHMAN |  | Sun Jun 01 1997 02:48 | 152 | 
|  |     
    Hi Roger,
    
    That is exactly the case.. the block is not bad as u could detect it
    during the formating at the manufacturing. It is an ugly block, ie 
    bad because of difficulties arise when attempting to read the block.
    
    I believe the situation that I am encountering is similar to the 2
    notes
    I attached below. Please analyse this situation. If there is no logical 
    explaination, then the customer has the right to change digittal's
    Hardware.
    I have checked the in /usr/include/sys/disklabel.h, about the
    definition
    of alternate sector and alternate cylinder and it seems that it is not
    used in /etc/disktab. Please verify the Rz29-va (is it seagate
    baracuda) and 
    is the unix driver does not comply to the SCSI command from seagate?
    
    MCS engineers has verified the "suspected" disk is OK at local digital
    office!! 
    
    I would say it is because of heavy IO, that the driver mark it as bad,
    and 
    the alternate track is running out, because of so many "UGLY" block.
    
    Please look into this matter more seriously. If u need info please ask
    for it.
    I am very interested to solve this matter once and for all. Otherwiese,
    tommorow I walk into the customer and selling different vendors box.
    
    rahman ibrahim@MSA
    SSU Malaysia.
    
    
    132.0">Topic #132: ``Bad RCT causes an err on BBR?
    
        I believe the term "Good" block and "Bad" block in the RCT should
    be    clearly understood. The term "bad" generally implies unreadable   
    If the block is deemed bad at the factory (PBN entry in the FCT)    or
    the Formatter "detects" the block as bad, then it will format    the
    header with header code "11", marking it unusuable. If the block   
    header is still "00" (Good LBN) but difficulties arise attempting    to
    read the block (continued uncorectable ECC, smashed header, etc)    the
    block is again deemed bad. Alternate copys of the relative block   
    will be acessed in the RCT during BBR or revector operations.       
    There is, however, a condition I like to call "ugly". This is a   
    block that is not bad but contains "bad data" with good ECC, EDC,   
    etc. Alternate copies of these type blocks WILL NOT BE ACCESSED   
    under normal circumstances.        Example:        K.SDI fails and
    "forgets the HOST/RCT boundary" and writes a data    pattern into the
    first few blocks of the RCT during periodics, for    example. This
    corrupts the first copy of the RCT control block.    The data happens
    to get written with good ECC,EDC. This could have    a variety of
    effects during host mount of that disk.        Continuing on, problems
    arise and the Field Engineer determines    the K.SDI is bad and
    replaces it. Good ! The disk is still corrupt    but the symptoms may
    not be obvious. If the corruption "clobbered    word 4 in the RCT (BBR
    control word) the symptoms appear during    each attempt to ONLINE the
    disk (VMS Mount for example). If the P1    or P2 flags happen to be
    set, the system will attempt to finish    a BBR that never really
    started. If the replaced LBN address field    gets filled with this
    erroneous pattern, the HSC may attempt a    BBR to a "non-existent" LBN
    and crash the HSC "Every time a mount    is attempted.  If undefined
    bits get set in the control word, the    HSC will "data safety
    write-protect" the disk every time it is mounted.    The list is
    endless, esp if the descriptor blocks become affected.        The point
    is this, if blocks in the RCT get written with bad data    but good
    ECC, then alternate copies of the blocks are NOT ACCESSED    because
    the block is considered "good" (better term is readable,    not
    necessarily good).        I can produce these symptoms manually, and
    they do happen in the field,    fortunately infrequently (I hope). We
    had two occassions of K.SDI    failure in our lab (CSSE lab) that
    produced these very same "subtle"    but serious problems. I saved the
    printout for one and use it during    my seminar (DSA troubleshooting )
    to teach FE's how to deal with    logical failures usually resulting
    from hardware failures.        Rule of thumb. If you have experienced
    any hardware problem that    could affect the R/W data path to the disk
    (controller, SDI, disk    electronics, you may have experienced
    corruption on the media, which    stays around "after" the HW is
    resolved. I call it logical recovery.        Mark Himes    CX/CSSE                                                                     
    href="5752.0">Topic #5752: ``command timeout issue ''
    
       Looks  like  HSJ01$DUA62  and  HSJ04$DUA702  are  suffering  Command    
    Timeouts; What rev firmware  are  they  running?  (If  it's  running    
    V007, upgrade to 0016...if it's running 0014, then it should be OK).                                                   
    I've  included  a  blitz that Roger Patenaude put out in relation to    
    Command Timeouts.    BTW, you should REALLY upgrade HSOF to V2.7 and
    SWEAT to X2.7Copyright (c) Digital Equipment Corporation 1995. All
    rights reserved.      +---------------------------+TM      |   |   |  
    |   |   |   |   |      | d | i | g | i | t | a | l |              TIME
    DEPENDENT CASE      |   |   |   |   |   |   |   |     
    +---------------------------+      TITLE: What are SCSI Command
    Timeouts Errors?      AUTHOR: Roger Patenaude                   DATE:
    August 16, 1995      DTN: 237-3705                             TD #:
    1904      ENET: BABAGI::Patenaude                   CROSS REFERENCE
    #'s:      DEPT: Storage External Products           (PRISM/TIME/CLD#'s)           
    Continuation Engineering      INTENDED AUDIENCE: All                   
    PRIORITY LEVEL:  2      (U.S./EUROPE/GIA)                        
    (1=TIME CRITICAL,                                               
    2=NON-TIME CRITICAL)     
    =====================================================================     
    PROBLEM:      --------      The purpose of this Blitz is to give you
    some insight as to what a      SCSI "Command Timeout" error is. I've
    kept this very generic as more      of an informational Blitz for a
    change.      These errors are telling you that a specific "command" did
    not      complete in a specified period of time. This can be caused by     
    multiple sources and in most all cases can be recovered by the host     
    system by reissuing the failed command. Some of the reasons for     
    "Command Timeouts" are;      1) The SCSI bus is too busy. The SCSI bus
    priority is designed using         the drives ID in arbitration with no
    regard for how many times         the device wins the bus. So, if you
    have a bus with the highest         priority device doing VERY heavy
    workload ("hogging" the bus), then         other devices on the bus
    will not be able to arbitrate and win the         bus. These devices
    will then have commands outstanding that they         cannot complete.
    The host will then log an error "command timeout"         and sometime
    follow it with a bus reset.      2) The host issued a command to a
    drive that took to long to         complete. This could be due to a
    broken device but more common is         that the device is doing a
    long commands and does not have time to         answer the host. Normal
    convention is the host will only ask "how         things are
    proceeding" (as in the case where you issued a rewind         to a tape
    drive and are waiting for it to become ready) via a Test         Unit
    Ready command but if data type (read/write) command are        
    continually issued to the unit this the first command can not be        
    completed and may time out.      3) Operating system driver issues. The
    drivers may not be allowing         reasonable enough time for the
    commands to complete. A case in         point, VMS recently increased
    the command timeout values in         MKDRIVER (TAPE) and DKDRIVER
    (DISK) (from 3 seconds to 10 in MK).          This was because 3 was
    just to aggressive on a busy bus and command          timeouts and bus
    resets were occurring under heavy load.      4) Device issues. The
    drive may not have enough horsepower to         complete the commands
    it accepted in a reasonable amount of time.         OR, the drive may
    be not be working on commands it has accepted         because it is too
    busy. RZ28B's running version 003 code are one         such case, the
    drive will optimize it's seeks by working commands         that are in
    the local area of the heads. One side effect is that a         command
    may timeout if it was not in the local area of where the         drive
    is spending all it's time thus not getting serviced. RZ28B's        
    running 006 do not have this issue.      RESOLUTION/WORKAROUND:     
    -----------------------         For the most part these are just events
    and should be left alone.                  In the rare case where this
    is disruptive due to resets occurring,         review the four points
    above and see how they fit into your         environment. You may need
    to split heavily loaded devices between         multiple busses, or you
    may need new firmware or maybe move a         device off to another
    bus.       ADDITIONAL COMMENTS:      --------------------         None.                     
    ****  DIGITAL INTERNAL USE ONLY ****
    
    
 | 
| 6736.6 | Man you are ALL over the place... | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Mon Jun 02 1997 11:51 | 45 | 
|  | 
>    That is exactly the case.. the block is not bad as u could detect it
>    during the formating at the manufacturing. It is an ugly block, ie 
>    bad because of difficulties arise when attempting to read the block.
Exactly WHAT case????? You got a failure in the errorlog that said;
----- CAM STRING -----
                                        ILLEGAL REQUEST - Illegal request or 
                                         _CDB parameter 
The drive also returned status that said it got an invalid request!
How are you equating that with a note about DSDF / RCT / FCT information that
was written about SDI device's (RA81, RA82, RA90, etc...) and a note about
command timeouts???????
>   Please verify the Rz29-va (is it seagate
>    baracuda) and 
>    is the unix driver does not comply to the SCSI command from seagate?
 
It is a Seagate drive and YOU can dig through UNIX drivers. Not I. 
>    MCS engineers has verified the "suspected" disk is OK at local digital
>    office!! 
>
So it's probably not the drive ;^)
>    Please look into this matter more seriously. If u need info please ask
>    for it.
NOTES IS NOT AN ESCALATION PATH!!!!! You need to look at this more seriously and
follow proper escalation to get this looked at. Have you tried any local sales
and service support folk? (Don't answer, rhetorical question)
>    I am very interested to solve this matter once and for all. Otherwiese,
>    tommorow I walk into the customer and selling different vendors box.
UNBELIEVABLE!!!! You have what most likely is a SOFTWARE problem and you are
about to condem our hardware. Unbelievable is all I can say. Glad I only have
250 shares of DEC stock as of today with this mindset.
roger.
 | 
| 6736.7 | Help is needed...... | MSAM03::RAHMAN |  | Mon Jun 02 1997 20:09 | 8 | 
|  |     Thanks for ur response to the problem. Opp! Sorry this is not the
ESCALATION....
    path. I will be more careful next time. However thanks for ur time in
    looking into my problem. 
    
    I will escalate this problem to our support people.
    
    Rahman
 | 
| 6736.8 | Roger is right: escalate it | SUBSYS::BROWN | SCSI and DSSI advice given cheerfully | Tue Jun 03 1997 07:19 | 19 | 
|  |     I don't think it's clear whether this is a software problem or a 
    configuration error.  The SCSI sense data is 05/21/00, which means
    the software attempted to read a block beyond the drive's capacity.
    
    Now, we know the capacity after the error was smaller than the capacity 
    before the error.  We know the blocks being read (16 blocks, starting at 
    0x7fd4ac) were within the drive's capacity before the error, and
    outside the capacity after the error.  We don't know when the capacity
    changed, or who changed it.
    
    The obvious candidates are:
    - the Informix software
    - the HSZ40 controller
    - a bus reset, causing the drive to return to the most recently saved
    	capacity
    
    It may take a fair amount of time and engineering support to find the
    cause.  Please escalate, so the right people can be identified and 
    assigned.
 | 
| 6736.9 | notes collision | WRKSYS::HOUSE | Kenny House, Workstations Engineering | Tue Jun 03 1997 07:23 | 26 | 
|  |     So far as I can tell, there are two issues in the basenote.
    
    (1)	The error log is quite explicit about the HSZ40's complaining about
        an out-of-range logical block address used by a READ(10) command. 
        The LBA requested was 8377516(decimal), although the number of
        sectors claimed in the disklabel was 8378028(decimal).
    
    (2)	Writing over the disklabel changed the geometry, so that the number
        of sectors is now 8377528(decimal).  Note that the flags now have
        "dynamic_geometry" set, too.
    
    The whole concept of a simple sector/head/track geometry is an
    industry-wide falsehood.  Zoned drives (with different number of
    sectors per track) and RAID volumes, for example, do not have this
    structure.  It would be nice, however, if all logical blocks on this
    "geometry" were addressable -- this does not seem to be the case in (1)
    above.
    
    Do SAP or Informix bypass the normal file structure to get to the raw
    drive?  Are they likely to be writing the disklabel?
    
    There is no indication of a "retry exhausted" error or "SCSI timeout"
    in the information presented in this note string to date.  Nor is there
    clear evidence of a hardware problem.
    
    -- Kenny House
 | 
| 6736.10 |  | SSDEVO::ROLLOW | Dr. File System's Home for Wayward Inodes. | Tue Jun 03 1997 09:05 | 13 | 
|  | 	Many database class applications on UNIX use the raw device,
	it avoid any issues of whether the file system buffers the
	data (sync, fsync or not) and it avoids a buffer copy.  If
	you remember that disk read and writes have to be multiples 
	of the sector size it is also easy, using the same system calls
	as reading and writing files.
	Since Digital UNIX disklabels have been around for a few years
	most vendors that use raw disks have either figured out where
	the label is and don't use it, or require the user to partition
	the disk to protect the label.  If this is the same disklabel
	that got posted to the DIGITAL_UNIX conference this morning,
	that's what that 32 sectors is in the A partition.
 | 
| 6736.11 | Not broken H/W | SMURF::KNIGHT | Fred Knight | Wed Jun 04 1997 15:06 | 19 | 
|  | What most likely happened, is that some user labeled
this device BEFORE it was put into the HSZ40 (note that
there is NO dynamic geometry in the first disklabel).
Then, after installing in into the HSZ40, they just started
to use it (with the WRONG disklabel).  After the error, they
put a NEW disklabel (now a correct one) on the media (now
note that dynamic geometry IS set).  And magically, it now
works!
The only other option is the HSZ40 firmware bug that has
been BLITZed about conditions when the firmware would change
the size of a volume (not common, but still possible).
In both cases, NOTHING is broken in the H/W.  If it's case
1, then educate your customer, if case 2, use the documented
firmware workaround.
	Fred Knight
 | 
| 6736.12 | Hmm, did somebody INIT SAVE_CONFIG? | SSDEVO::JACKSON | Jim Jackson | Wed Jun 04 1997 17:46 | 25 | 
|  | Sure, we've seen this type of error a bunch when folks got careless about
reusing disks.  Here's a recipe for the problem:
	1) Have a direct-connected SCSI disk.  Put a filesystem on to it.
	2) Move the disk to an HSZ40
	3) INIT the disk from the HSZ40 console
	4) ADD UNIT
At this point, the host sees a disk that has a valid filesystem on it.  The
only problem is that the last few blocks have been lopped off by the HSZ40
to contain its metadata.
One of the rules we have in our lab is if you INIT it on the HSZ, then you
have to put a new filesystem on it (VMS INIT, Unix ??).  Our documentation
has stated for eons that you should assume that an HSZ INIT destroys the
user data on the disk.
disklabel value	8378028
new value	8377528
-----------------------
difference	    500
500 blocks is exactly the number of blocks consumed by SAVE_CONFIG.  So, in
your case, it would appear that you had a JBOD with a filesystem on it, the
disk got an INIT SAVE_CONFIG, and a new filesystem was not put in place.
 |