| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 483.1 | well, let me try | DECWET::EVANS | NSR Engineering | Thu Mar 13 1997 11:35 | 14 | 
|  | when NFS goes out to lunch, there's not much *any* application (such as
 NetWorker) can do. Especially when the filesystem is mounted hard.
NetWorker passes an RPC message from NSR-server to NSR-client (nsrexecd)
 which then  tries to fstat each filesystem to gather info about "local"
 systems to backup. It's here that fstat hits the NFS mountpoint, and if
 NFS is gone, the fstat system call just does not return.
I see 2 points of failure: the RPC system call, and the fstat system call -
 both rely upon the network.
Thus, this is a system level issue, not really NetWorker (NFS, Unix)
Did you try to restart your network???
 | 
| 483.2 | NFS problem on the NSR server | EVTAI1::POUSSARD |  | Fri Mar 14 1997 01:30 | 8 | 
|  |     	The problem here is that the NFS problem occured on the NSR server
    , not on the NSR clients, and savegroups which started at 20:00pm had 
    nothing to do with the NSR server filesystems
    
               
    	Gilles.
    
    
 | 
| 483.3 | Could Networker look only at its target disks? | SANITY::LEMONS | And we thank you for your support. | Thu Apr 03 1997 07:10 | 60 | 
|  |     Hi
    
    May I re-open this nfs discussion?  Last night, we had backups on 6
    clients time-out and fail, because these 6 clients all had the same nfs
    disk mounted.  NOTE: none of the clients have this nfs disk, or any nfs
    disk, listed as a partition for NetWorker to back up.  And yet, the
    NetWorker backups hung.  Why?
    
    When I enter just 'df' on one of the systems on which backups timed-out
    and failed, I see:
    biggun-23: df
    NFS2 fsstat failed for server cadsys : RPC: Timed out
    ^c
    Then, I tried this command, which specifically excludes nfs disks:
    biggun-24: df -t nonfs
    Filesystem                512-blocks        Used   Available Capacity 
    Mounted on
    root_domain#root              199040      121624       63104    66%   
    /
    /proc                              0           0           0   100%   
    /proc
    usr_domain#usr               2347072     1512124      785344    66%   
    /usr
    var_domain#var               1564352      155288     1394752    11%   
    /var
    iss_work_domain#iss_work     4110480      197330     3867840     5%   
    /biggun/iss_work
    proj8_domain#proj8           4110480     2123910     1974256    52%   
    /biggun/proj8
    proj9_domain#proj9           4110480       92428     4006128     3%   
    /biggun/proj9
    proj10_domain#proj10         4110480          32     4085296     1%   
    /biggun/proj10
    proj11_domain#proj11         4110480     3692640      396576    91%   
    /biggun/proj11
    proj12_domain#proj12         4110480      523910     3560448    13%   
    /biggun/proj12
    alt_root_domain#root          199040       78546      106512    43%   
    /alt_root
    alt_usr_domain#usr           2347072          32     2303472     1%   
    /alt_usr
    alt_var_domain#var           1564352          32     1551568     1%   
    /alt_var
    biggun-25:
    
    When NetWorker interrogates the disks mounted on the client, does it:
    1. attempt to list all mounted disks
    2. attempt to list all non-NFS mounted disks
    3. attempt to list only the disks it has been told to back up?
    
    It appears that option #1 is done, where option #3 should be done, and
    option #2 would at least work.
    
    As doing a list of all mounted disks provides no benefit that I can
    see, I view Networker's attempt to do so a bug.
    
    Thoughts?
    
    Thanks!
    tl
 | 
| 483.4 | try #2 | DECWET::EVANS | NSR Engineering | Thu Apr 03 1997 09:10 | 11 | 
|  | NetWorker passes an RPC message from server to client... which client??
  all the clients in the savegroup. How did it figure out which routing to
 use?? system calls using BIND, which are the same network stuff as NFS.
NetWorker relies upon system calls to resolve hostnames. If those system
 calls result in an NFS usage occuring, then your still stuck in NFS-land.
 Hence the server-side behaviour.
This is base Legato code, not Digital porting changes, ergo, we need to
 file an enhancement request to Legato.
 | 
| 483.5 |  | SANITY::LEMONS | And we thank you for your support. | Thu Apr 03 1997 10:04 | 9 | 
|  |     Hi
    
    The client is NetWorker for Digital UNIX V4.2B.
    
    Your reply mentions BIND, and resolving system calls to hostnames. 
    Could I take a step back, and ask why NetWorker attempts to get a list
    of all disks on the system?  That, to me, seems like the problem.
    
    tl
 | 
| 483.6 | check for mount points is important to NetWorker correctness | DECWET::CARRUTHERS | Life gets easier when you realize you can't have everything. | Thu Apr 03 1997 10:23 | 7 | 
|  | and stat/fstat calls are the standard way to determine if any file is a mount
point.  As Bruce mentioned in /1, this a system level (UNIX, NFS) issue.  
{Remember, all mount points don't have to be listed in /etc/fstab.
Many is the time I have mounted large, remote file system on my desktop at
the /mnt file and left them mounted for days.  I sure am glad NetWorker knows 
not to back up those file systems, through my desktop.}
 | 
| 483.7 | soft option | BACHUS::DEVOS | Manu Devos DEC/SI Brussels 856-7539 | Thu Apr 03 1997 11:54 | 9 | 
|  |     Hi tl (t?)
    
    You can also change the fstab file such that the NFS filesystem(s) are
    mounted with the "soft" option. So, after a reasonable amount of
    timeout and retries, the fstat/start system calls give up with and
    error instead of hang up indefinitely...
    
    Manu.
    
 | 
| 483.8 |  | SANITY::LEMONS | And we thank you for your support. | Fri Apr 04 1997 09:06 | 44 | 
|  |     Thanks for this discussion.  I still think I'm missing the point.  I
    understand that NetWorker relies on UNIX and its add-ons (like nfs) to
    access the disks that it backs up.  If UNIX can't access the disk, than
    Networker can't either.  I'm certainly okay with that.
    
    My concern is that I don't want NetWorker backups to fail on a client,
    when it can't access one of the disks.  I want NetWorker to do whatever
    work it can.  I don't understand nfs very well, but I do know that we
    use nfs 'soft' mounts, as in:
    
    /usr@cadsrv:/server_usr:ro:0:0:nfs:bg,soft,intr,timeo=12,retrans=5,
    retry=10:
    
    When a new NetWorker client is created, Saveset has a default value
    of 'All'.  So, NetWorker would have to find the list of all the disks
    on the system, and back up each one.  Right?
    
    But we don't do that; we explicitly list each disk/partition we want to
    save.  So there is no need for the (apparent) full-system list of disks
    that NetWorker tries to obtain.
    
    I feel that, if the list of Savesets is not 'All', then NetWorker
    should NOT attempt to list all disks, but should check the status of
    the disks/partitions listed in the Saveset field ONLY.  That would step
    completely around this NFS problem, as we heed NetWorker's suggestion,
    and do not backup any NFS-mounted disks.
    
    What I don't completely understand is why NetWorker times out after 33
    minutes.  My read of the man pages for the mount parameters in
    /etc/fstab is that the NFS disk access should time out after 6 seconds.
    Any thoughts on that?
    
    Thanks!
    tl
    
    [from the ULTRIX V4.3 'man 8nfs mount' man page:]
    retrans=n     Set number of NFS operation retransmissions (not the
    mount) to n. The retrans= option applies after the mount has succeeded.
    
    retry=n       Set number of mount failure retries to n. The retry=
    option applies to the mount command, itself.
    
    timeo=n       Set NFS timeout to n tenths of a second.
    
 | 
| 483.9 |  | DECWET::FARLEE | Insufficient Virtual um...er.... | Fri Apr 04 1997 09:52 | 20 | 
|  | Terry,
I agree with you that the behavior you suggest is reasonable, and
what "should happen".  I will try to walk through the code when I get 
a chance to find out what is really happening, but it won't be for a week or so.
Can you tell me if the client times out during the probe, or partway
through a save?  That would distinguish between the two possibilities that
I can see:
1) Regardless of the "savesets" field, we check every mounted filesystem
	at "probe" time when we're trying to figure out what to save.
	If this is happening, we'll fix it.
2) During the saving of a filesystem, we stat each directory that
	we walk into.  If that directory happens to be the mountpoint
	for an NFS filesystem, we hang.  Not sure what we could do 
	about this one.
Kevin
 | 
| 483.10 |  | KAHLUA::LEMONS | And we thank you for your support. | Fri Apr 04 1997 10:19 | 41 | 
|  |     Hi Kevin
    
    Thanks for validating my suggestion, and for offering to walk the code
    at a later date.
    
    Here are some lines from the /nsr/logs/messages file.  Please let me
    know if they don't answer your question.
    
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ asavegrp:
    authtype nsrexec
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ has been inactive
    for 30 minutes since Thu Apr  3 02:21:12 1997.
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:/ is being
    abandoned by asavegrp.
    Apr  3 05:38:33 robot1 crsupp: * cadosf.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 last message repeated 10 times
    Apr  3 05:38:34 robot1 crsupp:
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ asavegrp:
    authtype nsrexec
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ has been inactive
    for 32 minutes since Thu Apr  3 03:25:51 1997.
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:/ is being
    abandoned by asavegrp.
    Apr  3 05:38:34 robot1 crsupp: * cadpxa.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 last message repeated 7 times
    Apr  3 05:38:34 robot1 crsupp:
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ asavegrp:
    authtype nsrexec
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ has been inactive
    for 30 minutes since Thu Apr  3 01:15:07 1997.
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:/ is being
    abandoned by asavegrp.
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 crsupp: * cadtls.hlo.dec.com:probe abandoned.
    Apr  3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat
    /cadsys/aloe_build: Connection timed out
    Apr  3 05:38:34 robot1 crsupp: * cadsrv.hlo.dec.com:/ save: cannot stat
    /cadsys/tsc: Connection timed out
    
    Thanks!
    tl
 |