| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 2353.1 | No answers, just suggestions. | IOSG::STANDAGE | Oink...Oink...Mooooooooooooooooooooooooooooooooo | Thu Mar 04 1993 09:49 | 33 | 
|  |     
    
    Sunil,
    
    What exactly are the last few messages in OAFC$SERVER.LOG ?  These
    should indicate if the server process terminated via some 'normal'
    reason, or whether something a little more unusual is going on.
    
    For instance, when ALL-IN-1 is shut down (and hence the server), the
    following messages are written to the log file prior to the server
    process stopping :
    
    3-MAR-1993 16:59:21.37  Server: TRON::"73="  
    Error: %MCC-E-ALERT_TERMREQ, thread termination requested  
    Message: CsiCacheBlockAstService; Error from mcc_astevent_receive
    
    3-MAR-1993 16:59:26.13  Server: TRON::"73="  
    Error: %MCC-E-ALERT_TERMREQ, thread termination requested  
    Message: SrvTimeoutSysMan; receive alert to terminate thread
    
    
    Are you running housekeeping procedures which shutdown ALL-IN-1, but
    the problem occurs because they are not being started up properly ?
    
    If there's no hints or clues in the log file, I think you need to find
    out when the server dies, and if there's any consistancy. Usually the
    log file will indicate if the server is unhappy.
    
    
    
    Kevin.
    
    
 | 
| 2353.2 | multiple object 73's? | CHRLIE::HUSTON |  | Thu Mar 04 1993 14:41 | 23 | 
|  |     
    Sunil,
    
    as Kevin said, the server should not just "die", if it is being 
    shut down nicely by someone, there will be several log messages
    in oafc$server.log about thread termination requested. If these are
    there someone is telling the FCS to shutdown. 
    
    If there is nothing there, other than startup messages, then my
    guess is that someone is either doing a stop/id=FCS_PID or 
    another possiblity, not sure how this would work, is if someone else
    is starting something up as DECnet object 73, either another server
    or some other application. Not sure what the effects of this would
    be, but having multiple applications up with the same obj number is
    bad.
    
    If you can get some sort of guess as to when the process goes away, 
    it would help, turn tracing on just before that and see what happens.
    
    Sorry we can't give you more to go on.
    
    --Bob
    
 | 
| 2353.3 | More info | BUSHIE::SETHI | Man from Downunder | Fri Mar 05 1993 00:09 | 36 | 
|  |     Hi Bob and Kevin,
    Having looked at the server log and your example there does seem to be
    a difference.  The users were unable to access their shared drawers at
    13:30 yesterday and here is part of the log:
    3-MAR-1993 06:29:39.30  Server: AUTC01::"73="  Message: Startup for
    File Cabinet Server V1.0 complete
    3-MAR-1993 22:57:24.14  Server: AUTC01::"73="  Error: %DSL-W-SHUT,
    Network shut down  Message: Shutting Down server, network failure.
     
    4-MAR-1993 10:04:04.47  Server: AUTC01::"73="  Message: Startup for
    File Cabinet Server V1.0 complete
    4-MAR-1993 13:29:38.13  Server: AUTC01::"73="  Message: Startup for
    File Cabinet Server V1.0 complete
    The server was started at 4-MAR-1993 10:04:04.47 and in between it died
    and the customer restarted it at 4-MAR-1993 13:29:38.13.  No error
    message are in the logfile to point to the reason for the failure. Please 
    note the customer reboots his system every night at 11:00 pm.  
    
    I have asked the customer to enable accounting to enable me to get
    extra information.  I have copied the logfile to RIPPER::Q30178.LOG_2
    it may have something in there that I just did not pick up.  Hopefully
    either the server trace will pickup something or the account.
    
    Finally the customer has assured me that they do not have other
    applications running on the system therefore object 73 is not being
    used for anything else.
    
    Thanks for you advise will keep you posted,
    
    Sunil
    
 | 
| 2353.4 | I'll  look atthe log | IOSG::STANDAGE | Oink...Oink...Mooooooooooooooooooooooooooooooooo | Fri Mar 05 1993 09:22 | 25 | 
|  |     
    Sunil,
    
    As you said, the system is rebooted at 11pm, so that explains the 
    "Error: %DSL-W-SHUT,Network shut down" message. As the system is about
    to go away the server shuts itself down.
    
    So it appears that the problem occurs between the last two startup
    messages. As nothing else has been logged the server certainly did not
    die from natural causes, at least it doesn't seem that way. Even if
    someone is doing something to seriously upset the server, some form of
    message would appear in the log.
    
    When I get time I'll take a look at the log you have provided. The next
    step is to probably see if the server seems to go away around the same
    time each day.
    
    At the moment, the only way I can see this happening is if someone did
    a STOP PROC/ID of the process.
    
    
     
    Kevin.
    
    
 | 
| 2353.5 | STOP/ID writes messages to the log file | SCOTTC::MARSHALL | Spitfire Drivers Do It Topless | Fri Mar 05 1993 10:15 | 7 | 
|  | Re: STOP/ID
When I do that, several "thread termination" messages get written to the log
file.  So it doesn't look like anyone's doing that (unless they also lock the
log file first to stop the server writing to it! :-)
Scott
 | 
| 2353.6 | You won't always get "thread termination" | IOSG::STANDAGE | Oink...Oink...Mooooooooooooooooooooooooooooooooo | Fri Mar 05 1993 11:24 | 11 | 
|  |     
    Scott,
    
    This isn't always the case, it very much depends of what is happening
    on the system at the time. I just did this on a test machine (server
    state "HIB") - and no thread termination messages were produced.
   
    
    Kevin.
    
    
 | 
| 2353.7 | run the server in the foreground | CHRLIE::HUSTON |  | Fri Mar 05 1993 15:37 | 48 | 
|  |     
    I can think of 2 ways to have the server go away with no message:
    
    1) stop/id -- I have never seen it log a message, the process is
    stopped immediately so it won't have enough time to write a message.
    This is usually how we stop servers during our testing.
    
    2) The server itself access violated.  The server runs as two layers,
    the bottom layer does about 98% of the work and any access violation
    at this level will be written to the log file via a condition handler.
    The upper level does all the dasl and DECnet interaction, it has no
    condition handler and runs at AST level. THese routines are called 
    by DASL in response to certain DASL events such as receiving a 
    DASL message. Unfortunately, since the server runs as a detached
    process if this layer access violates the process will silently go
    away.  A problem at this layer could be either the server, or
    DASL. Do you know what version of DASL they are using? The FCS
    ships with V2.0, I know that there is a V2.2, we have not tested 
    against it, and theoretically it should work due to backwards 
    compatibility, but who knows, maybe there is a problem
    
    What can you do next?
    
    Start the server in the foreground, not through ALL-IN-1.  Do the
    following:
    
    $ A1FCS :== $sys$system:oafc$server.exe
    $ A1FCS your_configuration_file.dat
    
    to get you config file name, go to the MS menu and do a R on the
    server, it will show you the config file.
    
    Note that when you start the server up like this, the server is running
    in the context of the process you do the command from. Your best choice
    for this is to log into the OAFC$SERVER account (made during
    installation), you may have to mess around in the UAF record to 
    allow logins since the account is installed as DISUSER'd.  If this is
    not do-able, the next best choice is the ALLIN1 account or SYSTEM,
    either should have suitable privs and quotas to run the server.
    
    When you do this, if the server access violates at the top level, you
    will see the access violation on the screen, please save it and either
    send it to me or post it here.
    
    Thanks
    
    --Bob
    
 | 
| 2353.10 | Changed some sysuaf parameters and monitoring | BUSHIE::SETHI | Man from Downunder | Fri Mar 12 1993 05:45 | 40 | 
|  |     Hi All,
    The customer had the problem reoccur yet again and we had accounting
    enabled but the customer forgot to turn on tracing (makes me feel
    grumpy 8*{).
    The accounting file did not have a record for the process nor did the
    OAFC$SERVER.LOG file, I also did an analyze/error/include=bugcheck and
    found nothing.
    
    I than audited the OAFC$SERVER account and the SYSTEM account and found
    the following:
    
    mod OAFC$SERVER/BIOlm=50/DIOlm=50/astlm=100/TQElm=50/enqlm=300, I other
    words :-) these quotas were 5 times below what I changed them to.  The
    system account did not have the OA$MANAGER identifier, I don't know if
    it required it but I granted it as per my system.
    
    I asked the customer to reboot the system and he did so during the
    lunch hour.  So far he has not reported any problems and it seems that
    this is the first time after a reboot he has not had any minor or major
    problems.  I will monitor the system and report back any findings.
    
    One thing though why has the accounting file not got an entry for the
    process starting and stopping ?  Accounting was enabled before ALL-IN-1
    was started.  
    
    One last question Bob ;-),
    
    What is DASL ? How do I find out what version the customer has
    installed ?
    
    >$ A1FCS :== $sys$system:oafc$server.exe
    >$ A1FCS your_configuration_file.dat
    
    I did all of this no stack dumps etc.
    
    Regards,
         
    Sunil
 | 
| 2353.11 | DASL = DECNet i/f; Care with Trace file size... | CHRLIE::HUSTON |  | Fri Mar 12 1993 13:37 | 28 | 
|  |     DASL is Distributed Service Application Layer. It is a protocol that
    lays on top of DECnet, the FCS uses it for all its DECnet work. Removes
    us from needing to make DECnet calls.  DASL is not shipped as a product
    if a shipping product needs it (like the FCS) then it is up to that
    product to supply DASL. We include V2.0 in the kits so they have at 
    least V2.0.
    
    Ok, if this never stack dumped, did it simply go away?  You said the 
    server went away again, was there no message at this terminal?
    Running the server is this manner simply runs the server is the
    foreground process rather than as a detached process. If you run
    the server in this way and it access violates outside the scope of
    the condition handler, then you would see the access violation. If
    the process simply died, not sure how, then what you would probably 
    see is the startup message, then a '$' saying you were done and back
    at DCL.
    
    Before you do this, please go into ALL-IN-1 and stop the server that
    ALL-IN-1 starts, else all kinds of fun things happen.
    
    Also, if you cannot narrow down a time or circumstance that the server
    goes away on, I do not recommend turning tracing on. Each trace record
    is 1024 bytes and each request to the server takes an ABSOLUTE MINIMUM
    of 2 trace records or 2048 bytes. Most events take more than 2 trace 
    records. So running the FCS with tracing on all the time is rather
    disk intensive.
    
    --Bob
 | 
| 2353.12 |  | BUSHIE::SETHI | Man from Downunder | Tue Mar 16 1993 04:32 | 19 | 
|  |     G'day All,
    
    The problem has been solved.  Basically it was a bit of this and a bit
    of that :-).
    
    The problem was caused by a in-house process killing job running on a
    batch queue.  Aaaaahhhhh !!!! I had asked the customer many a time if a
    stop/id= was being done on the process and he said "No".
    
    The lesson of this hair pulling story is:
    
    1. Never trust a customer when he say's no to the obvious question
    2. Show system does not always show process killers, especially when there
       process names have not been set.
    3. Process killers can run on batch queues
    
    Thanks to all of you for your help,
    
    Sunil
 |