| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 8992.1 | you're on the right track... | SMURF::PETERT | rigidly defined areas of doubt and uncertainty | Fri Feb 28 1997 08:41 | 20 | 
|  |     Your assumptions of 1) different .so files, and 3) not enough file
    space, are both valid.  Though on 3) you should get some sort of
    warning message from the system that the device is full.  As for
    2) it seems to me I've gone through this before and never come
    up with a decent answer.  Certainly one way would be to replace
    the .so files on your system with their's, but that is not what 
    you are looking for.  Checking the man page for the loader(5),
    it appears the best bet would be to set the environment variable
    _RLD_LIST with an explicit list of libraries, which will get
    loaded in that order.  There is also a mention of LD_LIBRARY_PATH
    which might work, but it seems to me that this is the one that 
    didn't work the last time I looked at this.  You should peruse 
    the loader man page and see if it helps out.  
    
    But if the core is corrupt, nothing is going to help.  If they can
    get a traceback on their system, before you take a look at it,
    than the library mismatch is likely the problem.
    
    PeterT
    
 | 
| 8992.2 | Turn off the setuid bit | WTFN::SCALES | Despair is appropriate and inevitable. | Fri Feb 28 1997 09:28 | 14 | 
|  | .1> Checking the man page for the loader(5), it appears the best bet would be
.1> to set the environment variable _RLD_LIST with an explicit list of
.1> libraries, which will get loaded in that order.  There is also a mention
.1> of LD_LIBRARY_PATH which might work
The key is whether the debugger image has the setuid bit set -- if it does,
then the loader will ignore LD_LIBRARY_PATH, and it probably skips _RLD_LIST,
too, for the same reason.  (We use LD_LIBRARY_PATH, and we hit this problem
with Ladebug.)  However, if you turn off the setuid bit, things should work
well (provided you remember to set up LD_LIBRARY_PATH ;-).  By turning off
the setuid bit, you lose only the ability to do remote debugging, I think.
				Webb
 | 
| 8992.3 | Thanks, | EDSCLU::KELLY |  | Fri Feb 28 1997 12:44 | 5 | 
|  | Thanks for your help and input.  This is what we were looking for.
Regards,
Jim Kelly
 | 
| 8992.4 |  | SMURF::DENHAM | Digital UNIX Kernel | Fri Feb 28 1997 14:13 | 7 | 
|  |     There's also a couple of possibilities. The memory the core file
    reocrded may well be corrupt. In other words, if the stack is
    corrupted or the code took wild jump somewhere, you can see
    similar unhelpful tracebacks. 
    
    Also, before V4.0, have more that 15 or so threads is a core
    dump would corrupt the core file.
 | 
| 8992.5 |  | DCEIDL::BUTENHOF | Dave Butenhof, DECthreads | Mon Mar 03 1997 07:33 | 18 | 
|  | .2: The key is whether the debugger image has the setuid bit set -- if it 
.2: does, then the loader will ignore LD_LIBRARY_PATH, and it probably skips 
.2: _RLD_LIST, too, for the same reason.
This is only a problem when you're trying to affect the shared libraries USED
BY THE DEBUGGER. It's not a problem when you only want to affect the shared
libraries used by the program that you're debugging. (Unless, of course, it,
too, has setuid [or setgrp] set.)
DECthreads runs into the problem Webb mentions because, with Digital UNIX
4.0, ladebug links against a library to facilitate debugging threaded
programs, and that library must match exactly with the version of DECthreads
used by the program you're debugging. Since we tend to debug against versions
of the thread library more recent than that installed on the system, we also
need ladebug to use the matching libpthreaddebug.so image -- which means
shutting off ladebug's setuid.
	/dave
 | 
| 8992.6 | help! | EDSCLU::WANG |  | Fri Mar 28 1997 09:13 | 16 | 
|  |     Hi,
    
            We received a couple more core dumps from the same customer. We
    also asked them to get the dbx output on "where,tlist,tstack" from their
    system (V3.2C) where the core was generated. We got same results as the
    base note .0 . We really cannot determine what caused the core from the
    dbx information. They checked the disk was big enough for the core and the
    user was root.
    
            Why did all the core files get currupted? Are there anything
    that the customer should be awared of, so the next time when the process 
    dies it will generate a usefull core dump?
    
    Thanks,                                 
    Danqing
    
 | 
| 8992.7 | Stack overwritten? | QUARRY::petert | rigidly defined areas of doubt and uncertainty | Fri Mar 28 1997 09:40 | 13 | 
|  | Hmmm, are you saying the customer can't do the trace backs either?
If so, it seems likely that the core file is getting corrupt, but
I'm not sure why.  One possibility is that the program is crashing 
because it has a memory leak someplace and it is overwriting it's
own stack.  In this case, all the information that a debugger 
looks for to determine the cause of the crash has been overwritten
with data which is meaningless to the debugger.  This can well
be the cause of the error messages you see as the debugger starts
up.  In this case, you might have more luck analyzing the program 
with some of the atom tools.  3rd (or is it third?) comes to mind,
but I don't know as much about it as some others might.
PeterT
 | 
| 8992.8 |  | EDSCLU::WANG |  | Fri Mar 28 1997 10:48 | 7 | 
|  |     Yes, the customer can't do the trace backs either. I've never run atom
    tools before, would you mind to show me how? I did "atom -tool third
    lu62_server" then it created lu62_server.third*. Where do I go from
    here? 
    
    Thanks for your help
    
 | 
| 8992.9 |  | SMURF::DENHAM | Digital UNIX Kernel | Fri Mar 28 1997 11:39 | 7 | 
|  |     So, how many threads in the application? More than 15 or
    so? Then you need a kernel patch to get good core files.
    
    I'm looking in the support pool sources and I'm not seeing
    the patch to kern_sig.o to fix this. I gave the support
    team a fix for this problem months ago, but it never seems
    to have made into the pools. Again! Argh.
 | 
| 8992.10 |  | EDSCLU::WANG |  | Fri Mar 28 1997 11:55 | 7 | 
|  |     Yes, we have more then 15 threads. 
    
    So if the customer install this new kernal patch, they may get good
    core files? Would you please let us know where we can get this new
    patch? 
    
    Thanks alot.
 | 
| 8992.11 | Looking for more info | EDSCLU::GARROD | IBM Interconnect Engineering | Fri Mar 28 1997 14:51 | 12 | 
|  |     Re .9
    
    Please could you give more information on this fix? Ie things like what
    versions of UNIX it applies to. What the patch kit ident is? Is it
    it incorporated after a certain version of UNIX etc.
    
    Also could you give more info on how we can identify if we're being hit
    by the lack of this patch when we receive core files.
    
    Thanks,
    
    
 | 
| 8992.12 |  | QUARRY::petert | rigidly defined areas of doubt and uncertainty | Fri Mar 28 1997 15:00 | 11 | 
|  | My knowledge of third is limited.  I know most of what I do from reading
the third(5) man page.  After producing the xxx.third file, set LD_LIBRARY_PATH
to point to the current directory (third should have produced a libc.so.third
file too, unless your application is unshared, which would seem unusual for
a threaded application.) and then run the program.  It will produce a 
xxx.3log file and you can preuse that for information on where you might have
memory problems and uninitialized data, etc...
But if it's the problem with the # of threads, this info may be moot.
PeterT
 | 
| 8992.13 |  | SMURF::DENHAM | Digital UNIX Kernel | Fri Mar 28 1997 17:15 | 18 | 
|  |     Re. .11.
    
    The problem applies to all releases before V4.0. There is *NO* patch
    for the problem. A test patch was generated for a customer in Hong
    Kong, and I foolishly assumed that this would get turned into
    a V3.2-based patch. Bzzzzt. Wrong. I'm investigating what happened
    there. In the meantime, send mail to [email protected] and
    ask for the test patch. Tell him I sent you. ;^)
    
    Identifying the problem isn't too hard, if you know the application
    well enough to know that it generally uses more than 15-16 threads
    at some point. Basically, the core file is corrupt. It will give
    some tracebacks but the stacks may be pretty bizarre, and it
    won't show all the stacks by a long shot.
    
    If we had the darn patch, this would be academic. You'd apply and
    then get on with life or eliminate the problem as irrelevant.
    But I'm stating the obvious (out of frustration).
 |