[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
| Title: | DIGITAL UNIX (FORMERLY KNOWN AS DEC OSF/1) | 
| Notice: | Welcome to the Digital UNIX Conference | 
| Moderator: | SMURF::DENHAM | 
|  | 
| Created: | Thu Mar 16 1995 | 
| Last Modified: | Fri Jun 06 1997 | 
| Last Successful Update: | Fri Jun 06 1997 | 
| Number of topics: | 10068 | 
| Total number of notes: | 35879 | 
9580.0. "A set of very odd problems" by TAMARA::OSM4S::Neumann (Stan Neumann) Tue Apr 22 1997 15:21
I need some suggestions on what could be causing a customer's
problems, and what I should have them do to diagnose it.
The customer just installed a new system (from the hardware up) because
of problems with the previous system.  They are running UNIX 4.0B,
and the only application running on this system is MailWorks for UNIX.
They have set new-wire-method to zero.
They are seeing a wide variety of problems that don't feel like they
could be caused by the application:
One process reports that it cannot allocate space (in a C++ new
function). Now, they have 500 megabytes of physical memory (so much that
free space typically runs 22K pages), plenty of swap space (2.5
gigabytes,although that is probably moot - there is essentially no
paging).  The process size is under 100 megabytes. maxusers is 2048,
vm-mapentries is 1024, and vm-vpagemax is 131072. The machine typically
runs 500-700 threads total (as determined by the r and w columns of
vmstat - we have two heavily threaded processes). Often after this error
is reported, the pthread_create begins to report that it cannot create a
thread - error 22, which points to the attribute parameter - however,
that parameter is allocated as a static, and has been used to create
hundreds of threads at this point. (Now, I can understand that it is
possible that the application could have stepped on this parameter, but
I don't see how the application could prevent the allocation of memory.)
In a separate problem, the routine gethostbyname generated a
segmentation fault three times - is there any way that the application
could have been causing this?  What kinds of conditions could lead to
this? (We've actually seen this on two different customer's systems -
one running 4.0B, and one running 3.2C).  On this customer's system,
they are using a TCP alias for a remote node they access frequently -
could that affect it?
And, to move to a third problem,
we encountered one incident in which the load average on the machine was
140; vmstat showed approximately 140 runnable threads, and the number of
context switches was 22K/second (it is normally 300-500/second). One
process was using about 50% of the CPU (and that number is abnormal for
that process) Now, this may be an application problem, but I could use
some help guessing what could cause this.  The obvious answer is that we
have a lot of threads that wake up, check for the availability of some
resource, then yield when that resource is not available - however, we
don't do yields - besides, wouldn't that yield a much smaller runnable
thread count?  It is quite possible that we would have many threads
blocked waiting on a mutex, but we don't *think* we've seen this kind of
symptom from that condition.  Any other ideas what to look for?
And finally, one of the MailWorks processes occasionally generates an
illegal instruction, with a very short stack trace, and what appears to
be a random location on the top of the stack.  We have never seen an
illegal instruction generated before (in spite of problems corrupting
memory).  Is this something the application could be doing?  Could the
application be causing this by stepping on random memory?  (And if so,
is it simply a matter of remarkable luck that we haven't run into it
before?)
Is it possible that the four problems are related in some way?
If no one has any ideas what could be causing these problems, can
anyone suggest tests that we can run  or things to look at to narrow 
the problem down?
-Stan Neumann
  (MailWorks for UNIX engineering)
| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 9580.1 | Memory corruptor and possibly clean-up problems. | WTFN::SCALES | Despair is appropriate and inevitable. | Tue Apr 22 1997 19:19 | 37 | 
|  | .0> One process reports that it cannot allocate space (in a C++ new
.0> function).
This one is probably vm-vpagemax.  (Having threads makes that parameter a
limitation on how much memory you can dynamically allocate; and, remember,
threads' stacks are dynamically allocated.)
.0> Often after this error
.0> is reported, the pthread_create begins to report that it cannot create a
.0> thread - error 22, which points to the attribute parameter - however,
.0> that parameter is allocated as a static, and has been used to create
.0> hundreds of threads at this point.
This one's probably a random memory corruptor, likely resulting from improperly
handling the allocation failure.  (You aren't using the System V environment,
are you?)
.0> the number of context switches was 22K/second (it is normally 
.0> 300-500/second)
This could result from a memory corruption as well (as could the gethostbyname()
problems).  If you nail the thread's library internal synchronization, this is
exactly what would happen.
.0> the MailWorks processes occasionally generates an illegal instruction
This one is almost certainly a memory corruption -- of the saved PC of a thread,
presumably on its stack, somewhere.
I suggest you check your code for use of uninitialized local pointer variables
(as well as examining what your code does when dynamic allocations fail --
conceivably that's the gethostbyname() problem, depending on whether it
allocates dynamic memory for threads and whether it deals with failures).
				Webb
 | 
| 9580.2 |  | KITCHE::schott | Eric R. Schott USG Product Management | Tue Apr 22 1997 20:02 | 6 | 
|  | Having some sys_check output might also help...What is maxusers
set to?
  http://www-unix.zk3.dec.com/tuning/tools/sys_check/sys_check.html
 | 
| 9580.3 | RE: 9580.2 | TAMARA::NEUMAN::Neumann | Stan Neumann | Wed Apr 23 1997 10:09 | 14 | 
|  | > maxusers is 2048, vm-mapentries is 1024, and vm-vpagemax is 131072. 
I rather dubious about blaming vm-vpagemax - for this customer,
the process in question was only around 100 megabytes in virtual
size.  We've had another customer grow that image to 1.2 gigabytes
with the same vm-vpagemax.  (Now, that customer was running
on UNIX 3.2C - has something changed with 4.0 that would
require a higher vm-vpagemax?)
On the illegal instruction - is it sheer unadulterated luck
that we've never seen this one before?  (I know we've
been guilty of random memory corruption before.)
-Stan
 | 
| 9580.4 | Luck, timing, whatever...  :-) | WTFN::SCALES | Despair is appropriate and inevitable. | Wed Apr 23 1997 11:46 | 22 | 
|  | .3> We've had another customer grow that image to 1.2 gigabytes with the same
.3> vm-vpagemax.
It depends on -how- the image was grown...
.3> that customer was running on UNIX 3.2C - has something changed with 4.0
.3> that would require a higher vm-vpagemax?
We have had some feedback that large threads applications on V4 require a
higher vm-vpagemax than on V3, but I don't think we've been able to confirm
or explain it.
.3> On the illegal instruction - is it sheer unadulterated luck that we've
.3> never seen this one before?
Prob'ly.  In order to get this, you've got to corrupt a saved PC.  They are
either in specific places on the stack or in a spot in the thread's control
block.  I guess the saved FP/SP usually gets nailed first, which results in a
SEGV before the bad PC gets executed....
					Webb
 | 
| 9580.5 | RE: 9580.1 | TAMARA::NEUMAN::Neumann | Stan Neumann | Fri Apr 25 1997 11:47 | 18 | 
|  | > .0> One process reports that it cannot allocate space (in a C++ new
> .0> function).
> 
> This one is probably vm-vpagemax.  (Having threads makes that parameter a
> limitation on how much memory you can dynamically allocate; and, remember,
> threads' stacks are dynamically allocated.)
We had vpagemax set to 131072, we tried to bump it to
262000 (based on the comment in .4), but that didn't
seem to "take" sysconfig -q vm still shows it at
131072 - is that the maximum?
Given that, I'm still struggling to explain the inability
to allocate space (to recap, they've got lots of real memory,
lots of swap, and the image size is 100 meg, well below
any limits).  Maxusers is 2048, and vm-mapentries is 1024.
-Stan
 |