| Title: | NAS Message Queuing Bus |
| Notice: | KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10 |
| Moderator: | PAMSRC::MARCUS EN |
| Created: | Wed Feb 27 1991 |
| Last Modified: | Thu Jun 05 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 2898 |
| Total number of notes: | 12363 |
We have experienced a weird problem at my customer.
Setup:
local group: # 1104, v3.2A ECO 1, on HP-UX 10.20 with ServiceGuard
this group is the initiator for all xgroup connects,
and has xgroup verification turned ON (the group is
using a ServiceGuard "floating IP")
remote groups: 5 groups running v3.0C on Solaris - 700, 710,..740
(these 5 groups have verification turned on)
bunch of other groups running v3.0B on HP
one other group running v3.0A on OpenVMS/Alpha
Problem: All 5 groups 7xx have "ld, link receiver for group 710 from
group 1104 is running" followed immediately by "ld, caught
signal 11" for that pid.
After this, we start having log file entries about duplicate
link receiver for group 1104. However the processes whose
pids are shown in the dmqmonc link detail screen for any of
these links do not exist (obvious, once you find the
"caught signal 11" entry in log).
Group 1104 log shows repeated "ld, operation failed to complete"
entries for ALL link senders EXCEPT group 7xx senders, for
which the only log file entries are the initial "sender
running" entries. (Note: the causes of the other links
not coming up were later diagnosed and fixed temporarily by
turning xgroup verification off).
Problem can only be cleared by dropping all 5 7xx groups and
their attached production server applications. Because of
this, we will be getting an urgent demand for a root cause
analysis of how this condition occurred, since it caused an
unscheduled production outage in a non-stop application.
Any clues as to why the signal 11 happened on all 5 groups?
Is this possibly due to some known problem with v3.0B/C? If so, is this
fixed by v3.2A ECO 1?
I could not find any mention in the v3.1 or v3.2x release notes.
Is there some way to recover this without bouncing the group?
(log files group1104.log and group710.log are at WHOS01""::)
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 2793.1 | XHOST::SJZ | Kick Butt In Your Face Messaging ! | Sat Mar 08 1997 14:00 | 15 | |
try turning off xgroup verification. make certain the
configurations have adequate min and max group numbers.
try setting the min group number to 1.
as for our "getting an urgent demand for a route cause
analysis", they can can demand whatever they want.
and though i am not the problem management person, the
response should be upgrade to V3.2A-1 all around.
V3.0C is not even a real release and is fully two ver-
sions back (3 if you consider V3.2A as being a major
release (which it really was even though we called it
a MUP)).
_sjz.
| |||||
| 2793.2 | link driver questions | WHOS01::ELKIND | Steve Elkind, Digital SI @WHO | Wed Mar 12 1997 18:03 | 29 |
We were able to get this same situation to occur regularly on a
cross-group link. By turning xgroup verify off, we found that there
was a name mismatch - "xyz.foo.bar.com" in the DmQ v3.0C
(non-initiator) side's xgroup table and DNS, and "xyz" in the v3.2A
(initiator) side's xgroup entry for itself. On the 3.0C side we were
getting "host xyz port xxxxx not found in local address database",
which we were not getting before. Removing the domain name from the
v3.0C table entry seemed to fix the problem.
However, other users have said "why now? we've been working
successfully up to now as things were". I went and tried to recreate
the problem using various combinations of 3.0C, 3.0B, 3.2A (non-eco1
and eco1), host names unqualified/qualified, floating IPs and
non-floating IPs, etc. - and have not recreated the problem (although I
am running the non-initiator end on HP-UX instead of Solaris). Is it
possible we are the victim of some sort of race condition in the 3.0C
link driver?
I also tried to recreate the dmqld segfault condition reported fixed in the
3.2A ECO1 release notes, with no success. I have noticed with the 3.2A
non-eco1 non-initiator side that the symptoms of an unconfigured group
connection cycled back and forth on every other connect attempt, with
one set just containing "lost connection" and "exiting" messages, while
the other set of symptoms also contained "unconfigured connection" (or
something like that, the log is elsewhere now), and "spi, semaphore
operation failed" as well. Was the problem fixed dependent on
conditions?
Thanks.
| |||||