| Title: | SCHEDULER |
| Notice: | Welcome to the Scheduler Conference on node HUMANE ril |
| Moderator: | RUMOR::FALEK |
| Created: | Sat Mar 20 1993 |
| Last Modified: | Tue Jun 03 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 1240 |
| Total number of notes: | 5017 |
We've got a large Scheduler customer here with a weird problem.
They have a mixed architecture cluster of 3 VAX's and 3 Alpha's
(which are all 8400's) running various versions of VMS 6.2.
This customer currently has 2.1B-1 running on all 6 nodes and they
have load balancing enabled. They have 2000 (yes two thousand!)
scheduler jobs, all of which are batch jobs. Since the weekend when
they rebooted several of the systems for regular maintenance they
have been unable to get Scheduler working.
All the NSCHED processes run but they seem to get "stuck" and stop
processing anything. Even with debug turned on nothing is reported
to the log file initially. Killing the default NSCHED process gets the
new default node to execute some more jobs and then it gets stuck and
the cycle is repeated. It looks as if Scheduler is having problems
communicating with the nodes in its own cluster.
To make matters worse VSS.DAT ended up corrupted and it was recreated
from a backup (with a new dependency.dat). The additional problem is
that Scheduler seems unable to cope with loading 2000 new jobs, it
consistently gets stuck loading the 1000th job - although no process
quotas are being exhausted.
Anyone got any suggestions?
Tonight we're
1. Loading 2.1B-9
2. Performing a cluster reboot
3. Rebuilding the database but only loading 20 jobs - if that works
this will be increased to 500 jobs.
Any other suggestions would be appreciated!!
Thanks,
Tony Parsons,
Sydney CSC
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 1179.1 | exactly 1000 jobs ? | RUMOR::FALEK | ex-TU58 King | Tue Nov 05 1996 13:40 | 30 |
Did this used to work with this many batch-mode jobs submitted in queues
simultaneously, and has only broken now that you've rebooted some nodes
?
If that's the case, I'd suspect ENQUELM quota for the NSCHED process.
Make it real big (doesn't cost extra if its not used, since its just
a limit). Also BYTELM, though that's less likely, and ASTLM.
Is it consistently hanging at exactly 1000 pending jobs submitted to
the batch queues? That is VERY suspicious. I wonder if a resource
name used for locks associated with batch mode jobs has a length limit
(3 ascii characters as part of name?) on the bad assumption that there
would never be more than 999 jobs simultaneously in queues?
What state is the NSCHED process in on the "Default" node? Does it
have any ASTs pending?
Are all the NSCHEDs hung (for example, if you have a detached mode job
that is restricted to a node other than the default node, can you manually
do $ SCHED RUN job and does it start ?
I would guess that the problem is not due to the mixed
architecture-ness of the cluster. If it seems to always get in
trouble submitting job number 1000 to the queues, it could be
a problem (bug) with lock resource names, but if used to work with
2000 jobs and just now broke on the cluster re-boot, it is probably
a quota related problem and not a resource name length bug.
I don't have access to the source code, so I can't check.
| |||||