UMBC High Performance Computing Facility
Scheduling policy on tara
Introduction
This page documents the scheduling policy implementing the
current usage policy.
This policy is
different than past policies in some significant ways, including the
incorporation of SLURM concepts like QOS and
fair-share priority,
but we believe that this policy will be natural to use.
It is designed to implement the philosophical underpinnings of
the usage policy concretely.
The scheduling policy and
usage policy are designed together
to help support the productivity goals of the cluster, including:
Throughput - handle as many jobs as possible from our users.
Utilization - don't leave processors idling if work is available.
Responsiveness - if you submit a job that will take X hours to run,
ideally it shouldn't have to wait more than X hours to start.
Give priority to faculty who have contributed to HPCF, but support
the work of community users as much as possible.
All users are expected to read this page and the
usage policy
carefully; by using the cluster, you agree to abide by the policies stated
in these pages.
Fundamentals
A partition represents a group of nodes in the cluster.
There are two partitions:
Partition |
Description |
Walltime limits |
develop |
There are two nodes in the develop partition, n1 and n2. This partition
is dedicated to code under development. Jobs of up to 16 cores may be tested,
but run time is supposed to be negligible.
|
5 min default, 30 min max |
batch |
The majority of the compute nodes on tara are allocated to this partition.
There are 82 nodes: n3, ..., n84. Jobs running on these nodes are considered
"production" runs; users should have a high degree of confidence that bugs
have been worked out.
|
--- |
When a user runs a job, time is charged to their PI group for
accounting purposes, as explained in the
usage policy.
The historical accounting data is used in scheduling to influence
priority, to determine which job will run next if there are
multiple
queued jobs. Note that priority does not affect jobs which are already
running. On tara we have implemented a fair-share policy.
PIs who have used more than their allocation in their recent history will
have a reduced priority, while PIs who have used less than their allocation
will have an increased priority.
Following the terminology of the SLURM scheduler,
queues are referred to as QOS's,
short for Quality of Services. Several QOS's are
available, which are designed to handle different kinds of jobs.
Every job is run under a particular QOS, chosen by the user in the
submission script, and is subject to that QOS's resource limitations.
The following bullets introduce the QOS's by giving a
motivational viewpoint first:
short -
Designed for very short jobs, which may require many nodes but will
not take very long - on the order of several minutes to a lunch break.
normal (default) -
Designed for average length jobs, which may require
a significant number of nodes.
We consider average length to be on the order of a lunch break to half a
workday. This is the default QOS if you do not specify one
explicitly in your sbatch job submission script.
medium -
Designed for medium length jobs, which we consider to be on the order
of half a workday up to an overnight run, but require only few nodes.
long -
Designed for long jobs, which we consider to be on the order
of overnight to several days. Any user (community or contribution) may use
this QOS, but only 2 nodes can be in use by any one user at a time.
long_contrib -
This QOS is similar to the long QOS, but access is limited to
contribution PI groups. There is no limit to the number of nodes in use by any
single user at a time.
Conflicts in usage are expected to be infrequent, and
will be resolved between the affected PIs and the HPCF Point of Contact.
support -
The support QOS is designed for critical jobs run by HPCF support personnel.
It has minimal restrictions (time limits, node limits), and the highest
possible priority. It is intended for special circumstances and not for
everyday use.
To use the support QOS, you must have access to
the support account, and must specify --account=support in your batch script
The actual specific definitions for the QOS's are given as follows
QOS |
Wall time limit per job |
CPU time limit per job |
Total node limit for the QOS |
Node limit per user |
short |
1 hour |
512 hours |
--- |
--- |
normal (default) |
4 hours |
512 hours |
--- |
--- |
medium |
24 hours |
--- |
32 |
--- |
long |
5 days |
--- |
32 |
2 |
long_contrib |
5 days |
--- |
32 |
--- |
support |
5 days |
--- |
--- |
--- |
Note (6/1/2011): At this time, the "2 node limit per user" in the long QOS is not in effect.
This constraint will be implemented as soon as possible,
but for now the limits for the long QOS
are "4 jobs per user" and "2 nodes per job".
where
CPU time limit is the maximum CPU time allowed for a
single job, which is measured as product of walltime and number of cores
requested by the job.
Here, "CPU time" is actually a misnomer; please notice the definition
as given in the previous sentence;
we use the term following the SLURM documentation.
Total node limit is the number of nodes may be in use at any
given time, across all jobs in the given QOS
Node limit per user is the number of nodes may be in use at any
given time by a particular user in the given QOS
Notice that 2 nodes contain a total of 16 cores on tara at present,
thus this limit permits the running of several jobs,
as long as their total number of cores used adds up to no more than 16.
A general guideline is to choose the QOS with the minimum resources needed
to run your job. This is
considered to be a good user behavior, which responsible users should follow.
But a direct benefit to the user is backfilling.
Backfilling is a feature of the scheduler that allows your job ahead of
higher priority jobs, if your job's estimated run
time is shorter than the estimated wait time for the other jobs to start.
A very responsible user (or one who really wishes to take advantage of
backfilling) can set specific
walltime
and
memory
limits for their job, based on their estimates.
Note that QOS's work the same way across both the develop and batch
partitions. QOS limits (e.g. number of nodes and walltime) are applied in
conjunction with any partition limits. In the case of the develop partition
we suggest setting the QOS to the default "normal", or simply leaving it blank.
The CPU time limit per job enforced in some of the QOS's
limits the number of nodes and time limit that a job can request
in such a way that the total resource use of CPU time is limited.
For this purpose, it is considered equivalent to use
64 nodes for 1 hour, 32 nodes for 2 hours, 16 nodes for 4 hours, etc.
Thus, with 8 cores per node, the CPU time limit
(walltime times number of cores) is 64 nodes times 8 cores times 1 hour
equal to 512 hours of CPU time.
For demonstration, the following table lists some sample
combinations of number of nodes and time limits per job,
all of which are actually equivalent to 512 hours of CPU time.
Number of nodes |
Cores per node |
Total number of cores |
Wall time (hours) |
CPU time (hours) |
64 |
8 |
512 |
1 |
512 |
32 |
8 |
256 |
2 |
512 |
16 |
8 |
128 |
4 |
512 |
8 |
8 |
64 |
8 |
512 |
4 |
8 |
32 |
16 |
512 |
2 |
8 |
16 |
32 |
512 |
1 |
8 |
8 |
64 |
512 |
1 |
4 |
4 |
128 |
512 |
1 |
2 |
2 |
256 |
512 |
1 |
1 |
1 |
512 |
512 |
Here, the number of nodes in the first column and the
cores per node in the second column are multiplied to give
the total numbers of cores in the third column;
this is the quantity that enters into the SLURM definition
of CPU time.
Thus, the fourth column shows which wall times in hours
yield 512 hours of CPU time in the fifth column.
The numbers in this table also explain how the wall time limits
for the QOS's short, normal, and medium were chosen,
namely to accomodate the equivalence of the job that uses
64 nodes for 1 hour.
Specifically, the wall time limits in each QOS together with the
CPU time limit per job limits each job to
64 nodes in the short QOS, to 16 in the normal QOS, and
to 4 nodes in the medium QOS.
These choices ensure that only short (1 hour) jobs can
use many nodes in the cluster, while only jobs
with relatively few nodes can take a long time (16 hours).
Notice in any case that the wall time limits of a QOS
also apply, for instance, 5 days for the long QOS;
this is 120 hours and thus jobs with the parameters of the last three rows of
the above table would not complete even in the long QOS.
How to submit jobs
Use of the batch system is discussed in detail on the
how to run page.
Failure modes
In this section we give some commonly encountered failure modes - and what
kind of behavior you will observe when experiencing them.
A community user tries to run in the long_contrib queue.
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Job has invalid qos
[araim1@tara-fe1 ~]$
Additionally, no slurm.out or slurm.err output is generated
You attempt to use more than two nodes in the long QOS
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Job violates accounting policy
(job submit limit, user's size and/or time limits)
[araim1@tara-fe1 ~]$
You attempt to use more than 30 nodes total in the long_contrib QOS
[araim1@slurm-dev ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Job violates accounting policy
(job submit limit, user's size and/or time limits)
[araim1@tara-fe1 ~]$
You attempt to use 30 nodes total in the long_contrib QOS, but not
all nodes are available at the time of submission
Suppose we first submitted a 2 node job, and then a 30 node job
[araim1@tara-fe1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES QOS NODELIST(REASON)
4278 batch users01 araim1 PD 0:00 30 normal (AssociationResourceLimit)
4277 batch users01 araim1 R 2:54 2 normal n[7-8]
[araim1@tara-fe1 ~]$
The 30 node job remains queued with reason "AssociationResourceLimit", until
all 30 nodes of long_contrib become available.
Your job reaches a maximum walltime limit.
The job is killed with a message in the stderr output
[araim1@tara-fe1 ~]$ cat slurm.err
slurmd[n1]: error: *** JOB 59545 CANCELLED AT 2011-05-20T08:10:52 DUE TO TIME
LIMIT ***
[araim1@tara-fe1 ~]$
Your job violates the 512 hours maximum CPU time limit.
The job is killed with a message in the stderr output
[araim1@tara-fe1 ~]$ cat slurm.err
slurmd[n3]: *** JOB 4254 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
slurmd[n3]: *** STEP 4254.0 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
[araim1@tara-fe1 ~]$
Run a job with walltime limit too high for QOS / partition.
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Job violates accounting policy
job submit limit, user's size and/or time limits)
[araim1@tara-fe1 ~]$
Try to charge time to a PI who doesn't exist, or who you don't work for.
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Invalid account specified
[araim1@tara-fe1 ~]$
Try to use a QOS that doesn't exist.
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Job has invalid qos
[araim1@tara-fe1 ~]$
Try to use a partition that doesn't exist.
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Invalid partition name specified
[araim1@tara-fe1 ~]$
Try to use more than 8 processes per node.
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Requested node configuration is not available
[araim1@tara-fe1 ~]$
Try to use more nodes than available in a partition.
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Node count specification invalid
[araim1@tara-fe1 ~]$
Invalid syntax in SLURM batch script
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: unrecognized option `--ndoes=2'
sbatch: error: Try "sbatch --help" for more information
[araim1@tara-fe1 ~]$
Memory limit exceeded by your program
[araim1@tara-fe1 ~]$ cat slurm.err
slurmd[n1]: error: Job 60204 exceeded 10240 KB memory limit, being killed
slurmd[n1]: error: *** JOB 60204 CANCELLED AT 2011-05-27T19:34:34 ***
[araim1@tara-fe1 ~]$
You've set the memory limit set too high for available memory
[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Requested node configuration is not available
[araim1@tara-fe1 ~]$
You try to use more than 30 minutes of walltime in the develop partition,
by setting the --time flag
The job will be stuck in the pending state, with reason "PartitionTimeLimit"
[araim1@tara-fe1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES QOS NODELIST(REASON)
62280 develop SNOW araim1 PD 0:00 1 normal (PartitionTimeLimit)
[araim1@tara-fe1 ~]$