UMBC High Performance Computing Facility

Scheduling policy on tara

Introduction

This page documents the scheduling policy implementing the current usage policy. This policy is different than past policies in some significant ways, including the incorporation of SLURM concepts like QOS and fair-share priority, but we believe that this policy will be natural to use. It is designed to implement the philosophical underpinnings of the usage policy concretely. The scheduling policy and usage policy are designed together to help support the productivity goals of the cluster, including:

Throughput - handle as many jobs as possible from our users.
Utilization - don't leave processors idling if work is available.
Responsiveness - if you submit a job that will take X hours to run, ideally it shouldn't have to wait more than X hours to start.
Give priority to faculty who have contributed to HPCF, but support the work of community users as much as possible.

All users are expected to read this page and the usage policy carefully; by using the cluster, you agree to abide by the policies stated in these pages.

Fundamentals

A partition represents a group of nodes in the cluster. There are two partitions:

Partition	Description	Walltime limits
develop	There are two nodes in the develop partition, n1 and n2. This partition is dedicated to code under development. Jobs of up to 16 cores may be tested, but run time is supposed to be negligible.	5 min default, 30 min max
batch	The majority of the compute nodes on tara are allocated to this partition. There are 82 nodes: n3, ..., n84. Jobs running on these nodes are considered "production" runs; users should have a high degree of confidence that bugs have been worked out.	---

When a user runs a job, time is charged to their PI group for accounting purposes, as explained in the usage policy. The historical accounting data is used in scheduling to influence priority, to determine which job will run next if there are multiple queued jobs. Note that priority does not affect jobs which are already running. On tara we have implemented a fair-share policy. PIs who have used more than their allocation in their recent history will have a reduced priority, while PIs who have used less than their allocation will have an increased priority.

Following the terminology of the SLURM scheduler, queues are referred to as QOS's, short for Quality of Services. Several QOS's are available, which are designed to handle different kinds of jobs. Every job is run under a particular QOS, chosen by the user in the submission script, and is subject to that QOS's resource limitations.

The following bullets introduce the QOS's by giving a motivational viewpoint first:

short - Designed for very short jobs, which may require many nodes but will not take very long - on the order of several minutes to a lunch break.
normal (default) - Designed for average length jobs, which may require a significant number of nodes. We consider average length to be on the order of a lunch break to half a workday. This is the default QOS if you do not specify one explicitly in your sbatch job submission script.
medium - Designed for medium length jobs, which we consider to be on the order of half a workday up to an overnight run, but require only few nodes.
long - Designed for long jobs, which we consider to be on the order of overnight to several days. Any user (community or contribution) may use this QOS, but only 2 nodes can be in use by any one user at a time.
long_contrib - This QOS is similar to the long QOS, but access is limited to contribution PI groups. There is no limit to the number of nodes in use by any single user at a time. Conflicts in usage are expected to be infrequent, and will be resolved between the affected PIs and the HPCF Point of Contact.
support - The support QOS is designed for critical jobs run by HPCF support personnel. It has minimal restrictions (time limits, node limits), and the highest possible priority. It is intended for special circumstances and not for everyday use.
To use the support QOS, you must have access to the support account, and must specify --account=support in your batch script

The actual specific definitions for the QOS's are given as follows

QOS	Wall time limit per job	CPU time limit per job	Total node limit for the QOS	Node limit per user
short	1 hour	512 hours	---	---
normal (default)	4 hours	512 hours	---	---
medium	24 hours	---	32	---
long	5 days	---	32	2
long_contrib	5 days	---	32	---
support	5 days	---	---	---

Note (6/1/2011): At this time, the "2 node limit per user" in the long QOS is not in effect.
This constraint will be implemented as soon as possible,
but for now the limits for the long QOS
are "4 jobs per user" and "2 nodes per job".

where

CPU time limit is the maximum CPU time allowed for a single job, which is measured as product of walltime and number of cores requested by the job. Here, "CPU time" is actually a misnomer; please notice the definition as given in the previous sentence; we use the term following the SLURM documentation.
Total node limit is the number of nodes may be in use at any given time, across all jobs in the given QOS
Node limit per user is the number of nodes may be in use at any given time by a particular user in the given QOS Notice that 2 nodes contain a total of 16 cores on tara at present, thus this limit permits the running of several jobs, as long as their total number of cores used adds up to no more than 16.

A general guideline is to choose the QOS with the minimum resources needed to run your job. This is considered to be a good user behavior, which responsible users should follow. But a direct benefit to the user is backfilling. Backfilling is a feature of the scheduler that allows your job ahead of higher priority jobs, if your job's estimated run time is shorter than the estimated wait time for the other jobs to start. A very responsible user (or one who really wishes to take advantage of backfilling) can set specific walltime and memory limits for their job, based on their estimates.

Note that QOS's work the same way across both the develop and batch partitions. QOS limits (e.g. number of nodes and walltime) are applied in conjunction with any partition limits. In the case of the develop partition we suggest setting the QOS to the default "normal", or simply leaving it blank.

The CPU time limit per job enforced in some of the QOS's limits the number of nodes and time limit that a job can request in such a way that the total resource use of CPU time is limited. For this purpose, it is considered equivalent to use 64 nodes for 1 hour, 32 nodes for 2 hours, 16 nodes for 4 hours, etc. Thus, with 8 cores per node, the CPU time limit (walltime times number of cores) is 64 nodes times 8 cores times 1 hour equal to 512 hours of CPU time. For demonstration, the following table lists some sample combinations of number of nodes and time limits per job, all of which are actually equivalent to 512 hours of CPU time.

Number of nodes	Cores per node	Total number of cores	Wall time (hours)	CPU time (hours)
64	8	512	1	512
32	8	256	2	512
16	8	128	4	512
8	8	64	8	512
4	8	32	16	512
2	8	16	32	512
1	8	8	64	512
1	4	4	128	512
1	2	2	256	512
1	1	1	512	512

Here, the number of nodes in the first column and the cores per node in the second column are multiplied to give the total numbers of cores in the third column; this is the quantity that enters into the SLURM definition of CPU time. Thus, the fourth column shows which wall times in hours yield 512 hours of CPU time in the fifth column. The numbers in this table also explain how the wall time limits for the QOS's short, normal, and medium were chosen, namely to accomodate the equivalence of the job that uses 64 nodes for 1 hour. Specifically, the wall time limits in each QOS together with the CPU time limit per job limits each job to 64 nodes in the short QOS, to 16 in the normal QOS, and to 4 nodes in the medium QOS. These choices ensure that only short (1 hour) jobs can use many nodes in the cluster, while only jobs with relatively few nodes can take a long time (16 hours). Notice in any case that the wall time limits of a QOS also apply, for instance, 5 days for the long QOS; this is 120 hours and thus jobs with the parameters of the last three rows of the above table would not complete even in the long QOS.

How to submit jobs

Use of the batch system is discussed in detail on the how to run page.

Failure modes

In this section we give some commonly encountered failure modes - and what kind of behavior you will observe when experiencing them.

A community user tries to run in the long_contrib queue.

[araim1@tara-fe1 ~]$ sbatch run.slurm
sbatch: error: Batch job submission failed: Job has invalid qos
[araim1@tara-fe1 ~]$

Additionally, no slurm.out or slurm.err output is generated

You attempt to use more than two nodes in the long QOS

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Job violates accounting policy
(job submit limit, user's size and/or time limits)
[araim1@tara-fe1 ~]$

You attempt to use more than 30 nodes total in the long_contrib QOS

[araim1@slurm-dev ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Job violates accounting policy
(job submit limit, user's size and/or time limits)
[araim1@tara-fe1 ~]$

You attempt to use 30 nodes total in the long_contrib QOS, but not all nodes are available at the time of submission
Suppose we first submitted a 2 node job, and then a 30 node job
```
[araim1@tara-fe1 ~]$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES QOS    NODELIST(REASON)
   4278     batch  users01   araim1  PD       0:00     30 normal (AssociationResourceLimit)
   4277     batch  users01   araim1   R       2:54      2 normal n[7-8]
[araim1@tara-fe1 ~]$
```
The 30 node job remains queued with reason "AssociationResourceLimit", until all 30 nodes of long_contrib become available.

Your job reaches a maximum walltime limit.

The job is killed with a message in the stderr output

[araim1@tara-fe1 ~]$ cat slurm.err
slurmd[n1]: error: *** JOB 59545 CANCELLED AT 2011-05-20T08:10:52 DUE TO TIME
LIMIT ***
[araim1@tara-fe1 ~]$

Your job violates the 512 hours maximum CPU time limit.

The job is killed with a message in the stderr output

[araim1@tara-fe1 ~]$ cat slurm.err 
slurmd[n3]: *** JOB 4254 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
slurmd[n3]: *** STEP 4254.0 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
[araim1@tara-fe1 ~]$

Run a job with walltime limit too high for QOS / partition.

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Job violates accounting policy
job submit limit, user's size and/or time limits)
[araim1@tara-fe1 ~]$

Try to charge time to a PI who doesn't exist, or who you don't work for.

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Invalid account specified
[araim1@tara-fe1 ~]$

Try to use a QOS that doesn't exist.

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Job has invalid qos
[araim1@tara-fe1 ~]$

Try to use a partition that doesn't exist.

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Invalid partition name specified
[araim1@tara-fe1 ~]$

Try to use more than 8 processes per node.

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Requested node configuration is not available
[araim1@tara-fe1 ~]$

Try to use more nodes than available in a partition.

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Node count specification invalid
[araim1@tara-fe1 ~]$

Invalid syntax in SLURM batch script

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: unrecognized option `--ndoes=2'
sbatch: error: Try "sbatch --help" for more information
[araim1@tara-fe1 ~]$

Memory limit exceeded by your program

[araim1@tara-fe1 ~]$ cat slurm.err 
slurmd[n1]: error: Job 60204 exceeded 10240 KB memory limit, being killed
slurmd[n1]: error: *** JOB 60204 CANCELLED AT 2011-05-27T19:34:34 ***
[araim1@tara-fe1 ~]$

You've set the memory limit set too high for available memory

[araim1@tara-fe1 ~]$ sbatch run.slurm 
sbatch: error: Batch job submission failed: Requested node configuration is not available
[araim1@tara-fe1 ~]$

You try to use more than 30 minutes of walltime in the develop partition, by setting the --time flag

The job will be stuck in the pending state, with reason "PartitionTimeLimit"

[araim1@tara-fe1 ~]$ squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES QOS     NODELIST(REASON)
  62280   develop     SNOW   araim1 PD       0:00      1 normal  (PartitionTimeLimit)
[araim1@tara-fe1 ~]$

Number of nodes	Cores per node	Total number of cores	Wall time (hours)	CPU time (hours)
64	8	512	1	512
32	8	256	2	512
16	8	128	4	512
8	8	64	8	512
4	8	32	16	512
2	8	16	32	512
1	8	8	64	512
1	4	4	128	512
1	2	2	256	512
1	1	1	512	512

Number of nodes	Cores per node	Total number of cores	Wall time (hours)	CPU time (hours)
64	8	512	1	512
32	8	256	2	512
16	8	128	4	512
8	8	64	8	512
4	8	32	16	512
2	8	16	32	512
1	8	8	64	512
1	4	4	128	512
1	2	2	256	512
1	1	1	512	512

Number of nodes	Cores per node	Total number of cores	Wall time (hours)	CPU time (hours)
64	8	512	1	512
32	8	256	2	512
16	8	128	4	512
8	8	64	8	512
4	8	32	16	512
2	8	16	32	512
1	8	8	64	512
1	4	4	128	512
1	2	2	256	512
1	1	1	512	512