UMBC logo
UMBC High Performance Computing Facility
Scheduling policy on tara

Introduction

This page documents the scheduling policy implementing the current usage policy. This policy is different than past policies in some significant ways, including the incorporation of SLURM concepts like QOS and fair-share priority, but we believe that this policy will be natural to use. It is designed to implement the philosophical underpinnings of the usage policy concretely. The scheduling policy and usage policy are designed together to help support the productivity goals of the cluster, including: All users are expected to read this page and the usage policy carefully; by using the cluster, you agree to abide by the policies stated in these pages.

Fundamentals

A partition represents a group of nodes in the cluster. There are two partitions:
Partition Description Walltime limits
develop There are two nodes in the develop partition, n1 and n2. This partition is dedicated to code under development. Jobs of up to 16 cores may be tested, but run time is supposed to be negligible. 5 min default, 30 min max
batch The majority of the compute nodes on tara are allocated to this partition. There are 82 nodes: n3, ..., n84. Jobs running on these nodes are considered "production" runs; users should have a high degree of confidence that bugs have been worked out. ---

When a user runs a job, time is charged to their PI group for accounting purposes, as explained in the usage policy. The historical accounting data is used in scheduling to influence priority, to determine which job will run next if there are multiple queued jobs. Note that priority does not affect jobs which are already running. On tara we have implemented a fair-share policy. PIs who have used more than their allocation in their recent history will have a reduced priority, while PIs who have used less than their allocation will have an increased priority.

Following the terminology of the SLURM scheduler, queues are referred to as QOS's, short for Quality of Services. Several QOS's are available, which are designed to handle different kinds of jobs. Every job is run under a particular QOS, chosen by the user in the submission script, and is subject to that QOS's resource limitations.

The following bullets introduce the QOS's by giving a motivational viewpoint first:

The actual specific definitions for the QOS's are given as follows
QOS Wall time limit per job CPU time limit per job Total node limit for the QOS Node limit per user
short 1 hour 512 hours --- ---
normal (default) 4 hours 512 hours --- ---
medium 24 hours --- 32 ---
long 5 days --- 32 2
long_contrib 5 days --- 32 ---
support 5 days --- --- ---
Note (6/1/2011): At this time, the "2 node limit per user" in the long QOS is not in effect. This constraint will be implemented as soon as possible, but for now the limits for the long QOS are "4 jobs per user" and "2 nodes per job".
where A general guideline is to choose the QOS with the minimum resources needed to run your job. This is considered to be a good user behavior, which responsible users should follow. But a direct benefit to the user is backfilling. Backfilling is a feature of the scheduler that allows your job ahead of higher priority jobs, if your job's estimated run time is shorter than the estimated wait time for the other jobs to start. A very responsible user (or one who really wishes to take advantage of backfilling) can set specific walltime and memory limits for their job, based on their estimates.

Note that QOS's work the same way across both the develop and batch partitions. QOS limits (e.g. number of nodes and walltime) are applied in conjunction with any partition limits. In the case of the develop partition we suggest setting the QOS to the default "normal", or simply leaving it blank.

The CPU time limit per job enforced in some of the QOS's limits the number of nodes and time limit that a job can request in such a way that the total resource use of CPU time is limited. For this purpose, it is considered equivalent to use 64 nodes for 1 hour, 32 nodes for 2 hours, 16 nodes for 4 hours, etc. Thus, with 8 cores per node, the CPU time limit (walltime times number of cores) is 64 nodes times 8 cores times 1 hour equal to 512 hours of CPU time. For demonstration, the following table lists some sample combinations of number of nodes and time limits per job, all of which are actually equivalent to 512 hours of CPU time.

Number of nodes Cores per node Total number of cores Wall time (hours) CPU time (hours)
64 8 512 1 512
32 8 256 2 512
16 8 128 4 512
8 8 64 8 512
4 8 32 16 512
2 8 16 32 512
1 8 8 64 512
1 4 4 128 512
1 2 2 256 512
1 1 1 512 512
Here, the number of nodes in the first column and the cores per node in the second column are multiplied to give the total numbers of cores in the third column; this is the quantity that enters into the SLURM definition of CPU time. Thus, the fourth column shows which wall times in hours yield 512 hours of CPU time in the fifth column. The numbers in this table also explain how the wall time limits for the QOS's short, normal, and medium were chosen, namely to accomodate the equivalence of the job that uses 64 nodes for 1 hour. Specifically, the wall time limits in each QOS together with the CPU time limit per job limits each job to 64 nodes in the short QOS, to 16 in the normal QOS, and to 4 nodes in the medium QOS. These choices ensure that only short (1 hour) jobs can use many nodes in the cluster, while only jobs with relatively few nodes can take a long time (16 hours). Notice in any case that the wall time limits of a QOS also apply, for instance, 5 days for the long QOS; this is 120 hours and thus jobs with the parameters of the last three rows of the above table would not complete even in the long QOS.

How to submit jobs

Use of the batch system is discussed in detail on the how to run page.

Failure modes

In this section we give some commonly encountered failure modes - and what kind of behavior you will observe when experiencing them.
  1. A community user tries to run in the long_contrib queue.
    [araim1@tara-fe1 ~]$ sbatch run.slurm
    sbatch: error: Batch job submission failed: Job has invalid qos
    [araim1@tara-fe1 ~]$
    
    Additionally, no slurm.out or slurm.err output is generated
  2. You attempt to use more than two nodes in the long QOS
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    (job submit limit, user's size and/or time limits)
    [araim1@tara-fe1 ~]$
    
  3. You attempt to use more than 30 nodes total in the long_contrib QOS
    [araim1@slurm-dev ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    (job submit limit, user's size and/or time limits)
    [araim1@tara-fe1 ~]$
    
  4. You attempt to use 30 nodes total in the long_contrib QOS, but not all nodes are available at the time of submission

    Suppose we first submitted a 2 node job, and then a 30 node job

    [araim1@tara-fe1 ~]$ squeue
      JOBID PARTITION     NAME     USER  ST       TIME  NODES QOS    NODELIST(REASON)
       4278     batch  users01   araim1  PD       0:00     30 normal (AssociationResourceLimit)
       4277     batch  users01   araim1   R       2:54      2 normal n[7-8]
    [araim1@tara-fe1 ~]$
    
    The 30 node job remains queued with reason "AssociationResourceLimit", until all 30 nodes of long_contrib become available.
  5. Your job reaches a maximum walltime limit.

    The job is killed with a message in the stderr output

    [araim1@tara-fe1 ~]$ cat slurm.err
    slurmd[n1]: error: *** JOB 59545 CANCELLED AT 2011-05-20T08:10:52 DUE TO TIME
    LIMIT ***
    [araim1@tara-fe1 ~]$
    
  6. Your job violates the 512 hours maximum CPU time limit.

    The job is killed with a message in the stderr output

    [araim1@tara-fe1 ~]$ cat slurm.err 
    slurmd[n3]: *** JOB 4254 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
    slurmd[n3]: *** STEP 4254.0 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
    [araim1@tara-fe1 ~]$ 
    
  7. Run a job with walltime limit too high for QOS / partition.
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    job submit limit, user's size and/or time limits)
    [araim1@tara-fe1 ~]$
    
  8. Try to charge time to a PI who doesn't exist, or who you don't work for.
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Invalid account specified
    [araim1@tara-fe1 ~]$
    
  9. Try to use a QOS that doesn't exist.
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job has invalid qos
    [araim1@tara-fe1 ~]$ 
    
  10. Try to use a partition that doesn't exist.
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Invalid partition name specified
    [araim1@tara-fe1 ~]$ 
    
  11. Try to use more than 8 processes per node.
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Requested node configuration is not available
    [araim1@tara-fe1 ~]$
    
  12. Try to use more nodes than available in a partition.
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Node count specification invalid
    [araim1@tara-fe1 ~]$ 
    
  13. Invalid syntax in SLURM batch script
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: unrecognized option `--ndoes=2'
    sbatch: error: Try "sbatch --help" for more information
    [araim1@tara-fe1 ~]$ 
    
  14. Memory limit exceeded by your program
    [araim1@tara-fe1 ~]$ cat slurm.err 
    slurmd[n1]: error: Job 60204 exceeded 10240 KB memory limit, being killed
    slurmd[n1]: error: *** JOB 60204 CANCELLED AT 2011-05-27T19:34:34 ***
    [araim1@tara-fe1 ~]$ 
    
  15. You've set the memory limit set too high for available memory
    [araim1@tara-fe1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Requested node configuration is not available
    [araim1@tara-fe1 ~]$ 
    
  16. You try to use more than 30 minutes of walltime in the develop partition, by setting the --time flag

    The job will be stuck in the pending state, with reason "PartitionTimeLimit"

    [araim1@tara-fe1 ~]$ squeue
      JOBID PARTITION     NAME     USER ST       TIME  NODES QOS     NODELIST(REASON)
      62280   develop     SNOW   araim1 PD       0:00      1 normal  (PartitionTimeLimit)
    [araim1@tara-fe1 ~]$