UMBC logo
UMBC High Performance Computing Facility
Scheduling rules on maya

Introduction

This page documents the scheduling rules implementing the current usage rules. These rules are different than past rules in some significant ways, including the incorporation of SLURM concepts like QOS and fair-share priority, but we believe that these rules will be natural to use. It is designed to implement the philosophical underpinnings of the usage rules concretely. The scheduling rules and usage rules are designed together to help support the productivity goals of the cluster, including: All users are expected to read this page and the usage rules carefully; by using the cluster, you agree to abide by the rules stated in these pages.

Fundamentals

A partition represents a group of nodes in the cluster. There are two partitions:
Partition Description Walltime limits
develop There are six nodes in the develop partition: n1, n2, n70, n112, n156 and n196. This partition is dedicated to code under development. Jobs using many cores may be tested, but run time is supposed to be negligible. 5 min default, 30 min max
develop-mic There is one node, with 2 Intel Phi in this develop partition: maya-usr2. This partition is dedicated to Intel Phi code under development. Jobs using many cores may be tested, but run time is supposed to be negligible. 5 min default, 30 min max
batch The majority of the compute nodes on maya are allocated to this partition. There are 229 nodes: n3, ..., n69, n71, ..., n111, n113, ..., n153, n157, ..., n195, n197, ..., n237. Jobs running on these nodes are considered "production" runs; users should have a high degree of confidence that bugs have been worked out. 5 day maximum
prod The majority of the compute nodes on maya are allocated to this partition. There are 162 nodes: n71, ..., n111, n113, ..., n153, n157, ..., n195, n197, ..., n237. Jobs running on these nodes are considered "long production" runs. Contibuting members can use this partition in conjunction with the long_contrib QOS for runs less than 45 days. The long_prod QOS should be used with this partition. 45 day maximum
mic The nodes with two Intel Phi each on maya are allocated to this partition. There are 18 nodes each with 2 mic cards: n34, ..., n51. Jobs running on these nodes are considered "production" runs; users should have a high degree of confidence that bugs have been worked out. 5 day maximum

When a user runs a job, time is charged to their PI group for accounting purposes, as explained in the usage rules. The historical accounting data is used in scheduling to influence priority, to determine which job will run next if there are multiple queued jobs. Note that priority does not affect jobs which are already running. On maya we have implemented fair-share rules. PIs who have used more than their allocation in their recent history will have a reduced priority, while PIs who have used less than their allocation will have an increased priority.

Following the terminology of the SLURM scheduler, queues are referred to as QOS's, short for Quality of Services. Several QOS's are available, which are designed to handle different kinds of jobs. Every job is run under a particular QOS, chosen by the user in the submission script, and is subject to that QOS's resource limitations.

The following bullets introduce the QOS's by giving a motivational viewpoint first:

The actual specific definitions for the QOS's are given as follows
QOS Wall time limit per job CPU time limit per job Total number of cores limit for the QOS Number of cores limit per user
short 1 hour 1024 hours --- ---
normal (default) 4 hours 1024 hours --- 256
medium 24 hours 1024 hours 1536 256
long 5 days --- 256 16
long_contrib 5 days --- 768 128
long_prod 45 days --- 64 ---
support --- --- --- ---
where A general guideline is to choose the QOS with the minimum resources needed to run your job. This is considered to be a good user behavior, which responsible users should follow. But a direct benefit to the user is backfilling. Backfilling is a feature of the scheduler that allows your job ahead of higher priority jobs, provided that your job's estimated run time is shorter than the estimated wait time for the other jobs to start. A very responsible user (or one who really wishes to take advantage of backfilling) can set specific walltime and memory limits for their job, based on their estimates.

Note that QOS's work the same way across all partitions including develop and batch. QOS limits (e.g. number of cores and walltime) are applied in conjunction with any partition limits. In the case of the develop partition we suggest setting the QOS to the default "normal", or simply leaving it blank.

The CPU time limit per job enforced in some of the QOS's limits the number of nodes and time limit that a job can request in such a way that the total resource use of CPU time is limited. For this purpose, it is considered equivalent to use 64 nodes for 1 hour, 32 nodes for 2 hours, 16 nodes for 4 hours, etc. Thus, with 16 cores per node, the CPU time limit (walltime times number of cores) is 64 nodes times 16 cores times 1 hour equal to 1024 hours of CPU time. For demonstration, the following table lists some sample combinations of number of nodes and time limits per job, all of which are actually equivalent to 1024 hours of CPU time. Note that when using --exclusive in a submission script all cores on the CPU will be counted in the CPU time limit calculation.

Number of nodes Cores per node Total number of cores Wall time (hours) CPU time (hours)
64 16 1024 1 1024
32 16 512 2 1024
16 16 256 4 1024
8 16 128 8 1024
4 16 64 16 1024
2 16 32 32 1024
1 16 16 64 1024
1 8 8 128 1024
1 4 4 256 1024
1 2 2 512 1024
1 1 1 1024 1024
Here, the number of nodes in the first column and the cores per node in the second column are multiplied to give the total numbers of cores in the third column; this is the quantity that enters into the SLURM definition of CPU time. Thus, the fourth column shows which wall times in hours yield 1024 hours of CPU time in the fifth column. The numbers in this table also explain how the wall time limits for the QOS's short, normal, and medium were chosen, namely to accomodate the equivalence of the job that uses 64 nodes for 1 hour. Specifically, the wall time limits in each QOS together with the CPU time limit per job limits each job to 64 nodes in the short QOS, to 16 in the normal QOS, and to 4 nodes in the medium QOS. These choices ensure that only short (1 hour) jobs can use many nodes in the cluster, while only jobs with relatively few nodes can take a long time (16 hours). Notice in any case that the wall time limits of a QOS also apply, for instance, 5 days for the long QOS; this is 120 hours and thus jobs with the parameters of the last four rows of the above table would not complete even in the long QOS.

How to submit jobs

Use of the batch system is discussed in detail on the how to run page.

Failure modes

In this section we give some commonly encountered failure modes - and what kind of behavior you will observe when experiencing them.
  1. A community user tries to run in the long_contrib queue.
    [araim1@maya-usr1 ~]$ sbatch run.slurm
    sbatch: error: Batch job submission failed: Job has invalid qos
    [araim1@maya-usr1 ~]$
    
    Additionally, no slurm.out or slurm.err output is generated
  2. You attempt to use more than two nodes in the long QOS
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    (job submit limit, user's size and/or time limits)
    [araim1@maya-usr1 ~]$
    
  3. You attempt to use more than 30 nodes total in the long_contrib QOS
    [araim1@slurm-dev ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    (job submit limit, user's size and/or time limits)
    [araim1@maya-usr1 ~]$
    
  4. You attempt to use 30 nodes total in the long_contrib QOS, but not all nodes are available at the time of submission

    Suppose we first submitted a 2 node job, and then a 30 node job

    [araim1@maya-usr1 ~]$ squeue
      JOBID PARTITION     NAME     USER  ST       TIME  NODES QOS    NODELIST(REASON)
       4278     batch  users01   araim1  PD       0:00     30 normal (AssociationResourceLimit)
       4277     batch  users01   araim1   R       2:54      2 normal n[7-8]
    [araim1@maya-usr1 ~]$
    
    The 30 node job remains queued with reason "AssociationResourceLimit", until all 30 nodes of long_contrib become available.
  5. Your job reaches a maximum walltime limit.

    The job is killed with a message in the stderr output

    [araim1@maya-usr1 ~]$ cat slurm.err
    slurmd[n1]: error: *** JOB 59545 CANCELLED AT 2011-05-20T08:10:52 DUE TO TIME
    LIMIT ***
    [araim1@maya-usr1 ~]$
    
  6. Your job violates the 512 hours maximum CPU time limit.

    The job is killed with a message in the stderr output

    [araim1@maya-usr1 ~]$ cat slurm.err 
    slurmd[n3]: *** JOB 4254 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
    slurmd[n3]: *** STEP 4254.0 CANCELLED AT 2011-05-27T19:42:14 DUE TO TIME LIMIT ***
    [araim1@maya-usr1 ~]$ 
    
  7. Run a job with walltime limit too high for QOS / partition.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job violates accounting policy
    job submit limit, user's size and/or time limits)
    [araim1@maya-usr1 ~]$
    
  8. Try to charge time to a PI who doesn't exist, or who you don't work for.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Invalid account specified
    [araim1@maya-usr1 ~]$
    
  9. Try to use a QOS that doesn't exist.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Job has invalid qos
    [araim1@maya-usr1 ~]$ 
    
  10. Try to use a partition that doesn't exist.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Invalid partition name specified
    [araim1@maya-usr1 ~]$ 
    
  11. Try to use more processes per node than are available.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Requested node configuration is not available
    [araim1@maya-usr1 ~]$
    
  12. Try to use more nodes than available in a partition.
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Node count specification invalid
    [araim1@maya-usr1 ~]$ 
    
  13. Invalid syntax in SLURM batch script
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: unrecognized option `--ndoes=2'
    sbatch: error: Try "sbatch --help" for more information
    [araim1@maya-usr1 ~]$ 
    
  14. Memory limit exceeded by your program
    [araim1@maya-usr1 ~]$ cat slurm.err 
    slurmd[n1]: error: Job 60204 exceeded 10240 KB memory limit, being killed
    slurmd[n1]: error: *** JOB 60204 CANCELLED AT 2011-05-27T19:34:34 ***
    [araim1@maya-usr1 ~]$ 
    
  15. You've set the memory limit set too high for available memory
    [araim1@maya-usr1 ~]$ sbatch run.slurm 
    sbatch: error: Batch job submission failed: Requested node configuration is not available
    [araim1@maya-usr1 ~]$ 
    
  16. You try to use more than 30 minutes of walltime in the develop partition, by setting the --time flag

    The job will be stuck in the pending state, with reason "PartitionTimeLimit"

    [araim1@maya-usr1 ~]$ squeue
      JOBID PARTITION     NAME     USER ST       TIME  NODES QOS     NODELIST(REASON)
      62280   develop     SNOW   araim1 PD       0:00      1 normal  (PartitionTimeLimit)
    [araim1@maya-usr1 ~]$