UMBC High Performance Computing Facility
 
How to run programs on tara
 
Introduction
Running a program on tara is a bit different than running one on a standard
workstation. When we log into the cluster, we are interacting with the front end
node. But we would like our programs to run on the compute nodes, which is
where the real computing power of the cluster is. We will walk through
the processes of running serial and parallel code on the cluster, and then
later discuss some of the finer details. This page uses the code examples
from the compilation tutorial.
Please download and compile those examples, so you can follow along.
Resource intensive jobs (long running, high memory demand, etc) should 
be run on the compute nodes of the cluster. You cannot execute jobs 
directly on the compute nodes yourself; you must request the cluster's 
batch system do it on your behalf. To use the batch system, you will
submit a special script which contains instructions to execute your
job on the compute nodes. When submitting your job, you specify one
of the available queues. Your job will wait in the queue until it 
is "next in line", and free processors on the compute nodes become 
available. Which job is "next in line" is determined by the scheduling 
policy of the cluster. Once a job is started, it continues until it
either completes or reaches its time limit, in which case it is 
terminated by the system.
The batch system used on tara is called SLURM, which is short for
Simple Linux Utility for Resource Management.
Users transitioning from the cluster hpc should be aware that SLURM behaves 
a bit differently than PBS, and the scripting is a little different too. Unfortunately,
this means you will need to rewrite your batch scripts. However many of the
confusing points of PBS, such as requesting the number of nodes and
tasks per node, are simplified in SLURM.
 
Queues on tara
There are several different queues to which users can submit jobs. Make sure
to note that "queues" are also called "partitions" in SLURM terminology. You'll
see the term "partition" used when interacting with the system.
| Queue | Priority | Node Limits | Default  Node Sharing | Default/Max Time Limit | Notes | 
| develop | N/A | two nodes total, no limits per user or per job | Shared | 5 minutes / 30 minutes | This queue is for debugging and testing of code, before submitting them
to the other queues. Do not run programs on the other queues until
you have verified that they run without error and without problems
in this queue. Two nodes with InfiniBand connection are available
for this queue, so you should be able to test all aspects of code
including message passing across and between nodes.
This queue uses dedicated hardware and does not compete for
resources with the other queues. | 
| long_term | 5 | four nodes total, one node per user at a time | Shared | 4 hours / 5 days | This queue is intended for jobs that require very few nodes, but
might run for a long time.
The priority given by the scheduler is very low and thus will only be
available at times of low system load. | 
| serial | 4 | all compute nodes available, one core per job, no limit on jobs per user | Shared | 4 hours / 23 hours | This queue is for serial jobs only. The scheduler will take jobs from
it periodically to fill in the "gaps" in time and space between parallel jobs.
Its lower priority than the default queue
will ensure that serial jobs aren't preventing parallel jobs from running.
Jobs in this queue will not have exclusive access to nodes; i.e. other jobs
may run concurrently on different cores of the same node. | 
| parallel | 3 | all compute nodes available, no limits per user or per job | Exclusive | 4 hours / 23 hours | This is the standard queue for all users and for parallel code.
All 82 compute nodes are available for scheduling, subject to the limits
established by the usage policy.
The actual number of nodes that you can get depends on the system load,
so you should monitor this (see squeue information below) before submitting jobs.
Each job gets exclusive use of nodes through this queue, i.e., one job's
nodes are not shared with jobs from other users;
this makes this queue inappropriate for
typical serial or software package jobs.
The default walltime limit is intended to prevent run-away jobs from
clogging the queue.
Note that the scheduling software uses your requested walltime to
schedule jobs into "gaps" in space and time between other parallel jobs,
effectively prioritizing shorter jobs. | 
| queues for paying research groups | 2 | limited to number of nodes purchased by group,
no limits per user or per job | TBD | no time limit | Each research group that purchased compute nodes will receive their
own queue. Suppose your group
has purchased N nodes, then you will be able to run jobs in this queue using up 
to 8*N cores. The cores will be distributed among the nodes based on your
submission script. Jobs in this queue are given high priority by the scheduler,
which implements the priority access established in the usage policy.
When system load permits, you can run additional jobs through the
default queue.
Notice that submitting all your jobs to the queue of your research group
constitutes also a form of good user behavior because it leaves the
public nodes available for all other users.
NOTE (4/27/2010): These queues do not exist yet, and may not
represent the final solution for giving priority of cluster resources to 
paying research groups. | 
| perform | 1 | all compute nodes available, no limits per user or per job | Shared | 4 hours / no time limit | This is the highest priority queue and is reserved
for admin use, testing purposes, performance studies, and special arrangments.
(It has a very different meaning than the high_priority queue
on hpc; users do not have access to this queue by default.) | 
Interacting with the Batch System
 
There are several basic commands you'll need to know to submit jobs, 
cancel them, and check their status. These are:
sbatch - submit a job to the batch queue system
squeue - check the current jobs in the batch queue system
sinfo - view the current status of the queues
scancel - cancel a job
Check here for more detailed 
information about job monitoring.scancel
The first command we will mention is scancel. If you've submitted a job 
that you no longer want, you should be a responsible user and kill it. 
This will prevent resources from being wasted, and allow other users' 
jobs to run. Jobs can be killed while they are pending (waiting to run), 
or while they are actually running. To remove a job from the queue or 
to cancel a running job cleanly, use the scancel command with the 
identifier of the job to be deleted. For instance:
[araim1@tara-fe1 hello_serial]$ scancel 636
[araim1@tara-fe1 hello_serial]$
 
 
The job identifier can be obtained from the job listing from squeue (see 
below) or you might have noted it from the response of the call to sbatch, 
when you originally submitted the job (also below). Try "man scancel" for 
more information.
sbatch
Now that we know how to cancel a job, we will see how to submit one. You 
can use the sbatch command to submit a script to the batch queue system.
[araim1@tara-fe1 hello_serial]$ sbatch run.slurm
sbatch: Submitted batch job 2626
[araim1@tara-fe1 hello_serial]$ 
 
 
In this example run.slurm is the script we are sending to the batch queue 
system. We will see shortly how to formulate such a script. Notice that sbatch
returns a job identifier. We can use this to kill the job later if necessary, or 
to check its status. For more information, check the man page by running "man sbatch".
squeue
You can use the squeue command to check the status of jobs in the batch queue 
system. Here's an example of the basic usage:
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   2564  parallel   MPI_DG   araim1  PD       0:00     32 (Resources)
   2626  parallel fmMle_no   araim1   R       0:02      4 n[9-12]
   2578 performan   MPI_DG   araim1   R 1-02:40:36      1 n31
   2579 performan   MPI_DG   araim1   R 1-02:40:36      2 n[1-2]
   2580 performan   MPI_DG   araim1   R 1-02:40:36      1 n31
   2615 performan     bash   aaronk   R    2:41:51      4 n[3-6]
 
 
The most interesting column is the one titled ST for "status". It shows 
what a job is doing at this point in time. The state "PD" indicates 
that the job has been queued. When free processor cores become available
and this process is "next in line", it will change to the "R" state and
begin executing. You may also see a job with status "CG" which means it's
completing, and about to exit the batch system. Other statuses are possible
too, see the man page for squeue.Once a job has exited the batch queue 
system, it will no longer show up in the squeue display.
We can also see several other pieces of useful information. The TIME
column shows the current walltime used by the job. For example, job
2578 has been running for 1 day, 2 hours, 40 minutes, and 36 seconds. The
nodelist column shows which compute nodes have been assigned to the job.
For job 2626, nodes n9, n10, n11, and n12 are being used. However for
job 2564, we can see that it's pending because it's waiting on resources.
sinfo
The sinfo command also shows the current status of the batch system, but from 
the point of view of the queues. Here is an example 
[araim1@tara-fe1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
develop*     up      31:00      2   idle n[33-34]
long_term    up 5-01:00:00      1  alloc n8
long_term    up 5-01:00:00     31   idle n[1-7,9-32]
serial       up   23:30:00      1  alloc n8
serial       up   23:30:00     31   idle n[1-7,9-32]
parallel     up   23:30:00      1  alloc n8
parallel     up   23:30:00     31   idle n[1-7,9-32]
performan    up   infinite      1  alloc n8
performan    up   infinite     31   idle n[1-7,9-32]
[araim1@tara-fe1 ~]$ 
 
 
In this scenario, the two develop nodes are idle, node n8 is busy running
a job, and the remaining compute nodes are idle.
Running Serial Jobs
 
This section assumes you've already compiled the serial hello world 
example. Now we'll see how to run it several different ways.
Test runs on the front-end node
 The most obvious way to run the program is 
on the front end node, which we normally log into.
[araim1@tara-fe1 hello_serial]$ ./hello_serial 
Hello world from tara-fe1.rs.umbc.edu
[araim1@tara-fe1 hello_serial]$
 
 
We can see the reported hostname which confirms that the program ran 
on the front end node.  Jobs should usually be run on the
front end node only for testing purposes. The purpose of the front end node is to 
develop code and submit jobs to the compute nodes. Everyone who uses 
tara must interact with the front end node, so slowing it down will affect 
all users. Therefore, the usage policy prohibits the use of the front end node
for running jobs. One exception to this rule is graphical post-processing
of results that can only be done interactively in some software packages,
for instance, COMSOL Multiphysics.
(Our "hello world" example here uses very little memory and runs 
very quickly and is run on the front end node exactly for testing purposes
as part of this tutorial.)
Test runs on the develop queue 
Let's submit our job to the testing queue, since we just finished 
creating it and we're not completely sure that it works. The following 
script will accomplish this. Save it to your account alongside the 
"hello-serial" executable.
Now we're ready to submit our job to the scheduler. To accomplish this, 
use the sbatch command as follows
[araim1@tara-fe1 hello_serial]$ sbatch run-testing.slurm
sbatch: Submitted batch job 2626
[araim1@tara-fe1 hello_serial]$
 
 
If the submission was successful, the sbatch command returns a job 
identifier to us. We can use this to check the status of the job 
(squeue), or delete it (scancel) if necessary. This job should run very 
quickly if there are processors available, but we can try to check its 
status in the batch queue system. The following command shows that our 
job is not in the system - it has already completed.
[araim1@tara-fe1 hello_serial]$ squeue
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
[araim1@tara-fe1 hello_serial]$
 
 
We should have obtained two output files. The file slurm.err contains 
stderr output from our program. If slurm.err isn't empty, check the 
contents carefully as something may have gone wrong. The file slurm.out 
contains our stdout output; it should contain the hello world message 
from our program.
[araim1@tara-fe1 hello_serial]$ ls slurm.*
slurm.err  slurm.out
[araim1@tara-fe1 hello_serial]$ cat slurm.err 
[araim1@tara-fe1 hello_serial]$ cat slurm.out
Hello world from n1
[araim1@tara-fe1 hello_serial]$ 
 
 
Notice that the hostname no longer matches the front end node, but one of the 
test nodes. We've successfully used one of the compute nodes to run our 
job. The develop queue limits jobs to five minutes by default, measured in 
"walltime", which is just the elapsed run time. After 
your job has reached this time limit, it is stopped by the scheduler. 
This is done to ensure that everyone has a fair chance to use the 
cluster.
Note that with SLURM, the stdout and stderr files (slurm.out and slurm.err) will
be written as your job executes. This is different than PBS which was used on hpc,
where stdout/stderr files did not exist until the job completed.
The stdout and stderr mechanisms in the batch system are not intended for 
large amounts of output. If your program writes out more than a few KB of 
output, consider using file I/O to write to logs or data files.
Production runs on the serial queue
Once our job has been tested and we're confident that it's working 
correctly, we can run it in the serial queue. Now the walltime 
limit for our job will be raised from five minutes to four hours (by default). 
There are also many more compute nodes available in this queue, so we probably 
won't have to wait long to find a free processor. Start by creating 
the following script.
The only change from the testing queue script is the name of the queue 
to use. To submit our job to the scheduler, we issue the command
[araim1@tara-fe1 hello_serial]$ sbatch run-serial.slurm
sbatch: Submitted batch job 2626
[araim1@tara-fe1 hello_serial]$
 
 
We can check the job's status, but in this case it has already completed
[araim1@tara-fe1 hello_serial]$ squeue
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
[araim1@tara-fe1 hello_serial]$
 
 
This time our stdout output file indicates that our job has run on one 
of the primary compute nodes, rather than a develop node
[araim1@tara-fe1 hello_serial]$ ls slurm.*
slurm.err  slurm.out
[araim1@tara-fe1 hello_serial]$ cat slurm.err 
[araim1@tara-fe1 hello_serial]$ cat slurm.out
Hello world from n3
[araim1@tara-fe1 hello_serial]$ 
 
 
When using the production queues, you'll be sharing resources with other
researchers. So keep your duties as a responsible user in mind, which are
described in this tutorial and in the 
usage policy.
Running Parallel Jobs
This section assumes that you've successfully 
compiled the parallel hello world 
example. Now we'll see how to run this program on the cluster.
Test runs on the develop queue
Example 1: Single process 
First we will run the hello_parallel program as a single process. This 
will appear very similar to the serial job case. The difference is that
now we are using the MPI-enabled executable hello_parallel, rather than
the plain hello_serial executable.
Create the following script in the same directory as the hello_parallel
program. Notice the addition of the "srun" command before the executable,
which is used to launch MPI-enabled programs. We've also added "--nodes=1"
and "--ntasks-per-node=1" to specify what kind of resources we'll need for
our parallel program.
#!/bin/bash
#SBATCH --job-name=hello_parallel
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=develop
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
srun ./hello_parallel
 
Download: 
../code-2010/hello_parallel/mvapich2-np1.slurm
 
Now submit the script
[araim1@tara-fe1 hello_parallel]$ sbatch mvapich2-np1.slurm
sbatch: Submitted batch job 2626
[araim1@tara-fe1 hello_parallel]$
 
 
Checking the output after the job has completed, we can see that exactly 
one process has run and reported back.
[araim1@tara-fe1 hello_parallel]$ ls slurm.*
slurm.err  slurm.out
[araim1@tara-fe1 hello_parallel]$ cat slurm.err 
[araim1@tara-fe1 hello_parallel]$ cat slurm.out
Hello world from process 000 out of 001, processor name n1
[araim1@tara-fe1 hello_parallel]$ 
 
 
Example 2: One node, two processes
Next we will run the job on two processes of the same node. This is one 
important test, to ensure that our code will function in parallel. We want to
be especially careful that the communications work correctly, and that processes
don't hang. We modify the single process script and set "--ntasks-per-node=2".
#!/bin/bash
#SBATCH --job-name=hello_parallel
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=develop
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
srun ./hello_parallel
 
Download: 
../code-2010/hello_parallel/mvapich2-ppn2.slurm
 
Submit the script to the batch queue system
[araim1@tara-fe1 hello_parallel]$ sbatch mvapich2-ppn2.slurm
sbatch: Submitted batch job 2626
[araim1@tara-fe1 hello_parallel]$
 
 
Now observe that two processes have run and reported in. Both were 
located on the same node as we expected.
[araim1@tara-fe1 hello_parallel]$ ls slurm.*
slurm.err  slurm.out
[araim1@tara-fe1 hello_parallel]$ cat slurm.err 
[araim1@tara-fe1 hello_parallel]$ cat slurm.out 
Hello world from process 000 out of 002, processor name n1
Hello world from process 001 out of 002, processor name n1
[araim1@tara-fe1 hello_parallel]$ 
 
 
Example 3: Two nodes, one process per node
Now let's try to use two different nodes, but only one process on each node.
This will exercise our program's use of the high performance network, which 
didn't come into the picture when a single node was used.
Submit the script to the batch queue system
[araim1@tara-fe1 hello_parallel]$ sbatch mvapich2-nodes2-ppn1.slurm
sbatch: Submitted batch job 2626
[araim1@tara-fe1 hello_parallel]$
 
 
Notice that again we have two processes, but this time they have distinct
processor names.
[araim1@tara-fe1 hello_parallel]$ ls slurm.*
slurm.err  slurm.out
[araim1@tara-fe1 hello_parallel]$ cat slurm.err 
[araim1@tara-fe1 hello_parallel]$ cat slurm.out
Hello world from process 000 out of 002, processor name n1
Hello world from process 001 out of 002, processor name n2
[araim1@tara-fe1 hello_parallel]$
 
 
Example 4: Two nodes, eight processes per node
To illustrate the use of more processes, let's try a job that uses two nodes,
eight processes on each node. This is still possible on the develop queue.
Therefore it is possible to run small performance studies which are completely 
restricted to the develop queue. Use the following batch script
Submit the script to the batch system
[araim1@tara-fe1 hello_parallel]$ sbatch mvapich2-nodes2-ppn8.slurm
sbatch: Submitted batch job 2626
[araim1@tara-fe1 hello_parallel]$
 
 
Now observe the output. Notice that the processes have reported back in a 
non-deterministic order, and there are eight per node if you count them.
[araim1@tara-fe1 hello_parallel]$ ls slurm.*
slurm.err  slurm.out
[araim1@tara-fe1 hello_parallel]$ cat slurm.err 
[araim1@tara-fe1 hello_parallel]$ cat slurm.out
Hello world from process 002 out of 016, processor name n1
Hello world from process 011 out of 016, processor name n2
Hello world from process 014 out of 016, processor name n2
Hello world from process 006 out of 016, processor name n1
Hello world from process 010 out of 016, processor name n2
Hello world from process 007 out of 016, processor name n1
Hello world from process 001 out of 016, processor name n1
Hello world from process 015 out of 016, processor name n2
Hello world from process 000 out of 016, processor name n1
Hello world from process 008 out of 016, processor name n2
Hello world from process 003 out of 016, processor name n1
Hello world from process 012 out of 016, processor name n2
Hello world from process 005 out of 016, processor name n1
Hello world from process 013 out of 016, processor name n2
Hello world from process 004 out of 016, processor name n1
Hello world from process 009 out of 016, processor name n2
[araim1@tara-fe1 hello_parallel]$
 
 
Production runs on the parallel queue
Now we've tested our program in several important configurations in the
develop queue. We know that it performs well, and processes do not hang.
We may now want to solve larger problems which are more time consuming,
or perhaps we may wish to use more processes. We can promote our code to
"production", by simply changing "--partition=develop" to 
"--partition=parallel". 
Some details about the batch system
A SLURM batch script is a special kind of shell script. As we've seen, it contains
information about the job like its name, expected walltime, etc. It also
contains the procedure to actually run the job. Read on for some important 
details about SLURM scripting, as well as a few other features that we didn't 
mention yet.
For more information, try the following sources
Parts of a SLURM script
Here is a quick reference for the options discussed on this page.
| : (colon) | Indicates a commented-out line that should be ignored by the scheduler. | 
| #SBATCH | Indicates a special line that should be interpreted by the scheduler. | 
| srun ./hello_parallel | This is a special command used to execute MPI programs. The command uses 
directions from SLURM to assign your job to the scheduled nodes. | 
| --job-name=hello_serial | This sets the name of the job; the name that shows up in the "Name" column
in squeue's output. The name has no significance to the scheduler, but helps 
make the display more convenient to read. | 
| --output=slurm.out --error=slurm.err
 | This tells SLURM where it should send your job's output stream and error stream, respectively.
If you would like to prevent either of these streams from being written, set the file name to /dev/null | 
| --partition=parallel | Set the queue (aka partition) in which your job will run. See above on
this page for the list of queues and their descriptions. | 
| --nodes=4 | Request four nodes. The queue you've chosen will determine if you might 
share the node with other users, if you don't request all processor cores on the
node. | 
| --ntasks-per-node=8 | Request eight tasks to be run on each node. The number of tasks may not
exceed the number of processor cores on the node. | 
| --ntasks=11 | Request 11 tasks for your job. | 
| --time=1-12:30:00 | This option sets the maximum amount of time SLURM will allow your job to 
run before it is automatically killed. In the example shown, we have requested
1 day, 12 hours, 30 minutes, and 0 seconds. Several other formats are accepted
such as "HH:MM:SS" (assuming less than a day). | 
| --mail-type=type | SLURM can email you when your job reaches certain states. Set type to one
of: BEGIN to notify you when your job starts, END for when it ends, FAIL for
if it fails to run, or ALL for all of the above. See the example below. | 
| --mail-user=email@umbc.edu | Specify a recipient(s) for notification emails (see example below) | 
| --mem-per-cpu=MB | Specify a memory limit for each process of your job. The default is 
2994 | 
| --mem=MB | Specify a memory limit for each node of your job. The default 
is that there is a per-core limit | 
| --exclusive | Specify that you need exclusive access to nodes for your job. This is the
opposite of "--shared". The default behavior depends on the queue you use. | 
| --shared | Specify that your job may share nodes with other jobs. This is the
opposite of "--exclusive" | 
| --begin=2010-01-20T01:30:00 | Tell the scheduler not to attempt to run the job until the given time
has passed. | 
| --dependency=afterany:15030:15031 | Tell the scheduler not to run the job until jobs with IDs 15030 and 15031 have
completed. | 
| #SBATCH --account=pi_name | Tell the scheduler to charge this job to pi_name | 
Job scheduling issues
Don't leave large-scale jobs enqueued during weekdays. 
Suppose you have a job that requires all the nodes on the cluster. If you 
submit this job to the scheduler, it will remain enqueued until all the nodes
become available. Usually the scheduler will allow smaller jobs to run first,
but sometimes it will enqueue them behind yours. The latter causes a problem 
during peak usage hours, because it clogs up the queue and diminishes the 
overall throughput of the cluster. The best times to submit these large-scale 
types of jobs are nights and weekends.
Make sure the scheduler is in control of your programs.
Avoid spawning processes or running background jobs from your code, as the
scheduler can lose track of them. Programs running outside of the 
scheduler hinder its ability to allocate new jobs to free processors.
Avoid jobs that run for many days uninterrupted.
This is a note for users with access to the unlimited time queues. It's
best for the overall productivity of the cluster to design jobs that
are "medium" in size - they should be large enough to accomplish 
a significant amount of work, but small enough so that resources
are not tied up for too long.
Consider saving your progress. 
This is related to the issue above. Running your entire computation at 
once might be impractical or infeasible. Besides the fact that very long 
jobs can make scheduling new jobs difficult, it can also be very 
inconvenient for you if they fail after running for several days (due to 
a hardware issue for example). It may be possible to design your code to 
save its progress occasionally, so that it won't need to restart from 
the beginning if there are any issues.
Estimate memory usage before running your job.
As discussed later in this page, jobs on tara have a default memory limit
which you can raise or lower as needed. If your job uses more than this
limit, it will be killed by the scheduler. Requesting the maximum available
memory for every job is not a very good user behavior though, because it can
lower the overall productivity of the system. Imagine you're running a serial
job which requires half the available memory. Specifying this in your 
submission script will allow the scheduler to use the remaining 7 cores
and 12 GB on the node for other users' jobs.
The best strategy is to estimate how much memory you will need, and specify 
a reasonable upper bound in your submission script. Two 
suggestions to do this estimation are (1) calculate the sizes of the main
objects in your code, and (2) run a small-scale version of the problem 
and observe the actual memory used. For more information on how to 
observe memory usage, see
checking memory usage.
Estimate walltime usage before running your job.
Giving an accurate estimate of your job's necessary walltime is very helpful
for the scheduler, and the overall productivity of the cluster. See below
for more information, and an example.
Use less nodes and more processes per node.
Performance studies have demonstrated that using multiple cores
on the same node gives generally comparable performance to using a single core
on the same number of nodes. And using the minimum number of nodes required for
your job benefits the overall productivity of the cluster. See the 
technical report HPCF-2010-2.
Consider whether you need exclusive access to your nodes, or if 
they can be shared with others.
If your job isn't using all 8 cores on a node, it might be possible for another
job to make use of the free cores. Allowing your nodes to be shared helps to
improve the overall productivity of the cluster. The downside however is that
other jobs might interfere with the performance of your job. If your job uses
a small amount of memory and it's not a performance study (for example), sharing
is probably a good option. To see how to control exclusive vs. shared, see
the examples below.
Specifying a time limit
Most of the queues on tara have two walltime limits - a default limit and a
maximum limit. Let's consider the parallel queue for example. If you do not 
specify a "--time=" statement in your batch script, the scheduler will use 
the default four hour time limit. If your job runs over this limit, it will 
be terminated. You may request a larger time limit, up to 23 hours maximum. 
However, it benefits you and other cluster users to give a good estimate, and 
not just request the maximum.
Suppose the system currently has 14 free nodes, and there are two jobs in the
parallel queue waiting to run. Suppose also that no additional nodes will become
free in the next few hours. The first queued job ("job #1") requires 16 nodes, 
and the second job ("job #2") requires only 2 nodes. Since job #1 was queued 
first, job #2 would normally need to wait behind it. However, if the scheduler
sees that job #2 would complete in the time that job #1 would be waiting, it
can allow job #2 to skip ahead.
Here is an example where we've specified a time limit of 10 hours,
15 minutes, and 0 seconds. Notice that we've started with the batch script
from Running Parallel Jobs, Example 1 
and added a single "--time=" statement.
#!/bin/bash
#SBATCH --job-name=hello_parallel
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=parallel
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=10:15:00
srun ./hello_parallel
 
Download: 
../code-2010/hello_parallel/mvapich2-np1-walltime.slurm
 
Email notifications for job status changes
You can request the scheduler to email you on certain events related to your 
job. Namely:
When the job starts running
When the job exits normally
If the job is aborted
As an example of how to use this feature, let's ask the scheduler to email
us on all three events when running the hello_serial program. Let's start
with the batch script developed earlier,
and add the options "--mail-type=ALL" and "--mail-user=username@domain.edu".
That is, where "username@domain.edu" is your actual email address.
After submitting this script, we can check our email and 
receive the following messages.
From: Simple Linux Utility for Resource Management <slurm@tara-mgt.rs.umbc.edu>
Date: Thu, Jan 14, 2010 at 10:53 AM
Subject: SLURM Job_id=2655 Name=hello_serial Began
To: username@domain.edu
 
 
From: Simple Linux Utility for Resource Management <slurm@tara-mgt.rs.umbc.edu>
Date: Thu, Jan 14, 2010 at 10:53 AM
Subject: SLURM Job_id=2655 Name=hello_serial Ended
To: username@domain.edu
 
 
Because hello_serial is such a trivial program, the start and end emails
appear to have been sent simultaneously. For a more substantial program
the waiting time could be significant, both for your job to start and for
it to run to completion. In this case email notifications could be useful 
to you.
Controlling exclusive vs. shared access to nodes 
Certain queues do not give you exclusive access to the nodes you're assigned 
by the scheduler, by default. For example, if you submit a job to the serial
queue, it may  run on a node with other users's jobs. Other queues like the 
parallel queue have the opposite default behavior, giving exclusive access
to any nodes assigned to you. You can override the default behavior of the 
queue using the "--exclusive" and "--shared" options.
Here's an example where we reserve an entire node in the serial queue.
In this example, we permit sharing of our nodes in the parallel queue.
#!/bin/bash
#SBATCH --job-name=hello_parallel
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=parallel
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --share
srun ./hello_parallel
 
Download: 
../code-2010/hello_parallel/mvapich2-share.slurm
 
Overriding the default exclusive/shared behavior of the queue should not be
done arbitrarily. Before using these options in a job, make sure you've
thought it through. Especially for the "--exclusive" option, which could
have a negative effect on the overall productivity of the cluster.
The memory limit
Jobs on tara are limited to a maximum of 23,954 MB per node out of the total 
24 GB system memory. A small amount is reserved for the operating system,
to protect against jobs overwhelming the nodes. By default, jobs are limited 
to 2994 MB per core, based on the number of cores you have requested. 
If your job goes over the memory limit, it will be killed by the batch system.
The memory limit may be specified per core or per node.
To set the limit per core, simply add a line to your submission script as follows:
#SBATCH --mem-per-cpu=4500
 
 
where 4500 represents a number of MB. Similarly, to set the limit per node you 
can use this instead.
In the serial case, the two options are equivalent. 
For most parallel situations it is probably more natural to use the
per core limit, given that the scheduler has some freedom to 
assign processes to nodes for you. 
If your job is killed because it has exceeded its memory limit,
you will receive an error similar to the following in your stderr output.
Notice that the effective limit is reported in the error.
slurmd[n1]: error: Job 13902 exceeded 3065856 KB memory limit, being killed
slurmd[n1]: error: *** JOB 13902 CANCELLED AT 2010-04-22T17:21:40 ***
srun: forcing job termination
 
 
Note that the memory limit can be useful in conducting performance studies.
If your code runs out of physical memory and begins to use swap space, the
performance will be severely degraded. For a performance study, this
may be considered an invalid result and you may want to try a smaller problem,
use more nodes, etc. One way to protect against this is to
reserve entire nodes (as discussed elsewhere on this page), and setting the 
memory limit to less than 23 GB per node. This is about the maximum you can
use before swapping starts to occur. Then the batch system will kill your job
if it's close enough to swapping.
5/13/2010: 
Note that when using MVAPICH2, if your job has exclusive access to 
its assigned nodes (by virtue of the queue you've used - for example
the parallel queue, or by the "--exclusive" flag), it will have access 
to the maximum available memory. This is not the case
with OpenMPI. We hope to obtain version of SLURM will support this
feature consistently. To avoid confusion in the meantime, we recommend 
using the "--mem" and "--mem-per_cpu" options as the preferred method
of controlling the memory limit.
Requesting an arbitrary number of tasks
So far on this page we've requested some number of nodes, and some number of tasks
per node. But what if our application requires a number of tasks like 11, which
can not be split evenly among a set of nodes. That is, unless we use one process
per node, which isn't a very efficient use of those nodes.
We can split our 11 processes among as few as two nodes, using the following
script. Notice that we don't specify anything else like how many nodes to use.
The scheduler will figure this out for us, and most likely use the minimum number
of nodes (two) to accomodate our tasks.
Running this yields the following output in slurm.out
[araim1@tara-fe1 hello_parallel]$ cat slurm.out
Hello world from process 009 out of 011, processor name n2
Hello world from process 008 out of 011, processor name n2
Hello world from process 000 out of 011, processor name n1
Hello world from process 001 out of 011, processor name n1
Hello world from process 002 out of 011, processor name n1
Hello world from process 003 out of 011, processor name n1
Hello world from process 004 out of 011, processor name n1
Hello world from process 005 out of 011, processor name n1
Hello world from process 006 out of 011, processor name n2
Hello world from process 010 out of 011, processor name n2
Hello world from process 007 out of 011, processor name n2
[araim1@tara-fe1 hello_parallel]$
 
 
Now suppose we want to limit the number of tasks per node to 2. This can be
accomplished with the following batch script. 
#!/bin/bash
#SBATCH --job-name=hello_parallel
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=parallel
#SBATCH --ntasks=11
#SBATCH --ntasks-per-node=2
srun ./hello_parallel
 
Download: 
../code-2010/hello_parallel/mvapich2-n11-npn2.slurm
 
Notice that we needed to move out of the develop queue to demonstrate this
scenario. Now we've specified --ntasks-per-node=2 at the top of the script,
in addition to --ntasks=11.
[araim1@tara-fe1 hello_parallel]$ sort slurm.out
Hello world from process 000 out of 011, processor name n1
Hello world from process 001 out of 011, processor name n1
Hello world from process 002 out of 011, processor name n2
Hello world from process 003 out of 011, processor name n2
Hello world from process 004 out of 011, processor name n3
Hello world from process 005 out of 011, processor name n3
Hello world from process 006 out of 011, processor name n4
Hello world from process 007 out of 011, processor name n4
Hello world from process 008 out of 011, processor name n5
Hello world from process 009 out of 011, processor name n5
Hello world from process 010 out of 011, processor name n6
[araim1@tara-fe1 hello_parallel]$
 
 
where we've sorted the output to make it easier to read.
It's also possible to use the "--ntasks" and "--nodes" options together,
to specify the number of tasks and nodes, but leave the number of tasks
per node up to the scheduler. See "man sbatch" for more information about
these options.
Setting a 'begin' time
You can tell the scheduler to wait a specified amount of time before attempting 
to run your job. This is useful for example, if your job requires many nodes.
Being a conscientious user, you may want to wait until late at night for your job
to run. By adding the following to your batch script, we can have the scheduler
wait until 1:30am on 2010-01-20 for example.
#SBATCH --begin=2010-01-20T01:30:00
 
 
You can also specify a relative time
#SBATCH --begin=now+1hour
 
 
See "man sbatch" for more information.
Dependencies
You may want a job to wait until another one starts or finishes. This can be
useful if one job's input depends on the other's output. It can also be useful
to ensure that you're not running too many jobs at once. For example, suppose
we want our job to wait until jobs (job with ID's) 15030 and 15031 complete. This
can be accomplished by adding the following to our batch script.
#SBATCH --dependency=afterany:15030:15031
 
 
See "man sbatch" for more information.
Requeue-ability of your jobs
By default it's assumed that your job can be restarted if a node fails, or
if the cluster is about to be brought offline for maintenance. For many jobs
this is a safe assumption, but sometimes it may not be. 
For example suppose your job appends to an existing data file as it runs.
Suppose it runs partially, but then is restarted and then runs to completion.
The output will then be incorrect, and it may not be easy for you to
recognize. An easy way to avoid this situation is to make sure output files
are newly created on each run.
Another way to avoid problems is to specify the following option in your
submission script. This will prevent the scheduler from automatically 
restarting your job if any system failures occur.
For very long-running jobs, you might also want to consider designing them 
to save their progress occasionally. See
job scheduling issues for more information.
Using scratch storage
Temporary scratch storage is available when you run a job on the compute nodes. The storage
is local to each node. You can find the name of your scratch directory in the environment
variable "$JOB_SCRATCH_DIR" which is provided by SLURM. Here is an example of how it may be 
accessed by your batch script.
#!/bin/bash
#SBATCH --job-name=test_scratch
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=develop
echo $JOB_SCRATCH_DIR
echo "Contents of my scratch file" > $JOB_SCRATCH_DIR/testfile
ls -l $JOB_SCRATCH_DIR
cat $JOB_SCRATCH_DIR/testfile
 
Download: 
../code-2010/test_scratch/test.slurm
 
Submitting this script should yield something like the following
[araim1@tara-fe1 test_scratch]$ cat slurm.out 
/scratch/22922
total 4
-rw-rw---- 1 araim1 pi_nagaraj 28 Jun  7 19:37 testfile
Contents of my scratch file
[araim1@tara-fe1 test_scratch]$ 
 
 
You can of course also access $JOB_SCRATCH_DIR from C, MATLAB, or any other language or
package. Remember that the files only exist for the duration of your job, so make sure
to copy anything you want to keep to a separate location, before your job exits.
Check here
for more information about scratch and other storage areas.
Charging computing time to a PI
If you a member of multiple research groups on tara this will apply to you. When
you run a job on tara, the resources you've used (e.g. computing time) are 
"charged" to your PI. This simply means that there is a record of your
group's use of the cluster. Our goal is to make sure everyone has access to
their fair share of resources, especially the PIs who have paid for nodes.
You have a "primary" account which your jobs are charged to by default. To 
see this, try checking one of your jobs as follows (suppose our job has ID 25097)
[araim1@tara-fe1 ~]$ scontrol show job 25097
JobId=25097 Name=fmMle_MPI
   UserId=araim1(28398) GroupId=pi_nagaraj(1057)
   Priority=4294798165 Account=pi_nagaraj QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   TimeLimit=04:00:00 Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   SubmitTime=2010-06-30T00:14:24 EligibleTime=2010-06-30T00:14:24
   StartTime=2010-06-30T00:14:24 EndTime=2010-06-30T04:14:24
   SuspendTime=None SecsPreSuspend=0
...
[araim1@tara-fe1 ~]$
 
 
Notice the "Account=pi_nagaraj" field - in this example, this is our default
account. Suppose we are also working for another PI "pi_gobbert". When running
jobs for that group, it's only fair that we charge the computing resources to
that group instead. To accomplish that, we may add the "--account" option
to our batch scripts.
#SBATCH --account=pi_gobbert
 
 
Note that if you specify an invalid name for the account (a group that does not
exist, or which you do not belong to), the scheduler will silently revert back to 
your default account. You can quickly check the status field in the scontrol output
to make sure the option worked.
[araim1@tara-fe1 ~]$ scontrol show job 25097
JobId=25097 Name=fmMle_MPI
   UserId=araim1(28398) GroupId=pi_nagaraj(1057)
   Priority=4294798165 Account=pi_gobbert QOS=normal
...
[araim1@tara-fe1 ~]$
 
 
Interactive jobs
Normally interactive programs should be run on the front end node (i.e. not using
the scheduler). If you need to run them on the compute nodes for some reason, 
contact our
HPCF Point of Contact
as requested in the usage policies.
Jobs stuck in the "PD" state
Your job may become stuck in the "PD" state either if there are not enough 
nodes available to run your job, or the cluster's scheduler has decided 
to run other jobs before yours. 
A job cannot be run until there are enough free processor cores/nodes to
meet its requirement. To illustrate, if somebody submits a job that 
uses all of the cluster nodes for twelve hours, nobody else can run 
any jobs until that large job finishes. If you are trying to run a 
sixteen node job, and there are a set of jobs running which leave
less than 16 nodes available, then your job must wait.
When there are a sufficient number of processes/nodes available,
the scheduler must decide which job to run next. The decision
is based on several factors:
The number of nodes your job uses. A job that takes up the entire 
cluster will not run very soon. Use the options mentioned earlier to set 
the number of nodes your job uses.
The maximum length of time that your job claims it will take to run.
As mentioned earlier, give a walltime estimate to give the scheduler an 
idea of how long this will be. Smaller jobs may be allowed to run
ahead of larger ones. If you do not give an estimate, the scheduler
will assume a default, which is based on the queue you've submitted to.
The job priority. This depends on when you submitted your job
(generally first-in-first-out (FIFO) is used) and which queue you use. If you 
use the perform queue, your job will probably run before jobs in 
the serial queue. 
It is also possible that someone else's job has gotten stuck, or that there 
is another problem on the cluster. If you suspect that may be the case, run 
squeue. If there are many jobs whose state ("ST" column) is "R" or "PD" then 
there are probably no problems on the cluster - there are just a lot 
of jobs taking up nodes. If a job has been in the "R" state for most 
of the day, or if you see jobs that are in states other than "PD" or "R" 
for more than a few seconds, then something is wrong. If this is the case,
or if you notice any other strange behavior
contact us.