UMBC logo
UMBC High Performance Computing Facility
How to run OpenMP programs on tara

Introduction

On this page we'll see how to run OpenMP programs on the cluster. Before proceeding, make sure you've read the How To Run tutorial first.

OpenMP is a parallel programming model for shared memory systems. In this model, the user creates worker threads which are coordinated by a master thread. The user marks sections of code as parallel using special preprocessor directives. The nodes on tara do not share memory, so OpenMP by itself cannot be used for jobs that need to utilize multiple cluster nodes. But it can be used to utilize multiple cores on a single node. For this reason, we recommend MPI as the more general programming model. (For multi-node jobs, hybrid programs using both MPI + OpenMP are also possible, but we won't get into that at this time).

OpenMP is available from several programming languages such as C and FORTRAN.

Hello World example C

Let's start with a simple Hello World script written in C (taken from an example at Purdue)
#include <omp.h>
#include <stdio.h>

int main (int argc, char *argv[])
{
   int nthreads, thread_id;

   #pragma omp parallel private(nthreads, thread_id)
   {
      thread_id = omp_get_thread_num();
      printf("Thread %d says: Hello World\n", thread_id);

      if (thread_id == 0)
      {
         nthreads = omp_get_num_threads();
         printf("Thread %d reports: the number of threads are %d\n", 
            thread_id, nthreads);
      }
  }
  return 0;
}


Download: ../code-2010/hello_openmp_c/hello_openmp.c
Here is the batch script we will use to launch it
#!/bin/bash
#SBATCH --job-name=OMP_hello
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=develop
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8

export OMP_NUM_THREADS=8
./hello_openmp

Download: ../code-2010/hello_openmp_c/run.slurm
Notice the setting of the environment variable OMP_NUM_THREADS to 8; this controls how many OpenMP threads will be used for the job. Setting this to a higher number will generally not improve performance, since there are 8 cores on each node. If you don't require 8 threads, you can also decrease "--ntasks-per-node" accordingly (you should make OMP_NUM_THREADS and --ntasks-per-node match).

Another important thing to note - if we change "--nodes" to 2, the job will be duplicated on two nodes, not parallelized across them as we would probably want. So it's recommended to leave --nodes=1

Now we will compile and launch the job

[araim1@tara-fe1 hello_openmp_c]$ gcc -fopenmp hello_openmp.c -o hello_openmp -lm
[araim1@tara-fe1 hello_openmp_c]$ ls
hello_openmp.c   run.slurm
[araim1@tara-fe1 hello_openmp_c]$ sbatch run.slurm 
Submitted batch job 37532
[araim1@tara-fe1 hello_openmp_c]$ cat slurm.out 
Thread 1 says: Hello World
Thread 5 says: Hello World
Thread 6 says: Hello World
Thread 2 says: Hello World
Thread 7 says: Hello World
Thread 0 says: Hello World
Thread 3 says: Hello World
Thread 0 reports: the number of threads are 8
Thread 4 says: Hello World
[araim1@tara-fe1 hello_openmp_c]$ 

Hello World example FORTRAN

Now let's see a similar program in FORTRAN. Begin by downloading the hello world FORTRAN example from here. Then grab the following batch script (which is the same as for the C code above)
#!/bin/bash
#SBATCH --job-name=OMP_hello
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=develop
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8

export OMP_NUM_THREADS=8
./hello_open_mp

Download: ../code-2010/hello_openmp_f90/run.slurm
Now we can compile and run the code, the same way as in the C example
[araim1@tara-fe1 hello_openmp_f90]$ gfortran -fopenmp hello_open_mp.f90 -o hello_open_mp
[araim1@tara-fe1 hello_openmp_f90]$ sbatch run.slurm 
Submitted batch job 37537
[araim1@tara-fe1 hello_openmp_f90]$ cat slurm.out 
 
HELLO_OPEN_MP
  FORTRAN90/OpenMP version
  The number of processors available =        8
  The number of threads available    =        8
 
  OUTSIDE the parallel region.
 
  HELLO from process        0
 
  Going INSIDE the parallel region:
 
  HELLO from process        0
  HELLO from process        4
  HELLO from process        5
  HELLO from process        3
  HELLO from process        6
  HELLO from process        2
  HELLO from process        7
  HELLO from process        1
 
  Back OUTSIDE the parallel region.
 
HELLO_OPEN_MP
  Normal end of execution.
 
  Elapsed wall clock time =   0.131280E-01
[araim1@tara-fe1 hello_openmp_f90]$