UMBC High Performance Computing Facility

How to run programs on the Intel Phi on maya

Introduction

This webpage discusses how to run programs in the Intel Xeon Phi.

An Intel Phi packages 60 cores into a single coprocessor. The cores are connected to each other and main memory through a bidirectional ring bus. Each core is x86 compatible and is capable of running its own instruction stream. The x86 compatibility allows the programmer to use familiar frameworks such as MPI and OpenMP when developing code on the Phi. The Phi 5110P, which is on maya, has 8 GB of onboard memory. There are three main modes of running programs on the Intel Phi:

Native Mode, where the program is run directly on the Phi. In this mode it is possible to have serial jobs and parallel jobs using MPI and/or OpenMP.
Offload Mode, where the program is run on the CPU and segments of the code are moved (or offloaded) to the Phi.
Symmetric mode, where the program is run both on the CPU and directly on the Phi concurrently.

Note (12/02/14): At this time symmetric modes is not working.

To follow along, ensure you are logged into maya-usr2 and have the Intel MPI and MIC modules loaded.

This webpage is based on slides provided by Colfax International during a Developer Boot Camp on July 15, 2014 sponsored by Intel. For more information about programming on the Phi, see the Intel Developer Zone website.

Native Mode on Phi

When compiling on native mode you must load the module intel-mpi/mic and unload the modules intel/compiler/64 and intel-mpi/64 so that you have the following modules loaded.

[khsa1@maya-usr2 hello_phi]$ module list
Currently Loaded Modulefiles:
  1) dot                       8) intel/mic/runtime/3.3
  2) matlab/r2014a             9) default-environment
  3) comsol/4.4               10) intel-mpi/mic/4.1.3/049
  4) gcc/4.8.2                
  5) slurm/14.03.6            
  6) texlive/2014             
  7) intel/mic/sdk/3.3

Example: Serial Code

Let's try to compile this simple Hello World program:

#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[]){
    printf("Hello world! I have %ld logical cores.\n",
    sysconf(_SC_NPROCESSORS_ONLN ));
}

Download: ../code/hello_phi/hello_native.c

We can compile with the Intel compiler with the -mmic flag so that the compiler knows that this code will be run natively on the Phi.

[khsa1@maya-usr2 hello_phi]$ icc hello_native.c -o hello_native -mmic

We submit the job using the following slurm script so that it will run on a Phi.

#!/bin/bash
#SBATCH --job-name=hello_phi
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=mic
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=miccard

mpiexec.hydra ./hello_native

Download: ../code/hello_phi/run-native.slurm

Upon execution, the job will be allocated a Phi and then be executed.

[khsa1@maya-usr2 hello_phi]$ cat slurm.out 
Hello world! I have 240 logical cores.

Since this is a serial job that only uses one core of the sixty available on the Phi it is possible to run several of these jobs on one core at the same time.

Example: Code with OpenMP

Now we will look at how to run a program with OpenMP natively on the Phi. We will compile the following Hello World program:

#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main(int argc, char *argv[]){
    #pragma omp parallel
    {
        printf("Hello from thread %03d of %03d\n",omp_get_thread_num(), omp_get_num_threads());
    }
}

Download: ../code/hello_phi/hello_openmp.c

We can compile with the Intel compiler and Intel MPI with the -mmic flag so that the compiler knows that this code will be run natively on the Phi. We also use the -openmp flag so that the compiler knows that this code will use OpenMP.

[khsa1@maya-usr2 hello_phi]$ icc -mmic -openmp -o hello_openmp hello_openmp.c

After compilation, we run the following slurm script to run the job on a Phi node.

#!/bin/bash
#SBATCH --job-name=hello_phi
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=mic
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=miccard

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cm/shared/apps/intel/composer_xe/current/compiler/lib/mic
export MIC_OMP_NUM_THREADS=8
mpiexec.hydra ./hello_openmp

Download: ../code/hello_phi/run-openmp.slurm

The program will put the output in the file slurm.out:

[khsa1@maya-usr2 hello_phi]$ cat slurm.out
Hello from thread 000 of 008
Hello from thread 001 of 008
Hello from thread 002 of 008
Hello from thread 003 of 008
Hello from thread 004 of 008
Hello from thread 005 of 008
Hello from thread 006 of 008
Hello from thread 007 of 008

Example: Code with MPI

Now we will look at how to run a program which uses MPI natively on the Phi. We will compile the following Hello World program:

#include <stdio.h>
#include <mpi.h>

int main (int argc, char *argv[])
{
    int id, np;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int processor_name_len;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &np);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Get_processor_name(processor_name, &processor_name_len);

    printf("Hello world from process %03d out of %03d, processor name %s\n", 
        id, np, processor_name);

    MPI_Finalize();
    return 0;
}

Download: ../code/hello_phi/hello_mpi.c

We can compile with the Intel compiler and Intel MPI with the -mmic flag so that the compiler knows that this code will be run natively on the Phi.

[khsa1@maya-usr2 hello_phi]$ mpiicc -mmic -o hello_mpi hello_mpi.c

After compilation, we run the following slurm script to run the job on a Phi node.

#!/bin/bash
#SBATCH --job-name=hello_phi
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=mic
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=miccard

unset I_MPI_FABRICS
mpiexec.hydra -n 8 ./hello_mpi

Download: ../code/hello_phi/run-mpi.slurm

The program will put the output in the file slurm.out:

[khsa1@maya-usr2 hello_phi]$ cat slurm.out
Hello world from process 001 out of 008, processor name n51-mic0
Hello world from process 002 out of 008, processor name n51-mic0
Hello world from process 003 out of 008, processor name n51-mic0
Hello world from process 005 out of 008, processor name n51-mic0
Hello world from process 006 out of 008, processor name n51-mic0
Hello world from process 000 out of 008, processor name n51-mic0
Hello world from process 004 out of 008, processor name n51-mic0
Hello world from process 007 out of 008, processor name n51-mic0

Offload Mode on the Phi

When compiling in offload mode the default-environment module may be loaded.

[khsa1@maya-usr2 hello_phi]$ module list
Currently Loaded Modulefiles:
  1) dot                                     9) intel/mic/sdk/3.3
  2) matlab/r2014a                          10) intel/mic/runtime/3.3
  3) comsol/4.4                             11) default-environment
  4) gcc/4.8.2                              
  5) slurm/14.03.6                          
  6) intel/compiler/64/14.0/2013_sp1.3.174  
  7) intel-mpi/64/4.1.3/049                 
  8) texlive/2014

Now we will run programs which uses offloading to run part of the code on the CPU and part of the code on the Phi.

Example: Offload Mode with a Single Phi

The program below will first print a message from the CPU. It will then offload to the Phi, print a message, and return to the CPU. The code inside of the #pragma offload is offloaded to the Phi. Note that if offloading fails then the code will be run entirely on the CPU. We check if the the offloading was successful by checking if __MIC__ is defined.

#include <stdio.h>
#include "offload.h"
int main(int argc, char * argv[] ) {
        printf("Hello World from CPU!\n");
#pragma offload target(mic)
        {
#ifdef __MIC__
                printf("Hello World from Phi!\n");
#else
                printf("Hello world from CPU (offload to Phi failed).");
#endif
                fflush(0);
        }
}

Download: ../code/hello_phi/hello_offload.c

Since this code is not run natively on the Phi we do not need to add the -mmic flag.

[khsa1@maya-usr2 hello_phi]$ icc hello_offload.c -o hello_offload

After compilation, we run the following slurm script to run the job on a Phi node.

#!/bin/bash
#SBATCH --job-name=hello_phi
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=mic_5110p

srun ./hello_offload

Download: ../code/hello_phi/run-offload.slurm

The program will put the output in the file slurm.out:

[khsa1@maya-usr2 hello_phi]$ cat slurm.out 
Hello World from CPU!
Hello World from Phi!

Example: Offload Mode with Multiple Phis

Now we will run a program which uses offloading to run part of the code on the CPU and part of the code on the two Phis connected to the node.

This program uses MPI on the CPU and OpenMP on the Phi. Since by default processes are distributed in a round-robin fashion the two CPUs with the 0-th process being on the CPU 0, if we let the even ranks offload to mic0 and the odd ranks offload to mic1 we will can split the work between the two Phis. To do this, we specify the micid in the offload pragma statement.

#include <stdio.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char * argv[] ) {
    int id, np, namelen, idleft, idright, micid;
    char name[128];
    MPI_Comm comm;

    MPI_Init (&argc, &argv);
    MPI_Comm_size (MPI_COMM_WORLD, &np);
    MPI_Comm_rank (MPI_COMM_WORLD, &id);
    comm = MPI_COMM_WORLD;

    if (id % 2 == 0)
        micid = 0;
    else
        micid = 1;

    gethostname(name, sizeof name);
    printf("Hello World from rank %d on %s!\n", id, name);

    #pragma offload target(mic : micid)
    {
        gethostname(name, sizeof name);
        printf("Hello World from rank %d on %s\n", id, name);
    }

  MPI_Finalize();
}

Download: ../code/hello_phi/hello_multioffload.c

Since this code uses MPI we must use mpiicc to compile and add the -openmp flag since we use OpenMP on the Phis.

[khsa1@maya-usr2 hello_phi]$ mpiicc -openmp hello_multioffload.c -o hello_multioffload

After compilation, we run the following slurm script to run the job on a Phi node.

#!/bin/bash
#SBATCH --job-name=hello_phi
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --constraint=mic_5110p

srun ./hello_multioffload

Download: ../code/hello_phi/run-multioffload.slurm

The program will put the output in the file slurm.out:

[khsa1@maya-usr2 hello_phi]$ cat slurm.out
Hello World from rank 0 on n34!
Hello World from rank 1 on n34!
Hello World from rank 1 on n34-mic1
Hello World from rank 0 on n34-mic0

Example: Offload Mode with Data Transfer and Offloading Functions

In this example we will demonstrate how to write and compile a more advanced example of code with offloading to the Phi. In this example we will show how to transfer data onto the Phi and how to create functions that are used only on the Phi.

The code we will look at performs an axpby operation on the Phi and transfers the resulting vector back to the CPU. Before we look at the code it is important to note that scope-local scalars and known-size arrays are offloaded automatically, however dynamically allocated data must be explicitly transferred between the CPU and Phi.

The file main.c demonstrates two methods to perform this computation on the Phi. In the first method we add the inout command to the offload pragma statment. This command should contain all variables that are used inside of the offload region that are not already automatically offloaded, i.e. all dynamically allocated variables. In our case this would be the vectors x, y, and z1. The variables are then followed by a colon and the command length(n), where n is the length of the vectors. This command will transfer the data to the Phi, call the function axpby1 on the Phi, and transfer the vectors x, y, and z1 along with any data that was offloaded automatically back to the CPU.

The second method is to first transfer all data to the Phi, then call a function which contains the offload regions inside of it that will perform the operation, and finally transfer the resulting vectors back to the CPU. To first transfer the data to the Phi we use the in command in the offload pragma. Just as before, we transfer in the x, y, and z2 arrays that were dynamically allocated and specify the length of these vectors. However, now we add ALLOC after the colon. As we can see on the first line of main.c, the ALLOC macro allocates these vectors on the Phi but does not free them upon exiting the offload region. The function axpby2, which is defined in axpby.c, is called on the CPU but the resulting vector z2 remains on the Phi. So, we use another offload pragma with the out command to transfer the z2 vector back to the CPU. This out command uses the FREE macro after the colon so that the associated memory is freed on the Phi. Since we will no longer need the Phi but the vectors x and y are still on the Phi, the nocopy command is used for the vectors x and y and the memory associated with these vectors is also freed. This method is useful if you have a function that must be called several times and requires operations on both the CPU and the Phi, allowing you to avoid unnecessary communications between the CPU and Phi as in the first method.

#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)

#include <stdio.h>
#include "axpby.h"
int main(int argc, char * argv[] ) {
    double *x, *y, *z1, *z2;
    double a, b;
    int i, n;

    n = 8;
    a = 2, b = 1;
    
    x  = (double*) calloc (n, sizeof(double));
    y  = (double*) calloc (n, sizeof(double));
    z1 = (double*) calloc (n, sizeof(double));
    z2 = (double*) calloc (n, sizeof(double));

    for(i = 0; i < n; i++) {
        x[i] = (double) i;
        y[i] = (double) i+1;
    }

    for(i = 0; i < n; i++)
        printf("x[%d]=%f\n", i, x[i]);
    printf("\n");
    for(i = 0; i < n; i++)
        printf("y[%d]=%f\n", i, y[i]);
    printf("\n");

    #pragma offload target(mic) inout(x,y,z1 : length(n))
    {
        axpby1(z1, a, x, b, y, n);
    }

    #pragma offload target(mic) in(x, y, z2: length(n) ALLOC)
    {}

    axpby2(z2, a, x, b, y, n);

    #pragma offload target(mic) out(z2 : length(n) FREE) \
    nocopy(x,y : length(n) FREE)
    {}

    for(i = 0; i < n; i++)
        printf("z1[%d] = %f\n", i, z1[i]);
    printf("\n");
    for(i = 0; i < n; i++)
        printf("z2[%d] = %f\n", i, z2[i]);

    free(x);
    free(y);
    free(z1);
    free(z2);
}

Download: ../code/phi/main.c

Now we will look at the axpby.c and axpby.h files. Since the function axpby1 is used only on the Phi it must be marked with the specifier __attribute__((target(mic))). This means that it can only be called from an offload region. The function axpby2 does not contain this specifier so it cannot be called from an offload region. If this function needs to perform an operation on the Phi it must use an offload region. Since we already offloaded z, x, and y to the Phi we can reuse these vectors in an offload region inside the function. We use the in command in the offload pragma but instead of providing the real length of these vectors we simply provide a length of 0 and use the REUSE macro so that there are no transfers between the CPU and Phi.

#include "axpby.h"

__attribute__ ((target(mic))) void axpby1(double *z, double a, double *x, double b, double *y, int n) {
    #pragma omp parallel
    {
        int i;
        #pragma omp for
        for (i = 0; i < n; i++)
            z[i] = a*x[i] + b*y[i];
    }
}

void axpby2(double *z, double a, double *x, double b, double *y, int n) {
    #pragma offload target(mic) in(x, y, z : length(0) REUSE)
    {
        #pragma omp parallel
        {
            int i;
            #pragma omp for
            for (i = 0; i < n; i++)
                z[i] = a*x[i] + b*y[i];
        }
    }
}

Download: ../code/phi/axpby.c

#include <stdio.h>
#include <math.h>
#include <omp.h>
#include "offload.h"

#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)

__attribute__ ((target(mic))) void axpby1(double *z, double a, double *x, double b, double *y, int n);
void axpby2(double *z, double a, double *x, double b, double *y, int n);

Download: ../code/phi/axpby.h

The file Makefile is used to compile and link all files.

axpby: main.o axpby.o
    icc -openmp main.o axpby.o -o axpby

main.o: main.c
    icc -openmp -c -o main.o main.c

axpby.o: axpby.c
    icc -openmp -c axpby.c -o axpby.o

clean:
    -rm -f *.o axpby

Download: ../code/phi/Makefile

We run the following slurm script to run the job on a Phi node. Note that we add the line export OFFLOAD_REPORT=2, causing offload details to be printed to standard output. This is useful for debugging to determine how much time is used on transferring data between the Phi and CPU. Note that the value of the OFFLOAD_REPORT environment variable can be set as 0, 1, 2, or 3 for varying levels of verbosity.

#!/bin/bash
#SBATCH --job-name=axpby
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=mic_5110p

export OFFLOAD_REPORT=2
srun ./axpby

Download: ../code/phi/run.slurm

The program will result in the following output in the slurm.out file. As we can see both methods result in the correct output. By looking at the output of the offload report, we observe that the second method requires the transfer of less data and less time in the offload regions.

[khsa1@maya-usr2 phi]$ cat slurm.out
x[0]=0.000000
x[1]=1.000000
x[2]=2.000000
x[3]=3.000000
x[4]=4.000000
x[5]=5.000000
x[6]=6.000000
x[7]=7.000000

y[0]=1.000000
y[1]=2.000000
y[2]=3.000000
y[3]=4.000000
y[4]=5.000000
y[5]=6.000000
y[6]=7.000000
y[7]=8.000000

[Offload] [MIC 0] [File]            main.c
[Offload] [MIC 0] [Line]            32
[Offload] [MIC 0] [Tag]             Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        0.796314(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   212 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        0.248634(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   192 (bytes)

[Offload] [MIC 0] [File]            main.c
[Offload] [MIC 0] [Line]            37
[Offload] [MIC 0] [Tag]             Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        0.004586(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   192 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        0.000066(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   24 (bytes)

[Offload] [MIC 0] [File]            axpby.c
[Offload] [MIC 0] [Line]            14
[Offload] [MIC 0] [Tag]             Tag 2
[Offload] [HOST]  [Tag 2] [CPU Time]        0.000511(seconds)
[Offload] [MIC 0] [Tag 2] [CPU->MIC Data]   44 (bytes)
[Offload] [MIC 0] [Tag 2] [MIC Time]        0.000334(seconds)
[Offload] [MIC 0] [Tag 2] [MIC->CPU Data]   20 (bytes)

[Offload] [MIC 0] [File]            main.c
[Offload] [MIC 0] [Line]            42
[Offload] [MIC 0] [Tag]             Tag 3
[Offload] [HOST]  [Tag 3] [CPU Time]        0.011541(seconds)
[Offload] [MIC 0] [Tag 3] [CPU->MIC Data]   48 (bytes)
[Offload] [MIC 0] [Tag 3] [MIC Time]        0.000060(seconds)
[Offload] [MIC 0] [Tag 3] [MIC->CPU Data]   64 (bytes)

z1[0] = 1.000000
z1[1] = 4.000000
z1[2] = 7.000000
z1[3] = 10.000000
z1[4] = 13.000000
z1[5] = 16.000000
z1[6] = 19.000000
z1[7] = 22.000000

z2[0] = 1.000000
z2[1] = 4.000000
z2[2] = 7.000000
z2[3] = 10.000000
z2[4] = 13.000000
z2[5] = 16.000000
z2[6] = 19.000000
z2[7] = 22.000000

Example: Symmetric Mode with CPU and Phi

Finally, we will explain how to code in symmetric mode. In this mode we have one MPI world accross both CPUs and Phis working concurrently. In the following example we will have 8 MPI processes on the CPU and 8 MPI Processes on the Phi.

We will again use hello_mpi.c but it will now be run on the a CPU and Phi:

#include <stdio.h>
#include <mpi.h>

int main (int argc, char *argv[])
{
    int id, np;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int processor_name_len;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &np);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Get_processor_name(processor_name, &processor_name_len);

    printf("Hello world from process %03d out of %03d, processor name %s\n", 
        id, np, processor_name);

    MPI_Finalize();
    return 0;
}

Download: ../code/hello_phi/hello_mpi.c

Since this code will be run on both a CPU and a Phi two executables need to be compiled:

[khsa1@maya-usr2 hello_phi]$ mpiicc -mmic hello_mpi.c -o hello_mpi.MIC
[khsa1@maya-usr2 hello_phi]$ mpiicc hello_mpi.c -o hello_mpi.XEON

The hello_mpi.XEON executable is compiled to be run on the CPU and the hello_mpi.MIC is compiled to be run on the Phi.

#!/bin/bash
#SBATCH --job-name=hello_phi
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --constraint=mic_5110p

mpirun -host mic0 -n 8 ./hello_offload.MIC : -host localhost -n 8 ./hello_offload.XEON

Download: ../code/hello_phi/run-symmetric.slurm

This program will result in the following output in the slurm.out file:

[khsa1@maya-usr2 hello_phi]$ cat slurm.out
Hello world from process 008 out of 016, processor name n34
Hello world from process 009 out of 016, processor name n34
Hello world from process 010 out of 016, processor name n34
Hello world from process 011 out of 016, processor name n34
Hello world from process 012 out of 016, processor name n34
Hello world from process 013 out of 016, processor name n34
Hello world from process 014 out of 016, processor name n34
Hello world from process 015 out of 016, processor name n34
Hello world from process 000 out of 016, processor name n34-mic0
Hello world from process 001 out of 016, processor name n34-mic0
Hello world from process 002 out of 016, processor name n34-mic0
Hello world from process 003 out of 016, processor name n34-mic0
Hello world from process 004 out of 016, processor name n34-mic0
Hello world from process 005 out of 016, processor name n34-mic0
Hello world from process 006 out of 016, processor name n34-mic0
Hello world from process 007 out of 016, processor name n34-mic0