How to run CUDA programs on maya
This webpage discusses how to run programs using GPU on maya 2013.
is a powerful general purpose graphics processing unit
(GPGPU) with 2496 computational cores which is designed for efficient
double-precision calculation. GPU accelerated computing has become popular in
recent years due to the GPU's ability to achieve high performance in
computationally intensive portions of code beyond a general purpose CPU.
The NVIDIA K20 GPU has 5 GB of onboard memory.
Before proceeding, make sure you've read the
How To Run tutorial first.
To follow along, ensure you are logged into maya-usr1 and have the CUDA modules loaded.
[hu6@maya-usr1 ~]$ module list
Currently Loaded Modulefiles:
1) dot 7) intel-mpi/64/4.1.3/049
2) matlab/r2014a 8) texlive/2014
3) comsol/4.4 9) default-environment
4) gcc/4.8.2 10) cuda65/blas/6.5.14
5) slurm/14.03.6 11) cuda65/toolkit/6.5.14
6) intel/compiler/64/14.0/2013_sp1.3.174
Example - Hello World from GPU
In CUDA programming language, CPU and the system's memory are referred to
as host, and the GPU and its memory are referred to as device.
Figure below explains how threads are grouped into blocks,
and blocks grouped into grids.
Threads unite into thread blocks -- one- two or three-dimensional grids of
threads that interact with each other via shared memory and synchpoints.
A program (kernel) is executed over a grid of thread blocks.
One grid is executed at a time. Each block can also be one-, two-, or
three-dimensional in form.
This is due to the fact that GPUs used to work on graphical data, which has
3 dimensions red, green and blue.
This now gives much flexibility in launching kernels with different data structure.
However, there are still limitations, such as one block can only have no more
than 1024 threads, regardless of dimension.
Let's start with a Hello World program using GPU.
#include <stdio.h>
#define NUM_BLOCKS 2
#define BLOCK_WIDTH 16
__global__ void hello()
printf("Hello world! I'm thread %d in block %d\n", threadIdx.x, blockIdx.x);
int main(int argc, char **argv)
// launch the kernel
// force the printf()s to flush
printf("That's all!\n");
return 0;
We can compile with NVIDIA's NVCC compiler. Normally, the compiler does not allow device to use host fuction printf,
however, if we compile with the flag -arch=sm_20 then it can be done. The GPU that we have on maya is K20, and it support sm_35.
[hu6@maya-usr1 test04_hello]$ nvcc -arch=sm_35 -o hello
We need to submit the job using the following slurm file, so it will run on a node that has GPU.
#SBATCH --job-name=gpu_hl
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=batch
#SBATCH --qos=short
#SBATCH --exclusive
#SBATCH --gres=gpu
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --constraint=hpcf2013
Upon executation, Host (CPU) launchs a kernel on Device (GPU) that prints Hello World and corresponding
block id and thread id.
[hu6@maya-usr1 test04_hello]$ cat slurm.out
Hello world! I'm thread 0 in block 0
Hello world! I'm thread 1 in block 0
Hello world! I'm thread 2 in block 0
Hello world! I'm thread 3 in block 0
Hello world! I'm thread 4 in block 0
Hello world! I'm thread 5 in block 0
Hello world! I'm thread 6 in block 0
Hello world! I'm thread 7 in block 0
Hello world! I'm thread 8 in block 0
Hello world! I'm thread 9 in block 0
Hello world! I'm thread 10 in block 0
Hello world! I'm thread 11 in block 0
Hello world! I'm thread 12 in block 0
Hello world! I'm thread 13 in block 0
Hello world! I'm thread 14 in block 0
Hello world! I'm thread 15 in block 0
Hello world! I'm thread 0 in block 1
Hello world! I'm thread 1 in block 1
Hello world! I'm thread 2 in block 1
Hello world! I'm thread 3 in block 1
Hello world! I'm thread 4 in block 1
Hello world! I'm thread 5 in block 1
Hello world! I'm thread 6 in block 1
Hello world! I'm thread 7 in block 1
Hello world! I'm thread 8 in block 1
Hello world! I'm thread 9 in block 1
Hello world! I'm thread 10 in block 1
Hello world! I'm thread 11 in block 1
Hello world! I'm thread 12 in block 1
Hello world! I'm thread 13 in block 1
Hello world! I'm thread 14 in block 1
Hello world! I'm thread 15 in block 1
That's all!
Example - Compile Host only and Device only program
Now we'll try to run a slightly more complicated program, which has several files.
The program below will launch kernels and call function.
#include <stdio.h>
#include <stdlib.h>
#include "kernel.h"
#include "hostOnly.h"
int main( void ) {
int c;
int *dev_c;
cudaMalloc( (void**)&dev_c, sizeof(int) );
add<<<1,1>>>( 20, 14, dev_c );
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
printf( "20 + 14 = %d\n", c );
cudaFree( dev_c );
return 0;
The file contains code that will run on GPU.
#ifndef _KERNEL_H_
#define _KERNEL_H_
__global__ void add( int a, int b, int *c );
The file contains code that only runs on CPU.
#include <stdio.h>
#include <stdlib.h>
#include "hostOnly.h"
void hostOnly()
printf("Host only function goes here.\n");
The file Makefile is used to compile and link all CUDA files.
TEST: kernel.o main.o hostOnly.o
nvcc main.o kernel.o hostOnly.o -o TEST
nvcc -c -o hostOnly.o
nvcc -c -o main.o
nvcc -c -o kernel.o
rm TEST *.o
After successful compilation, we need the following slurm file to launch the job on a GPU node.
#SBATCH --job-name=gpu_HK
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=batch
#SBATCH --qos=short
#SBATCH --exclusive
#SBATCH --gres=gpu
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --constraint=hpcf2013
The program will put the output in the file slurm.out:
For more information about programming CUDA, see the
NVIDIA Programming Guide website.