General Purpose GPU Programming

During the past 20 years there has been a steadily increasing demand for high-end visual applications. For example video games, movies, and medical imaging are applications which demand real-time graphics processing. As a result, graphics processing units (GPU's) need to be able to perform a high volume of arithmetic operations quickly and efficiently. GPGPU stands for General-Purpose computation on Graphics Processing Units. Graphics Processing Units are high-performance many-core processors that can be used to accelerate a wide range of applications. Advances in GPU chip architectures have opened the gateway for high definition video rendering, real time MRI imaging, as well as bringing digital characters to life on screen.

In the summer of 2011, while working as a research assistant for Dr. Kaori Tanaka, I began benchmarking the performance of GPU's for scientific calculations. The algorithms were developed using nVidia's CUDA parallel computing platform and calculations were performed on WestGrid's Checkers cluster.

The algorithm which was tested is an implementation of the Riccati equations, and requires performing numerical integration of ordinary differential equations, as well as iterating the results for convergence. The numerical recipes routine "odeint" is used to solve the differential equations. Typical calculations may require calling this routine on the order of 100 million times. The figure to the right illustrates that a GPU implementation of this algorithm is roughly 10x faster than when performed on a conventional CPU.

The routine DSYEV is a lapack routine which performs a double precision symmetric eigenvalue decomposition. I also investigated how the GPU implementation of lapack routines scales versus conventional CPU calculations. The lower figure on the right illustrates how the GPU routine (CULA DSYEV) performs relative to the corresponding serial and parallel versions. The CULA package performs roughly 5x faster than the serial version and is competitive with the 6-8 processor MPI versions. This is significant for a few reasons:

  1. CULA is very easy to use. Literally, the only difference between the serial and GPU versions of my code is about 4 lines. If one already has software using the serial lapack library it is very simple to implement the GPU version.
  2. You can use the cula library in C or in Fortran. For old-school programmers there is no need to learn a new language.
  3. The alternative is ScaLapack, which can be hard to implement depending on the application. With ScaLapack one has to initialize process grids, distribute matrices, gather data, collect, etc. Due to the added communication time, more processors does not always increase performance unless the matrix size is sufficiently large.

The above figure clearly illustrates that the GPU implementation of the Riccati algorithm shows significant performance enhancement over the CPU version with a speedup of roughly 10x.

Scaling properties of the GPU implementation of LAPACK routines (CULA), vs. the LAPACK and ScaLAPACK packages. The CULA package performs roughly 5x faster than the corresponding serial version and is competitive with the 6-8 processor MPI versions.