Hierarchical checkerboard decomposition of Ising lattice

Hierarchical checkerboard decomposition of Ising lattice

High performance computing on GPUs

Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. Investigating possible advantages of GPU over CPU based systems for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, we find that only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations have been investigated. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed.

GPU Lectures at IMPRS school

The International Max Planck Graduate School for "Dynamical Processes in Atoms, Molecules and Solids" organised a winter school on GPU computing in October and November 2012 in Wroclaw. My own lectures on GPU architecture and computer simulations on graphics processing units are found below:

Simulation Code

We have recently developed a simulation code within the NVIDIA CUDA framework for the simulation of a number of spin models with algorithms ranging from single spin-flip Metropolis over parallel tempering to cluster algorithms. The results are discussed in the following publicatio

  • M. Weigel, Simulating spin models on GPU, Comput. Phys. Commun. 182, 1833 (2011). [PDF]
  • M. Weigel, Performance potential for simulating spin models on GPU, J. Comput. Phys. 231, 3064 (2012). [PDF]

The code for single spin-flip Metropolis simulations of the ferromagnetic Ising model on a two-dimensional lattice with periodic boundary conditions is available here. This software is distributed under the GNU General Public License, version 2 (GPLv2).

Version 1.2: download here

  • optimized tile load code for better coalescence (significant improvement without multi-hit updates)
  • re-write of collaborative reduction code for energy calculation

Version 1.1: download here

  • optimized CPU code for cache alignment
  • optimized GPU code to minimize branch divergence
  • use texture memory for Boltzmann weights
  • do not waste shared memory for random number generators

Version 1.0: download here

Compilation and linking instructions: using the CUDA 2.3 toolkit and gcc-4.3.2, I achieved the best results with compiling the code with the following flags (replace lib64 with lib32 for a 32-bit system)

nvcc -arch sm_13 --compiler-options -fno-strict-aliasing,-O3,-march=native,-msse4,-mfpmath=sse,-funroll-loops,-finline-functions -I. -I<cuda_install_dir>/include -I<cuda_sdk_dir>/C/common/inc/ -DUNIX -O3 -c   ising.cu
g++ -o ./ising ising.o  -L<cuda_install_dir>/lib64 -lcudart -L <cuda_sdk_dir>/C/lib/ -lcutil

For comparison with your setup, here are the benchmark results for the parameters chosen in the source files (Tesla C1060 GPU, Intel Q9650 CPU @ 3.0 GHz):

  • v1.2: 0.078 ns (GPU) vs. 7.811 ns (CPU) per spin flip
  • v1.1: 0.079 ns (GPU) vs. 7.811 ns (CPU) per spin flip
  • v1.0: 0.109 ns (GPU) vs. 12.005 ns (CPU) per spin flip