What if you would like to leverage your GPU to massively parallelize computations, but you cannot trivially do that using a machine learning framework like Tensorflow?

You can do that very easily using Numba and CUDA Python! This post is a basic practical guide on how to use them for GPU-accelerated computing.

I originally followed this NVIDIA guide to set up CUDA Python for GPU-accelerated computing. I had just acquired a CUDA-ready graphics card and wanted to leverage it to render my complex-number fractals more quickly, because I like exploring them. NVIDIA’s tutorial actually shows you how to use CUDA Python to render the Mandelbrot set specifically. I recommend that you read it and their documentation if you would like more details.

Steps to leverage your CUDA-ready GPU for accelerated computing with Python:

1. Install the CUDA toolkit

https://developer.nvidia.com/cuda-downloads

The version of CUDA that I used, and that this post is based on, is 11.0.

2. Install numba.

If you use pip for package management:

pip install numba

The version of numba that I used, and that this post is based on, is 0.48.0.

3. Import CUDA in your Python code file

from numba import cuda

4. Create your CUDA kernel function

The CUDA kernel function will be executed by multiple threads in parallel on your GPU. Each thread calls it once.

@cuda.jit
def your_cuda_kernel_function(your_func_params, array_to_be_filled):  
    # some useful things you might want access to:
    
    # positions of thread inside grid
    thread_pos_x, thread_pos_y = cuda.grid(2)
    
    total_threads_x = cuda.gridDim.x * cuda.blockDim.x;
    total_threads_y = cuda.gridDim.y * cuda.blockDim.y;
    
    # below, add code that you want to parallelize

This function:

  • Can only use the functions supported by CUDA
  • Cannot return a value. Instead, you must pass in the values (in array form) to be filled by the function
  • Should be annotated with @cuda.jit
  • You can use cuda.grid(), cuda.gridDim, cuda.blockDim and cuda.threadIdx to get the thread information that you can use to determine which elements of the passed-in array(s) that the function should fill
    • Note that cuda.grid(2) is the same as (cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x, cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y)

5. Call your CUDA kernel function

image_to_be_filled = np.zeros((res_y, res_x, 3), dtype = np.uint8) #RGB image

# num threads per block
block_dim = (32, 32)

# calculate num blocks per grid, so that each image pixel has its own thread
# assumes that res_x and res_y are divisible by block_dim
grid_dim = (int(res_x/block_dim[0]), int(res_y/block_dim[1]))

device_image = cuda.to_device(image_to_be_filled)

your_cuda_kernel_function[grid_dim, block_dim](your_func_params, image_to_be_filled)

device_image.to_host()

# now "image_to_be_filled" will be filled by your kernel function
  • The kernel function must be passed in values to be filled; it cannot return a value, as mentioned earlier.
  • For each value x that you want to be filled by the kernel function, you must send it to the GPU by calling cuda.to_device(x) before calling your kernel function with it. Afterwards you must send it back to host by calling device_x.to_host().
  • Your kernel function will run on your GPU on a grid of blocks of threads. You must specify those grid and block dimensions when calling your kernel function
    • Block dimension: the number of threads per block. When generating 2D fractals, I tried block dimensions of (16, 1), (32, 1) and (32, 32) (maximum size, the product of the dimensions, is 1024). (16, 1) performed noticeably worse than the others while (32, 1) and (32, 32) performed about the same.
    • Grid dimension: the number of blocks in each dimension. When generating 2D fractals of dimensions H x W, I’ve tried various values for grid dimension so that the total number of threads is H/8 x W/8, H/4 x W/4 and H x W. Total thread count of H/8 x W/8 performed noticeably worse, while the others performed about the same.
    • Note that for our image, the total number of threads in the x and y dimensions are block_dim[0] * grid_dim[0] and block_dim[1] * grid_dim[1], respectively.
    • Note: I haven’t yet done sufficient reading about blocks and grids to determine optimal block and grid dimensions in general. My values were just chosen based on some empirical results.

My results

Here are the results of my CUDA-accelerated implementation that renders the “Burning Ship” fractal (the coloring is based on custom mapping specified by a color patterns file).

An interesting zoom that reminds me of DNA/chromosomes, from my implementation of the "Burning Ship" fractal

Comparison of rendering times

For a rough idea of the difference in magnitude of computation times when rendered serially versus in parallel with CUDA, here are my rendering times for the above image (on my personal CPU and GPU).

Without CUDA (serially on CPU): 298 seconds
With CUDA (in parallel on GPU): 0.071 seconds

It rendered over 4000 times as fast with CUDA! So if you have computations such as those in fractal rendering that you want to massively parallelize, you may want to consider this relatively easy method to do so.