Custom kernels#

In this notebook we will show how to write a custom kernel is basic CUDA code.

See also

import cupy as cp
from cupyx.scipy.ndimage import gaussian_filter
import numpy as np
from skimage.io import imread
import stackview
image = imread("../../data/blobs.tif")
cp_image = cp.asarray(image[1:], dtype=np.float32)
image1 = gaussian_filter(cp_image, sigma=3)
image2 = gaussian_filter(cp_image, sigma=7)

cupy has a couple of simplifications implemented making it easy to write custom kernels. For example, if we plan to execute the very same operation on all pixels in an image, an ElementwiseKernel does the job.

squared_difference = cp.ElementwiseKernel(
    'T x, T y',
    'T z',
    'z = (x - y) * (x - y)',
    'squared_difference')

After defining such a kernel, you can call it like any other Python function. You just need to make sure the images are cupy images.

sqdiff = squared_difference(image1, image2)
stackview.insight(sqdiff)
shape(253, 256)
dtypefloat32
size253.0 kB
min3.0791853e-08
max7801.2026

For completenes, we view the documentation.

cp.ElementwiseKernel?
Init signature: cp.ElementwiseKernel(self, /, *args, **kwargs)
Docstring:     
ElementwiseKernel(in_params, out_params, operation, name=u'kernel', reduce_dims=True, preamble=u'', no_return=False, return_tuple=False, **kwargs)
User-defined elementwise kernel.

    This class can be used to define an elementwise kernel with or without
    broadcasting.

    The kernel is compiled at an invocation of the
    :meth:`~ElementwiseKernel.__call__` method,
    which is cached for each device.
    The compiled binary is also cached into a file under the
    ``$HOME/.cupy/kernel_cache/`` directory with a hashed file name. The cached
    binary is reused by other processes.

    Args:
        in_params (str): Input argument list.
        out_params (str): Output argument list.
        operation (str): The body in the loop written in CUDA-C/C++.
        name (str): Name of the kernel function. It should be set for
            readability of the performance profiling.
        reduce_dims (bool): If ``False``, the shapes of array arguments are
            kept within the kernel invocation. The shapes are reduced
            (i.e., the arrays are reshaped without copy to the minimum
            dimension) by default. It may make the kernel fast by reducing the
            index calculations.
        options (tuple): Compile options passed to NVRTC. For details, see
            https://docs.nvidia.com/cuda/nvrtc/index.html#group__options.
        preamble (str): Fragment of the CUDA-C/C++ code that is inserted at the
            top of the cu file.
        no_return (bool): If ``True``, __call__ returns ``None``.
        return_tuple (bool): If ``True``, __call__ always returns tuple of
            array even if single value is returned.
        loop_prep (str): Fragment of the CUDA-C/C++ code that is inserted at
            the top of the kernel function definition and above the ``for``
            loop.
        after_loop (str): Fragment of the CUDA-C/C++ code that is inserted at
            the bottom of the kernel function definition.

    
File:           c:\users\haase\mambaforge\envs\cupy39_1\lib\site-packages\cupy\_core\_kernel.cp39-win_amd64.pyd
Type:           type
Subclasses: