Custom kernel execution

Custom kernel execution#

The library clesperanto contains a function execute which is the method that is used to execute a kernel code on the GPU. If we look at its signature, it takes the following inputs:

anchor: a reference starting path
opencl_kernel_filename: an OpenCL kernel file which will be loaded
kernel_name: the name of the kernel function to be executed inside the kernel file (usually the same as the filename)
global_size: the working space of the kernel, usually the size of the image to be processed BUT not always
parameters: a dict of parameters as {key, variable} to be passed to the kernel function

import pyclesperanto_prototype as cle
import numpy as np
cle.select_device('TX')  # TODO: change to your GPU

<NVIDIA GeForce RTX 2080 SUPER on Platform: NVIDIA CUDA (1 refs)>

cle.execute?

Signature:
cle.execute(
    anchor,
    opencl_kernel_filename: str,
    kernel_name: str,
    global_size,
    parameters,
    prog=None,
    constants=None,
    image_size_independent_kernel_compilation: bool = None,
    device=None,
)
Docstring:
Call opencl kernels (.cl files)

Parameters
----------
anchor: str
        Enter __file__ when calling this method and the corresponding open.cl
        file lies in the same folder as the python file calling it.
opencl_kernel_filename: str
    Filename of the open.cl file to be called
kernel_name: str
    kernel method inside the open.cl file to be called
    most clij/clesperanto kernel functions have the same name as the file they are in
global_size: list(int)
    global_size according to OpenCL definition (usually shape of the destination image).
parameters: dict(str, any), optional
    dictionary containing parameters. Take care: They must be of the
    right type and in the right order as specified in the open.cl file.
constants: dict(str, any), optional
    dictionary with names/values which will be added to the define
    statements. They are necessary, e.g. to create arrays of a given
    maximum size in OpenCL as variable array lengths are not supported.
image_size_independent_kernel_compilation: bool, optional
    if set to true, the kernel can handle images of different size and
    is a bit slower. If set to false, it can handle only images of a
    specific size and is a bit faster

See Also
--------
https://github.com/clij/clij-clearcl/blob/master/src/main/java/net/haesleinhuepf/clij/clearcl/util/CLKernelExecutor.java
File:      ~/Libraries/miniconda3/envs/proto/lib/python3.11/site-packages/pyclesperanto_prototype/_tier0/_execute.py
Type:      function

If we get the maximum_z_projection.cl file from the repository, we can execute it using the following code.

First, as always, we need some input data:

z,y,x = 100, 512, 512
array = np.random.random((z,y,x)).astype(np.float32)

From there, we will need first to push the data to the GPU, create an output data object, defined the function parameters, and finally execute the kernel.

Before we run this cell, let’s look at the kernel code itself to understand what it does, and why we need to define these parameters as we do.

// maximum_z_projection.cl

// the sample define the GPU behaviour when accessing pixels outside of the image
__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;

// the kernel function itself, with the parameters. The order of the parameters is important.
__kernel void maximum_z_projection(
    IMAGE_src_TYPE  src,            // the input image, named 'src' 
    IMAGE_dst_TYPE  dst             // the output image, named 'dst'
) 
{
  const int x = get_global_id(0);  // the x coordinate of the current pixel, provided by the GPU thread
  const int y = get_global_id(1);  // the y coordinate of the current pixel, provided by the GPU thread

  IMAGE_src_PIXEL_TYPE max = 0;
  for (int z = 0; z < GET_IMAGE_DEPTH(src); ++z)  // loop over all z-axis range of the image
  {
    // read the pixel value at the current position (x,y,z) from the input image 'src'
    const IMAGE_src_PIXEL_TYPE value = READ_IMAGE(src, sampler, POS_src_INSTANCE(x,y,z,0)).x;  

    // conditional statement to find the maximum value
    if (value > max || z == 0) {  
      max = value;
    }
  }

  // write the pixel value at the current position (x,y,z) of the output image 'dst'
  WRITE_IMAGE(dst, POS_dst_INSTANCE(x,y,0,0), CONVERT_dst_PIXEL_TYPE(max));
}

# Prepare the input and output memory
input_arg = cle.push(array)
output_arg = cle.create_like((y,x)) # the z dimension is 1 because we will project into 2D along the z axis

# build the dictionary of parameters of the kernel
parameters = {'src': input_arg, 'dst': output_arg} # the key (e.g. 'src', 'dst') is the name of the variable in the kernel code
                                                   # the order of the parameters is the same as in the kernel code

# define the path and name of the kernel file to run
opencl_kernel_filename = 'maximum_z_projection.cl' # the name of the file
kernel_name = 'maximum_z_projection'               # the name of the kernel function in the kernel file

# define the working range of GPU
working_range = output_arg.shape

The working range is the most abstract parameter to understand. One way to see it is, if we could associate 1 pixel with 1 thread, the working range would be the number of threads we would need to execute the process. Here, we are projecting along the z-axis, hence in an ideal world, we would have 1 thread per pixel (x,y). We do not need to cover the z dimension. The working range is then the number of pixels in the x and y dimensions which here is the same as the output image shape.

Let’s run the execute method now

cle.execute("__file__",opencl_kernel_filename, kernel_name, working_range, parameters)

4 warnings generated.

<pyclesperanto_prototype._tier0._program.OCLProgram at 0x7f8fcd872450>

Notice that we do not return an output. Here the GPU reads the input src and saves the results in the dst data, which correspond to our output_arg variable.

All we have to do to read it now is to pull it back to the CPU:

projected = cle.pull(output_arg)
projected.shape

(512, 512)

Last step, let’s check that what we did is actually correct, by comparing it to the same process but on the CPU.

cpu_projected = array.max(axis=0)
cpu_projected.shape

assert np.array_equal(projected, cpu_projected)

Let’s compare to the actual implementation of the maximum_z_projection kernel in the library:

from .._tier0 import execute
from .._tier0 import create_2d_yx
from .._tier0 import plugin_function
from .._tier0 import Image

@plugin_function(output_creator=create_2d_yx, categories=['projection', 'in assistant'])
def maximum_z_projection(source :Image, destination_max :Image = None) -> Image:
    """Determines the maximum intensity projection of an image along Z. 
    
    Parameters
    ----------
    source : Image
    destination_max : Image, optional
    
    Returns
    -------
    destination_max
    
    Examples
    --------
    >>> import pyclesperanto_prototype as cle
    >>> cle.maximum_z_projection(source, destination_max)
    
    References
    ----------
    .. [1] https://clij.github.io/clij2-docs/reference_maximumZProjection
    """


    parameters = {
        "dst_max":destination_max,
        "src":source,
    }

    execute(__file__, '../clij-opencl-kernels/kernels/maximum_z_projection_x.cl', 'maximum_z_projection', destination_max.shape, parameters)

    return destination_max

To know a bit more about the CLIJ’s custom OpenCL, as well as look at the existing list of Kernel Operations, please refer to the clEsperanto’s Kernel Repository

You are welcome to contribute with your own kernels, and we will be happy to help you with that!