
In this notebook we will run operations and compare their runtime to benchmark performance of the given GPU.

import pyclesperanto_prototype as cle
import numpy as np
import timeit
from functools import partial
from import imread, imshow
import matplotlib.pyplot as plt
cle.select_device('TX')  # TODO: change to your GPU
<NVIDIA GeForce RTX 2080 SUPER on Platform: NVIDIA CUDA (1 refs)>
warm_up_iter = 1
eval_iter = 3

Gaussian blur#

Let’s setup import the necessary functions and setup common input parameters

from skimage.filters import gaussian

# gaussian sigma to run on
sigma = 5
# create a test image
array = np.random.random([100, 1000, 1000]).astype(np.float32)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000

We then prepare a minimal function containing the code we want to benchmark. In this case, we want to measure the time it takes to execute a Gaussian blur on an image.

def cle_gaussian(arr, sigma):
    cle.gaussian_blur(arr, sigma_x=sigma, sigma_y=sigma, sigma_z=sigma)

def ski_gaussian(arr, sigma):
    gaussian(arr, sigma)

We can then run the benchmarking script on the function to evaluate. Here we are using the built-in package timeit from python.

# GPU evaluation
partial_function = partial(cle_gaussian, gpu_array, sigma)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.25477212097030133 s
# CPU evaluation
partial_function = partial(ski_gaussian, array, sigma)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 14.080915375961922 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x55.2686664550841 times faster on GPU than on CPU.

Otsu Threshold#

We can, this way, look at the execution time of other operations. The Otsu thresholding is an other interesting case as a part of the algorithm cannot be distributed on the GPU. This means that, even if we can have a speed up, it will not be as good as the other operations more adapted to parallelization.

from skimage.filters import threshold_otsu

# create a test image
array = np.random.random([100, 1000, 1000]).astype(np.float32)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000

We define the function to evaluate

def cle_otsu(arr):

def ski_otsu(arr):
    arr > threshold_otsu(arr)

We run both timers for GPU and CPU, and compare the results

# GPU evaluation
partial_function = partial(cle_otsu, gpu_array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.1980779010336846 s
# CPU evaluation
partial_function = partial(ski_otsu, array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 2.864950605086051 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x14.46375688623056 times faster on GPU than on CPU.


Now, single operation benchmarking is easy, they however do not show real case application. Let’s say, first try to mimic a pipeline processing with a basic set of operations: gaussian, threshold, and labeling.

Here, we do not want a random value image, so we made this little function to generate a simple image with a random blobs distribution.

# Make a blobs like image
def create_test_image(shape, nb_points):
    sigma = 10
    pointlist = np.random.random([3, nb_points]) * shape[-1]
    image = cle.create(shape)
    cle.pointlist_to_labelled_spots(pointlist, image)
    blobs = cle.maximum_sphere(image, radius_x=10, radius_y=10, radius_z=10)
    binary_blobs = cle.greater_constant(blobs, constant=0)
    return cle.pull(binary_blobs).astype(np.float32)
from skimage.measure import label

# create a test image
array = create_test_image((100,1000,1000), 500)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000
print(array.size, array.itemsize, array_mb)
100000000 4 400.0

We can then define our mini-pipeline to evaluate

def cle_pipeline(arr):
    blurred = cle.gaussian_blur(arr, sigma_x=3, sigma_y=3, sigma_z=3)
    binary = cle.threshold_otsu(blurred)
    labels = cle.connected_components_labeling_box(binary)

def ski_pipeline(arr):
    blurred = gaussian(arr, sigma=3)
    binary = blurred>threshold_otsu(blurred)
    labels = label(binary)

And run the benchmarking

# GPU evaluation
partial_function = partial(cle_pipeline, gpu_array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.6984281110344455 s
# CPU evaluation
partial_function = partial(ski_pipeline, array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 14.165793296997435 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x20.282392809213256 times faster on GPU than on CPU.