Benchmarking#

In this notebook we will run operations and compare their runtime to benchmark performance of the given GPU.

import pyclesperanto_prototype as cle
import numpy as np
import timeit
from functools import partial
from skimage.io import imread, imshow
import matplotlib.pyplot as plt
cle.select_device('TX')  # TODO: change to your GPU
cle.set_wait_for_kernel_finish(True)
<NVIDIA GeForce RTX 2080 SUPER on Platform: NVIDIA CUDA (1 refs)>
warm_up_iter = 1
eval_iter = 3

Gaussian blur#

Let’s setup import the necessary functions and setup common input parameters

from skimage.filters import gaussian

# gaussian sigma to run on
sigma = 5
# create a test image
array = np.random.random([100, 1000, 1000]).astype(np.float32)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000

We then prepare a minimal function containing the code we want to benchmark. In this case, we want to measure the time it takes to execute a Gaussian blur on an image.

def cle_gaussian(arr, sigma):
    cle.gaussian_blur(arr, sigma_x=sigma, sigma_y=sigma, sigma_z=sigma)

def ski_gaussian(arr, sigma):
    gaussian(arr, sigma)

We can then run the benchmarking script on the function to evaluate. Here we are using the built-in package timeit from python.

# GPU evaluation
partial_function = partial(cle_gaussian, gpu_array, sigma)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.25477212097030133 s
# CPU evaluation
partial_function = partial(ski_gaussian, array, sigma)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 14.080915375961922 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x55.2686664550841 times faster on GPU than on CPU.

Otsu Threshold#

We can, this way, look at the execution time of other operations. The Otsu thresholding is an other interesting case as a part of the algorithm cannot be distributed on the GPU. This means that, even if we can have a speed up, it will not be as good as the other operations more adapted to parallelization.

from skimage.filters import threshold_otsu

# create a test image
array = np.random.random([100, 1000, 1000]).astype(np.float32)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000

We define the function to evaluate

def cle_otsu(arr):
    cle.threshold_otsu(arr)

def ski_otsu(arr):
    arr > threshold_otsu(arr)

We run both timers for GPU and CPU, and compare the results

# GPU evaluation
partial_function = partial(cle_otsu, gpu_array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.1980779010336846 s
# CPU evaluation
partial_function = partial(ski_otsu, array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 2.864950605086051 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x14.46375688623056 times faster on GPU than on CPU.

Mini-Pipeline#

Now, single operation benchmarking is easy, they however do not show real case application. Let’s say, first try to mimic a pipeline processing with a basic set of operations: gaussian, threshold, and labeling.

Here, we do not want a random value image, so we made this little function to generate a simple image with a random blobs distribution.

# Make a blobs like image
def create_test_image(shape, nb_points):
    sigma = 10
    pointlist = np.random.random([3, nb_points]) * shape[-1]
    image = cle.create(shape)
    cle.pointlist_to_labelled_spots(pointlist, image)
    blobs = cle.maximum_sphere(image, radius_x=10, radius_y=10, radius_z=10)
    binary_blobs = cle.greater_constant(blobs, constant=0)
    return cle.pull(binary_blobs).astype(np.float32)
from skimage.measure import label

# create a test image
array = create_test_image((100,1000,1000), 500)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000
print(array.size, array.itemsize, array_mb)
100000000 4 400.0

We can then define our mini-pipeline to evaluate

def cle_pipeline(arr):
    blurred = cle.gaussian_blur(arr, sigma_x=3, sigma_y=3, sigma_z=3)
    binary = cle.threshold_otsu(blurred)
    labels = cle.connected_components_labeling_box(binary)

def ski_pipeline(arr):
    blurred = gaussian(arr, sigma=3)
    binary = blurred>threshold_otsu(blurred)
    labels = label(binary)

And run the benchmarking

# GPU evaluation
partial_function = partial(cle_pipeline, gpu_array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.6984281110344455 s
# CPU evaluation
partial_function = partial(ski_pipeline, array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 14.165793296997435 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x20.282392809213256 times faster on GPU than on CPU.