Benchmarking#
In this notebook we will run operations and compare their runtime to benchmark performance of the given GPU.
import pyclesperanto_prototype as cle
import numpy as np
import timeit
from functools import partial
from skimage.io import imread, imshow
import matplotlib.pyplot as plt
cle.select_device('TX') # TODO: change to your GPU
cle.set_wait_for_kernel_finish(True)
<NVIDIA GeForce RTX 2080 SUPER on Platform: NVIDIA CUDA (1 refs)>
warm_up_iter = 1
eval_iter = 3
Gaussian blur#
Let’s setup import the necessary functions and setup common input parameters
from skimage.filters import gaussian
# gaussian sigma to run on
sigma = 5
# create a test image
array = np.random.random([100, 1000, 1000]).astype(np.float32)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000
We then prepare a minimal function containing the code we want to benchmark. In this case, we want to measure the time it takes to execute a Gaussian blur on an image.
def cle_gaussian(arr, sigma):
cle.gaussian_blur(arr, sigma_x=sigma, sigma_y=sigma, sigma_z=sigma)
def ski_gaussian(arr, sigma):
gaussian(arr, sigma)
We can then run the benchmarking script on the function to evaluate. Here we are using the built-in package timeit
from python.
# GPU evaluation
partial_function = partial(cle_gaussian, gpu_array, sigma)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.25477212097030133 s
# CPU evaluation
partial_function = partial(ski_gaussian, array, sigma)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 14.080915375961922 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x55.2686664550841 times faster on GPU than on CPU.
Otsu Threshold#
We can, this way, look at the execution time of other operations. The Otsu thresholding is an other interesting case as a part of the algorithm cannot be distributed on the GPU. This means that, even if we can have a speed up, it will not be as good as the other operations more adapted to parallelization.
from skimage.filters import threshold_otsu
# create a test image
array = np.random.random([100, 1000, 1000]).astype(np.float32)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000
We define the function to evaluate
def cle_otsu(arr):
cle.threshold_otsu(arr)
def ski_otsu(arr):
arr > threshold_otsu(arr)
We run both timers for GPU and CPU, and compare the results
# GPU evaluation
partial_function = partial(cle_otsu, gpu_array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.1980779010336846 s
# CPU evaluation
partial_function = partial(ski_otsu, array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 2.864950605086051 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x14.46375688623056 times faster on GPU than on CPU.
Mini-Pipeline#
Now, single operation benchmarking is easy, they however do not show real case application. Let’s say, first try to mimic a pipeline processing with a basic set of operations: gaussian, threshold, and labeling.
Here, we do not want a random value image, so we made this little function to generate a simple image with a random blobs distribution.
# Make a blobs like image
def create_test_image(shape, nb_points):
sigma = 10
pointlist = np.random.random([3, nb_points]) * shape[-1]
image = cle.create(shape)
cle.pointlist_to_labelled_spots(pointlist, image)
blobs = cle.maximum_sphere(image, radius_x=10, radius_y=10, radius_z=10)
binary_blobs = cle.greater_constant(blobs, constant=0)
return cle.pull(binary_blobs).astype(np.float32)
from skimage.measure import label
# create a test image
array = create_test_image((100,1000,1000), 500)
gpu_array = cle.push(array)
# compute the size of the image in MB
array_mb = array.size * array.itemsize / 1000000
print(array.size, array.itemsize, array_mb)
100000000 4 400.0
We can then define our mini-pipeline to evaluate
def cle_pipeline(arr):
blurred = cle.gaussian_blur(arr, sigma_x=3, sigma_y=3, sigma_z=3)
binary = cle.threshold_otsu(blurred)
labels = cle.connected_components_labeling_box(binary)
def ski_pipeline(arr):
blurred = gaussian(arr, sigma=3)
binary = blurred>threshold_otsu(blurred)
labels = label(binary)
And run the benchmarking
# GPU evaluation
partial_function = partial(cle_pipeline, gpu_array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
gpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {gpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 0.6984281110344455 s
# CPU evaluation
partial_function = partial(ski_pipeline, array)
_ = timeit.timeit(partial_function, number=warm_up_iter)
cpu_in_s = timeit.timeit(partial_function, number=eval_iter)
print(f"Processing {array.shape} of {array_mb} Mb ... {cpu_in_s} s")
Processing (100, 1000, 1000) of 400.0 Mb ... 14.165793296997435 s
print(f"We are x{cpu_in_s / gpu_in_s} times faster on GPU than on CPU.")
We are x20.282392809213256 times faster on GPU than on CPU.