napari integration#

The napari-plugin napari-cupy-image-processing allows to design image-processing workflows using cupy under the hood without the need to code. The user can use the napari-assistant to generate Python code that corresponds to a workflow that has been set up interactively. While this is convenient, the generated code needs to be tweaked to achieve optimal performance: memory transfer between CPU and GPU should be minimized. This notebook demonstrates this.

import napari_cupy_image_processing as ncupy

import cupy as cp
import numpy as np
from skimage.io import imread
from timeit import timeit
import stackview
image = imread("../../data/blobs.tif")

In case you call a napari-cupy operation and pass a numpy image, the image will be converted to a cupy array internally, sent to the GPU memory, processed and converted back to a numpy-like array afterwards. You can view the result in Jupyter immediately.

blurred = ncupy.gaussian_filter(image, sigma=5)

isinstance(blurred, np.ndarray)
True
blurred
n-cupy made image
shape(254, 256)
dtypefloat64
size508.0 kB
min36.21353885091685
max237.5686234472664

The back-and-forth conversion of numpy and cupy arrays is suboptimal when thinking of performance. It is desirable to convert the image to cupy once, run potentially many processing steps while keeping the image in GPU memory and by the end converting it back.

cp_image = cp.asarray(image)

isinstance(cp_image, np.ndarray)
False

Thus, if you pass a cupy-array to the function, you will also receive a cupy-image back. The conversion is not done in that case.

cp_blurred = ncupy.gaussian_filter(cp_image, sigma=5)

isinstance(cp_image, np.ndarray)
False
type(cp_blurred)
cupy.ndarray

Exercise#

Speed up the following image processing workflow. Make sure the image remains in GPU memory.

image = np.random.random((4000, 4000))
%%timeit
blurred = ncupy.gaussian_filter(image, sigma=5)
binary = blurred > 0.5
labels = ncupy.label(binary)
296 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Run the slow and the optimized code again with different image sizes, e.g. 0.1 MB, 1 MB, 10 MB,… What’s the limit below which the optimization become negligible?