Data exploration

Data exploration#

For this workshop, we will rely on a very simple dataset from the 2018 data science bowl. See this page for more details.

from data import get_dsb2018_train_files, get_dsb2018_validation_files, get_dsb2018_test_files, fill_label_holes, quantile_normalization
from tifffile import imread

import matplotlib.pyplot as plt
import numpy as np

Getting lists of input and label files#

The data required to execute the notebooks is located at /projects/p_scads_trainings/BIAS/dsb2018 and has to be integrated into your clone of this repository (which should reside in your home directory after clicking the above link to launch jupyter Hub).

To get the data:

create a directory named data in the top level of this repo (i.e. on the same level the *.ipynb* notebook files and this README are located).

mkdir data

copy the data to the freshly created directory using

cp -r /projects/p_scads_trainings/BIAS/dsb2018 $PWD/data/

As a backup solution, the data can be downloaded as a zip file from the stardist github repository.

# let's loop through the dataset and check how many samples we have
for name, getter_fn in zip(["train", "val", "test"], [get_dsb2018_train_files, get_dsb2018_validation_files, get_dsb2018_test_files]):
    X, y = getter_fn()
    print(name, len(X), len(y))

We retain the last iteration of this loop, i.e. the test set. The variables X and y should contain paths to specific .tif files now.

X[:3]

y[:3]

Looking at a single sample of the training data#

Xtrain, ytrain = get_dsb2018_train_files()

sidx = 0 #selecting the first image in the lists
image_file, label_file = Xtrain[sidx], ytrain[sidx]
image, label = imread(image_file), imread(label_file)
label_filled = fill_label_holes(label) # some masks have holes, let's fill them

print(type(image))
print(type(label))

for name, sample in zip(["image", "label"], [image, label]):
    print(name, sample.dtype, sample.shape, sample.min(), sample.max())

The loaded images are 8-bit greyscale images. The labels however are encoded as 16-bit files.

plt.subplot(131)
plt.imshow(image, cmap="gray")

plt.subplot(132)
plt.imshow(label)

plt.subplot(133)
plt.imshow(label_filled)

Convert the instance label to a binary segmentation mask#

As we intend to demonstrate the usage of pytorch, we are simplifying our problem from instance segmentation to semantic segmentation.

label_binary = np.zeros_like(label_filled)
label_binary[label_filled != 0] = 1

plt.imshow(label_binary, cmap="gray")

Normalization of the raw image#

As neural networks tend to be easier to train when input values are small, we should normalize the pixel intensities from the uint8 range of [0, 255] to floating point values closer to [0, 1].

In the code below, we use a technique that sets the lower boundary of the normalization range to the 1% percentile. Equally, we set the upper boundary of the normalization to the 99.8%th percentile. This technique has proven to be very robust in practice. We adopted it from StarDist, see stardist/stardist

# similar normalization as shown in stardist (https://github.com/stardist/stardist/blob/master/examples/2D/2_training.ipynb)
image_normalized_noclip = quantile_normalization(
    image,
    quantile_low=0.01,
    quantile_high=0.998,
    clip=False)[0]

image_normalized_clip = quantile_normalization(
    image,
    quantile_low=0.01,
    quantile_high=0.998,
    clip=True)[0]

print("image intensity range before normalisation")
print(image_normalized_noclip.min(), image_normalized_noclip.max())

print("image intensity range after normalisation")
print(image_normalized_clip.min(), image_normalized_clip.max())

plt.subplot(131)
_ = plt.hist(image.flatten(), density=True)

plt.subplot(132)
_ = plt.hist(image_normalized_noclip.flatten(), density=True)

plt.subplot(133)
_ = plt.hist(image_normalized_clip.flatten(), density=True)

plt.tight_layout()

from torchvision import transforms

# a convenient transform from torchvision is to cast the 
# np.array to a torch.Tensor
label_torch = transforms.ToTensor()(label_binary.astype(np.float32))

# when using code that expects numpy objects, we have to cast back again
plt.imshow(label_torch.numpy()[0], cmap="gray")

We explore the image resolutions on the training data#

# let's read in all training images
X = list(map(imread, Xtrain))

X[1].shape, type(X[1].shape)

shapes = [tuple(x.shape) for x in X]

# you will find many different shapes in the training data
shapes

# let's see the shapes we find
unique_shapes = set(shapes)
unique_shapes

counts = {}
for sh in unique_shapes:
    counts[sh] = len([s for s in shapes if s == sh])

counts

Exercise: A homogenous dataset?#

If the shapes differ, what else is different. Explore the training data set more and find out:

are all images encoded the same way?
are all label masks encoded the same way? Once done, approach you the person next to you and discuss how you would proceed with such a diverse data set in practice.