Scientific plotting with seaborn

Scientific plotting with seaborn#

In this tutorial, we will use some of the measurements taken in the course and visualize them nicely using the seaborn Python library. Seaborn comes with the advantage that it interacts very nicely with the DataFrame format from the Python pandas library - a convenient way of handling tabular data in Python.

For more detailed information: You may have been used to using Matplotlib for plotting in Python. In a way, seaborn is just matplotlib - but it expands the functionality of matplotlib to be easier to use for more complex visualizations.

Let’s load all the packages we need here:

import seaborn as sns
from pathlib import Path
import pandas as pd

Loading the data#

As a first step, we need to load the measurements from napari. Alternatively, you can download some from the course repository here. For this, we need to compile a list of all .csv files we take into account.

root = './measurements'

file_paths = [file_path for file_path in Path(root).iterdir() if file_path.suffix == '.csv']
file_paths

[WindowsPath('measurements/17P1_POS0006_D_1UL_features.csv'),
 WindowsPath('measurements/17P1_POS0007_D_1UL_features.csv'),
 WindowsPath('measurements/17P1_POS0011_D_1UL_features.csv'),
 WindowsPath('measurements/20P1_POS0005_D_1UL_features.csv'),
 WindowsPath('measurements/20P1_POS0007_D_1UL_features.csv'),
 WindowsPath('measurements/20P1_POS0008_D_1UL_features.csv'),
 WindowsPath('measurements/20P1_POS0010_D_1UL_features.csv'),
 WindowsPath('measurements/A9_p5d_features.csv'),
 WindowsPath('measurements/A9_p7d_features.csv'),
 WindowsPath('measurements/A9_p9d_features.csv')]

Next, we load them all into pandas dataframes. For this, we go through the data in a for-loop, load each file and then concatenate all the tabular data into a single table. Let’s look into a single dataframe first to see what we are dealing with. The df.head(5) command shows us the first 5 rows of the table so that we don’t clutter our output with tons of tabular data:

first_file = file_paths[0] # Get the first file path
df = pd.read_csv(first_file) # Load the first dataframe
df.head(5) # Display the first 5 rows of the dataframe

	label	area_filled	perimeter	axis_major_length	axis_minor_length	eccentricity	area	intensity_mean	intensity_max	intensity_min
0	1	29.0	18.863961	7.853437	4.935022	0.777898	29.0	58.724138	67.0	49.0
1	2	32.0	24.242641	12.953143	3.174186	0.969510	32.0	73.343750	118.0	40.0
2	3	316.0	65.112698	22.947857	17.642785	0.639465	316.0	95.332278	248.0	22.0
3	4	258.0	60.284271	22.477335	14.716400	0.755870	258.0	113.585271	204.0	19.0
4	5	59.0	32.278175	15.198125	5.328297	0.936529	59.0	80.881356	162.0	35.0

Let’s load everything now!

Hint We may want to distinguish from which dataset which measurement came. To do so, we simply add another column to each dataframe, that indicates from whhich image the respective measurement comes.

big_df = pd.DataFrame() # create empty dataframe

for file_path in file_paths: # iterate over all files
    df = pd.read_csv(file_path)  #load dataframe
    df['sample'] = file_path.stem  # add column with sample name
    big_df = pd.concat([big_df, df], axis=0)  # append table to big_df

big_df = big_df.reset_index(drop=True)  # reset index

big_df

	label	area_filled	perimeter	axis_major_length	axis_minor_length	eccentricity	area	intensity_mean	intensity_max	intensity_min	sample
0	1	29.0	18.863961	7.853437	4.935022	0.777898	29.0	58.724138	67.0	49.0	17P1_POS0006_D_1UL_features
1	2	32.0	24.242641	12.953143	3.174186	0.969510	32.0	73.343750	118.0	40.0	17P1_POS0006_D_1UL_features
2	3	316.0	65.112698	22.947857	17.642785	0.639465	316.0	95.332278	248.0	22.0	17P1_POS0006_D_1UL_features
3	4	258.0	60.284271	22.477335	14.716400	0.755870	258.0	113.585271	204.0	19.0	17P1_POS0006_D_1UL_features
4	5	59.0	32.278175	15.198125	5.328297	0.936529	59.0	80.881356	162.0	35.0	17P1_POS0006_D_1UL_features
...	...	...	...	...	...	...	...	...	...	...	...
919	144	200.0	52.384776	18.530911	14.190467	0.643111	200.0	78.305000	142.0	28.0	A9_p9d_features
920	145	21.0	20.207107	12.476841	1.973087	0.987417	21.0	50.523810	65.0	39.0	A9_p9d_features
921	146	250.0	78.284271	32.580598	11.991364	0.929805	250.0	74.804000	151.0	31.0	A9_p9d_features
922	147	23.0	18.242641	9.710753	3.102526	0.947588	23.0	53.478261	69.0	40.0	A9_p9d_features
923	148	56.0	31.692388	14.232087	5.310151	0.927787	56.0	59.982143	85.0	37.0	A9_p9d_features

924 rows × 11 columns

From the data here we can see that there are three different sample types - A9, 17P1 and 20P1. Let’s add a column to the dataframe that indicates from which type the respective image was taken. We can do this by splitting the sample name, getting the first part of the name and adding it as a new column to the dataframe.

big_df['experiment'] = big_df['sample'].apply(lambda x: x.split('_')[0]) # Split the sample name by underscore and take the first part as the experiment name
big_df

	label	area_filled	perimeter	axis_major_length	axis_minor_length	eccentricity	area	intensity_mean	intensity_max	intensity_min	sample	experiment
0	1	29.0	18.863961	7.853437	4.935022	0.777898	29.0	58.724138	67.0	49.0	17P1_POS0006_D_1UL_features	17P1
1	2	32.0	24.242641	12.953143	3.174186	0.969510	32.0	73.343750	118.0	40.0	17P1_POS0006_D_1UL_features	17P1
2	3	316.0	65.112698	22.947857	17.642785	0.639465	316.0	95.332278	248.0	22.0	17P1_POS0006_D_1UL_features	17P1
3	4	258.0	60.284271	22.477335	14.716400	0.755870	258.0	113.585271	204.0	19.0	17P1_POS0006_D_1UL_features	17P1
4	5	59.0	32.278175	15.198125	5.328297	0.936529	59.0	80.881356	162.0	35.0	17P1_POS0006_D_1UL_features	17P1
...	...	...	...	...	...	...	...	...	...	...	...	...
919	144	200.0	52.384776	18.530911	14.190467	0.643111	200.0	78.305000	142.0	28.0	A9_p9d_features	A9
920	145	21.0	20.207107	12.476841	1.973087	0.987417	21.0	50.523810	65.0	39.0	A9_p9d_features	A9
921	146	250.0	78.284271	32.580598	11.991364	0.929805	250.0	74.804000	151.0	31.0	A9_p9d_features	A9
922	147	23.0	18.242641	9.710753	3.102526	0.947588	23.0	53.478261	69.0	40.0	A9_p9d_features	A9
923	148	56.0	31.692388	14.232087	5.310151	0.927787	56.0	59.982143	85.0	37.0	A9_p9d_features	A9

924 rows × 12 columns

Plotting: Distributions#

Now for the actual plotting: Let’s try it with a histogram of the nuclei sizes first. The seaborn syntax is very simple: we pass the measurements table big_df directly to the plotting function (sns.histplot) and tell seaborn which variable (column) to take into account for the histogram.

We store each plot result in a variable fig + number so that we can save it later on.

fig1 = sns.histplot(data=big_df, x='area_filled', bins=20)

../_images/15495a1e5c15e1c65181b5f16e280d0881f32fc24343f8326d0ca683263016a8.png

Seaborn also offers to turn this into a smoothed distribution estimate (a kernel-density estimation):

fig2 = sns.histplot(data=big_df, x='area_filled', bins=100, kde=True)

../_images/94f497804067f2e23442b0276db0051f0ab9084a4b41042c0845df6302bea852.png

But where seaborn really shines, is when it comes to comparing different groups in datasets (i.e., categorical variables). If we wanted to compare directly how the nuclei sizes differ between the different conditions (i.e., images), we can simply pass sample on to seaborn as the hue parameter:

fig3 = sns.histplot(data=big_df, x='area_filled', hue='sample', kde=True)

../_images/11cf75757ea772ee8cc13ab36c457855d6093c04d4db9754b038f01560715856.png

Let’s replace the sample as the category by the experiment column:

fig4 = sns.histplot(data=big_df, x='area_filled', hue='experiment', kde=True)

../_images/95555f1b8a7fbe90cbd3434af264e2cb70ba602b25e1c16ac5690961c8b074cf.png

If you only want the smoothed histogram estimations rather than the bars, consider using the kdeplot function rather than the histogram:

fig5 = sns.kdeplot(data=big_df, x='area_filled', hue='experiment')

../_images/67a9faa77f07c45cf13ff39633b633c79281895ab37d67feb2589b1544e2fddc.png

Plotting: Scatters#

For a more granular insight into the individual data points, let’s take one step back and draw a good old scatterplot and draw two of the above variables against ech other:

fig6 = sns.scatterplot(data=big_df, x='area_filled', y='eccentricity')

../_images/724882cf25939a091d7bb2e3ff1de3ad896a8fb70928f81d92211fadc27b11a7.png

Again, we can simply pass a categorical variable as hue parameter and highlight the different samples. Let’s look at a different property:

fig7 = sns.scatterplot(data=big_df, x='area_filled', y='eccentricity', hue='experiment')

../_images/cc9ba58610a1970d84346a511c0923890db27c47a18092ee6c0096fd9d1a9479.png

For a smaller number of features, you may want to plot everything against everything - something that may be terribly hard to do in matplotlib. We drop the label column, though - it holds no information about any relevant biology:

fig8 = sns.pairplot(data=big_df.drop(columns=['label']), hue='experiment')

../_images/d539d089d7187150b619c9b419a9a29de7f57849a70c685cf882866b57aa52cc.png

Saving the plots#

Finally, we can save the plots to a file. For this, we can use the savefig function from matplotlib. We can also specify the resolution of the image in dpi (dots per inch). The command below saves the pairplot to a file called pairplot.png with a resolution of 300 dpi in the current working directory:

fig8.savefig('pairplot.png', dpi=300)

Exercise#

Replace the area_filled variable with intensity_mean (or another column) in the code above and see how the plots change.
Try out other seaborn functions to visualize data distributions split by categories, like sns.catplot and sns.jointplot, as shown in the seaborn documentation.

Package versions used to run this notebook#

import watermark
%load_ext watermark
%watermark
%watermark --iversions

Last updated: 2025-06-30T16:42:38.967258+02:00

Python implementation: CPython
Python version       : 3.10.0
IPython version      : 8.37.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark: 2.5.0
seaborn  : 0.13.2
pandas   : 2.3.0