Scientific plotting with seaborn

Scientific plotting with seaborn#

In this tutorial, we will use some of the measurements taken in the course and visualize them nicely using the seaborn Python library. Seaborn comes with the advantage that it interacts very nicely with the DataFrame format from the Python pandas library - a convenient way of handling tabular data in Python.

For more detailed information: You may have been used to using Matplotlib for plotting in Python. In a way, seaborn is just matplotlib - but it expands the functionality of matplotlib to be easier to use for more complex visualizations.

Let’s load all the packages we need here:

import seaborn as sns
import os
import pandas as pd

Loading the data#

As a first step, we need to load the measurements from napari. Alternatively, you can download some from the course repository (TODO: Add link). For this, we need to compile a list of all .csv files we takr into account.

root = r'./measurements'

files = os.listdir(root)
files
['17P1_POS0006_D_1UL_labels.csv',
 '17P1_POS0007_D_1UL_labels.csv',
 '17P1_POS0011_D_1UL_labels.csv',
 '20P1_POS0005_D_1UL_labels.csv',
 '20P1_POS0007_D_1UL_labels.csv',
 '20P1_POS0008_D_1UL_labels.csv',
 '20P1_POS0010_D_1UL_labels.csv',
 'A9_p5d_labels.csv',
 'A9_p7d_labels.csv',
 'A9_p9d_labels.csv']

Next, we load them all into pandas dataframes. For this, we go through the data in a for-loop, load each file and then concatenate all the tabular data into a single table. Let’s look into a single dataframe first to see what we are dealing with. The df.head(5) command shows us the first 5 rows of the table so that we don’t clutter our output with tons of tabular data:

first_file = os.path.join(root, files[0])
df = pd.read_csv(first_file)
df.head(5)
Unnamed: 0 label area_filled perimeter axis_major_length axis_minor_length eccentricity area intensity_mean intensity_max intensity_min index
0 0 1 29.0 18.863961 7.853437 4.935022 0.777898 29.0 58.724138 67.0 49.0 1
1 1 2 32.0 24.242641 12.953143 3.174186 0.969510 32.0 73.343750 118.0 40.0 2
2 2 3 316.0 65.112698 22.947857 17.642785 0.639465 316.0 95.332278 248.0 22.0 3
3 3 4 258.0 60.284271 22.477335 14.716400 0.755870 258.0 113.585271 204.0 19.0 4
4 4 5 59.0 32.278175 15.198125 5.328297 0.936529 59.0 80.881356 162.0 35.0 5

We notice that there’s a bunch of irrelevant stuff. The index column is a remnant of the analysis in napari - so let’s drop it:

df.drop(columns=['index']).head(5)
Unnamed: 0 label area_filled perimeter axis_major_length axis_minor_length eccentricity area intensity_mean intensity_max intensity_min
0 0 1 29.0 18.863961 7.853437 4.935022 0.777898 29.0 58.724138 67.0 49.0
1 1 2 32.0 24.242641 12.953143 3.174186 0.969510 32.0 73.343750 118.0 40.0
2 2 3 316.0 65.112698 22.947857 17.642785 0.639465 316.0 95.332278 248.0 22.0
3 3 4 258.0 60.284271 22.477335 14.716400 0.755870 258.0 113.585271 204.0 19.0
4 4 5 59.0 32.278175 15.198125 5.328297 0.936529 59.0 80.881356 162.0 35.0

The Unnamed: 0 column is essentially just the row index of the csv file, which pandas loads by default, but doesn’t know where to put it. We can tell pandas to use this column as the index of the row upon import. Let’s load everything now!

Hint We may want to distinguish from which dataset which measurement came. To do so, we simply add another column to each dataframe, that indicates from whhich image the respective measurement comes

big_df = pd.DataFrame()

for file in files:
    full_filename = os.path.join(root, file)
    df = pd.read_csv(full_filename, index_col='Unnamed: 0')  #load dataframe
    df = df.drop(columns=['index'])  # drop irrelevant index column
    df['sample'] = file  # add column with sample name

    big_df = pd.concat([big_df, df], axis=0)  # append table to big_df
big_df
label area_filled perimeter axis_major_length axis_minor_length eccentricity area intensity_mean intensity_max intensity_min sample
0 1 29.0 18.863961 7.853437 4.935022 0.777898 29.0 58.724138 67.0 49.0 17P1_POS0006_D_1UL_labels.csv
1 2 32.0 24.242641 12.953143 3.174186 0.969510 32.0 73.343750 118.0 40.0 17P1_POS0006_D_1UL_labels.csv
2 3 316.0 65.112698 22.947857 17.642785 0.639465 316.0 95.332278 248.0 22.0 17P1_POS0006_D_1UL_labels.csv
3 4 258.0 60.284271 22.477335 14.716400 0.755870 258.0 113.585271 204.0 19.0 17P1_POS0006_D_1UL_labels.csv
4 5 59.0 32.278175 15.198125 5.328297 0.936529 59.0 80.881356 162.0 35.0 17P1_POS0006_D_1UL_labels.csv
... ... ... ... ... ... ... ... ... ... ... ...
102 103 509.0 91.597980 34.431846 19.886244 0.816353 509.0 59.245580 118.0 31.0 A9_p9d_labels.csv
103 104 87.0 41.106602 18.743849 6.228837 0.943169 87.0 72.103448 138.0 25.0 A9_p9d_labels.csv
104 105 129.0 46.727922 20.222045 8.532153 0.906632 129.0 80.813953 126.0 26.0 A9_p9d_labels.csv
105 106 196.0 54.142136 20.214850 12.912128 0.769419 196.0 80.387755 171.0 21.0 A9_p9d_labels.csv
106 107 30.0 21.071068 9.779252 4.102249 0.907762 30.0 62.700000 83.0 37.0 A9_p9d_labels.csv

883 rows × 11 columns

From the data here we can see that there are three different sample types - A9, 17P1 and 20P1. Let’s add a column to the dataframe tahat indicates from which type the respective image was taken:

big_df['experiment'] = big_df['sample'].apply(lambda x: x.split('_')[0])
big_df.head()
label area_filled perimeter axis_major_length axis_minor_length eccentricity area intensity_mean intensity_max intensity_min sample experiment
0 1 29.0 18.863961 7.853437 4.935022 0.777898 29.0 58.724138 67.0 49.0 17P1_POS0006_D_1UL_labels.csv 17P1
1 2 32.0 24.242641 12.953143 3.174186 0.969510 32.0 73.343750 118.0 40.0 17P1_POS0006_D_1UL_labels.csv 17P1
2 3 316.0 65.112698 22.947857 17.642785 0.639465 316.0 95.332278 248.0 22.0 17P1_POS0006_D_1UL_labels.csv 17P1
3 4 258.0 60.284271 22.477335 14.716400 0.755870 258.0 113.585271 204.0 19.0 17P1_POS0006_D_1UL_labels.csv 17P1
4 5 59.0 32.278175 15.198125 5.328297 0.936529 59.0 80.881356 162.0 35.0 17P1_POS0006_D_1UL_labels.csv 17P1

Plotting: Distributions#

Now for the actual plotting: Let’s try it with a histogram of the nuclei sizes first. The seaborn syntax is very simple: We pass the measurements table big_df directly to the plotting function (sns.histplot) and tell seaborn which variable to take into account for the histogram.

sns.histplot(data=big_df, x='area_filled', bins=100)
<Axes: xlabel='area_filled', ylabel='Count'>
../_images/b2ce3387072f9837c2ddefb26ea229cf33508b57d568f39e8bbec1abd301bb5e.png

Seaborn also offers to turn this into a smoothed distribution estimate (a kernel-density estimation):

sns.histplot(data=big_df, x='area_filled', bins=100, kde=True)
<Axes: xlabel='area_filled', ylabel='Count'>
../_images/edbf166fe2df090f0a221ab3d146976b445dc05d3fc37dbb306a59fbb28098d9.png

But where seaborn really shines, is when it comes to comparing different groups in datasets (i.e., categorical variables). If we wanted to compare directly how the nuclei sizes differ between the different conditions (i.e., images), we can simply pass this on to seaborn as the hue parameter:

sns.histplot(data=big_df, x='area_filled', hue='sample', kde=True)
<Axes: xlabel='area_filled', ylabel='Count'>
../_images/dd20fcde9f1650101f2075deccbea79240ed6df6fce85b6fa37c173531620d0b.png

Let’s replace the sample as the category by the experiment column:

sns.histplot(data=big_df, x='area_filled', hue='experiment', kde=True)
<Axes: xlabel='area_filled', ylabel='Count'>
../_images/fb3b01355437eaf6bd75660f0d5eafeccc5dd01d5a49e41c47c712766c0eb1c9.png

If you only ewant the smoothed histogram estimations rather than the bars, consider using the kdeplot function rather than the histogram:

sns.kdeplot(data=big_df, x='area_filled', hue='sample')
<Axes: xlabel='area_filled', ylabel='Density'>
../_images/c5fc55ca4369b61b2cb0e0158ff27edbc42be15fa4a4cb2a02feb5b9e2f55000.png

Plotting: Scatters#

For a more granular insight into the individual data points, let’s take one step back and draw a good old scatterplot and draw two of the above variables against ech other:

sns.scatterplot(data=big_df, x='area_filled', y='eccentricity')
<Axes: xlabel='area_filled', ylabel='eccentricity'>
../_images/7d83bc2a8bf44a80fcf5d8dbc8d2963a1f933bd6fa346f8935519d49adcd265a.png

Again, we can simply pass a categorical variable as hue parameter and highlight the different samples. Let’s look at a different property:

sns.scatterplot(data=big_df, x='intensity_mean', y='eccentricity', hue='experiment')
<Axes: xlabel='intensity_mean', ylabel='eccentricity'>
../_images/947551cd53083fc22a1a586b9915757f34dc90e90d17320e1b75c13449d5fa0e.png

For a smaller number of features, you may want to plot everything against everything - something that may be terribly hard to do in matplotlib. We drop the label column, though - it holds no information about any relevant biology:

sns.pairplot(data=big_df.drop(columns=['label']), hue='experiment')
<seaborn.axisgrid.PairGrid at 0x1fecb2a4c10>
../_images/32813c9f772db24f1c270958839bf49be9cd51c576ab6af5cb40fc6c448c4496.png