Using Seaborn to Plot Distributions#

Sources and inspiration:

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

# !pip install watermark
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Brief Data Exploration with pandas#

We can work with one of the seaborn training datasets Penguins

penguins = sns.load_dataset("penguins")

pandas package can help us get some overview of the data.

penguins.describe()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
count 342.000000 342.000000 342.000000 342.000000
mean 43.921930 17.151170 200.915205 4201.754386
std 5.459584 1.974793 14.061714 801.954536
min 32.100000 13.100000 172.000000 2700.000000
25% 39.225000 15.600000 190.000000 3550.000000
50% 44.450000 17.300000 197.000000 4050.000000
75% 48.500000 18.700000 213.000000 4750.000000
max 59.600000 21.500000 231.000000 6300.000000

Since describe() function works only with numers, we will need to look at the few first values, and search for unique strings in some of the columns.

penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
penguins["species"].unique()
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
penguins["island"].unique()
array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)
penguins["sex"].unique()
array(['Male', 'Female', nan], dtype=object)

Let’s take a look at missing data (NaNs) with the isnull() method.

penguins.isnull().sum()
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

We will want to drop all rows with unknown entries with dropna() function.

penguins_cleaned = penguins.dropna()
penguins_cleaned.isnull().sum()
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Of course we can do a fast visualisation with pandas, but it is more useful for exploring the data than sharing the resulting charts.

Species=penguins_cleaned["species"].value_counts()
Species.plot(kind='pie',autopct="%.2f%%")
<Axes: ylabel='count'>
../_images/c89bffd3dd80f1669e648613ec992fc69c206e0618d5bafc0de9730c767e26aa.png

One of my favourite things about seaborn is that part of the documentations is a Example Gallery where you can simply copy-paste the code for charts. But there is a catch! What is missing?

Preparing charts for any occasion#

Setting the Theme and Color Pallet#

We can use the set_theme() function which changes the global defaults for all plots using the matplotlib system. So we can presetup the style or color palettes for the rest of the notebook.

You can explore more in Controlling figure aesthetics or Choosing color palettes.

We can control figure size and some axes parameters by using plt.subplots from matplotlib to ave access to the figure and the axes objects.

# Apply the theme
sns.set_theme(style="ticks", palette="colorblind")
# sns.set_theme(style="white")
# sns.set_theme(style="dark")

# Set up the matplotlib figure
figure, axes = plt.subplots(figsize=(5, 5))
../_images/c84811f6a84807840bb1ab33d58902062e221a12c961d2b067fd8b1c8f923243.png

Axes-level plots#

Seaborn has 2 hierarchy level of plots: figure-level and axes-level.

A figure-level function, like relplot, generates the full figure and sets the axes inside it depending on the provided parameters. An axes-level function generates a plot that should be put into an axes object. This allows having different types of plots in the same figure, because we can put each plot in a different axis.

image.png

Pivot tables and Heat maps#

As shown before, we can generate heatmaps from a pivotted table and from a correlation matrix.

# penguins_cleaned.groupby(['species', 'island'])['body_mass_g'].aggregate('mean').unstack()
pivot_table = penguins_cleaned.pivot_table("body_mass_g", index=["island", "species"]).unstack()

pivot_table.head()
body_mass_g
species Adelie Chinstrap Gentoo
island
Biscoe 3709.659091 NaN 5092.436975
Dream 3701.363636 3733.088235 NaN
Torgersen 3708.510638 NaN NaN

Below we create an empty canvas with instances of a figure object (fig) and an axes object (ax).

We make a heatmap with the heatmap function and assign it to the ax variable.

# Set up the matplotlib figure
figure, axes = plt.subplots(figsize=(5, 10))

# Draw the heatmap
sns.heatmap(
    data=pivot_table/1000,
    center=6,
    square=True,
    linewidths=.5, cbar_kws={"shrink": .5},
    ax = axes,
    #annot=True,
    #cmap="coolwarm", #Spectral)
)

axes.set(title="Mean weight of penguins, kg")
axes.set_xticklabels(penguins_cleaned.species.unique())
[Text(0.5, 0, 'Adelie'), Text(1.5, 0, 'Chinstrap'), Text(2.5, 0, 'Gentoo')]
../_images/8bd5e17c3605ee1609e4a00b46ca4b035f2ed441f3368d7b8a060b6a0144ab2d.png

Some figure/chart size are fixed, and if you force the figsize to be different - the chart size and shape will stay, but it will create empty part of image. Look at the output of this figure.

figure.savefig('heatmap_uneven_PNG.png', dpi=300)

Scatter plot#

The scatter plot is used to display the relationship between variables. Let’s see the scatter plot of culmen lengths and depths by penguin species.

# Set up the matplotlib figure
figure, axes = plt.subplots(figsize=(10, 5))

# Make a scatterplot
sns.scatterplot(
    data=penguins_cleaned,
    x="bill_length_mm",
    y="bill_depth_mm",
    hue="species",
    ax = axes
    )
# Give the plot a title
plt.title("Bill Length vs Bill Depth", size=20, color="red") #matplotlib way to define title

# Improve the legend
sns.move_legend(
    obj=axes,
    loc="lower right",
    ncol=3,
    frameon=True,
    columnspacing=1,
    handletextpad=0
)
../_images/e66b14e401f8e786a9419450737df66aad7f5d130dcf1dccfe99468e54b62b3d.png

Histogram#

The histogram plot shows the distribution of the data. You can use the histogram plot to see the distribution of one or more variables. Now let’s see the histogram of the bill length using the histplot function.

# Set up the matplotlib figure
figure, axes = plt.subplots(figsize=(10, 5))

sns.histplot(
    data = penguins_cleaned,
    x = "bill_length_mm",
    binwidth=1,
    ax = axes
    )
plt.title("Bill Length", size=20, color="red")
Text(0.5, 1.0, 'Bill Length')
../_images/602fad041d662361ddbabb6261988c8813911e5ece09dce8aee297308e9e400b.png

We can display subsets histograms with different colors easily with the same function. We just have to give some additional parameters, like assigning the column ‘species’ of the dataframe to the hue parameter.

# Set up the matplotlib figure
figure, axes = plt.subplots(figsize=(10, 5))

sns.histplot(
    data = penguins_cleaned,
    x = "bill_length_mm",
    binwidth = 1,
    hue = "species",
    kde = True,
    ax = axes
    )
axes.set(title="Bill Length")

sns.move_legend(
    axes, "upper center",
    bbox_to_anchor=(.5, 1),
    ncol=3,
    title=None,
    frameon=False,
)
../_images/984ba55fa760ee5385406c23cb43b728447799649cff34cce036d9ec1068eebc.png

Bar plot#

A bar plot represents an estimate of the central tendency for a numeric variable with the height of each rectangle. Let’s see the bar plot showing the bill lengths of penguin species.

# Set up the matplotlib figure
figure, axes = plt.subplots(figsize=(10, 5))

sns.barplot(
    data = penguins_cleaned, 
    x = "species", 
    y = "bill_length_mm", 
    hue = "sex",
    ax = axes
    )
axes.set(title="Bill Length for 3 Penguin Species by Sex")
[Text(0.5, 1.0, 'Bill Length for 3 Penguin Species by Sex')]
../_images/591ffeec8c5e7d9f47b667edefd4a8917d5998360de07d57d8770a92db283bf7.png

Box plot#

The box plot is used to compare the distribution of numerical data between levels of a categorical variable. Let’s see the distribution of flipper length by species.

figure, axes = plt.subplots(figsize=(10, 5))

sns.boxplot(
    data =penguins_cleaned,
    x = "species",
    y = "flipper_length_mm",
    ax = axes
    )
axes.set(title="Bill Length for 3 Penguin Species")
[Text(0.5, 1.0, 'Bill Length for 3 Penguin Species')]
../_images/a1fbc2262c4bec3f05b9887e0e1a47658a7a7f498e00daaacd7958c9fba5d487.png

You can use the hue parameter to see a boxplot of flipper lengths of species by sex.

figure, axes = plt.subplots(figsize=(8, 5))

sns.boxplot(
    data = penguins_cleaned,
    x = "species",
    y = "flipper_length_mm",
    hue="sex",
    ax = axes)
axes.set(title="Bill Length for 3 Penguin Species")
[Text(0.5, 1.0, 'Bill Length for 3 Penguin Species')]
../_images/e57e3649d30661ac1ec514160df29610828a8aae733f1ba8a0942aa64ba0eb92.png

But - does the box plot represent the distribution in a best way? Source

BoxViolin.gif

Violin plot#

violin-graph.png

  1. Violin graphs are similar to box plots but provide additional benefits.

  • Box plots are widely used to show medians, ranges, and variabilities in different groups.

  • However, box plots can be misleading when the data’s distribution changes while maintaining the same summary statistics.

  • Violin plots carry all the information of a box plot but are not affected by changes in data distribution.

  1. Violin graphs combine the advantages of density plots with improved readability.

  • The shape of a violin plot comes from the data’s density plot, which is turned sideways and mirrored on both sides of the box plot.

  • The thicker part of the violin represents higher frequency values, while the thinner part represents lower frequency values.

  • Violin plots are preferred over density plots when there are many groups, as overlapping density plots can become difficult to interpret.

  1. Violin graphs are visually intuitive and attractive.

  • Different sets are compared by placing their violin plots side by side, avoiding confusion about colors or patterns.

  • Violin plots are easy to read, with the median represented by a dot, the interquartile range by a box, and the 95% confidence interval by whiskers.

  • The shape of the violin displays the frequencies of values, making it visually informative and appealing.

  1. Violin graphs are non-parametric and suitable for various data types.

  • Unlike bar graphs with means and error bars, violin plots include all data points.

  • They are useful for visualizing samples with small sizes and work well for both quantitative and qualitative data.

  • Violin plots do not require data to conform to a normal distribution, making them a versatile visualization tool.

figure, axes = plt.subplots(figsize=(8, 5))

sns.violinplot(
    data = penguins_cleaned,
    x = "species",
    y = "flipper_length_mm",
    ax = axes
    )
<Axes: xlabel='species', ylabel='flipper_length_mm'>
../_images/c4889e5c98032a23616da8f8b1e88c0d693ca219804d389b12cca6288a2e2cc6.png
figure, axes = plt.subplots(figsize=(8, 5))

sns.violinplot(
    data = penguins_cleaned,
    x = "species",
    y = "flipper_length_mm",
    hue = "sex",
    ax = axes
    )
<Axes: xlabel='species', ylabel='flipper_length_mm'>
../_images/edecda7fb7a4b1072a71b490f438f3e18aed553876580129dcd177c738e8bf04.png

It’s also possible to “split” the violins when the hue parameter has only two levels, which can allow for a more efficient use of space:

figure, axes = plt.subplots(figsize=(8, 5))

sns.violinplot(
    data = penguins_cleaned,
    x = "species",
    y = "flipper_length_mm",
    hue = "sex",
    ax = axes,
    split = True,
    )
<Axes: xlabel='species', ylabel='flipper_length_mm'>
../_images/a8ea54c50d94a003e3aee0aa627b98cfb24a744e46a6367316699bcd4ec7619d.png

It can also be useful to combine swarmplot() or stripplot() with a box plot or violin plot to show each observation along with a summary of the distribution:

figure, axes = plt.subplots(figsize=(8, 5))

sns.violinplot(
    data = penguins_cleaned,
    x = "species",
    y = "flipper_length_mm",
    ax = axes,
    )

sns.swarmplot(
    data=penguins_cleaned,
    x="species",
    y="flipper_length_mm",
    color="k",
    size=3,
    ax=axes)
<Axes: xlabel='species', ylabel='flipper_length_mm'>
../_images/f86d8f07afffe55a55ec4d83ab729bc392f2fd0741238561aaf60597671ca674.png
figure.savefig("swarm_test.png")

Learn more about plotting categorical data here.

Other figure-level plots#

Joint Plot#

To have a joint distribution of two variables with the marginal distributions on the sides, we can use jointplot.

sns.jointplot(
    data=penguins_cleaned,
    x="flipper_length_mm",
    y="bill_depth_mm"
    )
<seaborn.axisgrid.JointGrid at 0x1f4493a6130>
../_images/a3231026f76bd70789027546df44128cb26dfc2c84537524f41b3cde2c288fb6.png

As expected, it is possible to separate groups by passing a categorical property to the hue argument. This has an effect on the marginal distribution, turning them from histogram to kde plots.

sns.jointplot(
    data=penguins_cleaned,
    x="flipper_length_mm",
    y="bill_depth_mm",
    hue='species',
    height=5)
<seaborn.axisgrid.JointGrid at 0x1f44ac65fa0>
../_images/36026b9781adb3b09fd72152f1ab82b31489d4b1cd6c824bc4e4033f2652e6d0.png

Pair plot#

You can use the pairplot method to see the pair relations of the variables. It is perfect for exploring your dataset in fast way. This function creates cross-plots of each numeric variable in the dataset. Let’s see the pairs of numerical variables according to penguin species in the dataset.

figure = sns.pairplot(
    data=penguins_cleaned,
    hue = "species",
    height=3
    )
c:\Users\mazo260d\mambaforge\envs\devbio-napari-clone\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
../_images/b051d964dd8d2fc7cf182853a172867a85985cff0912bbb13f36faf26d316418.png
figure.savefig('pairplot_empty_PNG.png', dpi=300)

If you get empty image, you need to clean previous plots with matplotlib.pyplot.close()

import matplotlib
matplotlib.pyplot.close()

And it should work.

figure = sns.pairplot(
    data = penguins_cleaned,
    hue = "species",
    height=3
    )
c:\Users\mazo260d\mambaforge\envs\devbio-napari-clone\lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
../_images/b051d964dd8d2fc7cf182853a172867a85985cff0912bbb13f36faf26d316418.png
figure.savefig('pairplot_working_PNG.png', dpi=300)

Exercise#

Use the following images and the corresponding data

from skimage.io import imread
image1 = imread("../../data/BBBC007_batch/20P1_POS0010_D_1UL.tif")
image2 = imread("../../data/BBBC007_batch/20P1_POS0007_D_1UL.tif")
df = pd.read_csv("../../data/BBBC007_analysis.csv")
data = df[ (df['file_name'] == '20P1_POS0010_D_1UL') | (df['file_name'] == '20P1_POS0007_D_1UL')]
data.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name
0 139 96.546763 17.504104 10.292770 1.700621 20P1_POS0010_D_1UL
1 360 86.613889 35.746808 14.983124 2.385805 20P1_POS0010_D_1UL
2 43 91.488372 12.967884 4.351573 2.980045 20P1_POS0010_D_1UL
3 140 73.742857 18.940508 10.314404 1.836316 20P1_POS0010_D_1UL
4 144 89.375000 13.639308 13.458532 1.013432 20P1_POS0010_D_1UL
  1. Create figure with two rows and two columns of subplots.

  2. Display the two images below in the first row of subplots

  3. Display graphs for the corresponding figures on the second row. Choose any axes-level function you like.


from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,matplotlib,scipy"))
Last updated: 2023-08-25T16:39:55.148366+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark : 2.4.3
numpy     : 1.23.5
pandas    : 2.0.3
seaborn   : 0.12.2
matplotlib: 3.7.2
scipy     : 1.11.2