Sources and inspiration:

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

# !pip install seaborn
# !pip scikit_posthocs
# !pip install watermark
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import mannwhitneyu, normaltest


Many libraries are available in Python to clean, analyze, and plot data. Python also has robust statistical packages which are used by thousands of other projects.text kurzívou

We will work with the penguins dataset from seaborn.

penguins = sns.load_dataset("penguins")
penguins_cleaned = penguins.dropna()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male

Exploring the Data#

We already prepared the penguin dataset with penguins_cleaned = penguins.dropna(), but we should double check.

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male

There are 3 species of penquins. We can access their names by applying the .unique method on the ‘species’ column. It returns the unique values in that column.

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

We can get average measurements of a property split by categories with the .pivot_table.

penguins_cleaned.pivot_table('bill_length_mm', index='sex', columns='species', aggfunc='mean')
species Adelie Chinstrap Gentoo
Female 37.257534 46.573529 45.563793
Male 40.390411 51.094118 49.473770

Splitting data into sub-groups#

Furhtermore, we can also prepare subsets for each penguin species by feeding a different binary mask to the dataframe.

Adelie_values =  penguins_cleaned[penguins_cleaned['species']=='Adelie']
Chinstrap_values = penguins_cleaned[penguins_cleaned['species']=='Chinstrap']
Gentoo_values = penguins_cleaned[penguins_cleaned['species']=='Gentoo']

Lets explore values of each species.

bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
count 146.000000 146.000000 146.000000 146.000000
mean 38.823973 18.347260 190.102740 3706.164384
std 2.662597 1.219338 6.521825 458.620135
min 32.100000 15.500000 172.000000 2850.000000
25% 36.725000 17.500000 186.000000 3362.500000
50% 38.850000 18.400000 190.000000 3700.000000
75% 40.775000 19.000000 195.000000 4000.000000
max 46.000000 21.500000 210.000000 4775.000000
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
count 68.000000 68.000000 68.000000 68.000000
mean 48.833824 18.420588 195.823529 3733.088235
std 3.339256 1.135395 7.131894 384.335081
min 40.900000 16.400000 178.000000 2700.000000
25% 46.350000 17.500000 191.000000 3487.500000
50% 49.550000 18.450000 196.000000 3700.000000
75% 51.075000 19.400000 201.000000 3950.000000
max 58.000000 20.800000 212.000000 4800.000000
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
count 119.000000 119.000000 119.000000 119.000000
mean 47.568067 14.996639 217.235294 5092.436975
std 3.106116 0.985998 6.585431 501.476154
min 40.900000 13.100000 203.000000 3950.000000
25% 45.350000 14.200000 212.000000 4700.000000
50% 47.400000 15.000000 216.000000 5050.000000
75% 49.600000 15.750000 221.500000 5500.000000
max 59.600000 17.300000 231.000000 6300.000000

Applying Statistics tests#

Normality Test#

normaltest() test whether a sample differs from a normal distribution.

This function tests the null hypothesis that a sample comes from a normal distribution.

The p-value range is between 0 and 1,

from scipy.stats import normaltest
print("Adelie: ", normaltest(Adelie_values['bill_length_mm']).pvalue)
print("Chinstrap: ", normaltest(Chinstrap_values['bill_length_mm']).pvalue)
print("Gentoo: ", normaltest(Gentoo_values['bill_length_mm']).pvalue)
Adelie:  0.7046667395852243
Chinstrap:  0.9143668075479967
Gentoo:  0.002785628232779262

Traditionally, in statistics, we need a p-value of less than 0.05 to reject the null hypothesis. In this case, the 2 out of 3 species have p-value > 0.05. Because our p value is greater than 0.05, we cannot reject the null hypothesis. Therefore, we have not proven that the 2 data sets are different from normality.

But what about the last?

Aren’t we forgeting something? Do we know enough about the data?


# import matplotlib.pyplot as plt
# # Set up the matplotlib figure
# f, ax = plt.subplots(figsize=(10, 5))

# sns.histplot(x = "bill_length_mm", data = penguins_cleaned, binwidth=1, hue="species", kde=True)
# # plt.title("Bill Length", size=20, color="red") # would look wierd
# ax.set(title="Bill Length")

# sns.move_legend(
#     ax, "upper center",
#     bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,
# )

# f.savefig('penguins_species_bill-length_PNG.png', dpi=300)

A normal distribution is symmetric about the mean.
A normal distribution also has a specific width for a given height.

If you double the height, the width scales proportionally. However, you could imagine stretching a bell curve out in weird ways without changing its symmetry. You could have a sharp, pointy distribution, or a fat, boxy one.


# import matplotlib.pyplot as plt

# # Set up the matplotlib figure
# f, ax = plt.subplots(figsize=(10, 5))

# sns.histplot(x = "bill_length_mm", data = Gentoo_values, binwidth=1, hue="sex", kde=True)
# # plt.title("Bill Length", size=20, color="red") # would look wierd
# ax.set(title="Bill Length of Gentoo")

# sns.move_legend(
#     ax, "upper center",
#     bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,
# )

# f.savefig('penguins_gentoo_bill-length_PNG.png', dpi=300)

print("Gentoo male: ", normaltest(Gentoo_values_male['bill_length_mm']).pvalue)
print("Gentoo female: ", normaltest(Gentoo_values_female['bill_length_mm']).pvalue)
Gentoo male:  0.0005453287292381343
Gentoo female:  0.9029679515828937

Luckily we are able to test and compare sets of values, even if they do not come from Normal Distribution

Comparing 2 groups#

Parametric Tests#


ttest_ind() calculates the T-test for the means of two independent samples of scores.

This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

There is a catch! We need to have either same number of samples, or same variance. ttest_ind() can perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance.

Note Using Student’s original definition of the t-test, the two populations being compared should have the same variance. If the sample sizes in the two groups being compared are equal, Student’s original t-test is highly robust to the presence of unequal variances. Welch’s t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.

Perform Levene test for equal variances.

The Levene test tests the null hypothesis that all input samples are from populations with equal variances.

from scipy.stats import ttest_ind, levene
# pvalues with scipy:
stat_results = [levene(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'])]

print("Adelie vs Chinstrap, variance: ", stat_results[0])

pvalues = [result.pvalue for result in stat_results]
Adelie vs Chinstrap, variance:  LeveneResult(statistic=4.529733833453024, pvalue=0.03446512682844295)

With p-value < 0.05 , we reject the null hypothesis that all input samples are from populations with equal variances. So we need Welch’s t-test

Null hypothesis is that 2 independent samples have identical average (expected) values.

# pvalues with scipy:
stat_results_ACh = [ttest_ind(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], equal_var=False)]

print("Adelie vs Chinstrap, mean: ", stat_results_ACh[0])

pvalues = [result.pvalue for result in stat_results_ACh]
Adelie vs Chinstrap, mean:  TtestResult(statistic=-21.712498056635937, pvalue=3.1490764303457434e-41, df=108.17221912082128)

With p-value < 0.05 , we reject the null hypothesis. Now how we can plot this?


Mann-Whitney U Test#
from scipy.stats import mannwhitneyu

# pvalues with scipy:
stat_results_GFM = [mannwhitneyu(Gentoo_values_male['bill_length_mm'], Gentoo_values_female['bill_length_mm'], alternative="two-sided"),]

print("Gentoo male vs Gentoo female, bill_length_mm: ", stat_results_GFM[0])

pvalues = [result.pvalue for result in stat_results_GFM]
Gentoo male vs Gentoo female, bill_length_mm:  MannwhitneyuResult(statistic=3125.5, pvalue=5.58594267834405e-13)

Comparing more than 2 groups#

Analysis of Variance#

Now we have three samples, so a t-test is actually not appropriate. If we state the 0-Hypothesis that there is no difference between samples, we should apply a one-way ANOVA.

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

from scipy.stats import f_oneway

stat_results_f_oneway = f_oneway(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])
F_onewayResult(statistic=397.29943741282835, pvalue=1.3809842053150027e-88)

Kruskal–Wallis test#

The Kruskal–Wallis test is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups.

from scipy.stats import kruskal

stat_results_kruskal = kruskal(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])
KruskalResult(statistic=236.8992355590763, pvalue=3.6139705965512625e-52)

We are looking at all comparisons at the same time. But we usually want to know which one makes the difference! But here we need to include Multiple testing correction!

Multiple testing correction#

For ANOVA there is Tukey. For non-parametric tests, there is Dunn’s test.


This test uses pairwise post-hoc testing for ANOVA to determine whether there is a difference between the mean of all possible pairs using a studentized range distribution. This method tests every possible pair of all groups.

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# For Tukey the dataframe needs to be melted

data = penguins_cleaned[['species','bill_length_mm']]
# perform multiple pairwise comparison (Tukey HSD)
stat_results_pairwise_tukeyhsd = pairwise_tukeyhsd(endog=data['bill_length_mm'], groups=data['species'], alpha=0.05)
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
  group1    group2  meandiff p-adj   lower   upper  reject
   Adelie Chinstrap  10.0099    0.0  8.9828 11.0369   True
   Adelie    Gentoo   8.7441    0.0  7.8801  9.6081   True
Chinstrap    Gentoo  -1.2658 0.0148 -2.3292 -0.2023   True
Dunn’s test#

If the results of a Kruskal-Wallis test are statistically significant, then it’s appropriate to conduct Dunn’s Test to determine exactly which groups are different.

# !pip install scikit_posthocs
data=[Adelie_values['bill_length_mm'].to_numpy(), Chinstrap_values['bill_length_mm'].to_numpy(), Gentoo_values['bill_length_mm'].to_numpy()]
#perform Dunn's test using a Bonferonni correction for the p-values
import scikit_posthocs as sp
sp.posthoc_dunn(data, p_adjust = 'bonferroni', group_col='species', val_col='bill_length_mm')
1 2 3
1 1.000000e+00 4.623888e-36 1.030275e-37
2 4.623888e-36 1.000000e+00 2.697766e-01
3 1.030275e-37 2.697766e-01 1.000000e+00

(Bonus preview) Use statannotations to apply scipy test#

Finally, statannotations can take care of most of the steps required to run the test by calling scipy.stats directly and annotate the plot. The available options are

  • Mann-Whitney

  • t-test (independent and paired)

  • Welch’s t-test

  • Levene test

  • Wilcoxon test

  • Kruskal-Wallis test

We will cover how to use a test that is not one of those already interfaced in statannotations. If you are curious, you can also take a look at the usage notebook in the project repository.

violin_annotated_PNG (1).png

from watermark import watermark
watermark(iversions=True, globals_=globals())
Last updated: 2023-08-25T10:49:29.031581+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark      : 2.4.3
numpy          : 1.23.5
pandas         : 2.0.3
seaborn        : 0.12.2
matplotlib     : 3.7.2
scipy          : 1.11.2
statannotations: 0.4.4