Statistics#
Sources and inspiration:
https://pandas.pydata.org/docs/user_guide/visualization.html
https://levelup.gitconnected.com/statistics-on-seaborn-plots-with-statannotations-2bfce0394c00
If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.
# !pip install seaborn
# !pip scikit_posthocs
# !pip install watermark
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import mannwhitneyu, normaltest
Introduction#
Many libraries are available in Python to clean, analyze, and plot data. Python also has robust statistical packages which are used by thousands of other projects.text kurzívou
We will work with the penguins dataset from seaborn.
penguins = sns.load_dataset("penguins")
penguins_cleaned = penguins.dropna()
penguins_cleaned.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
Exploring the Data#
We already prepared the penguin dataset with penguins_cleaned = penguins.dropna()
, but we should double check.
penguins_cleaned.isnull().sum()
species 0
island 0
bill_length_mm 0
bill_depth_mm 0
flipper_length_mm 0
body_mass_g 0
sex 0
dtype: int64
penguins_cleaned.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
There are 3 species of penquins. We can access their names by applying the .unique
method on the ‘species’ column. It returns the unique values in that column.
penguins_cleaned['species'].unique()
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
We can get average measurements of a property split by categories with the .pivot_table
.
penguins_cleaned.pivot_table('bill_length_mm', index='sex', columns='species', aggfunc='mean')
species | Adelie | Chinstrap | Gentoo |
---|---|---|---|
sex | |||
Female | 37.257534 | 46.573529 | 45.563793 |
Male | 40.390411 | 51.094118 | 49.473770 |
Splitting data into sub-groups#
Furhtermore, we can also prepare subsets for each penguin species by feeding a different binary mask to the dataframe.
Adelie_values = penguins_cleaned[penguins_cleaned['species']=='Adelie']
Chinstrap_values = penguins_cleaned[penguins_cleaned['species']=='Chinstrap']
Gentoo_values = penguins_cleaned[penguins_cleaned['species']=='Gentoo']
Lets explore values of each species.
Adelie_values.describe()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
count | 146.000000 | 146.000000 | 146.000000 | 146.000000 |
mean | 38.823973 | 18.347260 | 190.102740 | 3706.164384 |
std | 2.662597 | 1.219338 | 6.521825 | 458.620135 |
min | 32.100000 | 15.500000 | 172.000000 | 2850.000000 |
25% | 36.725000 | 17.500000 | 186.000000 | 3362.500000 |
50% | 38.850000 | 18.400000 | 190.000000 | 3700.000000 |
75% | 40.775000 | 19.000000 | 195.000000 | 4000.000000 |
max | 46.000000 | 21.500000 | 210.000000 | 4775.000000 |
Chinstrap_values.describe()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
count | 68.000000 | 68.000000 | 68.000000 | 68.000000 |
mean | 48.833824 | 18.420588 | 195.823529 | 3733.088235 |
std | 3.339256 | 1.135395 | 7.131894 | 384.335081 |
min | 40.900000 | 16.400000 | 178.000000 | 2700.000000 |
25% | 46.350000 | 17.500000 | 191.000000 | 3487.500000 |
50% | 49.550000 | 18.450000 | 196.000000 | 3700.000000 |
75% | 51.075000 | 19.400000 | 201.000000 | 3950.000000 |
max | 58.000000 | 20.800000 | 212.000000 | 4800.000000 |
Gentoo_values.describe()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
count | 119.000000 | 119.000000 | 119.000000 | 119.000000 |
mean | 47.568067 | 14.996639 | 217.235294 | 5092.436975 |
std | 3.106116 | 0.985998 | 6.585431 | 501.476154 |
min | 40.900000 | 13.100000 | 203.000000 | 3950.000000 |
25% | 45.350000 | 14.200000 | 212.000000 | 4700.000000 |
50% | 47.400000 | 15.000000 | 216.000000 | 5050.000000 |
75% | 49.600000 | 15.750000 | 221.500000 | 5500.000000 |
max | 59.600000 | 17.300000 | 231.000000 | 6300.000000 |
Applying Statistics tests#
Normality Test#
normaltest()
test whether a sample differs from a normal distribution.
This function tests the null hypothesis that a sample comes from a normal distribution.
The p-value range is between 0 and 1,
from scipy.stats import normaltest
print("Adelie: ", normaltest(Adelie_values['bill_length_mm']).pvalue)
print("Chinstrap: ", normaltest(Chinstrap_values['bill_length_mm']).pvalue)
print("Gentoo: ", normaltest(Gentoo_values['bill_length_mm']).pvalue)
Adelie: 0.7046667395852243
Chinstrap: 0.9143668075479967
Gentoo: 0.002785628232779262
Traditionally, in statistics, we need a p-value of less than 0.05 to reject the null hypothesis. In this case, the 2 out of 3 species have p-value > 0.05. Because our p value is greater than 0.05, we cannot reject the null hypothesis. Therefore, we have not proven that the 2 data sets are different from normality.
But what about the last?
Aren’t we forgeting something? Do we know enough about the data?
# import matplotlib.pyplot as plt
# # Set up the matplotlib figure
# f, ax = plt.subplots(figsize=(10, 5))
# sns.histplot(x = "bill_length_mm", data = penguins_cleaned, binwidth=1, hue="species", kde=True)
# # plt.title("Bill Length", size=20, color="red") # would look wierd
# ax.set(title="Bill Length")
# sns.move_legend(
# ax, "upper center",
# bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,
# )
# f.savefig('penguins_species_bill-length_PNG.png', dpi=300)
A normal distribution is symmetric about the mean.
A normal distribution also has a specific width for a given height.
If you double the height, the width scales proportionally. However, you could imagine stretching a bell curve out in weird ways without changing its symmetry. You could have a sharp, pointy distribution, or a fat, boxy one.
# import matplotlib.pyplot as plt
# # Set up the matplotlib figure
# f, ax = plt.subplots(figsize=(10, 5))
# sns.histplot(x = "bill_length_mm", data = Gentoo_values, binwidth=1, hue="sex", kde=True)
# # plt.title("Bill Length", size=20, color="red") # would look wierd
# ax.set(title="Bill Length of Gentoo")
# sns.move_legend(
# ax, "upper center",
# bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,
# )
# f.savefig('penguins_gentoo_bill-length_PNG.png', dpi=300)
Gentoo_values_male=Gentoo_values[Gentoo_values.sex=='Male']
Gentoo_values_female=Gentoo_values[Gentoo_values.sex=='Female']
print("Gentoo male: ", normaltest(Gentoo_values_male['bill_length_mm']).pvalue)
print("Gentoo female: ", normaltest(Gentoo_values_female['bill_length_mm']).pvalue)
Gentoo male: 0.0005453287292381343
Gentoo female: 0.9029679515828937
Luckily we are able to test and compare sets of values, even if they do not come from Normal Distribution
Comparing 2 groups#
Parametric Tests#
T-Test#
ttest_ind()
calculates the T-test for the means of two independent samples of scores.
This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.
There is a catch! We need to have either same number of samples, or same variance. ttest_ind()
can perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance.
Note Using Student’s original definition of the t-test, the two populations being compared should have the same variance. If the sample sizes in the two groups being compared are equal, Student’s original t-test is highly robust to the presence of unequal variances. Welch’s t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.
Perform Levene test for equal variances.
The Levene test tests the null hypothesis that all input samples are from populations with equal variances.
from scipy.stats import ttest_ind, levene
# pvalues with scipy:
stat_results = [levene(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'])]
print("Adelie vs Chinstrap, variance: ", stat_results[0])
pvalues = [result.pvalue for result in stat_results]
Adelie vs Chinstrap, variance: LeveneResult(statistic=4.529733833453024, pvalue=0.03446512682844295)
With p-value < 0.05 , we reject the null hypothesis that all input samples are from populations with equal variances. So we need Welch’s t-test
Null hypothesis is that 2 independent samples have identical average (expected) values.
# pvalues with scipy:
stat_results_ACh = [ttest_ind(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], equal_var=False)]
print("Adelie vs Chinstrap, mean: ", stat_results_ACh[0])
pvalues = [result.pvalue for result in stat_results_ACh]
Adelie vs Chinstrap, mean: TtestResult(statistic=-21.712498056635937, pvalue=3.1490764303457434e-41, df=108.17221912082128)
With p-value < 0.05 , we reject the null hypothesis. Now how we can plot this?
Non-parametric#
Mann-Whitney U Test#
from scipy.stats import mannwhitneyu
# pvalues with scipy:
stat_results_GFM = [mannwhitneyu(Gentoo_values_male['bill_length_mm'], Gentoo_values_female['bill_length_mm'], alternative="two-sided"),]
print("Gentoo male vs Gentoo female, bill_length_mm: ", stat_results_GFM[0])
pvalues = [result.pvalue for result in stat_results_GFM]
Gentoo male vs Gentoo female, bill_length_mm: MannwhitneyuResult(statistic=3125.5, pvalue=5.58594267834405e-13)
Comparing more than 2 groups#
Analysis of Variance#
Now we have three samples, so a t-test is actually not appropriate. If we state the 0-Hypothesis that there is no difference between samples, we should apply a one-way ANOVA.
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.
from scipy.stats import f_oneway
stat_results_f_oneway = f_oneway(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])
stat_results_f_oneway
F_onewayResult(statistic=397.29943741282835, pvalue=1.3809842053150027e-88)
Kruskal–Wallis test#
The Kruskal–Wallis test is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups.
from scipy.stats import kruskal
stat_results_kruskal = kruskal(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])
stat_results_kruskal
KruskalResult(statistic=236.8992355590763, pvalue=3.6139705965512625e-52)
We are looking at all comparisons at the same time. But we usually want to know which one makes the difference! But here we need to include Multiple testing correction!
Multiple testing correction#
For ANOVA there is Tukey. For non-parametric tests, there is Dunn’s test.
Tukey#
This test uses pairwise post-hoc testing for ANOVA to determine whether there is a difference between the mean of all possible pairs using a studentized range distribution. This method tests every possible pair of all groups.
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# For Tukey the dataframe needs to be melted
data = penguins_cleaned[['species','bill_length_mm']]
# perform multiple pairwise comparison (Tukey HSD)
stat_results_pairwise_tukeyhsd = pairwise_tukeyhsd(endog=data['bill_length_mm'], groups=data['species'], alpha=0.05)
print(stat_results_pairwise_tukeyhsd)
Multiple Comparison of Means - Tukey HSD, FWER=0.05
==========================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------------
Adelie Chinstrap 10.0099 0.0 8.9828 11.0369 True
Adelie Gentoo 8.7441 0.0 7.8801 9.6081 True
Chinstrap Gentoo -1.2658 0.0148 -2.3292 -0.2023 True
----------------------------------------------------------
Dunn’s test#
If the results of a Kruskal-Wallis test are statistically significant, then it’s appropriate to conduct Dunn’s Test to determine exactly which groups are different.
# !pip install scikit_posthocs
data=[Adelie_values['bill_length_mm'].to_numpy(), Chinstrap_values['bill_length_mm'].to_numpy(), Gentoo_values['bill_length_mm'].to_numpy()]
#perform Dunn's test using a Bonferonni correction for the p-values
import scikit_posthocs as sp
sp.posthoc_dunn(data, p_adjust = 'bonferroni', group_col='species', val_col='bill_length_mm')
1 | 2 | 3 | |
---|---|---|---|
1 | 1.000000e+00 | 4.623888e-36 | 1.030275e-37 |
2 | 4.623888e-36 | 1.000000e+00 | 2.697766e-01 |
3 | 1.030275e-37 | 2.697766e-01 | 1.000000e+00 |
(Bonus preview) Use statannotations to apply scipy test#
Finally, statannotations
can take care of most of the steps required to run the test by calling scipy.stats
directly
and annotate the plot.
The available options are
Mann-Whitney
t-test (independent and paired)
Welch’s t-test
Levene test
Wilcoxon test
Kruskal-Wallis test
We will cover how to use a test that is not one of those already interfaced in statannotations
.
If you are curious, you can also take a look at the usage
notebook in the project repository.
from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,matplotlib,scipy,statannotations"))
Last updated: 2023-08-25T10:49:29.031581+02:00
Python implementation: CPython
Python version : 3.9.17
IPython version : 8.14.0
Compiler : MSC v.1929 64 bit (AMD64)
OS : Windows
Release : 10
Machine : AMD64
Processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores : 16
Architecture: 64bit
watermark : 2.4.3
numpy : 1.23.5
pandas : 2.0.3
seaborn : 0.12.2
matplotlib : 3.7.2
scipy : 1.11.2
statannotations: 0.4.4