Statistics

Statistics#

Sources and inspiration:

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

# !pip install seaborn
# !pip scikit_posthocs
# !pip install watermark

import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import mannwhitneyu, normaltest

Introduction#

Many libraries are available in Python to clean, analyze, and plot data. Python also has robust statistical packages which are used by thousands of other projects.text kurzívou

We will work with the penguins dataset from seaborn.

penguins = sns.load_dataset("penguins")
penguins_cleaned = penguins.dropna()
penguins_cleaned.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male

Exploring the Data#

We already prepared the penguin dataset with penguins_cleaned = penguins.dropna(), but we should double check.

penguins_cleaned.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

penguins_cleaned.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male

There are 3 species of penquins. We can access their names by applying the .unique method on the ‘species’ column. It returns the unique values in that column.

penguins_cleaned['species'].unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

We can get average measurements of a property split by categories with the .pivot_table.

penguins_cleaned.pivot_table('bill_length_mm', index='sex', columns='species', aggfunc='mean')

species	Adelie	Chinstrap	Gentoo
sex
Female	37.257534	46.573529	45.563793
Male	40.390411	51.094118	49.473770

Splitting data into sub-groups#

Furhtermore, we can also prepare subsets for each penguin species by feeding a different binary mask to the dataframe.

Adelie_values =  penguins_cleaned[penguins_cleaned['species']=='Adelie']
Chinstrap_values = penguins_cleaned[penguins_cleaned['species']=='Chinstrap']
Gentoo_values = penguins_cleaned[penguins_cleaned['species']=='Gentoo']

Lets explore values of each species.

Adelie_values.describe()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
count	146.000000	146.000000	146.000000	146.000000
mean	38.823973	18.347260	190.102740	3706.164384
std	2.662597	1.219338	6.521825	458.620135
min	32.100000	15.500000	172.000000	2850.000000
25%	36.725000	17.500000	186.000000	3362.500000
50%	38.850000	18.400000	190.000000	3700.000000
75%	40.775000	19.000000	195.000000	4000.000000
max	46.000000	21.500000	210.000000	4775.000000

Chinstrap_values.describe()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
count	68.000000	68.000000	68.000000	68.000000
mean	48.833824	18.420588	195.823529	3733.088235
std	3.339256	1.135395	7.131894	384.335081
min	40.900000	16.400000	178.000000	2700.000000
25%	46.350000	17.500000	191.000000	3487.500000
50%	49.550000	18.450000	196.000000	3700.000000
75%	51.075000	19.400000	201.000000	3950.000000
max	58.000000	20.800000	212.000000	4800.000000

Gentoo_values.describe()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
count	119.000000	119.000000	119.000000	119.000000
mean	47.568067	14.996639	217.235294	5092.436975
std	3.106116	0.985998	6.585431	501.476154
min	40.900000	13.100000	203.000000	3950.000000
25%	45.350000	14.200000	212.000000	4700.000000
50%	47.400000	15.000000	216.000000	5050.000000
75%	49.600000	15.750000	221.500000	5500.000000
max	59.600000	17.300000	231.000000	6300.000000

Applying Statistics tests#

Normality Test#

normaltest() test whether a sample differs from a normal distribution.

This function tests the null hypothesis that a sample comes from a normal distribution.

The p-value range is between 0 and 1,

from scipy.stats import normaltest
print("Adelie: ", normaltest(Adelie_values['bill_length_mm']).pvalue)
print("Chinstrap: ", normaltest(Chinstrap_values['bill_length_mm']).pvalue)
print("Gentoo: ", normaltest(Gentoo_values['bill_length_mm']).pvalue)

Adelie:  0.7046667395852243
Chinstrap:  0.9143668075479967
Gentoo:  0.002785628232779262

Traditionally, in statistics, we need a p-value of less than 0.05 to reject the null hypothesis. In this case, the 2 out of 3 species have p-value > 0.05. Because our p value is greater than 0.05, we cannot reject the null hypothesis. Therefore, we have not proven that the 2 data sets are different from normality.

But what about the last?

Aren’t we forgeting something? Do we know enough about the data?

# import matplotlib.pyplot as plt
# # Set up the matplotlib figure
# f, ax = plt.subplots(figsize=(10, 5))

# sns.histplot(x = "bill_length_mm", data = penguins_cleaned, binwidth=1, hue="species", kde=True)
# # plt.title("Bill Length", size=20, color="red") # would look wierd
# ax.set(title="Bill Length")

# sns.move_legend(
#     ax, "upper center",
#     bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,
# )

# f.savefig('penguins_species_bill-length_PNG.png', dpi=300)

A normal distribution is symmetric about the mean.
A normal distribution also has a specific width for a given height.

If you double the height, the width scales proportionally. However, you could imagine stretching a bell curve out in weird ways without changing its symmetry. You could have a sharp, pointy distribution, or a fat, boxy one.

# import matplotlib.pyplot as plt

# # Set up the matplotlib figure
# f, ax = plt.subplots(figsize=(10, 5))

# sns.histplot(x = "bill_length_mm", data = Gentoo_values, binwidth=1, hue="sex", kde=True)
# # plt.title("Bill Length", size=20, color="red") # would look wierd
# ax.set(title="Bill Length of Gentoo")

# sns.move_legend(
#     ax, "upper center",
#     bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,
# )

# f.savefig('penguins_gentoo_bill-length_PNG.png', dpi=300)

Gentoo_values_male=Gentoo_values[Gentoo_values.sex=='Male']
Gentoo_values_female=Gentoo_values[Gentoo_values.sex=='Female']

print("Gentoo male: ", normaltest(Gentoo_values_male['bill_length_mm']).pvalue)
print("Gentoo female: ", normaltest(Gentoo_values_female['bill_length_mm']).pvalue)

Gentoo male:  0.0005453287292381343
Gentoo female:  0.9029679515828937

Luckily we are able to test and compare sets of values, even if they do not come from Normal Distribution

Comparing 2 groups#

Parametric Tests#

T-Test#

ttest_ind() calculates the T-test for the means of two independent samples of scores.

This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

There is a catch! We need to have either same number of samples, or same variance. ttest_ind() can perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance.

Note Using Student’s original definition of the t-test, the two populations being compared should have the same variance. If the sample sizes in the two groups being compared are equal, Student’s original t-test is highly robust to the presence of unequal variances. Welch’s t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.

Perform Levene test for equal variances.

The Levene test tests the null hypothesis that all input samples are from populations with equal variances.

from scipy.stats import ttest_ind, levene

# pvalues with scipy:
stat_results = [levene(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'])]

print("Adelie vs Chinstrap, variance: ", stat_results[0])

pvalues = [result.pvalue for result in stat_results]

Adelie vs Chinstrap, variance:  LeveneResult(statistic=4.529733833453024, pvalue=0.03446512682844295)

With p-value < 0.05 , we reject the null hypothesis that all input samples are from populations with equal variances. So we need Welch’s t-test

Null hypothesis is that 2 independent samples have identical average (expected) values.

# pvalues with scipy:
stat_results_ACh = [ttest_ind(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], equal_var=False)]

print("Adelie vs Chinstrap, mean: ", stat_results_ACh[0])

pvalues = [result.pvalue for result in stat_results_ACh]

Adelie vs Chinstrap, mean:  TtestResult(statistic=-21.712498056635937, pvalue=3.1490764303457434e-41, df=108.17221912082128)

With p-value < 0.05 , we reject the null hypothesis. Now how we can plot this?

Non-parametric#

Mann-Whitney U Test#

from scipy.stats import mannwhitneyu

# pvalues with scipy:
stat_results_GFM = [mannwhitneyu(Gentoo_values_male['bill_length_mm'], Gentoo_values_female['bill_length_mm'], alternative="two-sided"),]

print("Gentoo male vs Gentoo female, bill_length_mm: ", stat_results_GFM[0])

pvalues = [result.pvalue for result in stat_results_GFM]

Gentoo male vs Gentoo female, bill_length_mm:  MannwhitneyuResult(statistic=3125.5, pvalue=5.58594267834405e-13)

Comparing more than 2 groups#

Analysis of Variance#

Now we have three samples, so a t-test is actually not appropriate. If we state the 0-Hypothesis that there is no difference between samples, we should apply a one-way ANOVA.

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

from scipy.stats import f_oneway

stat_results_f_oneway = f_oneway(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])

stat_results_f_oneway

F_onewayResult(statistic=397.29943741282835, pvalue=1.3809842053150027e-88)

Kruskal–Wallis test#

The Kruskal–Wallis test is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups.

from scipy.stats import kruskal

stat_results_kruskal = kruskal(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])

stat_results_kruskal

KruskalResult(statistic=236.8992355590763, pvalue=3.6139705965512625e-52)

We are looking at all comparisons at the same time. But we usually want to know which one makes the difference! But here we need to include Multiple testing correction!

Multiple testing correction#

For ANOVA there is Tukey. For non-parametric tests, there is Dunn’s test.

Tukey#

This test uses pairwise post-hoc testing for ANOVA to determine whether there is a difference between the mean of all possible pairs using a studentized range distribution. This method tests every possible pair of all groups.

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# For Tukey the dataframe needs to be melted

data = penguins_cleaned[['species','bill_length_mm']]

# perform multiple pairwise comparison (Tukey HSD)
stat_results_pairwise_tukeyhsd = pairwise_tukeyhsd(endog=data['bill_length_mm'], groups=data['species'], alpha=0.05)

print(stat_results_pairwise_tukeyhsd)

   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
==========================================================
  group1    group2  meandiff p-adj   lower   upper  reject
----------------------------------------------------------
   Adelie Chinstrap  10.0099    0.0  8.9828 11.0369   True
   Adelie    Gentoo   8.7441    0.0  7.8801  9.6081   True
Chinstrap    Gentoo  -1.2658 0.0148 -2.3292 -0.2023   True
----------------------------------------------------------

Dunn’s test#

If the results of a Kruskal-Wallis test are statistically significant, then it’s appropriate to conduct Dunn’s Test to determine exactly which groups are different.

# !pip install scikit_posthocs

data=[Adelie_values['bill_length_mm'].to_numpy(), Chinstrap_values['bill_length_mm'].to_numpy(), Gentoo_values['bill_length_mm'].to_numpy()]

#perform Dunn's test using a Bonferonni correction for the p-values
import scikit_posthocs as sp
sp.posthoc_dunn(data, p_adjust = 'bonferroni', group_col='species', val_col='bill_length_mm')

	1	2	3
1	1.000000e+00	4.623888e-36	1.030275e-37
2	4.623888e-36	1.000000e+00	2.697766e-01
3	1.030275e-37	2.697766e-01	1.000000e+00

(Bonus preview) Use statannotations to apply scipy test#

Finally, statannotations can take care of most of the steps required to run the test by calling scipy.stats directly and annotate the plot. The available options are

Mann-Whitney
t-test (independent and paired)
Welch’s t-test
Levene test
Wilcoxon test
Kruskal-Wallis test

We will cover how to use a test that is not one of those already interfaced in statannotations. If you are curious, you can also take a look at the usage notebook in the project repository.

violin_annotated_PNG (1).png

from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,matplotlib,scipy,statannotations"))

Last updated: 2023-08-25T10:49:29.031581+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark      : 2.4.3
numpy          : 1.23.5
pandas         : 2.0.3
seaborn        : 0.12.2
matplotlib     : 3.7.2
scipy          : 1.11.2
statannotations: 0.4.4