PCA (Principle Component analysis)#

Notebook adapted from Anna Poetsch (source) under CC-BY-4.0

Source material:
Tutorial: https://umap-learn.readthedocs.io/en/latest/
Paper: https://arxiv.org/abs/1802.03426

Packages (if not available, pip install):

import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Load data:

penguins = pd.read_csv("https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv")
penguins = penguins.dropna()

lter_penguins.png

culmen_depth.png

Show data:

penguins.head()
species_short island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 MALE

penguins = penguins.dropna() penguins.species_short.value_counts()

sns.pairplot(penguins, hue='species_short')
<seaborn.axisgrid.PairGrid at 0x7fae38a024f0>
../_images/03_PCA_10_1.png

Data scaling#

change data format to values:

penguin_data = penguins[
    [
        "culmen_length_mm",
        "culmen_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
    ]
].values
import umap.umap_ as umap #install with 'pip install umap-learn'
np.random.seed(42) #a seed is defined, because there is a random component to UMAP
reducer = umap.UMAP()
scaled_penguin_data = StandardScaler().fit_transform(penguin_data)
scaled_penguin_data
array([[-0.89765322,  0.78348666, -1.42952144, -0.57122888],
       [-0.82429023,  0.12189602, -1.07240838, -0.50901123],
       [-0.67756427,  0.42724555, -0.42960487, -1.19340546],
       ...,
       [ 1.17485108, -0.74326098,  1.49880565,  1.91747742],
       [ 0.22113229, -1.20128527,  0.78457953,  1.23308319],
       [ 1.08314735, -0.53969463,  0.85600214,  1.48195382]])
scaled_df = pd.DataFrame(scaled_penguin_data)

sns.pairplot(scaled_df)
<seaborn.axisgrid.PairGrid at 0x7fae38da70a0>
../_images/03_PCA_17_1.png

PCA#

Principle component analysis is a linear dimensionality reduction technique, while UMAP is non-linear. Another popular non-linear one is tSNE.

from sklearn.decomposition import PCA
pca = PCA(n_components=4)
pen_pca = pca.fit(scaled_penguin_data).transform(scaled_penguin_data)

Show the percentage of variance explained for each components:

print(
    "explained variance ratio: %s"
    % str(pca.explained_variance_ratio_)
)
plt.scatter(
    pen_pca[:, 0],
    pen_pca[:, 1],
    c=[sns.color_palette()[x] for x in penguins.species_short.map({"Adelie":0, "Chinstrap":1, "Gentoo":2})])
plt.show()
explained variance ratio: [0.68641678 0.19448404 0.09215558 0.02694359]
../_images/03_PCA_22_1.png

The separation of species with PCA did work, but not very well.