Principal Feature Analysis#
Ryan Savill, July 8th 2021
Background#
Principal feature analysis (PFA) is a method for selecting a subset of features that describe most of the variability in the dataset based on this paper by Y. Lu et al. published in ACM, 2007. Most of the code for the implementation below was provided in a stackexchange comment section and is greatly appreciated!
This sounds oddly similar to principal component analysis (PCA), which is no coincidence as the methodologies are intertwined. PCA also does a similar thing but instead of choosing features that describe the variability to a certain threshold we choose a subset of principal components. This is not always optimal and there are valid choices for working with the features instead of the principal components as the principal components are always a transformed form of the data.
If you have not heard of PCA before I would advise you to check out this video. If you (like me just a few days ago) have an understanding of the principle of PCA but don’t understand the math behind it I’d advise you to take some time to grasp the concepts. I found a great series of videos that explain all necessary parts of the puzzle that is PCA. Each video is just about 10 minutes long and explained quite well, even if your math lectures have been a few years ago!
PCA (finally!)
Once you have grasped these concepts this explanation should be a breeze!
Principal feature analysis takes advantage of some of the transformations in PCA to find out which features explain the variance. The first steps in PFA are to create the covariance matrix and find the orthonormal eigenvectors and sort them by their eigenvalues, creating a matrix A. This in effect is what PCA does as well since the principal components are the eigenvectors of the covariance matrix. We can output them in scikit-learn with the command ‘.components_’.
# importing everything we might need and more:
from sklearn import decomposition
import tribolium_clustering as tc
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler
# function for labelling eigenvectors so it is readable:
def pca_comp_inf_dataframe(pca_object,regprops_df,n_components):
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
features = regprops_df.keys()
component_names = ['PC{}'.format(i+1) for i in range(n_components)]
comps = pca_object.components_
df_comps = pd.DataFrame(comps,index = component_names, columns = features)
return df_comps
# loading our test dataset
location_prefix = 'data/folder/'
timepoint = 9
regionpropspath = location_prefix + 'Master Thesis//First Coding Tries//regionprops_all_timepoints_lund//regprops t{}.csv'.format(timepoint)
regprops = tc.readcsv_as_cl_input(regionpropspath)
# redefinition of our dataset
X = regprops
# standardscaling the dataset is a prerequisite for PCA
sc = StandardScaler()
X = sc.fit_transform(X)
# fitting our PCA
pca = PCA().fit(X)
# getting our orthonormal eigenvectors:
A = pca.components_.T
# labelling that matrix for ease of use
A_readable = pca_comp_inf_dataframe(pca,regprops,len(regprops.keys())).transpose()
A_readable
No Predictions in Regionprops of data/folder/Master Thesis/First Coding Tries/regionprops_all_timepoints_lund/regprops t9.csv
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
area | 0.317564 | 0.035871 | 0.001643 | -0.060553 | -0.099831 | 0.060197 | 0.121308 | -0.070996 | -0.059213 | -0.200579 | 0.358877 | 0.496485 | 0.533148 | 0.376162 | 0.126583 | -0.007435 |
centroid-0 | 0.031897 | 0.394748 | -0.671168 | 0.066552 | -0.171701 | -0.245463 | -0.296170 | -0.423315 | -0.143398 | 0.086719 | 0.030823 | -0.041805 | -0.005947 | 0.007349 | 0.026808 | 0.005131 |
centroid-1 | 0.008123 | 0.382147 | 0.488175 | 0.230320 | -0.613868 | -0.143166 | -0.356966 | 0.192595 | 0.011508 | -0.013926 | 0.001356 | 0.008315 | -0.007246 | 0.000669 | -0.015394 | -0.000195 |
centroid-2 | 0.240592 | -0.076633 | -0.435728 | 0.075588 | -0.140601 | -0.327701 | 0.114531 | 0.634164 | 0.393689 | -0.174994 | -0.084351 | 0.026689 | -0.018534 | -0.022879 | -0.062223 | -0.007310 |
feret_diameter_max | 0.289769 | -0.012282 | -0.032635 | 0.181363 | -0.160207 | 0.380691 | 0.084649 | -0.211788 | 0.439432 | 0.320636 | -0.163147 | 0.056981 | -0.363715 | 0.448382 | 0.007851 | -0.012093 |
major_axis_length | 0.301739 | 0.002370 | -0.021720 | 0.113154 | -0.135457 | 0.346191 | 0.039834 | -0.195065 | 0.315525 | 0.010066 | 0.049534 | -0.247254 | 0.419644 | -0.608955 | -0.086439 | 0.013967 |
minor_axis_length | 0.308058 | 0.048526 | 0.032858 | -0.149205 | -0.075049 | 0.130833 | 0.027726 | -0.170614 | -0.094954 | -0.740310 | 0.122889 | -0.086970 | -0.489703 | -0.077289 | -0.034946 | -0.005721 |
solidity | 0.139750 | 0.213562 | 0.179860 | -0.784037 | 0.109756 | -0.245965 | -0.162139 | -0.093897 | 0.399242 | 0.142184 | -0.016268 | -0.005970 | 0.007210 | -0.012330 | 0.005534 | -0.001295 |
mean_intensity | -0.298859 | -0.031721 | -0.044582 | -0.053140 | -0.290490 | -0.081776 | 0.366782 | -0.070965 | 0.147539 | 0.210029 | 0.577876 | 0.280034 | -0.316568 | -0.304428 | 0.055622 | 0.001809 |
max_intensity | -0.223891 | 0.445127 | -0.015072 | -0.078562 | -0.095429 | 0.088448 | 0.504917 | -0.064260 | -0.004226 | -0.155993 | -0.573306 | 0.308846 | 0.074610 | -0.118217 | -0.036991 | 0.002788 |
min_intensity | -0.248677 | -0.277709 | -0.003527 | -0.151868 | -0.443942 | -0.187537 | 0.304523 | -0.223142 | 0.067566 | -0.152249 | -0.039673 | -0.525101 | 0.223898 | 0.328584 | -0.005177 | -0.002429 |
image_stdev | -0.124373 | 0.597732 | -0.026935 | 0.059831 | 0.275986 | 0.222842 | 0.207255 | 0.258350 | 0.080221 | -0.016470 | 0.376158 | -0.432514 | 0.043550 | 0.222652 | 0.025638 | -0.004268 |
avg distance of 2 closest points | 0.312403 | 0.040115 | 0.025366 | -0.113510 | -0.083929 | -0.065002 | 0.223752 | 0.060992 | -0.355553 | 0.261325 | 0.057316 | -0.050857 | -0.037707 | 0.038347 | -0.716577 | 0.324435 |
avg distance of 3 closest points | 0.315714 | 0.031638 | 0.028109 | -0.084834 | -0.084732 | -0.064296 | 0.212046 | 0.076844 | -0.308744 | 0.228016 | -0.034887 | -0.121603 | -0.049914 | -0.068178 | 0.120108 | -0.802665 |
avg distance of 4 closest points | 0.316944 | 0.023361 | 0.033518 | -0.059729 | -0.078356 | -0.068524 | 0.204538 | 0.092240 | -0.253916 | 0.188049 | -0.088972 | -0.156060 | -0.072032 | -0.097133 | 0.660690 | 0.499889 |
touching neighbor count | -0.190434 | -0.078438 | -0.284396 | -0.437245 | -0.339335 | 0.594522 | -0.243295 | 0.334427 | -0.201843 | 0.052019 | -0.024418 | 0.025692 | 0.012726 | 0.013417 | 0.028237 | 0.006079 |
Now we have a massive matrix of our eigenvectors sorted by eigenvalues, but since every feature has equal influence on the eigenvectors we need to take a subset of matrix A
called A_q
. q
represents the number of eigenvectors we take into account and the value of q
will influence how much variance we can explain with the subset of the principal components. We can turn this question around and set a threshold of how much variance should at least be explained and then find the corresponding subset A-q
of matrix A
:
# this shows us how much variance each principal component explains
explained_variance = pca.explained_variance_ratio_
print('Explained Variances')
print(explained_variance)
print('')
# this shows us how much variance the subset up to the i'th index explains
cumulative_expl_var = [sum(explained_variance[:i+1]) for i in range(len(explained_variance))]
print('Cumulative Explained Variance')
print(cumulative_expl_var)
print('')
# now we iterate through the cumulative explained variance until we reach a threshold that we define
q_variance_limit = 0.9
for i,j in enumerate(cumulative_expl_var):
if j >= q_variance_limit:
q = i
break
print('The Subset will include the first {} Eigenvectors'.format(q))
Explained Variances
[5.89549241e-01 1.24817549e-01 7.44073363e-02 6.06447683e-02
5.05242728e-02 3.80817718e-02 2.29718719e-02 1.43923774e-02
1.16341513e-02 5.35585488e-03 2.50361397e-03 1.82220555e-03
1.48239356e-03 1.06126637e-03 6.85699035e-04 6.56265602e-05]
Cumulative Explained Variance
[0.5895492409109632, 0.7143667903013323, 0.7887741265865403, 0.8494188949293884, 0.8999431677426732, 0.9380249394988094, 0.9609968114170524, 0.9753891887742816, 0.9870233400706391, 0.992379194950718, 0.9948828089223744, 0.9967050144743802, 0.9981874080386469, 0.9992486744047498, 0.9999343734397964, 0.9999999999999999]
The Subset will include the first 5 Eigenvectors
With our q defined we can get the the subset matrix of eigenvectors A_q:
A_q = A[:,:q]
A_q_readable = pd.DataFrame(A_q,index = regprops.keys(), columns = ['PC{}'.format(i+1) for i in range(q)])
A_q_readable
PC1 | PC2 | PC3 | PC4 | PC5 | |
---|---|---|---|---|---|
area | 0.317564 | 0.035871 | 0.001643 | -0.060553 | -0.099831 |
centroid-0 | 0.031897 | 0.394748 | -0.671168 | 0.066552 | -0.171701 |
centroid-1 | 0.008123 | 0.382147 | 0.488175 | 0.230320 | -0.613868 |
centroid-2 | 0.240592 | -0.076633 | -0.435728 | 0.075588 | -0.140601 |
feret_diameter_max | 0.289769 | -0.012282 | -0.032635 | 0.181363 | -0.160207 |
major_axis_length | 0.301739 | 0.002370 | -0.021720 | 0.113154 | -0.135457 |
minor_axis_length | 0.308058 | 0.048526 | 0.032858 | -0.149205 | -0.075049 |
solidity | 0.139750 | 0.213562 | 0.179860 | -0.784037 | 0.109756 |
mean_intensity | -0.298859 | -0.031721 | -0.044582 | -0.053140 | -0.290490 |
max_intensity | -0.223891 | 0.445127 | -0.015072 | -0.078562 | -0.095429 |
min_intensity | -0.248677 | -0.277709 | -0.003527 | -0.151868 | -0.443942 |
image_stdev | -0.124373 | 0.597732 | -0.026935 | 0.059831 | 0.275986 |
avg distance of 2 closest points | 0.312403 | 0.040115 | 0.025366 | -0.113510 | -0.083929 |
avg distance of 3 closest points | 0.315714 | 0.031638 | 0.028109 | -0.084834 | -0.084732 |
avg distance of 4 closest points | 0.316944 | 0.023361 | 0.033518 | -0.059729 | -0.078356 |
touching neighbor count | -0.190434 | -0.078438 | -0.284396 | -0.437245 | -0.339335 |
If you have read the paper or are following along with the paper we have just completed step 3 of the explained algorithm. What is done now is that we perform clustering on the matrix A_q with K-means clustering. Essentially, we are trying to find clusters of features as features that are clustered in this matrix have similar influences on the principal components and are thus correlating and not telling us much. This way we can just choose the feature that is closest to the cluster centre and it will be chosen as a principal feature.
We can see that the choice of the cluster-number will determine how many features we choose. Y. Lu et al. recommend choosing the cluster-number: p, as: p >= q
because we can’t be completely sure that the variance that we chose is explained when p = q
. We can just opt for a slightly larger p
(arbitrary choice of 2 larger than q
) but there is room for improvement in terms of an automated choice here OR it could be a parameter that we can implement later.
Since K-means-clustering is implemented well in scikit-learn this procedure is just a couple of lines:
kmeans = KMeans(n_clusters= q + 2, random_state= 42).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
clusters_readable = pd.DataFrame(clusters, index = regprops.keys(), columns = ['Cluster'])
clusters_readable
Cluster | |
---|---|
area | 2 |
centroid-0 | 5 |
centroid-1 | 6 |
centroid-2 | 1 |
feret_diameter_max | 2 |
major_axis_length | 2 |
minor_axis_length | 2 |
solidity | 4 |
mean_intensity | 0 |
max_intensity | 3 |
min_intensity | 0 |
image_stdev | 3 |
avg distance of 2 closest points | 2 |
avg distance of 3 closest points | 2 |
avg distance of 4 closest points | 2 |
touching neighbor count | 0 |
What can we see here? Well each feature is now part of a cluster (numbers 0 - 6, since we have chosen 7 clusters). What we can also see is that quite a few features are part of the same cluster( = 2), while some features are only part of a unique cluster without any other features. This means that these features do not correlate (much at least) with any other features and will be chosen by our feature analysis. The clusters which have 2 or more features as part of it are the clusters where a feature has to be chosen. This is decided by the distances the features have from the cluster centres, so let’s determine the distances from the cluster centres for each feature:
# initialising a dictionary to keep the distances in:
dists = defaultdict(list)
# iterating through each feature (i) and it's cluster (c)
for i, c in enumerate(clusters):
# measuring the distance of each feature in A_q ([A_q[i, :]])
# and the cluster centers [cluster_centers[c, :]]
dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
# each cluster is a key in the dictionary and the distance of each feature
# belonging to that cluster is saved under this key as a tupule (feature_idx, distance)
dists[c].append((i, dist))
dists
defaultdict(list,
{2: [(0, 0.039527541499107074),
(4, 0.22202861661398224),
(5, 0.1470198671974888),
(6, 0.13180725994276707),
(12, 0.09346378648078914),
(13, 0.06610212866152891),
(14, 0.04944098356286365)],
5: [(1, 0.0)],
6: [(2, 0.0)],
1: [(3, 0.0)],
4: [(7, 0.0)],
0: [(8, 0.217150845350435),
(10, 0.2117088702247021),
(15, 0.29315857787851873)],
3: [(9, 0.21819421392343466), (11, 0.21819421392343452)]})
In our dists
dictionary each cluster has a list of tuples associated with it. Each tuple contains:
(feature index, distance to cluster center)
We can see that for the clusters with only one tuple associated (5,6,1,4)
, the features all have distance 0 since the cluster center IS the feature. For all other clusters we still have to determine which feature is closest to the cluster center. Once the features have been determined we can transform our original dataset to only include these features:
# here we are getting the indices of the features with the smallest distances
# to the cluster centers, essentially choosing which features to keep
indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
# transforming the input data to only include the features we just chose
features_ = X[:, indices_]
Implementation#
Since data science in python is mainly performed in scikit-learn we’ll stick to the conventions and create a class instead of a function for PFA. As stated above this is a slightly modified code provided by Ulf Aslak on this stackexchange thread. Thanks for your kind code donation!
class PFA(object):
def __init__(self, diff_n_features = 2, q=None, explained_var = 0.95):
self.q = q
self.diff_n_features = diff_n_features
self.explained_var = explained_var
def fit(self, X):
sc = StandardScaler()
X = sc.fit_transform(X)
pca = PCA().fit(X)
if not self.q:
explained_variance = pca.explained_variance_ratio_
cumulative_expl_var = [sum(explained_variance[:i+1]) for i in range(len(explained_variance))]
for i,j in enumerate(cumulative_expl_var):
if j >= self.explained_var:
q = i
break
A_q = pca.components_.T[:,:q]
clusternumber = min([q + self.diff_n_features, X.shape[1]])
kmeans = KMeans(n_clusters= clusternumber).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
def fit_transform(self,X):
sc = StandardScaler()
X = sc.fit_transform(X)
pca = PCA().fit(X)
if not self.q:
explained_variance = pca.explained_variance_ratio_
cumulative_expl_var = [sum(explained_variance[:i+1]) for i in range(len(explained_variance))]
for i,j in enumerate(cumulative_expl_var):
if j >= self.explained_var:
q = i
break
A_q = pca.components_.T[:,:q]
clusternumber = min([q + self.diff_n_features, X.shape[1]])
kmeans = KMeans(n_clusters= clusternumber).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
return X[:, self.indices_]
def transform(self, X):
return X[:, self.indices_]
# Testing
pfa = PFA(diff_n_features=1, explained_var= 0.95)
pfa.fit_transform(regprops)
featurekeys = [regprops.keys().tolist()[i] for i in pfa.indices_]
featurekeys
['area',
'centroid-0',
'centroid-1',
'solidity',
'mean_intensity',
'image_stdev',
'touching neighbor count']
This is a modification of the provided code in that it tries to force you to choose a subset. If you don’t manually say what the subset is, it will get a subset that explains 95% of the variance or the amount of variance that you determine. As described in the paper, the number of clusters (and with that the number of features) should be larger than the subset of eigenvectors chosen. This is implemented by not determining the number of features but only setting how many more features than the subset should be chosen. Apart from this a transform function was added as well as a fit_transform function, similarly to other scikit learn functions.