Principal Feature Analysis#

Ryan Savill, July 8th 2021

Background#

Principal feature analysis (PFA) is a method for selecting a subset of features that describe most of the variability in the dataset based on this paper by Y. Lu et al. published in ACM, 2007. Most of the code for the implementation below was provided in a stackexchange comment section and is greatly appreciated!

This sounds oddly similar to principal component analysis (PCA), which is no coincidence as the methodologies are intertwined. PCA also does a similar thing but instead of choosing features that describe the variability to a certain threshold we choose a subset of principal components. This is not always optimal and there are valid choices for working with the features instead of the principal components as the principal components are always a transformed form of the data.

If you have not heard of PCA before I would advise you to check out this video. If you (like me just a few days ago) have an understanding of the principle of PCA but don’t understand the math behind it I’d advise you to take some time to grasp the concepts. I found a great series of videos that explain all necessary parts of the puzzle that is PCA. Each video is just about 10 minutes long and explained quite well, even if your math lectures have been a few years ago!

Once you have grasped these concepts this explanation should be a breeze!

Principal feature analysis takes advantage of some of the transformations in PCA to find out which features explain the variance. The first steps in PFA are to create the covariance matrix and find the orthonormal eigenvectors and sort them by their eigenvalues, creating a matrix A. This in effect is what PCA does as well since the principal components are the eigenvectors of the covariance matrix. We can output them in scikit-learn with the command ‘.components_’.

# importing everything we might need and more:
from sklearn import decomposition
import tribolium_clustering as tc
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler

# function for labelling eigenvectors so it is readable:
def pca_comp_inf_dataframe(pca_object,regprops_df,n_components):
    from sklearn.decomposition import PCA
    import pandas as pd
    import numpy as np
    features = regprops_df.keys()
    component_names = ['PC{}'.format(i+1) for i in range(n_components)]
    comps = pca_object.components_
    df_comps = pd.DataFrame(comps,index = component_names, columns = features)
    
    return df_comps

# loading our test dataset
location_prefix = 'data/folder/'

timepoint = 9
regionpropspath = location_prefix + 'Master Thesis//First Coding Tries//regionprops_all_timepoints_lund//regprops t{}.csv'.format(timepoint)
regprops = tc.readcsv_as_cl_input(regionpropspath)

# redefinition of our dataset
X = regprops

# standardscaling the dataset is a prerequisite for PCA
sc = StandardScaler()
X = sc.fit_transform(X)

# fitting our PCA
pca = PCA().fit(X)

# getting our orthonormal eigenvectors:
A = pca.components_.T

# labelling that matrix for ease of use
A_readable = pca_comp_inf_dataframe(pca,regprops,len(regprops.keys())).transpose()

A_readable
No Predictions in Regionprops of data/folder/Master Thesis/First Coding Tries/regionprops_all_timepoints_lund/regprops t9.csv
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
area 0.317564 0.035871 0.001643 -0.060553 -0.099831 0.060197 0.121308 -0.070996 -0.059213 -0.200579 0.358877 0.496485 0.533148 0.376162 0.126583 -0.007435
centroid-0 0.031897 0.394748 -0.671168 0.066552 -0.171701 -0.245463 -0.296170 -0.423315 -0.143398 0.086719 0.030823 -0.041805 -0.005947 0.007349 0.026808 0.005131
centroid-1 0.008123 0.382147 0.488175 0.230320 -0.613868 -0.143166 -0.356966 0.192595 0.011508 -0.013926 0.001356 0.008315 -0.007246 0.000669 -0.015394 -0.000195
centroid-2 0.240592 -0.076633 -0.435728 0.075588 -0.140601 -0.327701 0.114531 0.634164 0.393689 -0.174994 -0.084351 0.026689 -0.018534 -0.022879 -0.062223 -0.007310
feret_diameter_max 0.289769 -0.012282 -0.032635 0.181363 -0.160207 0.380691 0.084649 -0.211788 0.439432 0.320636 -0.163147 0.056981 -0.363715 0.448382 0.007851 -0.012093
major_axis_length 0.301739 0.002370 -0.021720 0.113154 -0.135457 0.346191 0.039834 -0.195065 0.315525 0.010066 0.049534 -0.247254 0.419644 -0.608955 -0.086439 0.013967
minor_axis_length 0.308058 0.048526 0.032858 -0.149205 -0.075049 0.130833 0.027726 -0.170614 -0.094954 -0.740310 0.122889 -0.086970 -0.489703 -0.077289 -0.034946 -0.005721
solidity 0.139750 0.213562 0.179860 -0.784037 0.109756 -0.245965 -0.162139 -0.093897 0.399242 0.142184 -0.016268 -0.005970 0.007210 -0.012330 0.005534 -0.001295
mean_intensity -0.298859 -0.031721 -0.044582 -0.053140 -0.290490 -0.081776 0.366782 -0.070965 0.147539 0.210029 0.577876 0.280034 -0.316568 -0.304428 0.055622 0.001809
max_intensity -0.223891 0.445127 -0.015072 -0.078562 -0.095429 0.088448 0.504917 -0.064260 -0.004226 -0.155993 -0.573306 0.308846 0.074610 -0.118217 -0.036991 0.002788
min_intensity -0.248677 -0.277709 -0.003527 -0.151868 -0.443942 -0.187537 0.304523 -0.223142 0.067566 -0.152249 -0.039673 -0.525101 0.223898 0.328584 -0.005177 -0.002429
image_stdev -0.124373 0.597732 -0.026935 0.059831 0.275986 0.222842 0.207255 0.258350 0.080221 -0.016470 0.376158 -0.432514 0.043550 0.222652 0.025638 -0.004268
avg distance of 2 closest points 0.312403 0.040115 0.025366 -0.113510 -0.083929 -0.065002 0.223752 0.060992 -0.355553 0.261325 0.057316 -0.050857 -0.037707 0.038347 -0.716577 0.324435
avg distance of 3 closest points 0.315714 0.031638 0.028109 -0.084834 -0.084732 -0.064296 0.212046 0.076844 -0.308744 0.228016 -0.034887 -0.121603 -0.049914 -0.068178 0.120108 -0.802665
avg distance of 4 closest points 0.316944 0.023361 0.033518 -0.059729 -0.078356 -0.068524 0.204538 0.092240 -0.253916 0.188049 -0.088972 -0.156060 -0.072032 -0.097133 0.660690 0.499889
touching neighbor count -0.190434 -0.078438 -0.284396 -0.437245 -0.339335 0.594522 -0.243295 0.334427 -0.201843 0.052019 -0.024418 0.025692 0.012726 0.013417 0.028237 0.006079

Now we have a massive matrix of our eigenvectors sorted by eigenvalues, but since every feature has equal influence on the eigenvectors we need to take a subset of matrix A called A_q. q represents the number of eigenvectors we take into account and the value of q will influence how much variance we can explain with the subset of the principal components. We can turn this question around and set a threshold of how much variance should at least be explained and then find the corresponding subset A-q of matrix A:

# this shows us how much variance each principal component explains
explained_variance = pca.explained_variance_ratio_
print('Explained Variances')
print(explained_variance)
print('')

# this shows us how much variance the subset up to the i'th index explains
cumulative_expl_var = [sum(explained_variance[:i+1]) for i in range(len(explained_variance))]
print('Cumulative Explained Variance')
print(cumulative_expl_var)
print('')

# now we iterate through the cumulative explained variance until we reach a threshold that we define
q_variance_limit = 0.9

for i,j in enumerate(cumulative_expl_var):
    if j >= q_variance_limit:
        q = i
        break
print('The Subset will include the first {} Eigenvectors'.format(q))
Explained Variances
[5.89549241e-01 1.24817549e-01 7.44073363e-02 6.06447683e-02
 5.05242728e-02 3.80817718e-02 2.29718719e-02 1.43923774e-02
 1.16341513e-02 5.35585488e-03 2.50361397e-03 1.82220555e-03
 1.48239356e-03 1.06126637e-03 6.85699035e-04 6.56265602e-05]

Cumulative Explained Variance
[0.5895492409109632, 0.7143667903013323, 0.7887741265865403, 0.8494188949293884, 0.8999431677426732, 0.9380249394988094, 0.9609968114170524, 0.9753891887742816, 0.9870233400706391, 0.992379194950718, 0.9948828089223744, 0.9967050144743802, 0.9981874080386469, 0.9992486744047498, 0.9999343734397964, 0.9999999999999999]

The Subset will include the first 5 Eigenvectors

With our q defined we can get the the subset matrix of eigenvectors A_q:

A_q = A[:,:q]

A_q_readable = pd.DataFrame(A_q,index = regprops.keys(), columns = ['PC{}'.format(i+1) for i in range(q)])

A_q_readable
PC1 PC2 PC3 PC4 PC5
area 0.317564 0.035871 0.001643 -0.060553 -0.099831
centroid-0 0.031897 0.394748 -0.671168 0.066552 -0.171701
centroid-1 0.008123 0.382147 0.488175 0.230320 -0.613868
centroid-2 0.240592 -0.076633 -0.435728 0.075588 -0.140601
feret_diameter_max 0.289769 -0.012282 -0.032635 0.181363 -0.160207
major_axis_length 0.301739 0.002370 -0.021720 0.113154 -0.135457
minor_axis_length 0.308058 0.048526 0.032858 -0.149205 -0.075049
solidity 0.139750 0.213562 0.179860 -0.784037 0.109756
mean_intensity -0.298859 -0.031721 -0.044582 -0.053140 -0.290490
max_intensity -0.223891 0.445127 -0.015072 -0.078562 -0.095429
min_intensity -0.248677 -0.277709 -0.003527 -0.151868 -0.443942
image_stdev -0.124373 0.597732 -0.026935 0.059831 0.275986
avg distance of 2 closest points 0.312403 0.040115 0.025366 -0.113510 -0.083929
avg distance of 3 closest points 0.315714 0.031638 0.028109 -0.084834 -0.084732
avg distance of 4 closest points 0.316944 0.023361 0.033518 -0.059729 -0.078356
touching neighbor count -0.190434 -0.078438 -0.284396 -0.437245 -0.339335

If you have read the paper or are following along with the paper we have just completed step 3 of the explained algorithm. What is done now is that we perform clustering on the matrix A_q with K-means clustering. Essentially, we are trying to find clusters of features as features that are clustered in this matrix have similar influences on the principal components and are thus correlating and not telling us much. This way we can just choose the feature that is closest to the cluster centre and it will be chosen as a principal feature.

We can see that the choice of the cluster-number will determine how many features we choose. Y. Lu et al. recommend choosing the cluster-number: p, as: p >= q because we can’t be completely sure that the variance that we chose is explained when p = q. We can just opt for a slightly larger p (arbitrary choice of 2 larger than q) but there is room for improvement in terms of an automated choice here OR it could be a parameter that we can implement later.

Since K-means-clustering is implemented well in scikit-learn this procedure is just a couple of lines:

kmeans = KMeans(n_clusters= q + 2, random_state= 42).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_

clusters_readable = pd.DataFrame(clusters, index = regprops.keys(), columns = ['Cluster'])
clusters_readable
Cluster
area 2
centroid-0 5
centroid-1 6
centroid-2 1
feret_diameter_max 2
major_axis_length 2
minor_axis_length 2
solidity 4
mean_intensity 0
max_intensity 3
min_intensity 0
image_stdev 3
avg distance of 2 closest points 2
avg distance of 3 closest points 2
avg distance of 4 closest points 2
touching neighbor count 0

What can we see here? Well each feature is now part of a cluster (numbers 0 - 6, since we have chosen 7 clusters). What we can also see is that quite a few features are part of the same cluster( = 2), while some features are only part of a unique cluster without any other features. This means that these features do not correlate (much at least) with any other features and will be chosen by our feature analysis. The clusters which have 2 or more features as part of it are the clusters where a feature has to be chosen. This is decided by the distances the features have from the cluster centres, so let’s determine the distances from the cluster centres for each feature:

# initialising a dictionary to keep the distances in:
dists = defaultdict(list)

# iterating through each feature (i) and it's cluster (c)
for i, c in enumerate(clusters):
    
    # measuring the distance of each feature in A_q ([A_q[i, :]])
    # and the cluster centers [cluster_centers[c, :]]
    dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
    
    # each cluster is a key in the dictionary and the distance of each feature
    # belonging to that cluster is saved under this key as a tupule (feature_idx, distance)
    dists[c].append((i, dist))

dists
defaultdict(list,
            {2: [(0, 0.039527541499107074),
              (4, 0.22202861661398224),
              (5, 0.1470198671974888),
              (6, 0.13180725994276707),
              (12, 0.09346378648078914),
              (13, 0.06610212866152891),
              (14, 0.04944098356286365)],
             5: [(1, 0.0)],
             6: [(2, 0.0)],
             1: [(3, 0.0)],
             4: [(7, 0.0)],
             0: [(8, 0.217150845350435),
              (10, 0.2117088702247021),
              (15, 0.29315857787851873)],
             3: [(9, 0.21819421392343466), (11, 0.21819421392343452)]})

In our dists dictionary each cluster has a list of tuples associated with it. Each tuple contains: (feature index, distance to cluster center) We can see that for the clusters with only one tuple associated (5,6,1,4), the features all have distance 0 since the cluster center IS the feature. For all other clusters we still have to determine which feature is closest to the cluster center. Once the features have been determined we can transform our original dataset to only include these features:

# here we are getting the indices of the features with the smallest distances 
# to the cluster centers, essentially choosing which features to keep
indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]

# transforming the input data to only include the features we just chose
features_ = X[:, indices_]

Implementation#

Since data science in python is mainly performed in scikit-learn we’ll stick to the conventions and create a class instead of a function for PFA. As stated above this is a slightly modified code provided by Ulf Aslak on this stackexchange thread. Thanks for your kind code donation!

class PFA(object):
    def __init__(self, diff_n_features = 2, q=None, explained_var = 0.95):
        self.q = q
        self.diff_n_features = diff_n_features
        self.explained_var = explained_var

    def fit(self, X):
        sc = StandardScaler()
        X = sc.fit_transform(X)

        pca = PCA().fit(X)
        
        if not self.q:
            explained_variance = pca.explained_variance_ratio_
            cumulative_expl_var = [sum(explained_variance[:i+1]) for i in range(len(explained_variance))]
            for i,j in enumerate(cumulative_expl_var):
                if j >= self.explained_var:
                    q = i
                    break
                    
        A_q = pca.components_.T[:,:q]
        
        clusternumber = min([q + self.diff_n_features, X.shape[1]])
        
        kmeans = KMeans(n_clusters= clusternumber).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]
        
    def fit_transform(self,X):    
        sc = StandardScaler()
        X = sc.fit_transform(X)

        pca = PCA().fit(X)
        
        if not self.q:
            explained_variance = pca.explained_variance_ratio_
            cumulative_expl_var = [sum(explained_variance[:i+1]) for i in range(len(explained_variance))]
            for i,j in enumerate(cumulative_expl_var):
                if j >= self.explained_var:
                    q = i
                    break
                    
        A_q = pca.components_.T[:,:q]
        
        clusternumber = min([q + self.diff_n_features, X.shape[1]])
        
        kmeans = KMeans(n_clusters= clusternumber).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]
        
        return X[:, self.indices_]
    
    def transform(self, X):
        return X[:, self.indices_]

# Testing
pfa = PFA(diff_n_features=1, explained_var= 0.95)
pfa.fit_transform(regprops)

featurekeys = [regprops.keys().tolist()[i] for i in pfa.indices_]
featurekeys
['area',
 'centroid-0',
 'centroid-1',
 'solidity',
 'mean_intensity',
 'image_stdev',
 'touching neighbor count']

This is a modification of the provided code in that it tries to force you to choose a subset. If you don’t manually say what the subset is, it will get a subset that explains 95% of the variance or the amount of variance that you determine. As described in the paper, the number of clusters (and with that the number of features) should be larger than the subset of eigenvectors chosen. This is implemented by not determining the number of features but only setting how many more features than the subset should be chosen. Apart from this a transform function was added as well as a fit_transform function, similarly to other scikit learn functions.