PCA

Understanding Principal Component Analysis
ML
PCA
Covariance
Author

François de Ryckel

Published

October 6, 2024

Modified

October 6, 2024

Principal Component Analysis is a widely used method to reduce the dimensionality of a dataset as well as to de-correlate it. It can also be used to weight the importance of variables. The PCA transforms variables into another set of variables called Principal Components.

Note

According to Hughes phenomenon, If the number of training samples is fixed and we keep on increasing the number of dimensions then the predictive power of our machine learning model first increases, but after a certain point it tends to decrease.

The Curse of Dimensionality

It takes the data and tries to find a direction (let’s say vector l) such that variance of points projected on vector l is maximum.

This is unsupervised learning. So we don’t need the label of that data set.

Let’s take an example without label.

Example 1

In our very basic fictious example, we have 3 variables.

library(dplyr)

df <- tibble(english = c(90, 90, 60, 60, 30), 
             math = c(60, 90, 60, 60, 30), 
             art = c(90, 30, 60, 90, 30))

df
# A tibble: 5 × 3
  english  math   art
    <dbl> <dbl> <dbl>
1      90    60    90
2      90    90    30
3      60    60    60
4      60    60    90
5      30    30    30
import pandas as pd

df_py = pd.DataFrame({'english': [90, 90, 60, 60, 30], 
                      'math': [60, 90, 60, 60, 30], 
                      'art': [90, 30, 60, 90, 30]})

df_py
   english  math  art
0       90    60   90
1       90    90   30
2       60    60   60
3       60    60   90
4       30    30   30

Step 1: find the mean of each variable

df |> summarise(across(everything(), mean))
# A tibble: 1 × 3
  english  math   art
    <dbl> <dbl> <dbl>
1      66    60    60
# or another way
colMeans(as.matrix(df))
english    math     art 
     66      60      60 
df_py.mean()
english    66.0
math       60.0
art        60.0
dtype: float64

Step 2: Compute the Covariance matrix of the whole dataset

As a reminder, we find the covariance between 2 variables X,YX, Y as cov(X,Y)=1n1i=1n((xix)(yiy))cov(X, Y) = \frac{1}{n-1} \cdot \sum_{i=1}^{n} \left( (x_i - \bar{x}) (y_i - \bar{y}) \right)

So let’s show the covariance of English and Math.

  • Mean of english =66= 66
  • Mean of math =60= 60 (9066)(6060)+(9066)(9060)+(6066)(6060)+(6066)(6060)+(3066)(3060)4\frac{(90 - 66) \cdot (60-60) + (90 - 66) \cdot (90-60) + (60 - 66) \cdot (60-60) + (60 - 66) \cdot (60-60) + (30 - 66) \cdot (30-60)}{4} 240+2430+60+60+36304\frac{24 \cdot 0 + 24 \cdot 30 + -6 \cdot 0 + -6 \cdot 0 + -36 \cdot -30}{4} 0+720+0+0+10804=18004=450\frac{0 + 720 + 0 + 0 + 1080}{4} = \frac{1800}{4} = 450
cov(df)
        english math art
english     630  450 225
math        450  450   0
art         225    0 900
# or using matrix
cov(as.matrix(df))
        english math art
english     630  450 225
math        450  450   0
art         225    0 900
df_py.cov()
         english   math    art
english    630.0  450.0  225.0
math       450.0  450.0    0.0
art        225.0    0.0  900.0

Using matrices, another way to compute the covariance matrix is the following:

1n1(𝐗X)T(𝐗X)\frac{1}{n-1} \left( \textbf{X} - \bar{X} \right)^T \cdot \left( \textbf{X} - \bar{X} \right)

Remember, the positive covariance between math and english indicates that both subject covary in the same direction. And the null covariance between math and art indicates that there is no predictable relationship between the art and math subject.

Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix

Recall that the eigenvectors satifies the following relationship:

𝐀v=λv\textbf{A}\cdot v = {\lambda} \cdot v (𝐀λ)v=0\left( \textbf{A} - \lambda \right) v = 0 det(𝐀λ𝐈)=0det\left( \textbf{A} - \lambda \textbf{I} \right) = 0

eigen(as.matrix(cov(df)))
eigen() decomposition
$values
[1] 1137.58744  786.38798   56.02458

$vectors
           [,1]       [,2]       [,3]
[1,] -0.6558023 -0.3859988  0.6487899
[2,] -0.4291978 -0.5163664 -0.7410499
[3,] -0.6210577  0.7644414 -0.1729644
import numpy as np

df_py_cov_mat = df_py.cov()

eigenvalues, eigenvectors = np.linalg.eig(df_py_cov_mat)
print(eigenvalues)
[  56.02457535 1137.5874413   786.38798335]
print(eigenvectors)
[[ 0.6487899  -0.65580225 -0.3859988 ]
 [-0.74104991 -0.4291978  -0.51636642]
 [-0.17296443 -0.62105769  0.7644414 ]]

It is serendipity that the first eigenvectors is the highest (aka explained most of the variance). The second one is the second highest and third one is last.

This is a 3D space with each eigen vector being orthogonal to the other. In an N-dimensional space, each eigenvectors are orthogonal.

To find the percentage of variance expalained by the eigenvalue kk (where kk is one of the dimension), we compute:

λki=1n\frac{\lambda_k}{\sum_{i=1}^{n}}

Step 4: Compute the new data frame based on the principal components.

To transform the eigenvectors to the new subspace we used: 𝐖tX\textbf{W}^t \cdot X

  • XX is our initial data matrix. Our df in the above steps
  • 𝐖t\textbf{W}^t is the transpose of the eigenvector matrix.

Another Example.

x_mat <- matrix(NA, nrow = 10, ncol = 2)
x_mat[, 1] <- c(14.3, 12.2, 13.7, 12.0, 13.4, 14.3, 13.0, 14.8, 11.1, 14.3)
x_mat[, 2] <- c(7.5, 5.5, 6.7, 5.1, 5.2, 6.3, 7.6, 7.3, 5.3, 7.2)

Find the covariance matrix.
Remember we d