Principal Component Analysis is a widely used method to reduce the dimensionality of a dataset as well as to de-correlate it. It can also be used to weight the importance of variables. The PCA transforms variables into another set of variables called Principal Components.

Note

According to Hughes phenomenon, If the number of training samples is fixed and we keep on increasing the number of dimensions then the predictive power of our machine learning model first increases, but after a certain point it tends to decrease.

It takes the data and tries to find a direction (let’s say vector l) such that variance of points projected on vector l is maximum.

This is unsupervised learning. So we don’t need the label of that data set.

Let’s take an example without label.

Example 1

In our very basic fictious example, we have 3 variables.

R
Python

library(dplyr)

df <- tibble(english = c(90, 90, 60, 60, 30), 
             math = c(60, 90, 60, 60, 30), 
             art = c(90, 30, 60, 90, 30))

df

# A tibble: 5 × 3
  english  math   art
    <dbl> <dbl> <dbl>
1      90    60    90
2      90    90    30
3      60    60    60
4      60    60    90
5      30    30    30

import pandas as pd

df_py = pd.DataFrame({'english': [90, 90, 60, 60, 30], 
                      'math': [60, 90, 60, 60, 30], 
                      'art': [90, 30, 60, 90, 30]})

df_py

   english  math  art
0       90    60   90
1       90    90   30
2       60    60   60
3       60    60   90
4       30    30   30

df |> summarise(across(everything(), mean))

# A tibble: 1 × 3
  english  math   art
    <dbl> <dbl> <dbl>
1      66    60    60

# or another way
colMeans(as.matrix(df))

english    math     art 
     66      60      60

df_py.mean()

english    66.0
math       60.0
art        60.0
dtype: float64

Step 2: Compute the Covariance matrix of the whole dataset

As a reminder, we find the covariance between 2 variables $X, Y$ as $cov(X, Y) = \frac{1}{n-1} \cdot \sum_{i=1}^{n} \left( (x_i - \bar{x}) (y_i - \bar{y}) \right)$

So let’s show the covariance of English and Math.

Mean of english $= 66$
Mean of math $= 60$ $\frac{(90 - 66) \cdot (60-60) + (90 - 66) \cdot (90-60) + (60 - 66) \cdot (60-60) + (60 - 66) \cdot (60-60) + (30 - 66) \cdot (30-60)}{4}$ $\frac{24 \cdot 0 + 24 \cdot 30 + -6 \cdot 0 + -6 \cdot 0 + -36 \cdot -30}{4}$ $\frac{0 + 720 + 0 + 0 + 1080}{4} = \frac{1800}{4} = 450$

R
Python

cov(df)

        english math art
english     630  450 225
math        450  450   0
art         225    0 900

# or using matrix
cov(as.matrix(df))

        english math art
english     630  450 225
math        450  450   0
art         225    0 900

df_py.cov()

         english   math    art
english    630.0  450.0  225.0
math       450.0  450.0    0.0
art        225.0    0.0  900.0

Using matrices, another way to compute the covariance matrix is the following:

$\frac{1}{n-1} \left( \textbf{X} - \bar{X} \right)^T \cdot \left( \textbf{X} - \bar{X} \right)$

Remember, the positive covariance between math and english indicates that both subject covary in the same direction. And the null covariance between math and art indicates that there is no predictable relationship between the art and math subject.

Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix

Recall that the eigenvectors satifies the following relationship:

$\textbf{A}\cdot v = {\lambda} \cdot v$ $\left( \textbf{A} - \lambda \right) v = 0$ $det\left( \textbf{A} - \lambda \textbf{I} \right) = 0$

R
Python

eigen(as.matrix(cov(df)))

eigen() decomposition
$values
[1] 1137.58744  786.38798   56.02458

$vectors
           [,1]       [,2]       [,3]
[1,] -0.6558023 -0.3859988  0.6487899
[2,] -0.4291978 -0.5163664 -0.7410499
[3,] -0.6210577  0.7644414 -0.1729644

import numpy as np

df_py_cov_mat = df_py.cov()

eigenvalues, eigenvectors = np.linalg.eig(df_py_cov_mat)
print(eigenvalues)

[  56.02457535 1137.5874413   786.38798335]

print(eigenvectors)

[[ 0.6487899  -0.65580225 -0.3859988 ]
 [-0.74104991 -0.4291978  -0.51636642]
 [-0.17296443 -0.62105769  0.7644414 ]]

It is serendipity that the first eigenvectors is the highest (aka explained most of the variance). The second one is the second highest and third one is last.

This is a 3D space with each eigen vector being orthogonal to the other. In an N-dimensional space, each eigenvectors are orthogonal.

To find the percentage of variance expalained by the eigenvalue $k$ (where $k$ is one of the dimension), we compute:

$\frac{\lambda_k}{\sum_{i=1}^{n}}$

Step 4: Compute the new data frame based on the principal components.

To transform the eigenvectors to the new subspace we used: $\textbf{W}^t \cdot X$

$X$ is our initial data matrix. Our df in the above steps
$\textbf{W}^t$ is the transpose of the eigenvector matrix.

R
Python

Another Example.

R
Python

x_mat <- matrix(NA, nrow = 10, ncol = 2)
x_mat[, 1] <- c(14.3, 12.2, 13.7, 12.0, 13.4, 14.3, 13.0, 14.8, 11.1, 14.3)
x_mat[, 2] <- c(7.5, 5.5, 6.7, 5.1, 5.2, 6.3, 7.6, 7.3, 5.3, 7.2)

Find the covariance matrix.
Remember we d