Principle Component Analysis
Origin
This notebook was adapted from amueller's notebook, "1 - PCA". Here is the link to his repository https://github.com/amueller/tutorial_ml_gkbionics.git .
This notebook provides examples for eigenvalues and eigenvectors in LaTeX and python.
PCA
PCA can always be used to simplify the data with high dimensions (larger than 2) into 2-dimensional data by eliminating the least influntial features on the data. However, we should know the elimination of data makes the independent variable less interpretable. Before we start to deal with the PCA, we need to first learn how PCA utilizes eigenvectors to gain a diagonalization covariance matrix.
Eigenvectors
Eigenvectors and eigenvalues are the main tools used by PCA to obtain a diagnolization covariance matrix. The eigenvector is a vector whos direction will not be affected by the linear transformation, hence eigenvectors represents the direction of largest variance of data while the eigenvalue decides the magnitude of this variance in those directions.
Here we using a simple (2x2) matrix $A$ to explain it. $$ A = \begin{bmatrix} 1 & 4 \\ 3 & 2 \end{bmatrix} $$
# importing class
import sympy as sp
import numpy as np
import numpy.linalg as lg
A = np.matrix([[1,4],[3,2]])
In general, the eigenvector $v$ of a matrix $A$ is the vector where the following holds: $$ Av = \lambda v $$ for which $\lambda$ stands for the eigenvalue such that linear transformation on $v$ can be defined by $\lambda$
Also, we can solve the equation by:
$$
Av - \lambda v = 0
v(A-\lambda I) = 0
$$
While $I$ is the identity matrix of A
In this case, if $v$ is none-zero vector than $Det(A - \lambda I) = 0$, since it cannot be invertible, and we can solve $v$ for $A$ depends on this relationship.
$$
I = \begin{bmatrix}
1 & 0
0 & 1
\end{bmatrix}
$$
def solveLambda(A = A,Lambda = sp.symbols("Lambda", real = True) ):
I = AA.I
I = np.around(I, decimals =0)
return (A - LambdaI)
Lambda = sp.symbols("Lambda", real = True)
B = solveLambda(A = A, Lambda = Lambda)
B
3 & 2 - \lambda \end{bmatrix}
$$
To solve the $\lambda$ we can use the function solve in sympy or calculating.
function = Lambda*2 - 3Lambda - 10
answer = sp.solve(function, Lambda)
answer
In this case, $\lambda_1 = -2$ and $\lambda_2 = 5$, and we can figure out the eigenvectors in two cases.
For $\lambda_1 = -2$
identity = np.identity(len(A))
eigenvectors_1 = A - answer[0]*identity
eigenvectors_1
Based on the matrix we can infer the eigenvector can be
$$
v_1 = \begin{bmatrix}
-4
3\end{bmatrix}
$$
For $\lambda = 5$
eigenvectors_2 = A - answer[1]*identity
eigenvectors_2
Based on the matrix we can infer the eigenvector can be
$$
v_2 = \begin{bmatrix}
1
1\end{bmatrix}
$$
All in all, the covariance matrix $A’$ now can be:
$$
A’ = v * A
$$
Such that we can obtain the matrix $V$
$$
V = \begin{bmatrix}
-4 & 1
3 & 1
\end{bmatrix}
$$
where $A’ = V^{-1} A V$ for the diagnalization:
V = np.matrix([[-4,1],[3,1]])
diagnalization = V.I A V
diagnalization
Hence, the diagonalization covariance matrix is
$$
\begin{bmatrix}
-2 & 0
0 & 5
\end{bmatrix}
$$
Luckily, PCA can do all of this by applyng the function pca.fit_transform(x)
and np.cov()
import numpy as np
import matplotlib.pyplot as plt
Cov = np.array([[2.9, -2.2], [-2.2, 6.5]])
X = np.random.multivariate_normal([1,2], Cov, size=200)
X
np.set_printoptions(4, suppress=True) # show only four decimals
print (X[:10,:]) # print the first 10 rows of X (from 0 to 9)
We round the whole data for only 4 decimals.
However, there is no obvious relationship based on this 2-dimensional data, hence we plot it.
plt.figure(figsize=(4,4))
plt.scatter(X[:,0], X[:,1], c= "b", edgecolor = "black")
plt.axis('equal') # equal scaling on both axis;
We can have a look at the actual covariance matrix,as well:
print (np.cov(X,rowvar=False))
Running PCA
We would now like to analyze the directions in which the data varies most. For that, we
- place the point cloud in the center (0,0) and
- rotate it, such that the direction with most variance is parallel to the x-axis.
Both steps can be done using PCA, which is conveniently available in sklearn.
We start by loading the PCA class from the sklearn package and creating an instance of the class:
from sklearn.decomposition import PCA
pca = PCA()
Now, pca
is an object which has a function pca.fit_transform(x)
which performs both steps from above to its argument x
, and returns the centered and rotated version of x
.
X_pca = pca.fit_transform(X)
pca.components_
pca.mean_
plt.figure(figsize=(4,4))
plt.scatter(X_pca[:,0], X_pca[:,1],c = "b", edgecolor = "black")
plt.axis('equal');
The covariances between different axes should be zero now. We can double-check by having a look at the non-diagonal entries of the covariance matrix:
print (np.cov(X_pca, rowvar=False))
High-Dimensional Data
Our small example above was very easy, since we could get insight into the data by simply plotting it. This approach will not work once you have more than 3 dimensions, Let’s use the famous iris dataset, which has the following 4 dimensions:
- Sepal Length
- Sepal Width
- Pedal Length
- Pedal Width
from io import open
data = open('bezdekIris.data', 'r').readlines()
iris_HD = np.matrix([np.array(val.split(',')[:4]).astype(float) for val in data[:-1]])
Lets look at the data again. First, the raw data:
print (iris_HD[:10])
Since each dimension has different scale in the Iris Database, we can use StandardScaler
to standard the unit of all dimension onto unit scale.
from sklearn.preprocessing import StandardScaler
iris_HD = StandardScaler().fit_transform(iris_HD)
iris_HD
We can also try plot a few two-dimensional projections, with combinations of 2 features at a time:
colorClass = [val.split(',')[-1].replace('\n', '') for val in data[:-1]]
for i in range(len(colorClass)):
val = colorClass[i]
if val == 'Iris-setosa':
colorClass[i] ='r'
elif val == 'Iris-versicolor':
colorClass[i] ='b'
elif val == 'Iris-virginica':
colorClass[i] ='g'
plt.figure(figsize=(8,8))
for i in range(0,4):
for j in range(0,4):
plt.subplot(4, 4, i * 4 + j + 1)
plt.scatter(iris_HD[:,i].tolist(), iris_HD[:,j].tolist(),c = colorClass, edgecolors = "black")
plt.axis('equal')
plt.gca().set_aspect('equal')
It is not easy to see that this is still a two-dimensional dataset!
However, if we now do PCA on it, you’ll see that the last two dimensions do not matter at all:
pca = PCA()
X_HE = pca.fit_transform(iris_HD)
print (X_HE[:10,:])
By looking at the data after PCA, it is easy to see the value of last two dimension, especially the last one, is pretty small such that the data can be considered as still only two-dimensional. To prove this we can use the code PCA(0.95)
to told PCA choose the least number of PCA components such that 95% of the data can be kept.
Lets give a try on it!
pca = PCA(0.95)
X_95 = pca.fit_transform(iris_HD)
print (X_95[:10,:])
We can see that PCA eliminate the last two dimension cause they are redundant under our requirment. Let’s plot the two dimension
plt.figure(figsize=(4,4))
plt.scatter(X_HE[:,0], X_HE[:,1], c = colorClass, edgecolor = "black")
plt.axis('equal')
plt.gca().set_aspect('equal')
We can have a look on the relationship between each dimention from following plots.
plt.figure(figsize=(8,8))
for i in range(4):
for j in range(4):
plt.subplot(4, 4, i * 4 + j + 1)
plt.scatter(X_HE[:,i], X_HE[:,j], c = colorClass, edgecolor = "black")
plt.gca().set_xlim(-40,40)
plt.gca().set_ylim(-40,40)
plt.axis('equal')
plt.gca().set_aspect('equal')
It is easy to see that the correlation between other dimensions (other than first two) was ambiguous and highly concentrated in either horizontal or vertical line. This fact suggests that there are large difference between the dimension we select so that the weak dimension cant change too much on the shape of graph.
pca = PCA(1) # only keep one dimension!
X_E = pca.fit_transform(iris_HD)
print (X_E[:10,:])
Now lets plot the reconstructed data and compare to the original data D. We plot the original data in red, and the reconstruction with only one dimension in blue:
X_reconstructed = pca.inverse_transform(X_E)
plt.figure(figsize=(8,8))
for i in range(4):
for j in range(4):
plt.subplot(4, 4, i * 4 + j + 1)
plt.scatter(iris_HD[:,i].tolist(), iris_HD[:,j].tolist(),c=colorClass, edgecolor = "black")
plt.scatter(X_reconstructed[:,i], X_reconstructed[:,j],c='purple', edgecolor = "black")
plt.axis('equal')
pca=PCA(6)
DATA = np.dot(X,np.random.uniform(0.2,3,(2,6))(np.random.randint(0,2,(2,6))2-1))
DATA
## Answer:
pca=PCA(6)
DATA = np.dot(X,np.random.uniform(0.2,3,(2,6))(np.random.randint(0,2,(2,6))2-1))
DATA2 = pca.fit_transform(DATA)
plt.figure(figsize=(4,4))
plt.scatter(DATA2[:,0], DATA2[:,1], c = "b", edgecolor = "black")
plt.axis('equal')
plt.gca().set_aspect('equal')
Contributers: Linghao Dong, Josh Beck, Jose Figueroa, Yuvraj Chopra