Principal component analysis
This work by Jephian Lin is licensed under a Creative Commons Attribution 4.0 International License.
$\newcommand{\trans}{^\top} \newcommand{\adj}{^{\rm adj}} \newcommand{\cof}{^{\rm cof}} \newcommand{\inp}[2]{\left\langle#1,#2\right\rangle} \newcommand{\dunion}{\mathbin{\dot\cup}} \newcommand{\bzero}{\mathbf{0}} \newcommand{\bone}{\mathbf{1}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bd}{\mathbf{d}} \newcommand{\be}{\mathbf{e}} \newcommand{\bh}{\mathbf{h}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bq}{\mathbf{q}} \newcommand{\br}{\mathbf{r}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bu}{\mathbf{u}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\tr}{\operatorname{tr}} \newcommand{\nul}{\operatorname{null}} \newcommand{\rank}{\operatorname{rank}} %\newcommand{\ker}{\operatorname{ker}} \newcommand{\range}{\operatorname{range}} \newcommand{\Col}{\operatorname{Col}} \newcommand{\Row}{\operatorname{Row}} \newcommand{\spec}{\operatorname{spec}} \newcommand{\vspan}{\operatorname{span}} \newcommand{\Vol}{\operatorname{Vol}} \newcommand{\sgn}{\operatorname{sgn}} \newcommand{\idmap}{\operatorname{id}} \newcommand{\am}{\operatorname{am}} \newcommand{\gm}{\operatorname{gm}} \newcommand{\mult}{\operatorname{mult}} \newcommand{\iner}{\operatorname{iner}}$
from lingeo import random_int_list
Let $X$ be an $N\times d$ data matrix whose the sample vectors are centered at $\bzero\in\mathbb{R}^d$.
Let $C = \frac{1}{N-1}X\trans X$ be the covariance matrix.
Let $\bv$ be a unit vector in $\mathbb{R}^d$.
One may project all data points onto the direction of $\bv$.
Indeed, the projected data is $\bu = X\bv$.
Thus, the mean of $\bu$ is $\inp{\bu}{\bone} = \bone\trans X\bu = \bzero\trans\bu = 0$,
and the variance of $\bu$ is $\frac{1}{N-1}\inp{\bu}{\bu} = \frac{1}{N-1}\bu\trans X\trans X\bu = \bv\trans C\bv$.
It is natural to consider the optimization problem:
maximize $\bv\trans C\bv$,
subject to $\|\bv\| = 1$.
Let $\lambda_1\geq \cdots \geq \lambda_d$ be the eigenvalues of $C$ and $\{\bv_1,\ldots,\bv_d\}$ the corresponding orthonormal eigenbasis.
According to the Rayleigh quotient theorem, the maximum is achieved by $\lambda_1$ when $\bv = \bv_1$.
Therefore, $\bv_1$ is called the first principal component of the data,
and one may continue to take $\bv_2,\bv_3,\ldots$ as the next principal components.
Input: a data represented by an $N\times d$ matrix $X$ and the desired number $k$ of principal components
Output: the principal components represented by the column vectors of a matrix $P$
A few points to emphasize:
scipy.linalg.eigh
執行以下程式碼。
Run the code below.### code
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(0)
print_ans = False
mu = np.random.randint(-3,4, (2,))
cov = np.array([[1,1.9],
[1.9,4]])
X = np.random.multivariate_normal(mu, cov, (20,))
mu = X.mean(axis=0)
X0 = X - mu
# u,s,vh = np.linalg.svd(X0)
C = (1 / (19)) * X0.T.dot(X0)
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1]
P = vecs[:,:1]
plt.axis('equal')
plt.scatter(*X.T)
plt.scatter(*mu, c="red")
plt.arrow(*mu, *(vecs[:,0]), head_width=0.3, color="red")
xs = X0.dot(P)
ys = np.zeros_like(xs)
plt.scatter(xs, ys)
pretty_print(LatexExpr("Q ="))
print(vecs)
if print_ans:
print("red point: center")
print("red arrow: first principal component")
print("orange points: projection of blue points onto the red arrow")
print("red arrow = first column of Q")
若藍點為給定的資料。
改變不同的 seed
,說明紅點、紅箭頭、橘點的意思。
令 $X_0$ 的列向量為將藍點資料置中過後的資料點、
而 $C = \frac{1}{N-1}X_0\trans X_0$ 為其共變異數矩陣。
若 $C = QDQ\trans$ 為 $C$ 的譜分解,其中 $D$ 的對角線項由大到小排列。
說明紅箭頭與 $Q$ 的關係。
令 $X$ 是已置中的 $N\times d$ 資料矩陣、
而 $C = \frac{1}{N-1}X_0\trans X_0$ 為其共變異數矩陣。
令 $\bv$ 為一 $\mathbb{R}^d$ 中的單位向量。
說明 $\bu = X\bv$ 為將 $X$ 的所有樣本點投影在 $\bv$ 方向的值、
並說明 $\bu$ 的變異數為 $\bv\trans C\bv$。
執行以下程式碼。
Run the code below.import numpy as np
X = np.random.randn(3,4)
u,s,vh = np.linalg.svd(X)
vals,vecs = np.linalg.eigh(X.T.dot(X))
vals = vals[::-1]
vecs = vecs[:,::-1]
print("V =")
print(vh.T)
print("Q =")
print(vecs)
print("singular values:", s)
print("eigenvalues:", vals)
說明為什麼印出來的兩個矩陣會一樣。
(除了有的行會有正負號的差別。)
說明印出來的奇異值和特徵值有什麼關係。
Explain the relation between the singular values and the eigenvalues in the output.解釋
vals = vals[::-1]
vecs = vecs[:,::-1]
的用處。
Explain the meaning of ```python vals = vals[::-1] vecs = vecs[:,::-1] ```有時候我們只須要最大的幾個特徵值。
所以如果把所有的特徵值都算出來是不必要的浪費時間。
查找資源如何利用 SciPy 中的 linalg.eigh
找最大的幾個特徵值。
給定 $k$ 時,
$$ p = \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i} $$稱作變異數解釋率(explained variance ratio)。
當解釋率夠高的時候,表示投影後的資料足以代表原資料的重要特徵。
觀察下圖,其橫軸為 $k$,縱軸為解釋率。
判斷 $k$ 應該要取多少,以及為什麼。
import numpy as np
import matplotlib.pyplot as plt
X = np.genfromtxt('hidden_text.csv', delimiter=',')
print("shape of X =", X.shape)
X = X - X.mean(axis=0)
u,s,vh = np.linalg.svd(X)
vals = s**2
plt.plot(np.arange(1,101), vals.cumsum() / vals.sum())
plt.gca().set_xlim(0,10)
找一個共變異數矩陣 cov
使得 np.random.multivariate_normal
產出來的資料大約是一個楕圓形,且:
import numpy as np
import matplotlib.pyplot as plt
mu = np.array([0,0])
cov = np.array([
[1,0],
[0,1]
])
X = np.random.multivariate_normal(mu, cov, (1000,))
plt.axis('equal')
plt.scatter(*X.T)