主成份分析¶

Principal component analysis

Creative Commons License
This work by Jephian Lin is licensed under a Creative Commons Attribution 4.0 International License.

$\newcommand{\trans}{^\top} \newcommand{\adj}{^{\rm adj}} \newcommand{\cof}{^{\rm cof}} \newcommand{\inp}[2]{\left\langle#1,#2\right\rangle} \newcommand{\dunion}{\mathbin{\dot\cup}} \newcommand{\bzero}{\mathbf{0}} \newcommand{\bone}{\mathbf{1}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bd}{\mathbf{d}} \newcommand{\be}{\mathbf{e}} \newcommand{\bh}{\mathbf{h}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bq}{\mathbf{q}} \newcommand{\br}{\mathbf{r}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bu}{\mathbf{u}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\tr}{\operatorname{tr}} \newcommand{\nul}{\operatorname{null}} \newcommand{\rank}{\operatorname{rank}} %\newcommand{\ker}{\operatorname{ker}} \newcommand{\range}{\operatorname{range}} \newcommand{\Col}{\operatorname{Col}} \newcommand{\Row}{\operatorname{Row}} \newcommand{\spec}{\operatorname{spec}} \newcommand{\vspan}{\operatorname{span}} \newcommand{\Vol}{\operatorname{Vol}} \newcommand{\sgn}{\operatorname{sgn}} \newcommand{\idmap}{\operatorname{id}} \newcommand{\am}{\operatorname{am}} \newcommand{\gm}{\operatorname{gm}} \newcommand{\mult}{\operatorname{mult}} \newcommand{\iner}{\operatorname{iner}}$

In [ ]:
from lingeo import random_int_list

Main idea¶

Let $X$ be an $N\times d$ data matrix whose the sample vectors are centered at $\bzero\in\mathbb{R}^d$.
Let $C = \frac{1}{N-1}X\trans X$ be the covariance matrix.

Let $\bv$ be a unit vector in $\mathbb{R}^d$.
One may project all data points onto the direction of $\bv$.
Indeed, the projected data is $\bu = X\bv$.
Thus, the mean of $\bu$ is $\inp{\bu}{\bone} = \bone\trans X\bu = \bzero\trans\bu = 0$,
and the variance of $\bu$ is $\frac{1}{N-1}\inp{\bu}{\bu} = \frac{1}{N-1}\bu\trans X\trans X\bu = \bv\trans C\bv$.

It is natural to consider the optimization problem:

maximize $\bv\trans C\bv$,
subject to $\|\bv\| = 1$.

Let $\lambda_1\geq \cdots \geq \lambda_d$ be the eigenvalues of $C$ and $\{\bv_1,\ldots,\bv_d\}$ the corresponding orthonormal eigenbasis.
According to the Rayleigh quotient theorem, the maximum is achieved by $\lambda_1$ when $\bv = \bv_1$.

Therefore, $\bv_1$ is called the first principal component of the data,
and one may continue to take $\bv_2,\bv_3,\ldots$ as the next principal components.

Algorithm (principal component analysis)¶

Input: a data represented by an $N\times d$ matrix $X$ and the desired number $k$ of principal components
Output: the principal components represented by the column vectors of a matrix $P$

  1. Let $J$ be the $N\times N$ all-ones matrix.
    Define $X_0 = (I - \frac{1}{n}J)X$.
  2. Calculate the covariance matrix $C= \frac{1}{N-1}X_0\trans X_0$.
  3. Find a diagonal matrix $D$, whose diagonal entries are $\lambda_1 \geq \cdots \geq \lambda_d$, and an orthogonal matrix $Q$ such that $C = QDQ\trans$.
  4. Let $P$ be the $d\times k$ matrix obtained by the first $k$ columns of $Q$.

A few points to emphasize:

  • The $Q$ matrix is the same as the $V$ in the algorithm in 606.
  • The eigenvalues $\lambda_1 \geq \cdots \geq \lambda_d$ are the variances when the data are projected onto the columns of $Q$.

Side stories¶

  • explained variance ratio
  • scipy.linalg.eigh

Experiments¶

Exercise 1¶

執行以下程式碼。

Run the code below.
In [ ]:
### code
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(0)
print_ans = False

mu = np.random.randint(-3,4, (2,))
cov = np.array([[1,1.9],
                [1.9,4]])
X = np.random.multivariate_normal(mu, cov, (20,))
mu = X.mean(axis=0)
X0 = X - mu
# u,s,vh = np.linalg.svd(X0)
C = (1 / (19)) * X0.T.dot(X0)
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1]
P = vecs[:,:1]

plt.axis('equal')
plt.scatter(*X.T)
plt.scatter(*mu, c="red")
plt.arrow(*mu, *(vecs[:,0]), head_width=0.3, color="red")

xs = X0.dot(P)
ys = np.zeros_like(xs)
plt.scatter(xs, ys)

pretty_print(LatexExpr("Q ="))
print(vecs)

if print_ans:
    print("red point: center")
    print("red arrow: first principal component")
    print("orange points: projection of blue points onto the red arrow")
    print("red arrow = first column of Q")
Exercise 1(a)¶

若藍點為給定的資料。
改變不同的 seed,說明紅點、紅箭頭、橘點的意思。

Let the blue points be the dataset. Try different `seed` and explore the meaning of the red point, the red arrow, and the orange point.
Exercise 1(b)¶

令 $X_0$ 的列向量為將藍點資料置中過後的資料點、
而 $C = \frac{1}{N-1}X_0\trans X_0$ 為其共變異數矩陣。
若 $C = QDQ\trans$ 為 $C$ 的譜分解,其中 $D$ 的對角線項由大到小排列。
說明紅箭頭與 $Q$ 的關係。

Let $X_0$ be the matrix whose rows record the data from the blue points, centered at their mean. Let $C = \frac{1}{N-1}X_0\trans X_0$ be the covariance matrix. Suppose $C = QDQ\trans$ is the spectral decomposition of $C$ such that the diagonal entries of $D$ is arranged from the largest to the smallest. Explain the relation between the red arrow and $Q$.

Exercises¶

Exercise 2¶

令 $X$ 是已置中的 $N\times d$ 資料矩陣、
而 $C = \frac{1}{N-1}X_0\trans X_0$ 為其共變異數矩陣。
令 $\bv$ 為一 $\mathbb{R}^d$ 中的單位向量。
說明 $\bu = X\bv$ 為將 $X$ 的所有樣本點投影在 $\bv$ 方向的值、
並說明 $\bu$ 的變異數為 $\bv\trans C\bv$。

Let $X$ be a $N\times d$ data matrix whose rows are centered at their mean. Let $C = \frac{1}{N-1}X_0\trans X_0$ be the covariance matrix. Fix a unit vector $\bv$ in $\mathbb{R}^d$. Show that $\bu = X\bv$ records the length of data points in $X$ projected on $\bv$ and show that the variance of $\bu$ is $\bv\trans C\bv$.
Exercise 3¶

執行以下程式碼。

Run the code below.
In [ ]:
import numpy as np

X = np.random.randn(3,4)
u,s,vh = np.linalg.svd(X) 
vals,vecs = np.linalg.eigh(X.T.dot(X))
vals = vals[::-1]
vecs = vecs[:,::-1]
print("V =")
print(vh.T)
print("Q =")
print(vecs)
print("singular values:", s)
print("eigenvalues:", vals)
Exercise 3(a)¶

說明為什麼印出來的兩個矩陣會一樣。
(除了有的行會有正負號的差別。)

Explain why the two output matrices are the same (except that some of the columns are negated).
Exercise 3(b)¶

說明印出來的奇異值和特徵值有什麼關係。

Explain the relation between the singular values and the eigenvalues in the output.
Exercise 3(c)¶

解釋

vals = vals[::-1]
vecs = vecs[:,::-1]

的用處。

Explain the meaning of ```python vals = vals[::-1] vecs = vecs[:,::-1] ```
Exercise 4¶

有時候我們只須要最大的幾個特徵值。
所以如果把所有的特徵值都算出來是不必要的浪費時間。
查找資源如何利用 SciPy 中的 linalg.eigh 找最大的幾個特徵值。

Sometimes we only need the first few largest eigenvalues, and it is a waste of time to compute all eigenvalues. Search online and find out how to find the largest eigenvalues through `linalg.eigh` in SciPy.
Exercise 5¶

給定 $k$ 時,

$$ p = \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i} $$

稱作變異數解釋率(explained variance ratio)。
當解釋率夠高的時候,表示投影後的資料足以代表原資料的重要特徵。

觀察下圖,其橫軸為 $k$,縱軸為解釋率。
判斷 $k$ 應該要取多少,以及為什麼。

For a given $k$, $$ p = \frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i} $$ is known as the **explained variance ratio** . When this rate is high enough, it means the projected data is representative enough for the original data. Observe the figure below, where the $x$-axis represents $k$ and the $y$-axis represents the explained variance ratio. Determine which $k$ is preferred and give your reasons.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt

X = np.genfromtxt('hidden_text.csv', delimiter=',')
print("shape of X =", X.shape)
X = X - X.mean(axis=0)
u,s,vh = np.linalg.svd(X)
vals = s**2

plt.plot(np.arange(1,101), vals.cumsum() / vals.sum())
plt.gca().set_xlim(0,10)
Exercise 6¶

找一個共變異數矩陣 cov 使得 np.random.multivariate_normal 產出來的資料大約是一個楕圓形,且:

  • 長軸指向 $45^\circ$ 角、短軸指向 $135^\circ$ 角;
  • 長軸是大約是短軸的 $2$ 倍。
Find a covariance matrix `cov` such that the data generated by `np.random.multivariate_normal` is roughly an ellipse such that: - the major axis is on $45^\circ$, while the minor axis is on $135^\circ$; - the major axis is about twice the length of the minor axis.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt

mu = np.array([0,0])
cov = np.array([
    [1,0],
    [0,1]
])
X = np.random.multivariate_normal(mu, cov, (1000,))
plt.axis('equal')
plt.scatter(*X.T)