體驗主成份分析¶

Understanding the principal component analysis

Creative Commons License
This work by Jephian Lin is licensed under a Creative Commons Attribution 4.0 International License.

$\newcommand{\trans}{^\top} \newcommand{\adj}{^{\rm adj}} \newcommand{\cof}{^{\rm cof}} \newcommand{\inp}[2]{\left\langle#1,#2\right\rangle} \newcommand{\dunion}{\mathbin{\dot\cup}} \newcommand{\bzero}{\mathbf{0}} \newcommand{\bone}{\mathbf{1}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bd}{\mathbf{d}} \newcommand{\be}{\mathbf{e}} \newcommand{\bh}{\mathbf{h}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bq}{\mathbf{q}} \newcommand{\br}{\mathbf{r}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bu}{\mathbf{u}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\tr}{\operatorname{tr}} \newcommand{\nul}{\operatorname{null}} \newcommand{\rank}{\operatorname{rank}} %\newcommand{\ker}{\operatorname{ker}} \newcommand{\range}{\operatorname{range}} \newcommand{\Col}{\operatorname{Col}} \newcommand{\Row}{\operatorname{Row}} \newcommand{\spec}{\operatorname{spec}} \newcommand{\vspan}{\operatorname{span}} \newcommand{\Vol}{\operatorname{Vol}} \newcommand{\sgn}{\operatorname{sgn}} \newcommand{\idmap}{\operatorname{id}} \newcommand{\am}{\operatorname{am}} \newcommand{\gm}{\operatorname{gm}} \newcommand{\mult}{\operatorname{mult}} \newcommand{\iner}{\operatorname{iner}}$

In [ ]:
from lingeo import random_int_list

Main idea¶

In statistics, data is often stored as a matrix (two-dimensional table) such that
each row represent a sample, while
each column represent a feature.
For example, the rows can be the students and the columns can be the scores in different subjects.

Let $X$ be an $N\times d$ matrix for some data and $\bx_1,\ldots,\bx_N\in\mathbb{R}^d$ its rows.
Thus,

$$ \mu = \frac{1}{N} \sum_{i = 1}^N \bx_i $$

is the center of these data points.

One may center the data by creating a new matrix $X_0$ whose rows are $\bx_1 - \mu, \ldots, \bx_N - \mu$.
Let $J$ be the $N\times N$ all-ones matrix.
Then one way to describe this is $X_0 = (I - \frac{1}{N}J)X$.

The principal component analysis (or PCA) is
a method of finding the most important directions of the data.
These directions (vectors) are called the principal components of the data.

Algorithm (principal component analysis)¶

Input: a data represented by an $N\times d$ matrix $X$ and the desired number $k$ of principal components
Output: the principal components represented by the column vectors of a matrix $P$

  1. Let $J$ be the $N\times N$ all-ones matrix.
    Define $X_0 = (I - \frac{1}{n}J)X$.
  2. Let $U\Sigma V\trans$ be the singular value decomposition of $X_0$.
  3. Let $P$ be the $d\times k$ matrix obtained by the first $k$ columns of $V$.

Once the principal components are found.
One may project all data points onto the plane spanned by the principal components.
This can be done by the $d\times k$ matrix $Y = X_0P$,
whose rows are the new data points contains the essential information of the original data.

Side stories¶

  • hidden words

Experiments¶

Exercise 1¶

執行以下程式碼。

Run the code below.
In [ ]:
### code
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(0)
print_ans = False

mu = np.random.randint(-3,4, (2,))
cov = np.array([[1,1.9],
                [1.9,4]])
X = np.random.multivariate_normal(mu, cov, (20,))
mu = X.mean(axis=0)
X0 = X - mu
u,s,vh = np.linalg.svd(X0)
P = vh.T[:,:1]

plt.axis('equal')
plt.scatter(*X.T)
plt.scatter(*mu, c="red")
plt.arrow(*mu, *(P[:,0]), head_width=0.3, color="red")

xs = X0.dot(P)
ys = np.zeros_like(xs)
plt.scatter(xs, ys)

pretty_print(LatexExpr(r"V^\top ="))
print(vh)

if print_ans:
    print("red point: center")
    print("red arrow: first principal component")
    print("orange points: projection of blue points onto the red arrow")
    print("red arrow = first row of VT = first column of V")
Exercise 1(a)¶

若藍點為給定的資料。
改變不同的 seed,說明紅點、紅箭頭、橘點的意思。

Let the blue point set be the data. Choose different `seed` and explain the meaning of the red point, the red arrow, and the orange point.
Exercise 1(b)¶

令 $X_0$ 的列向量為將藍點資料置中過後的資料點。
若 $X_0 = U\Sigma V\trans$ 為其奇異值分解。
說明紅箭頭與 $V\trans$ 的關係。

Let $X_0$ be the matrix whose rows are the blue points, centered at their mean. If $X_0 = U\Sigma V\trans$ is the singular value decomposition, describe the relation between the red arrow and $V\trans$.

Exercises¶

Exercise 2¶

令 $X$ 為一 $N\times d$ 矩陣,而其列向量為一筆資料。
令 $J$ 為 $N\times N$ 的全一矩陣。

Let $X$ be an $N\times d$ matrix whose rows are the data points. Let $J$ be the $N\times N$ all-ones matrix.
Exercise 2(a)¶

說明 $\frac{1}{N}JX$ 每列的意義。

Describe the meaning of the rows of $\frac{1}{N}JX$.
Exercise 2(b)¶

說明 $(I - \frac{1}{N}J)X$ 每列的意義。

Describe the meaning of the rows of $(I - \frac{1}{N}J)X$.
Exercise 3¶

執行以下程式碼。

Run the code below.
In [ ]:
### code

import numpy as np
np.random.seed(0)
X = np.random.randint(-3, 4, (5,2))
mu = X.mean(axis=0)
X0 = X - mu
print("X =\n", X)
print("mu =\n", mu)
print("X0 =\n", X0)
Exercise 3(a)¶

解釋 mu = X.mean(axis=0) 的意義。

Describe the meaning of `mu = X.mean(axis=0)` .
Exercise 3(b)¶

解釋 X0 = X - mu 的意義。

Describe the meaning of `X0 = X - mu` .
Exercise 4¶

執行以下程式碼。

Run the code below.
In [ ]:
### code

import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

np.random.seed(0)

mu = np.random.randint(-3,4, (3,))
cov = np.array([[3, -0.9, -1.9],
                [-0.9, 11, -9.9], 
                [-1.9, -9.9, 12]])
X = np.random.multivariate_normal(mu, cov, (100,))
mu = X.mean(axis=0)
X0 = X - mu
u,s,vh = np.linalg.svd(X0)
P = vh.T[:,:2]

ax = plt.axes(projection='3d')
ax.set_xlim(-5,5)
ax.set_ylim(-5,5)
ax.set_zlim(-5,5)

ax.scatter(*X.T)
ax.scatter(*mu, c="red")
ax.quiver(*(mu[:,np.newaxis] + np.zeros((2,))), *P, color="red")
In [ ]:
%matplotlib inline
Y = X0.dot(P)
plt.axis('equal')
plt.scatter(*Y.T)
Exercise 4(a)¶

說明兩段程式碼產出的圖的關係。

Describe the relation between the two graphs in the output.
Exercise 4(b)¶

使用 X.shape 來觀察一個陣列的形狀、
並逐句解釋下方程式碼所做的事情。

mu = X.mean(axis=0)
X0 = X - mu
u,s,vh = np.linalg.svd(X0)
P = vh.T[:,:2]
You may use `X.shape` to check the shape of an array. Then explain the meaning of each line in the code. ```python mu = X.mean(axis=0) X0 = X - mu u,s,vh = np.linalg.svd(X0) P = vh.T[:,:2] ```
Exercise 5¶

挑戰
以下程式碼讀進一筆資料。
這筆資料實際上是貼在高維度中的一個二維平面,
且它在這個平面上排出一個一個英文字。

請找出這個英文字是什麼。

**Challenge** The following code load some data from a file. The data falls on a $2$-dimensional subspace of a higher dimensional space, and the data points show an English word. Try to find out this word.
In [ ]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')
print("shape of X =", X.shape)
Exercise 6¶

令 $X_0$ 及 $P$ 為主成份分析演算法中的矩陣。
說明 $X_0P$ 與 $X_0PP\trans$ 的列向量所代表的意義。

Let $X_0$ and $P$ be the matrices in the principal component analysis. Explain the meanings of $X_0P$ and $X_0PP\trans$.