Spectral embedding and clustering
This work by Jephian Lin is licensed under a Creative Commons Attribution 4.0 International License.
$\newcommand{\trans}{^\top} \newcommand{\adj}{^{\rm adj}} \newcommand{\cof}{^{\rm cof}} \newcommand{\inp}[2]{\left\langle#1,#2\right\rangle} \newcommand{\dunion}{\mathbin{\dot\cup}} \newcommand{\bzero}{\mathbf{0}} \newcommand{\bone}{\mathbf{1}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bd}{\mathbf{d}} \newcommand{\be}{\mathbf{e}} \newcommand{\bh}{\mathbf{h}} \newcommand{\bp}{\mathbf{p}} \newcommand{\bq}{\mathbf{q}} \newcommand{\br}{\mathbf{r}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\bu}{\mathbf{u}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\tr}{\operatorname{tr}} \newcommand{\nul}{\operatorname{null}} \newcommand{\rank}{\operatorname{rank}} %\newcommand{\ker}{\operatorname{ker}} \newcommand{\range}{\operatorname{range}} \newcommand{\Col}{\operatorname{Col}} \newcommand{\Row}{\operatorname{Row}} \newcommand{\spec}{\operatorname{spec}} \newcommand{\vspan}{\operatorname{span}} \newcommand{\Vol}{\operatorname{Vol}} \newcommand{\sgn}{\operatorname{sgn}} \newcommand{\idmap}{\operatorname{id}} \newcommand{\am}{\operatorname{am}} \newcommand{\gm}{\operatorname{gm}} \newcommand{\mult}{\operatorname{mult}} \newcommand{\iner}{\operatorname{iner}}$
Sometimes a graph can have some "clusters" such that the vertices in each cluster are strongly related, while vertices from different clusters are weakly related.
The Laplacian matrix is good at find these clusters.
Let $G$ be a graph on $n$ vertices and $L$ its Laplacian matrix.
Let $\lambda_1 \leq \cdots \leq \lambda_n$ be the eigenvalues.
We knew that $\nul(L)$ is the lagest $k$ such that $\lambda_k = 0$, and it is also the number of connected components of $G$.
Similarly, we may pick a threshold $\epsilon > 0$ such that eigenvalues with $\lambda < \epsilon$ is considered as zeroish eigenvalues.
Then the largest $k$ such that $\lambda_k < \epsilon$ can be interpreted as the number of clusters.
Let $\{\bv_1,\ldots, \bv_n\}$ be the corresponding orthonormal eigenbasis.
We knew that $\bv_1 = \bone$.
The unit eigenvector $\bv_2$ corresponding to $\lambda_2$ is called the Fiedler vector.
One may partition a graph by its Fiedler vector as follows:
Let $G$ be a graph and $W$ a subset of vertices.
Then the induced subgraph of $G$ on $W$ is denoted by $G[W]$ and is the graph on the vertex set $W$ and with all edges in $G$ whose both endpoints are in $W$.
Let $G$ be a connected graph and $L$ its Laplacian matrix.
Assume the multiplicity of the second smallest eigenvalue $\lambda_2$ is $1$.
Let $\bv=(v_0,\ldots,v_{n-1})$ be the eigenvector of $L$ with respect to $\lambda_2$.
Let $N_+ = \{ i : (\bv_2)_i > 0\}$.
Let $N_- = \{ i : (\bv_2)_i < 0\}$.
Let $N_0 = \{ i : (\bv_2)_i = 0\}$.
Then $G[N_+]$ and $G[N_0\cup N_-]$ are connected subgraphs of $G$.
From this point of view, the Fiedler vector assigns a number to each vertex, so it can be viewed as a feature vector.
Indeed, one may use $\bv_3, \bv_4, \ldots$ as new feature vectors.
This lead to the idea called the spectral embedding of a graph.
Input: a graph $G$ on $n$ vertices, and the desired dimension $k$
Output: an $n\times k$ data matrix $Y$
The importance of spectral embedding is to provide a way to transform a graph data into a data matrix.
Thus, one may use clustering algorithms, such as $k$-means, to cluster the vertices of the graph.
The combination of spectral embedding and $k$-means is called the spectral clustering .
Finally, all the theories in 614 and 615 works for weighted graphs, where each edge $\{i,j\}$ is assigned with a positive weight $w_{ij}$.
The weighted Laplacian matrix of a weighted graph is having $-w_{ij}$ on the edges and
on the $i,i$-entry.
執行以下程式碼。
Run the code below.### code
set_random_seed(0)
print_ans = False
k = choice([2,3,4])
sizes = [choice([4,5,6]) for i in range(k)]
n = sum(sizes)
psum = 0
cuts = [0]
for s in sizes:
psum += s
cuts.append(psum)
g = graphs.PathGraph(n)
g.set_pos(None)
for a,b in zip(cuts[:-1], cuts[1:]):
for i in range(a, b - 1):
for j in range(i + 1, b):
g.add_edge(i,j)
g.show(figsize=(5,5), title="$G$")
L = g.laplacian_matrix()
eigs = L.eigenvalues()
eigs.sort()
print("eigenvalues:", eigs)
if print_ans:
print("I would guess G has %s clusters."%k)
觀察畫出來的圖,你認為 $G$ 有幾個群?
Observe the given graphs. Based on your intuition, how many clusters are there in $G$?觀察 $L(G)$ 的特徵值,你認為 $G$ 有幾群?
Observe the eigenvalues of $L(G)$. Based on your intuition, how many clusters are there in $G$?逐行解釋以下程式碼。
Explain the code below line by line.### code
import numpy as np
from scipy.linalg import eigh
g = graphs.PathGraph(11)
L = np.array(g.laplacian_matrix())
eigs,vecs = eigh(L, eigvals=(1,1))
fvec = vecs.T[0]
eps = 0.00001
Np = [i for i in g.vertices() if fvec[i] > eps]
Nm = [i for i in g.vertices() if fvec[i] < -eps]
Nz = [i for i in g.vertices() if abs(fvec[i]) < eps]
g.show(vertex_colors={"red": Np, "green": Nm, "white": Nz})
逐行解釋以下程式碼。
Explain the code below line by line.### code
import numpy as np
from scipy.linalg import eigh
import matplotlib.pyplot as plt
%matplotlib inline
g = graphs.CycleGraph(10)
n = g.order()
L = np.array(g.laplacian_matrix())
vals,vecs = eigh(L, subset_by_index=(1,2))
x,y = vecs.T
plt.axis("equal")
### plot points
plt.scatter(x, y, s=50, zorder=3)
### add vertex labels
for i in range(n):
plt.annotate(i, (x[i], y[i]), zorder=4)
### add lines
for i,j in g.edges(labels=False):
plt.plot([x[i],x[j]], [y[i],y[j]], 'c')
逐行解釋以下程式碼。
Explain the code below line by line.### code
import numpy as np
from scipy.linalg import eigh
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
%matplotlib notebook
g = graphs.CubeGraph(3)
g.relabel()
n = g.order()
L = np.array(g.laplacian_matrix())
vals,vecs = eigh(L, subset_by_index=(1,3))
x,y,z = vecs.T
ax = plt.axes(projection='3d')
ax.set_xlim(-1,1)
ax.set_ylim(-1,1)
ax.set_zlim(-1,1)
### plot points
ax.scatter(x, y, z, s=50, zorder=3)
### add vertex labels
for i in range(n):
ax.text(x[i], y[i], z[i], i, zorder=4)
### add lines
for i,j in g.edges(labels=False):
ax.plot([x[i],x[j]], [y[i],y[j]], [z[i],z[j]], 'c')
令 $G$ 為一 $n$ 個點的圖、$L = L(G)$。
令 $Y$ 為一 $n\times k$ 矩陣,其各列為 $\bx_1, \ldots, \bx_n\in\mathbb{R}^k$。
說明
$$ \tr(Y\trans LY) = \sum_{\{i,j\}\in E(G)} \|\bx_i - \bx_j\|^2. $$ Show that $$ \tr(Y\trans LY) = \sum_{\{i,j\}\in E(G)} \|\bx_i - \bx_j\|^2. $$說明 $\bone\trans Y = \bzero$ 表示 $\bx_1, \ldots, \bx_n$ 的中心點在原點。
Explain why $\bone\trans Y = \bzero$ means $\bx_1, \ldots, \bx_n$ are centered at the origin.說明 $Y\trans Y = I$ 表示 $Y$ 的各特徵向量(行向量)
彼此互相垂直、而且各自的變異數均為 $1$。
解釋拉普拉斯嵌入法做出來的 $Y$,就是以下最佳化問題的解:
minimize $\tr(Y\trans LY) = \sum_{\{i,j\}\in E(G)} \|\bx_i - \bx_j\|^2$,
subject to $\bone\trans Y = \bzero$, $Y\trans Y = I$
先執行(讀入)筆記最下方的所有函數,並執行以下程式碼。
查找資源解釋什麼是 $k$-means 分群演算法。
### code
%matplotlib inline
X = make_blobs()
y_new, centers = k_means(X, 3)
plt.scatter(*X.T, c=y_new)
各種分群演算法都可以拿來將圖片的像素分群,整體的效果等同在將圖像分割。
先執行(讀入)筆記最下方的所有函數,並執行以下程式碼。
逐行解釋以下程式碼。
### original image
import numpy as np
from PIL import Image
r = 10
img = Image.open('incrediville-side.jpg')
x,y = img.size
img = img.resize((x // r, y // r))
img
### image segmentation by the k-means clustering algorithm
import matplotlib.pyplot as plt
k = 3
arr = np.array(img, dtype=float) / 256
m,n,r = arr.shape
N = m * n
arr = arr.reshape(N, r)
y_new, centers = k_means(arr, k)
fig,axs = plt.subplots(1, 3, figsize=np.array([3*n,m])/100)
for i in range(k):
axs[i].axis('off')
axs[i].imshow((y_new == i).reshape(m,n), cmap='binary')
import numpy as np
import matplotlib.pyplot as plt
def make_blobs(N=150, k=3, d=2, seed=None):
"""
Input:
N: an integer, number of samples
k: an integer, number of blobs
d: an integer, dimension of the space
Output:
a dataset X of shape (N, d)
"""
np.random.seed(seed)
X = np.random.randn(N,d)
blob_size = N // k
centers = np.random.randn(k, d) * 3
for i in range(k):
left = blob_size * i
right = blob_size * (i+1) if i != k-1 else N
X[left:right] += centers[i]
return X
def dist_mtx(X, Y=None):
"""Return the distance matrix between rows of X and rows of Y
Input:
X: an array of shape (N,d)
Y: an array of shape (M,d)
if None, Y = X
Output:
the matrix [d_ij] where d_ij is the distance between
the i-th row of X and the j-th row of Y
"""
if isinstance(Y, np.ndarray):
pass
elif Y == None:
Y = X.copy()
else:
raise TypeError("Y should be a NumPy array or None")
X_col = X[:, np.newaxis, :]
Y_row = Y[np.newaxis, :, :]
diff = X_col - Y_row
dist = np.sqrt(np.sum(diff**2, axis=-1))
return dist
def k_means(X, k, init="random"):
"""k-means clustering algorithm
Input:
X: an array of shape (N,d)
rows for samples and columns for features
k: number of clusters
init: "random" or an array of shape (k,d)
if "random", k points are chosen randomly from X as the initial cluster centers
if an array, the array is used as the initial cluster centers
Output:
(y_new, centers)
y_new: an array of shape (N,)
that records the labels in (0, ..., k-1) of each sample
centers: an array of shape (k,d)
that records the cluster centers
Example:
mu = np.array([3,3])
cov = np.eye(2)
X = np.vstack([np.random.multivariate_normal(mu, cov, 100),
np.random.multivariate_normal(-mu, cov, 100)])
y_new,centers = k_means(X, 2)
"""
N,d = X.shape
### initialize y and center
if isinstance(init, np.ndarray):
centers = init.copy()
elif init == "random":
inds = np.random.choice(np.arange(N), k, replace=False)
centers = X[inds, :]
else:
raise TypeError("init can only be a NumPy array or 'random'")
dist = dist_mtx(X, centers)
y_new = dist.argmin(axis=1)
while True:
### compute the new centers
for i in range(k):
mask = (y_new == i)
centers[i] = X[mask].mean(axis=0)
### generate the new y_new
dist = dist_mtx(X, centers)
y_last = y_new.copy()
y_new = dist.argmin(axis=1)
if np.all(y_last == y_new):
break
return y_new, centers