PCA 정리

2019-12-12 Thu 03:09

Math

1. 차원 축소

고차원 데이터를 저차원으로 맵핑
정보손실을 줄이는 것이 관건
정보손실을 줄인다는 것은 대개 MSE(means squared error)를 최소화

2. 주성분 분석 PCA(Pricipal Component Analysis)

관련개념: 행렬 X, 공분산 행렬 $X^TX$, SVD, 표본 공분산(sample covariance), 상관계수(correlation codfficient)
사용: 차원 축소, 압축
- featur extraction, remove noise
한계: linear한 경우에만 효과적

3. pca vs svd

분산 $var(X) = E[(X-E(X)^2]$
- 표본 분산 $s^2 = \dfrac{\Sigma(x - \bar{x})^2}{n-1}$
공분산 $cov(X, Y) = E[(X-E[X])(Y-E[Y])]$
- 표본 공분산(sample covariance) $s_{xy}= \dfrac{1}{N} \Sigma^N_{i=1}(x_i - \hat{x})(y_i - \hat{y})$
- 값이 의미하는 바: 얼마나 선형에 가까운가.(기울기와 관련없음)
- https://datascienceschool.net/view-notebook/4cab41c0d9cd4eafaff8a45f590592c5/
공분산 행렬 $C= \dfrac{X^TX}{n-1}$ (링크)
- 고유값 분해: $Av = \lambda v$
- PCA: $C = V \Lambda V^{-1}$
- SVD: $C = \dfrac{V \Sigma U^T U \Sigma V^T}{n-1} = V \dfrac{\Sigma^2}{n-1} U^T$

1
2
3
4
5
6
# 공분산:
mu = [0, 0]; r = 0.4;
Sigma = np.array([[1, r],
                  [r, 1]])
data = np.random.multivariate_normal(mu, Sigma, 100)
pd.DataFrame(data, columns=['x', 'y'])

	x	y
0	-0.616981	0.072565
...	...	...
70	-0.535418	0.400531
71	0.780422	-0.023953

100 rows × 2 columns

1
2
3
4
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(data[:,0], data[:,1])
plt.show()

png

1
2
3
4
# 공분산
x = data[:,0]; y = data[:, 1]
x_bar = x.mean(); y_bar = y.mean()
((x - x_bar) * (y - y_bar)).sum() / len(x)


0.3966239389360993




1
2
# 공분산 행렬
np.dot(data.T, data)/len(data)


    array([[0.8724964 , 0.39755701],
           [0.39755701, 1.10857547]])




1
2
3
4
5
6
# pca
from sklearn.datasets import load_iris
iris = load_iris()
N = 10
X = iris.data[:N, :2]
print(X, '\n', iris.feature_names[:2])


    [[5.1 3.5]
     [4.9 3. ]
     [4.7 3.2]
     [4.6 3.1]
     [5.  3.6]
     [5.4 3.9]
     [4.6 3.4]
     [5.  3.4]
     [4.4 2.9]
     [4.9 3.1]]
     ['sepal length (cm)', 'sepal width (cm)']



1
2
3
4
5
6
7
8
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline
n = list(range(len(X)))
f, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X[:,0], X[:,1])
for i, txt in enumerate(n):
    ax.annotate(txt, (X[:,0][i]+0.01, X[:,1][i]+0.01 ))

png

1
2
3
4
5
6
7
8
9
10
11
12
import matplotlib.pyplot as plt
%matplotlib inline
n = list(range(len(X)))
f, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X_mean_centering[:,0], X_mean_centering[:,1])
for i, txt in enumerate(n):
    ax.annotate(txt, (X_mean_centering[:,0][i]+0.01, X_mean_centering[:,1][i]+0.01 ))

coef = np.polyfit(X_mean_centering[:,0], X_mean_centering[:,1], 1)
poly1d_fn = np.poly1d(coef)
ax.plot(X_mean_centering[:,0], poly1d_fn(X_mean_centering[:,0]))
plt.show()

png

1
2
3
# ver1: 공분산 행렬
X_mean_centering = X-X.mean(axis=0)
np.dot(X_mean_centering.T, X_mean_centering)/(X.shape[0]-1)


    array([[0.08488889, 0.07044444],
           [0.07044444, 0.09433333]])




1
2
3
# ver2: 공분산 행렬 함수
C = np.cov(X.T)
C


    array([[0.08488889, 0.07044444],
           [0.07044444, 0.09433333]])




1
2
eigen_vals, eigen_vecs = np.linalg.eig(C)
eigen_vecs.shape, X.shape


    ((2, 2), (10, 2))




1
2
3
4
5
6
7
# sklearn pca
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
x_low = pca.fit_transform(X_mean_centering)
x_inverse = pca.inverse_transform(x_low)
x_low


    array([[ 0.30270263],
           [-0.1990931 ],
           [-0.18962889],
           [-0.33097106],
           [ 0.30743473],
           [ 0.79976625],
           [-0.11185966],
           [ 0.16136046],
           [-0.61365539],
           [-0.12605597]])




1
2
3
4
5
6
7
# 차원축소: 2 -> 1
f, ax = plt.subplots(figsize=(10, 5))
plt.scatter(x_low[:,0], [1]*len(x_low[:,0]))
for i, txt in enumerate(n):
    ax.annotate(txt, (x_low[:,0][i]+0.01, 1+0.0001))
ax.axhline(1)
plt.show()

png

1
2
3
4
5
6
# noise 제거
f, ax = plt.subplots(figsize=(10, 5))
plt.scatter(x_inverse[:,0], x_inverse[:, 1])
for i, txt in enumerate(n):
    ax.annotate(txt, (x_inverse[:,0][i]+0.01, x_inverse[:,1][i]+0.0001))
plt.show()

png

Henry's blog

Step by step

PCA 정리

1. 차원 축소

2. 주성분 분석 PCA(Pricipal Component Analysis)

3. pca vs svd