PCA 정리

1. 차원 축소

  • 고차원 데이터를 저차원으로 맵핑
  • 정보손실을 줄이는 것이 관건
  • 정보손실을 줄인다는 것은 대개 MSE(means squared error)를 최소화

2. 주성분 분석 PCA(Pricipal Component Analysis)

  • 관련개념: 행렬 X, 공분산 행렬 $X^TX$, SVD, 표본 공분산(sample covariance), 상관계수(correlation codfficient)
  • 사용: 차원 축소, 압축
    • featur extraction, remove noise
  • 한계: linear한 경우에만 효과적

3. pca vs svd

  • 분산 $var(X) = E[(X-E(X)^2]$
    • 표본 분산 $s^2 = \dfrac{\Sigma(x - \bar{x})^2}{n-1}$
  • 공분산 $cov(X, Y) = E[(X-E[X])(Y-E[Y])]$
  • 공분산 행렬 $C= \dfrac{X^TX}{n-1}$ (링크)
    • 고유값 분해: $Av = \lambda v$
    • PCA: $C = V \Lambda V^{-1}$
    • SVD: $C = \dfrac{V \Sigma U^T U \Sigma V^T}{n-1} = V \dfrac{\Sigma^2}{n-1} U^T$
1
2
3
4
5
6
# 공분산:
mu = [0, 0]; r = 0.4;
Sigma = np.array([[1, r],
[r, 1]])
data = np.random.multivariate_normal(mu, Sigma, 100)
pd.DataFrame(data, columns=['x', 'y'])
x y
0 -0.616981 0.072565
... ... ...
70 -0.535418 0.400531
71 0.780422 -0.023953

100 rows × 2 columns

1
2
3
4
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(data[:,0], data[:,1])
plt.show()

png

1
2
3
4
# 공분산
x = data[:,0]; y = data[:, 1]
x_bar = x.mean(); y_bar = y.mean()
((x - x_bar) * (y - y_bar)).sum() / len(x)
0.3966239389360993
1
2
# 공분산 행렬
np.dot(data.T, data)/len(data)
array([[0.8724964 , 0.39755701], [0.39755701, 1.10857547]])
1
2
3
4
5
6
# pca
from sklearn.datasets import load_iris
iris = load_iris()
N = 10
X = iris.data[:N, :2]
print(X, '\n', iris.feature_names[:2])
[[5.1 3.5] [4.9 3. ] [4.7 3.2] [4.6 3.1] [5. 3.6] [5.4 3.9] [4.6 3.4] [5. 3.4] [4.4 2.9] [4.9 3.1]] ['sepal length (cm)', 'sepal width (cm)']
1
2
3
4
5
6
7
8
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline
n = list(range(len(X)))
f, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X[:,0], X[:,1])
for i, txt in enumerate(n):
ax.annotate(txt, (X[:,0][i]+0.01, X[:,1][i]+0.01 ))

png

1
2
3
4
5
6
7
8
9
10
11
12
import matplotlib.pyplot as plt
%matplotlib inline
n = list(range(len(X)))
f, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X_mean_centering[:,0], X_mean_centering[:,1])
for i, txt in enumerate(n):
ax.annotate(txt, (X_mean_centering[:,0][i]+0.01, X_mean_centering[:,1][i]+0.01 ))

coef = np.polyfit(X_mean_centering[:,0], X_mean_centering[:,1], 1)
poly1d_fn = np.poly1d(coef)
ax.plot(X_mean_centering[:,0], poly1d_fn(X_mean_centering[:,0]))
plt.show()

png

1
2
3
# ver1: 공분산 행렬
X_mean_centering = X-X.mean(axis=0)
np.dot(X_mean_centering.T, X_mean_centering)/(X.shape[0]-1)
array([[0.08488889, 0.07044444], [0.07044444, 0.09433333]])
1
2
3
# ver2: 공분산 행렬 함수
C = np.cov(X.T)
C
array([[0.08488889, 0.07044444], [0.07044444, 0.09433333]])
1
2
eigen_vals, eigen_vecs = np.linalg.eig(C)
eigen_vecs.shape, X.shape
((2, 2), (10, 2))
1
2
3
4
5
6
7
# sklearn pca
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
x_low = pca.fit_transform(X_mean_centering)
x_inverse = pca.inverse_transform(x_low)
x_low
array([[ 0.30270263], [-0.1990931 ], [-0.18962889], [-0.33097106], [ 0.30743473], [ 0.79976625], [-0.11185966], [ 0.16136046], [-0.61365539], [-0.12605597]])
1
2
3
4
5
6
7
# 차원축소: 2 -> 1
f, ax = plt.subplots(figsize=(10, 5))
plt.scatter(x_low[:,0], [1]*len(x_low[:,0]))
for i, txt in enumerate(n):
ax.annotate(txt, (x_low[:,0][i]+0.01, 1+0.0001))
ax.axhline(1)
plt.show()

png

1
2
3
4
5
6
# noise 제거
f, ax = plt.subplots(figsize=(10, 5))
plt.scatter(x_inverse[:,0], x_inverse[:, 1])
for i, txt in enumerate(n):
ax.annotate(txt, (x_inverse[:,0][i]+0.01, x_inverse[:,1][i]+0.0001))
plt.show()

png

< !-- add by yurixu 替换Google的jquery并且添加判断逻辑 -->