[Math] Singular Value Decomposition, SVD

특이값 분해(Singular Value Decomposition, SVD)

  • 활용: 1) 차원감소 2) 압축 3) 잠재변수 분석 4) PCA보다 계산 속도 빠름
  • $U: \text{left singular vector}, orthogonal, m \times m, \text{eigen-vector of } AA^T$
  • $V: \text{right singular vector}, orthogonal, n \times n, \text{eigen-vector of } A^TA$
  • $\text{The diagonal entries of } \Sigma: \text{singular values of A} A$
    • $\text{singular-value}^2$ = eigen-value

예제) 영화 평점

  • A ~ E 영화 5편(column)에 관객 7명(index)이 평점 부여
  • 영화는 SF, Romance 두가지 부류
  • 관객들은 SF, Romance에 대한 기호가 확실

1) SVD

1
2
3
4
5
6
7
8
A = np.array([[1,1,1,0,0],
[3,3,3,0,0],
[4,4,4,0,0],
[5,5,5,0,0],
[0,2,0,4,4],
[0,0,0,5,5],
[0,1,0,2,2]])
A
array([[1, 1, 1, 0, 0],
       [3, 3, 3, 0, 0],
       [4, 4, 4, 0, 0],
       [5, 5, 5, 0, 0],
       [0, 2, 0, 4, 4],
       [0, 0, 0, 5, 5],
       [0, 1, 0, 2, 2]])
1
2
3
4
5
6
7
8
idx1 = ['SF'] * 4
idx2 = ['Romance'] * 3
idx = idx1 + idx2
idx_ = [idx + " lover" + str(i) for i, idx in enumerate(idx, 1)]

col = list('ABCDE')
col_ = ["movie_" + idx for idx in col]
col_
['movie_A', 'movie_B', 'movie_C', 'movie_D', 'movie_E']
1
2
mov = pd.DataFrame(A,columns=col_, index=idx_)
mov
movie_A movie_B movie_C movie_D movie_E
SF lover1 1 1 1 0 0
SF lover2 3 3 3 0 0
SF lover3 4 4 4 0 0
SF lover4 5 5 5 0 0
Romance lover5 0 2 0 4 4
Romance lover6 0 0 0 5 5
Romance lover7 0 1 0 2 2
1
2
3
left_s, v, right_s = np.linalg.svd(A)
print(" Original Matrix A: {} \n Left_SV: {} \n Right_SV: {} \n Singualr Values: {}"\
.format(A.shape, left_s.shape, right_s.shape, v.shape))
 Original Matrix A: (7, 5)
 Left_SV: (7, 7)
 Right_SV: (5, 5)
 Singualr Values: (5,)

2) Number of Singular Values

  • 주성분1, 2가 data의 94% 설명
  • 영화 5편을 2개 부류로 나눌 수 있음
1
2
3
4
SVD = pd.DataFrame(v, columns=['Singular values'], index=np.arange(1,6))
SVD['Portion'] = round(SVD / SVD.sum(), 3)
SVD['Portion_cum'] = SVD['Portion'].cumsum()
SVD
Singular values Portion Portion_cum
1 1.248101e+01 0.535 0.535
2 9.508614e+00 0.407 0.942
3 1.345560e+00 0.058 1.000
4 3.046427e-16 0.000 1.000
5 0.000000e+00 0.000 1.000
1
2
3
4
5
6
7
plt.plot(SVD['Portion_cum'])
plt.scatter(2, 0.942, marker='o', c='red', s=100)
plt.xticks(np.arange(1, 6))
plt.xlabel('Num of PC')
plt.ylabel('Cumulative Portion')
plt.grid(False)
plt.show()

png

3) Compared with Original data

1
2
# PC: 2
left_s[:,:2]
array([[-0.13759913, -0.02361145],
       [-0.41279738, -0.07083435],
       [-0.5503965 , -0.09444581],
       [-0.68799563, -0.11805726],
       [-0.15277509,  0.59110096],
       [-0.07221651,  0.73131186],
       [-0.07638754,  0.29555048]])
1
np.diag(v)[:2, :2]
array([[12.48101469,  0.        ],
       [ 0.        ,  9.50861406]])
1
right_s[:2, :]
array([[-0.56225841, -0.5928599 , -0.56225841, -0.09013354, -0.09013354],
       [-0.12664138,  0.02877058, -0.12664138,  0.69537622,  0.69537622]])

3.1) retored by PCA2

  • if Original Matrix is too sparse, this restored Matrix would be used
1
2
3
col_pc2 = [x + '_reversed' for x in col_]
restore_PC2 = pd.DataFrame(np.linalg.multi_dot([left_s[:,:2], np.diag(v)[:2, :2], right_s[:2, :]]), columns=col_pc2, index=idx_,)
restore_PC2
movie_A_reversed movie_B_reversed movie_C_reversed movie_D_reversed movie_E_reversed
SF lover1 0.994042 1.011704 0.994042 -0.001327 -0.001327
SF lover2 2.982126 3.035113 2.982126 -0.003982 -0.003982
SF lover3 3.976168 4.046818 3.976168 -0.005309 -0.005309
SF lover4 4.970210 5.058522 4.970210 -0.006636 -0.006636
Romance lover5 0.360313 1.292165 0.360313 4.080263 4.080263
Romance lover6 -0.373851 0.734429 -0.373851 4.916721 4.916721
Romance lover7 0.180157 0.646082 0.180157 2.040132 2.040132

3.2) Comparing Original_data with Restored_PC2

  • Similar with Original data
1
2
3
df = pd.concat([mov, restore_PC2], axis=1)
df.columns=col_ + col_pc2
df
movie_A movie_B movie_C movie_D movie_E movie_A_reversed movie_B_reversed movie_C_reversed movie_D_reversed movie_E_reversed
SF lover1 1 1 1 0 0 0.994042 1.011704 0.994042 -0.001327 -0.001327
SF lover2 3 3 3 0 0 2.982126 3.035113 2.982126 -0.003982 -0.003982
SF lover3 4 4 4 0 0 3.976168 4.046818 3.976168 -0.005309 -0.005309
SF lover4 5 5 5 0 0 4.970210 5.058522 4.970210 -0.006636 -0.006636
Romance lover5 0 2 0 4 4 0.360313 1.292165 0.360313 4.080263 4.080263
Romance lover6 0 0 0 5 5 -0.373851 0.734429 -0.373851 4.916721 4.916721
Romance lover7 0 1 0 2 2 0.180157 0.646082 0.180157 2.040132 2.040132

4) applying PC1, PC2

  • U(left singular vectors M): user to PCA similarity M
1
2
SVD2 = pd.DataFrame(left_s[:,:2] ,columns=['PC1', 'PC2',], index=idx_)
SVD2
PC1 PC2
SF lover1 -0.137599 -0.023611
SF lover2 -0.412797 -0.070834
SF lover3 -0.550397 -0.094446
SF lover4 -0.687996 -0.118057
Romance lover5 -0.152775 0.591101
Romance lover6 -0.072217 0.731312
Romance lover7 -0.076388 0.295550
1
2
3
4
# 비교: user
df2 = pd.concat([mov, SVD2], axis=1)
df2.columns = col_ + list(SVD2.columns)
df2
movie_A movie_B movie_C movie_D movie_E PC1 PC2
SF lover1 1 1 1 0 0 -0.137599 -0.023611
SF lover2 3 3 3 0 0 -0.412797 -0.070834
SF lover3 4 4 4 0 0 -0.550397 -0.094446
SF lover4 5 5 5 0 0 -0.687996 -0.118057
Romance lover5 0 2 0 4 4 -0.152775 0.591101
Romance lover6 0 0 0 5 5 -0.072217 0.731312
Romance lover7 0 1 0 2 2 -0.076388 0.295550

5) applying PC1, PC2

  • movie to PC
    • movie_A heavily corresponds to PC
    • movie_B heavily corresponds to PC
    • $\dots$
1
2
3
col_pca = [x + "_PC2" for x in col_]
df3 = pd.DataFrame(right_s[:2,:], index=['PC1', 'PC2'], columns=col_pca)
df3
movie_A_PC2 movie_B_PC2 movie_C_PC2 movie_D_PC2 movie_E_PC2
PC1 -0.562258 -0.592860 -0.562258 -0.090134 -0.090134
PC2 -0.126641 0.028771 -0.126641 0.695376 0.695376

4) Wrap-up

1
2
# user to PC
SVD2
PC1 PC2
SF lover1 -0.137599 -0.023611
SF lover2 -0.412797 -0.070834
SF lover3 -0.550397 -0.094446
SF lover4 -0.687996 -0.118057
Romance lover5 -0.152775 0.591101
Romance lover6 -0.072217 0.731312
Romance lover7 -0.076388 0.295550
1
2
# movie to PC
df3
movie_A_PC2 movie_B_PC2 movie_C_PC2 movie_D_PC2 movie_E_PC2
PC1 -0.562258 -0.592860 -0.562258 -0.090134 -0.090134
PC2 -0.126641 0.028771 -0.126641 0.695376 0.695376

reference

질문

  • 왜 고유값이 설명력의 ‘양’을 대변하는지
< !-- add by yurixu 替换Google的jquery并且添加判断逻辑 -->