[NLP] Transformer implement 정리

NLP

Encoder part

sequential data
- eg. [나, 는, 학, 교, 에, 간, 다] -> [1, 2, 3, 4, 5, 6, 7]
- sequence length 정하고 padding처리
- sequence length보다 길면 버리거나, truncate

$Attention(Q, K, V) = softmax_k(\dfrac{QK^T}{\sqrt{d_k}})V$

왜 scale?
- d_k는 key vector의 dim.
- depth가 깊다면, $QK^T$ 값이 커짐
- “The dot-product attention is scaled by a factor of square root of the depth. This is done because for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients resulting in a very hard softmax.”
  (https://www.tensorflow.org/tutorials/text/transformer)
detail: 내적은 두 벡터의 크기 * 둘 간의 cosin angle
- $cos \theta = \dfrac{A \dot B}{|A||B|}$
- key dim으로 정규화하면 값이 너무 커지는 것을 방지할 수 있음(https://physics.stackexchange.com/questions/252086/dot-product-approaches-zero-as-the-magnitude-of-the-vectors-increase)

score vector = $Q \times K^T = \text{[seq_len, hidden_d] * [hidden_d, seq_len]}$
score vector *= $\dfrac{1}{\sqrt{dk}}$
score vector = softmax(score vector)
- shape: [seq_len, seq_len]
attention head #1의 최종 output
- score vector * value vector = [seq_len, seq_len] * [seq_len, hidden_dim]
  - shape: [seq_len, hidden_dim]
- score vector에 value vector를 곱해 필요여부에 따라 가중치가 조절됨
- 예를 들어 output의 matrix의 (row=1, col1=1)은, query 1(row=1)과 key 1, …, n의 score에, 각 key 1, …, n의 value vector 1 feature(col=1)을 곱하고 합산한 결과.

각 attn head에서 나온 결과물을 concat
- shape: [seq_len, n_head * hidden_dim]
encoder output
- 상기 concat(multi head attn output)(shape:[seq_len, n_head * hidden_dim])을 $W_0$(shape: [n_head * hidden_dim, hidden_n])과 mat+mul -> 초기 input 값과 같은 형태로 만듦
- shape: [seq_len, hidden_n] = [seq_len, n_head * hidden_dim] * [n_head * hidden_dim, hidden_n]

Decoder의 1st attention은 self attention
- 다음 단어 볼 수 없도록 triangular mask
2nd attention layer(encoder-decoder attention layer)는, Encoder의 output을 key, value로 사용
마지막 output에는 Feed Forward 적용
- $ReLU(xW_1 + b_1) W_2 + b_2$