[DeepLearning] DL Cheat Sheet

1. sigmoid with bainary cross entropy back propagation

  • 참고
  • back propagation에서 cross entropy와 softmax층을 거치면 결과값 y-t로 단순화
  • 그래프 노드

    • forward: x -> sigmoid -> y -> binary cross entropy error -> Loss
    • backward: $\dfrac{\partial L}{\partial x} = \dfrac{\partial L}{\partial y} \dfrac{\partial y}{\partial x} = \dfrac{\partial L}{\partial y} y(1-y) = y - t$ <- sigmoid <- $\dfrac{\partial L}{\partial y}=-\dfrac{t}{y} + \dfrac{1-t}{1-y}$ <- binary cross entropy error <- $\dfrac{\partial L}{\partial L} = 1$
  • softmax의 경우

    • a -> softmax -> y -> cross entropy error -> Loss
    • y-t <- softmax <- $-\dfrac{t}{y}$ <- cross entropy error -> 1

2. soft max 함수에서 a-a.max()의 정체

  • 자연상수의 지수가 되기 때문에($e^a$) a값이 크면, 전체 값이 너무 커짐. 이를 방지하기 위해 max값을 0으로 위치시킴
  • $\dfrac{e^{x_i-m}}{\Sigma e^{x_j-m}} = \dfrac{e^{x_i}/e^m}{\Sigma e^{x_j}/e^m} = \dfrac{e^{x_i}}{\Sigma e^{x_j}}$ where m=max(x)

3. sigmoid vs softmax

  • 모두 output activation function
    1
    2
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))

4. W initialize

  • np.sqrt(num_of_data)
  • This sort of initialization helps to set the weight matrix neither too bigger than 1, nor too smaller than 1. Thus it doesn’t explode or vanish gradients respectively.

5. multilabel cross entropy cost function

  • key: $(-\log{\hat{y_i}})$ $(0 \le \text{y} \le 1)$
    • x축 0에서 1로 갈수록 감소하는 그래프(x절편=1)
    • 맞추면 cost = 0
  • element wise multiply

6. log softmax

  • 참고
  • softmax보다 log softmax가 gradient descent performance가 좋음

7. cross entropy vs multi class entropy

  • multi class의 경우: negative의 cost는 무시(참고)
    "this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes."

8. RNN

  • 헷갈렸던 용어
    • number of step: unfolding RNN 했을 때, RNN 갯수 = time step 갯수
  • folding/unfolding RNN: 이렇게 부르는 이유는 결국에 RNN은 “하나”의 layer에 가까움. 즉 각 time step에 적용되는 h_w, h_i, bias 모두 동일. 다만 BPTT time step만큼 update가 되는 것.
  • batch data 예시
    • 말뭉치(length=1000), time step=10 훈련시키기
    • time step은 결국 10개단위로 truncate하여 BPTT하는 것
    • batch size=2일 때 input data

9. NLP stemming vs lemmatization

  • link1: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/
  • link2: https://datascience.stackexchange.com/questions/49712/lemmatization-vs-stemming
  • stemming: suffix 등을 없앤 값. eg. studies -> studi
    • "Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.""
  • lemmatization: 어원으로 환원 eg. studies -> study
  • 그럼 성능은? 대부분 lemma
    • stemming은 recall을 올리는 방법. 즉 후보의 수를 증가 시키는 만드는 방법.(while lemma increase precision)
    • "Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma."

10. Language Model(LM)

11. Domain Adaptation vs Transfer learning

12. Feature based vs fine-tuning based approach

  • 참고: https://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html
  • 둘 다 Domain Adaptation / Transfer learning의 특수 case
  • Fine tuning: pretrain된 weights를 initial weights로 사용하고, domain specific data에 Train. eg. GPT, BERT
    • 효과: train 속도 향상; 보유 데이터셋 size가 작을 때 유효;
    • 사용: 특정 layer freeze하여 initial weights(from pretrained model)는 update하지 않음; 또는 unfreeze하여 update;
  • Feature based: pretrained representation을 추가적인 feature로 사용. eg. ELMo

shuffle dataset

: TBD

learning rate decay

: TBD

< !-- add by yurixu 替换Google的jquery并且添加判断逻辑 -->