Okapi BM25: TF-IDF를 개선한 Ranking Function

NLP

Okapi BM25

참고: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
BM25는 ranking function으로서, query가 주어졌을 때, query와 document 간 연관의 정도를 추정함.

$W_{i, d} = tf_{i, d} \times \log{\dfrac{N}{df_i}}$

$W_{i, d}$은 doc_d의 word_i
$tf_{i, d}$는 doc_d에서 word_i의 frequency 값
$df_i$는 word_i의 df, document frequency 값
TF-IDF로 doc를 벡터화하면, row는 doc1, doc2, …, col은 term1, term2, … value는 TF-IDF score
기존 TF-IDF의 문제점
- 1) TF의 경우 freq에 따라 지속 증가 -> square root(tf) 사용하기도 함 -> BM25는 이를 더 개선
- 2) 해당 doc의 길이가 길수록 Term freq가 높을 가능성이 있음(해당 문제는 doc_len으로 Term frequency를 나눠줌으로써 normalize하여 어느정도 해결가능)
BM25는 특히 TF score를 개선

$score(query, doc_d) = \Sigma_{i=1}^n (IDF_i \dfrac{n_i(k_1 + 1)}{n_i+k_i(1-b+b \dfrac{\text{len of doc_d}}{\text{total doc avg len}})})$

상수 k 추가: 일정 frequency 이상은 saturate
- ((k + 1) * sqrt(tf_score)) / (k + sqrt(tf_score))
sqrt(tf): red line vs BM25_tf: blue line
- BM25_tf은 일정 tf_score이상은 빠르게 saturate

변수 L추가: 해당 텍스트 길이로만 tf를 normalize하는 것이 아니라, 전체 텍스트들의 길이를 반영. L은 전체 텍스트들의 평균 길이의 배수
- L = doc_d / avg(total len of target docs)
L과 k를 모두 반영한 tf_score
- ((k + 1) * sqrt(tf_score)) / (k * (1.0 - b + b * L) + sqrt(tf_score))
hyper param은 k1, b는 각각 2.0, 0.75가 자주 사용됨
k1=2, b=0.75 일 때, L=0.5, 2, 5(각각 전체 텍스트 길이의 n배)에 대한 비교
- L이 커질수록(평균 대비 text길이가 길수록) tf_score 천천히 saturate

IDF part는 log시 smoothing 1
- log ( num of texts / term_i freq through texts + 1)

scoring: query 입력 -> BM25 value합산
- eg. query: [outdoor sports]라고 했다면
  - Doc1: [best selling outdoor sports wear]에서 outdoor, sports의 BM25 tf_idf 값을 합산
  - Doc2: [best outdoor wear]에서 outdoor의 BM25 tf_idf 값을 합산
  - Doc1, Doc2의 score(relevance) 비교

검색 엔진 text similarity score에 사용됨. (IR, Information Retrieval modeling)
- 관련slide: https://www.slideshare.net/MinsubYim/ir-modeling)
Doc1(query)와 가장 유사한 문장을 찾을 때 활용
드물지만 text cluster에 활용 가능(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3060097/pdf/pone.0018029.pdf)
- 해당 논문에서는 PubMed’s own related article approach (PMRA) 방법 다음으로 BM25를 꼽음
특히 TF-IDF의 내적을 통한 cosin sim방법과 가장 큰 차이점은,
“1) it limited the word set to those that occur in less than 0.99% of the documents instead of using the full word-document matrix, and
2) it used the BM25 similarity approach in place of the standard tf-idf. The effect of the first change (truncating the word distribution) was to remove a large amount of noise from the solution space. The effect of the second change (BM25) was to use a superior similarity approach, as has been established in the literature. Combined, these two changes had an enormous positive effect on the accuracy of the cluster solution.”