Image credit: Unsplash

Alternatives to Classic BM25-IDF based on a New Information Theoretical Framework

Abstract

The IDF (Inverse Document Frequency) term weighting method is a classic treatment of a term’s significance in information retrieval and text analytics. IDF can be derived from the information-theoretic KL Divergence and has given rise to competitive methods such as TF*IDF and Okapi BM25, which is the default scoring function of ElasticSearch. We developed a new information metric called {\dlite} and derived from it an alternative to IDF, namely {\idl}, for term weighting and scoring in ranked information retrieval. In a series of experiments we conducted on multiple benchmark TREC collections, {\idl} methods consistently outperformed BM25, a very competitive baseline, for ad hoc retrieval. We outline the theoretical properties of {\dlite} that support the effectiveness of {\idl}. As a general information measure, we expect {\dlite} to be applicable in many other areas of big-data analytics where further research will be valuable.

Publication
In IEEE Big Data 2022

Supplementary notes can be added here, including code and math.

Big Data Osaka, Japan
Weimao Ke
Associate Professor & Assoc Dept Head for Grad Affairs

My research interests include information retrieval, distributed machine learning, big data, and the notion of information.

Next
Previous