KL Divergence Classification with Ankusa on Hadoop/HBase
I recently posted a description of a new text classification project called ankusa. I decided to add a new classification method in addition to the naive bayes classifier to provide an alternative method of differentiation. I’ve used it before for determining semantic distance between different categories of text and thought it could be useful here, especially under the right conditions.
The method uses Kullback-Liebler divergence to measure the difference between the probability distributions of each class of text and the text to classify. It is not a true distance measure in that it does not satisfy the triangle inequality, but it can still be quite useful for applications like text classification. It can be slightly faster (than naive Bayes) in cases where you have a large corpora because it doesn’t have to calculate prior probabilities (only likelihoods). The implementation uses Laplacian smoothing just like the Bayes classifier.
The one caviat, however, to its use is that without a large enough test string (i.e., the text you are trying to classify) your results may not be as accurate as they could be with the naive Bayes classifier.
KL divergence classifier usage: