Naive Bayes Classification in Ruby using Hadoop and HBase
One of the problems I’ve run into recently at work is that we have quite a bit of text that needs to be classified. My first thought was to use one of the simplest classification methods, a naive bayes classifier. I couldn’t find anything that could possibly handle many terabytes of data, though. Most Ruby implementations, like the classifier gem, have only a simplistic implementation (for instance, the classifier gem doesn’t actually provide a true naive bayes implementation in that it ignores prior probabilities). I decided to create a better naive bayes implementation (for instance, using a Laplacian smoother) that could also handle up to many terabytes of corpus data.
I spent today implementing the classifier, and have released the code in the ankusa gem. Unlike other classifiers written in Ruby, ankusa has a fairly abstract storage class that can easily be implemented for other storage solutions. For instance, the two that come with the gem provide both HBase storage and in memory storage.
To use the gem:
The classifier does return probabilities (when you use the classifications method, unlike the classifier gem which only returns log likelihoods). Additionally, the classifier has no limitations on the size of the corpora (HBase can handle petabytes of data depending on your cluster size), so realistically your training set can be as large as you need it to be.