I've been playing around with various classification algorithms lately, so I wrote a really simplified discrete naive bayes classifier in Python. No emphasis on sample correction, simplicity was key here, but it still works quite well.
Playing around with it I used various spam/not-spam training sets and various categorical training sets. Attributes can be labeled by the user instead of just "bag of words" lists by tagging the values in the attrs list, such as
But playing with the algorithm in its "bag of words" form, I thought it would be neat to see how it does with authorship detection. Using an approach similar to spam/not-spam I trained it to classify quotes by author based on word and punctuation probabilities. In this example it parses brainyquote.com to train from the first pages of quotes by a given author, then tests the classifier with known quotes that weren't included on the first page. In a real world scenario, you'd want to train it with a much larger corpus, but in this case it works fairly well.
Here is it learning to classify between Richard Dawkins, George W Bush, and Charles Dickens :) (yes, I chose them for word contrast)
from operator import itemgetter
from collections import defaultdict
class BayesClassifier:
def __init__(self):
self.total_count = 0 # Observations of individual attributes
self.class_count = defaultdict(int) # Observations of cls
self.attrs_count = defaultdict(int) # Observations of (cls, attrs)
self.correction = 0.0001 # Prevent multiplication by 0.0
def train(self, cls, attrs):
''' Add observation of 'attrs' as being an instance of 'cls' '''
self.class_count[cls] += 1
for attr in attrs:
self.attrs_count[(cls, attr)] += 1
self.total_count += 1
def rate(self, cls, attrs):
''' Return probability rating of 'attrs' being an instance of 'cls' '''
result = float(self.class_count[cls]) / self.total_count
for attr in attrs:
result *= self.attrs_count.get((cls, attr), self.correction)
return result / pow(self.total_count, len(attrs))
def classify(self, attrs):
''' Return most likely class that 'attrs' belongs to '''
rated_classes = [(self.rate(cls, attrs), cls) for cls in self.class_count]
rated_classes.sort(key=itemgetter(0), reverse=True)
return rated_classes[0][1]
Playing around with it I used various spam/not-spam training sets and various categorical training sets. Attributes can be labeled by the user instead of just "bag of words" lists by tagging the values in the attrs list, such as
['weekday:wed', 'weather:sunny', 'humidity:high']
. Likewise, positional attributes can easily be tagged with their index ['0:this', '1:works', '2:well']
. Its trivial to write a function that turns lists, objects, dicts, data models into such tagged attribute lists.But playing with the algorithm in its "bag of words" form, I thought it would be neat to see how it does with authorship detection. Using an approach similar to spam/not-spam I trained it to classify quotes by author based on word and punctuation probabilities. In this example it parses brainyquote.com to train from the first pages of quotes by a given author, then tests the classifier with known quotes that weren't included on the first page. In a real world scenario, you'd want to train it with a much larger corpus, but in this case it works fairly well.
Here is it learning to classify between Richard Dawkins, George W Bush, and Charles Dickens :) (yes, I chose them for word contrast)
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
def getwords(text):
''' Split text into words and useful punctuation tokens '''
import string
text = text.replace("'", '')
# Remove useless punctuation
for c in string.punctuation:
if c not in '.,;?!':
text = text.replace(c, ' ')
# Keep useful punctuation
for c in '.,;?!':
text = text.replace(c, ' puncuation:%s ' % c)
text = text.lower()
return [str(word) for word in text.split() if len(word) > 3]
def getquotes(author):
''' Return list of quotes by author from brainyquote.com '''
base_url = 'http://www.brainyquote.com/quotes/authors/%s/%s.html'
soup = BeautifulSoup(urlopen(base_url % (author[0], author)).read())
td = soup.find('td', {'align': 'left', 'valign': 'top', 'width': 440})
quotes = []
for quote in td.findAll('span', {'class': 'body'}):
quotes.append(quote.string)
return quotes
bayes = BayesClassifier()
# Train bayes with quotes by author
for author in ['richard_dawkins', 'george_w_bush', 'charles_dickens']:
for quote in getquotes(author):
bayes.train(author, getwords(quote))
test_data = [
['Bush',
"Government does not create wealth. The major role for the government is to create an environment where people take risks to expand the job rate in the United States."],
['Dawkins',
"There may be fairies at the bottom of the garden. There is no evidence for it, but you can't prove that there aren't any, so shouldn't we be agnostic with respect to fairies?"],
['Dickens',
"I have known a vast quantity of nonsense talked about bad men not looking you in the face. Don't trust that conventional idea. Dishonesty will stare honesty out of countenance any day in the week, if there is anything to be got by it."],
]
# Test bayes with untrained quotes
for author, quote in test_data:
guess = bayes.classify(getwords(quote))
print 'Classified as %s, should be %s' % (guess, author)