Skip to main content

Naive Bayes (and author detection)

I've been playing around with various classification algorithms lately, so I wrote a really simplified discrete naive bayes classifier in Python. No emphasis on sample correction, simplicity was key here, but it still works quite well.

from operator import itemgetter
from collections import defaultdict

class BayesClassifier:

def __init__(self):
self.total_count = 0 # Observations of individual attributes
self.class_count = defaultdict(int) # Observations of cls
self.attrs_count = defaultdict(int) # Observations of (cls, attrs)
self.correction = 0.0001 # Prevent multiplication by 0.0

def train(self, cls, attrs):
''' Add observation of 'attrs' as being an instance of 'cls' '''
self.class_count[cls] += 1
for attr in attrs:
self.attrs_count[(cls, attr)] += 1
self.total_count += 1

def rate(self, cls, attrs):
''' Return probability rating of 'attrs' being an instance of 'cls' '''
result = float(self.class_count[cls]) / self.total_count
for attr in attrs:
result *= self.attrs_count.get((cls, attr), self.correction)
return result / pow(self.total_count, len(attrs))

def classify(self, attrs):
''' Return most likely class that 'attrs' belongs to '''
rated_classes = [(self.rate(cls, attrs), cls) for cls in self.class_count]
rated_classes.sort(key=itemgetter(0), reverse=True)
return rated_classes[0][1]

Playing around with it I used various spam/not-spam training sets and various categorical training sets. Attributes can be labeled by the user instead of just "bag of words" lists by tagging the values in the attrs list, such as ['weekday:wed', 'weather:sunny', 'humidity:high']. Likewise, positional attributes can easily be tagged with their index ['0:this', '1:works', '2:well']. Its trivial to write a function that turns lists, objects, dicts, data models into such tagged attribute lists.

But playing with the algorithm in its "bag of words" form, I thought it would be neat to see how it does with authorship detection. Using an approach similar to spam/not-spam I trained it to classify quotes by author based on word and punctuation probabilities. In this example it parses to train from the first pages of quotes by a given author, then tests the classifier with known quotes that weren't included on the first page. In a real world scenario, you'd want to train it with a much larger corpus, but in this case it works fairly well.

Here is it learning to classify between Richard Dawkins, George W Bush, and Charles Dickens :) (yes, I chose them for word contrast)
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

def getwords(text):
''' Split text into words and useful punctuation tokens '''
import string
text = text.replace("'", '')
# Remove useless punctuation
for c in string.punctuation:
if c not in '.,;?!':
text = text.replace(c, ' ')
# Keep useful punctuation
for c in '.,;?!':
text = text.replace(c, ' puncuation:%s ' % c)
text = text.lower()
return [str(word) for word in text.split() if len(word) > 3]

def getquotes(author):
''' Return list of quotes by author from '''
base_url = ''
soup = BeautifulSoup(urlopen(base_url % (author[0], author)).read())
td = soup.find('td', {'align': 'left', 'valign': 'top', 'width': 440})
quotes = []
for quote in td.findAll('span', {'class': 'body'}):
return quotes

bayes = BayesClassifier()

# Train bayes with quotes by author
for author in ['richard_dawkins', 'george_w_bush', 'charles_dickens']:
for quote in getquotes(author):
bayes.train(author, getwords(quote))

test_data = [
"Government does not create wealth. The major role for the government is to create an environment where people take risks to expand the job rate in the United States."],
"There may be fairies at the bottom of the garden. There is no evidence for it, but you can't prove that there aren't any, so shouldn't we be agnostic with respect to fairies?"],
"I have known a vast quantity of nonsense talked about bad men not looking you in the face. Don't trust that conventional idea. Dishonesty will stare honesty out of countenance any day in the week, if there is anything to be got by it."],

# Test bayes with untrained quotes
for author, quote in test_data:
guess = bayes.classify(getwords(quote))
print 'Classified as %s, should be %s' % (guess, author)


JustGlowing said…
It's a nice project. What about the precision?
Anonymous said…
It appears to me that the length of feature (word) has (big) impact on rating (e.g. longer words with same frequencies among classes have higher rating) .. guess this should not be so in "return result / pow(self.total_count, len(attrs))"

Popular posts from this blog

Procedural music with PyAudio and NumPy

Combining two of my favorite pastimes, programming and music... This is the hacky "reduced to it's basic components" version of a library I've been working on for generating music and dealing with music theory.

Tweaking the harmonics by changing the shape of the harmonic components and ratios can produce some interesting sounds. This one only uses sine waveforms, but a square / saw generator is trivial with numpy.

It takes a second to generate, so don't turn your volume up too loud in anticipation (it may be loud).

import math
import numpy
import pyaudio
import itertools
from scipy import interpolate
from operator import itemgetter

class Note:

NOTES = ['c','c#','d','d#','e','f','f#','g','g#','a','a#','b']

def __init__(self, note, octave=4):
self.octave = octave
if isinstance(note, int):
self.index = note
self.note = Note.NOTES[note]
elif isinstance(note, st…

Build a Feed Reader in Python (Parts 7-9)

Part 07 Adding Jinja2 templates to a flask web application.

 Part 08 Adding static files so we can serve some CSS to style our app.

Part 09 Adding a background task to continuously update the articles while the application is running.

Write a Feed Reader in Python

I just started a new video tutorial series. This time it'll cover the entire process of writing an RSS feed reader in Python from start to finish using the feedparser module, flask, and SQLAlchemy. Expect to see about 3-4 new videos a week until this thing is finished!
Click to watch