Skip to main content

Naive Bayes (and author detection)

I've been playing around with various classification algorithms lately, so I wrote a really simplified discrete naive bayes classifier in Python. No emphasis on sample correction, simplicity was key here, but it still works quite well.

from operator import itemgetter
from collections import defaultdict

class BayesClassifier:

def __init__(self):
self.total_count = 0 # Observations of individual attributes
self.class_count = defaultdict(int) # Observations of cls
self.attrs_count = defaultdict(int) # Observations of (cls, attrs)
self.correction = 0.0001 # Prevent multiplication by 0.0

def train(self, cls, attrs):
''' Add observation of 'attrs' as being an instance of 'cls' '''
self.class_count[cls] += 1
for attr in attrs:
self.attrs_count[(cls, attr)] += 1
self.total_count += 1

def rate(self, cls, attrs):
''' Return probability rating of 'attrs' being an instance of 'cls' '''
result = float(self.class_count[cls]) / self.total_count
for attr in attrs:
result *= self.attrs_count.get((cls, attr), self.correction)
return result / pow(self.total_count, len(attrs))

def classify(self, attrs):
''' Return most likely class that 'attrs' belongs to '''
rated_classes = [(self.rate(cls, attrs), cls) for cls in self.class_count]
rated_classes.sort(key=itemgetter(0), reverse=True)
return rated_classes[0][1]


Playing around with it I used various spam/not-spam training sets and various categorical training sets. Attributes can be labeled by the user instead of just "bag of words" lists by tagging the values in the attrs list, such as ['weekday:wed', 'weather:sunny', 'humidity:high']. Likewise, positional attributes can easily be tagged with their index ['0:this', '1:works', '2:well']. Its trivial to write a function that turns lists, objects, dicts, data models into such tagged attribute lists.

But playing with the algorithm in its "bag of words" form, I thought it would be neat to see how it does with authorship detection. Using an approach similar to spam/not-spam I trained it to classify quotes by author based on word and punctuation probabilities. In this example it parses brainyquote.com to train from the first pages of quotes by a given author, then tests the classifier with known quotes that weren't included on the first page. In a real world scenario, you'd want to train it with a much larger corpus, but in this case it works fairly well.

Here is it learning to classify between Richard Dawkins, George W Bush, and Charles Dickens :) (yes, I chose them for word contrast)
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

def getwords(text):
''' Split text into words and useful punctuation tokens '''
import string
text = text.replace("'", '')
# Remove useless punctuation
for c in string.punctuation:
if c not in '.,;?!':
text = text.replace(c, ' ')
# Keep useful punctuation
for c in '.,;?!':
text = text.replace(c, ' puncuation:%s ' % c)
text = text.lower()
return [str(word) for word in text.split() if len(word) > 3]

def getquotes(author):
''' Return list of quotes by author from brainyquote.com '''
base_url = 'http://www.brainyquote.com/quotes/authors/%s/%s.html'
soup = BeautifulSoup(urlopen(base_url % (author[0], author)).read())
td = soup.find('td', {'align': 'left', 'valign': 'top', 'width': 440})
quotes = []
for quote in td.findAll('span', {'class': 'body'}):
quotes.append(quote.string)
return quotes

bayes = BayesClassifier()

# Train bayes with quotes by author
for author in ['richard_dawkins', 'george_w_bush', 'charles_dickens']:
for quote in getquotes(author):
bayes.train(author, getwords(quote))

test_data = [
['Bush',
"Government does not create wealth. The major role for the government is to create an environment where people take risks to expand the job rate in the United States."],
['Dawkins',
"There may be fairies at the bottom of the garden. There is no evidence for it, but you can't prove that there aren't any, so shouldn't we be agnostic with respect to fairies?"],
['Dickens',
"I have known a vast quantity of nonsense talked about bad men not looking you in the face. Don't trust that conventional idea. Dishonesty will stare honesty out of countenance any day in the week, if there is anything to be got by it."],
]

# Test bayes with untrained quotes
for author, quote in test_data:
guess = bayes.classify(getwords(quote))
print 'Classified as %s, should be %s' % (guess, author)

Popular posts from this blog

DIY Solar Powered LoRa Repeater (with Arduino)

In today's video I be built a solar powered LoRa signal repeater to extend the range of my LoRa network. This can easily be used as the basis for a LoRa mesh network with a bit of extra code and additional repeaters. Even if you're not into LoRa networks all of the solar power hardware in this video can be used for any off-the-grid electronics projects or IoT nodes!  

A Lesson in LoRa Module P2P Standards (or the Lack Thereof)

I got a handful of LoRa modules from Reyax a while back, the RYLR896 model based on Semtech SX1276 chips. Instead of using an SPI interface they operate over UART using a small set of AT commands. This made them easier to work with since I didn't have to dig too deeply into a bunch of SPI registers and Semtech specs and they communicate between one another really well. My Espruino JS module for them is available here , which I've used in a few of my YouTube videos. And more recently I've written a MicroPython module for them here .   (A pair of Reyax RYLR896  modules) But, always being on the lookout for different boards and platforms I eventually ended up with a few Maduino LoRa boards. These are cool because they have an Arduino-compatible ATmega328 and the same Semtech LoRa chip (via an RFM95) both integrated on one board. They weren't compatible with Espruino or MicroPython though, and they used the SPI interface instead of AT commands so I knew I would need to lo...

Always Secure Your localhost Servers

Recently I was surprised to learn that web browsers allow any site you visit to make requests to resources on localhost (and that they will happily allow unreported mixed-content ). If you'd like to test this out, run an HTTP server on port 8080 (for instance with python -m http.server 8080 ) and then visit this page. You should see "Found: HTTP (8080)" listed and that's because the Javascript on that page made an HTTP GET request to your local server to determine that it was running. Chances are it detected other services as well (for instance if you run Tor or Keybase locally). There are two implications from this that follow: Website owners could potentially use this to collect information about what popular services are running on your local network. Malicious actors could use this to exploit vulnerabilities in those services. Requests made this way are limited in certain ways since they're considered opaque , meaning that the web page isn't able...