Friday, November 6, 2015

Naive Bayes Classification

The previous post was about Bayes' theorem, so now we'll talk about a use for it in machine learning: Naive Bayes Classification.

Let's say that you're making a program which given the content of a book, will tell you how likely it is that you will like it. In order to do so, it needs to know the contents of books that you like and books that you don't like. Parsing and understanding the content of a book is crazy hard, so you opt for a simpler strategy: You base the decision on the words used in the book. Books that you like use certain words that you like whilst books that you don't like use words that you don't like.

So you come up with a vocabulary of words (perhaps only a small set of words need to be considered) and you count the number of times each word appears a book you like and in a book you don't like. Let's say you end up with a table like this:

% of books that include word
Word\ClassLike bookHate book

This means that 90% of books that you like contain the word "fairy", which is another way of saying that a book with the word "fairy" has a 90% chance of being a good book.

Now we have a new book and we want to know if we're likely to like it or not. So we check which words it contains and find the following:

The probability that you'll like the book given that it contains these words is found by calculating
P(Like book | magic=yes, fairy=yes, car=no, gun=yes)

Naive Bayes Classification works by first using Baye's theorem on the above conditional probability:
P(magic=yes, fairy=yes, car=no, gun=yes | Like book) P(Like book) / P(magic=yes, fairy=yes, car=no, gun=yes)

Now that the list of AND conditions (has magic and fairy and...) is at the front of the conditional, we can use the Naive Bayes Assumption and assume that the occurrence of each term is independent from all the other terms. If we assume this, we can simplify the probability by decomposing the ANDed conditions into separate probabilities multiplied together as follows:
P(magic=yes|Like book) P(fairy=yes|Like book) P(car=no|Like book) P(gun=yes|Like book) P(Like book) / (P(magic=yes) P(fairy=yes) P(car=no) P(gun=yes))

Now we can use the table at the top to find P(word|Like book), the probability P(Like book) is the percentage of books that you like (from those used to construct the table), and P(word) is the probability that a book contains the given word (from the books used to construct the table). These percentages are easy to obtain.

The problem is that one of our percentages is a zero, P(gun=yes | Like book). Because of this, when it is multiplied by the other probabilities, the result will be zero. The solution is to disallow zero probabilities by assuming that just because a word does not occur in the books you like, doesn't mean that it will never occur. It might be that there is a very tiny probability that it will occur, but that you don't have enough books to find it. In these situations, we need to smooth our probabilities using Laplace Smoothing by adding 1 to every count.

Naive Bayes Classification can be used to find the most likely class a list of yes/no answers belongs to (such as whether the book contains the given words), but this is just the simplest type of Naive Bayes Classification known as Bernoulli Naive Bayes, so called because it assumes a Bernoulli distribution in the probabilities (a Bernoulli distribution is when there are only 2 possible outcomes from an event with one outcome having a probability of "p" and the other "p-1"). It can also be used on a list of frequencies of the terms by using a Multinomial Naive Bayes or on a list of numbers with a decimal point (such as the weight of the book) using Gaussian Naive Bayes.


  1. Hello! I`m having a problem hero. You say that
    >which is another way of saying that a book with the word "fairy" has a 90% chance of being a good book.
    So P(like | fairy = yes) = 90%
    While at the bottom you say that
    >Now we can use the table at the top to find P(word|Like book)
    So my question is : what is the difference between P(like | fairy = yes) and P(fairy = yes | like). I know the definition of conditional probability etc. What i mean is i can`t find the difference in how you distinguish those 2.

    1. You said that you know what they mean in a theoretical sense, so I'll explain their difference in a practical sense.

      The difference is that P(like book | fairy=yes) can be found by counting how many books contain the word "fairy", whereas P(fairy=yes | like book) can be found by counting the number of books that you like. You also need to know how many of the books that you like contain "fairy", but that's a similarity not a difference.

      This might make you think that there is no point in changing one into the other since they can both be easily calculated. The problem is when you don't care only about the word "fairy" but also the word "gun". In this case P(like book | fairy=yes, gun=yes) is found by counting the number of books that contain both "fairy" and "gun", P(like book | fairy=yes, gun=no) is found by counting the number of books that contain "fairy" but not "gun", etc. whereas P(fairy=yes, gun=yes | like book) and all the other varieties are still found by just counting the number of books you like.

      Of course you still need to know how many of those books you like contain or don't contain "fairy" and "gun", but by using the Naive Bayes assumption we can simplify this to knowing how many books that you like contain "fairy" and how many books that you like contain "gun", rather than how many contain both "fairy" and "gun", how many contain "fairy" but not "gun", etc.