- Implement the naive normal Bayesian classifier
- Generate a plot, depicting the classifier in action
The Bayesian approach to the classification problem is quite different from what we've studied before. This approach is based on an assumption that if the probability distribution of classes is known, then we can easily derive an optimal classification algorithm.
However, most commonly the probability distribution is not known and has to be restored based on the input data.
Abstractly, naive Bayes is a conditional probability model: given an instance to be classified - some vector consisting of n features, it assigns to this instance conditional probabilities for each of possible classes (i.e. how likely is the class to be C_k given vector x?).
The problem of this formulation is that it works slow on large datasets. To mitigate that, we can decompose the probability above using Bayes' theorem:
We mostly care about the numerator only, since denominator doesn't depend on nether vector x nor class C_k and therefore is effectively a constant. The numerator is equivalent to the joint probability model .
This can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:
The "naive" assumption means that we consider all features in x to be independent. Under this assumption, .
This means that under the above independence assumptions, the conditional distribution over the class variable C is:
To make a classifier, we just have to pick a class with max probability:
Here's the code:
getPriorProbabilities <- function (dset, column) {
dsetLength <- dim(dset)[1]
return (table(dset[, column]) / dsetLength)
}
likelihood <- function(feature, expectation, deviation) {
return ((1 / (deviation * sqrt(2 * pi))) * exp(-((feature - expectation) ^ 2) / (2 * deviation ^ 2)))
}
# returns a probability of a given point to be in a given class
getProbability <- function(dset, classesColumn, className, point) {
classesCount <- 0
featuresCount <- length(point)
dsetLength <- dim(dset)[1]
prior <- getPriorProbabilities(dset, classesColumn)[className]
lambda <- 1 # rep(1, classesCount)
expectations <- array(dim = featuresCount)
for (i in 1:featuresCount) {
expectations[i] <- mean(dset[which(dset[, classesColumn] == className), i])
}
matr <- matrix(0, featuresCount, featuresCount)
for (i in 1:featuresCount) {
matr[i, ] <- sum((dset[which(dset[, classesColumn] == className), i] - expectations[i]) ^ 2) / dsetLength
}
deviations <- array(dim = featuresCount)
for (i in 1:featuresCount) {
deviations[i] <- matr[i, i]
}
logOfLikelihood <- sum(log(likelihood(point, expectations, deviations)))
return (log(lambda * prior) + logOfLikelihood)
}
classify <- function(point) {
classes <- unique(dset[, 3])
classesCount <- length(classes)
scores <- array(dim = classesCount)
for (i in 1:classesCount) {
scores[i] <- getProbability(dset, 3, classes[i], point)
}
return(classes[which.max(scores)])
}
Now let's build a classification map to see how it all works. We'll use a dataset of two classes generated by multivariate normal distribution with params listed below:
mu1 <- c(0, 0)
mu2 <- c(4, 4)
sig1 <- matrix(c(2, 0.9, 0.9, 2), 2, 2)
sig2 <- matrix(c(0.5, 0, 0, 2), 2, 2)
Errors on this dataset: 16 (3,2%).