Introduction
Welcome back, everyone, to this 4th blog post in our Machine Learning Algorithms Series! Today, we are diving into the Naive Bayes algorithm, a fundamental tool in the machine learning toolkit. In this blog, we will implement the Naive Bayes algorithm from scratch in Python. By the end of this blog, you'll have a clear understanding of how Naive Bayes works, how to implement it and when to use it.
Understanding Naive Bayes
Naive Bayes is a basic but a very powerful probabilistic classifier that uses Bayes' theorem and makes strong (naive) independence assumptions about features. It is particularly beneficial for huge datasets and has been demonstrated to be effective in text categorization, spam detection, and sentiment analysis, among other tasks.
Types of Naive Bayes Algorithms
Gaussian Naive Bayes: Assumes that the continuous values associated with each feature follow a Gaussian (normal) distribution.
Multinomial Naive Bayes: Used for discrete counts. It is widely used for text classification issues in which data can be represented as word vector counts.
Bernoulli Naive Bayes: Uses the Bernoulli Distribution and requires features to have only binary values (i.e., 0 or 1).
In this Blog we will code Gaussian Naive Bayes, you can also find this code on my GitHub.
Mathematical Intuition
Naive Bayes is based on Bayes' theorem, which describes the probability of an event based on prior knowledge of conditions related to the event. The theorem is expressed as:
$$P(A∣B) = P(B∣A)⋅P(A) / P(B)$$
But how did this theorem came? lets find out!
Joint Probability
First, we start with the definition of joint probability for two events A
and B
:
$$P(A∩B)=P(A)⋅P(B∣A)$$
This equation states that the probability of both A
and B
occurring is equal to the probability of A
occurring times the probability of B
occurring given that A
has already occurred.
Similarly, we can write:
$$P(A∩B)=P(B)⋅P(A∣B)$$
This states that the probability of both A
and B
occurring is also equal to the probability of B
occurring times the probability of A
occurring given that B
has occurred.
Since both expressions represent the joint probability, we can set them equal to each other:
$$P(A)⋅P(B∣A)=P(B)⋅P(A∣B)$$
Rearranging to Bayes' Theorem
$$P(A∣B) = P(B∣A)⋅P(A) / P(B)$$
This equation tells us how to update the probability of event A
occurring given that event B
has occurred, based on the probability of B
given A
, the prior probability of A
, and the prior probability of B
.
Where:
P(A|B)
is the posterior probability of A given the predictor B.P(B|A)
is the likelihood, which is the probability of predictor B given class A.P(A)
is the prior probability of class A.P(B)
is the prior probability of predictor B.
Prior Probability is the initial probability of an event before any new evidence is considered. It represents our initial belief about the event.
Posterior Probability is the updated probability of an event after considering new evidence. It reflects our revised belief about the event based on the new information.
Mathematical Intuition Behind Naive Bayes
The Naive Bayes algorithm uses Bayes' Theorem under the "naive" assumption that all features are conditionally independent given the class label. This simplifies the computation of the posterior probability P(y|X)
for a class y
and input features X
:
$$P(y∣X)= P(y) P(X∣y)/P(X) $$
Since P(X) is constant for all classes, we can ignore it for the purpose of classification:
$$P(y∣X)∝P(X∣y)⋅P(y)$$
Using the naive assumption, we can further simplify P(X∣y)
:
$$P(X∣y)=\prod_{i=1}^{n} {P(xi∣y)}$$
So, the final formula becomes:
$$P(y∣X)∝P(y)⋅\prod_{i=1}^{n} {P(xi∣y)}$$
Using Logarithms for Calculation
To prevent numerical underflow and simplify multiplication of many small probabilities, we take the logarithm of the probabilities:
$$log P(y∣X)∝logP(y)+\sum_{i=1}^{n} logP(xi∣y)$$
Probability Density Function (PDF)
In the implementation of Gaussian Naive Bayes, we use the Probability Density Function (PDF) to compute P(xi∣y)
, which represents the likelihood of feature xi
given class y
. The PDF for Gaussian Naive Bayes is defined as:
$$f(xi∣y)=\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2}\right)$$
Where:
µy
is the mean of featurexi
for classy
.σ²
is the variance of featurexi
for classy
.
This function computes the likelihood that xi
belongs to class y
, when the feature values are normally distributed.
Implementing Naive Bayes in Python
Let's build the Naive Bayes algorithm from scratch in Python. For our illustration, we'll utilize a synthetic dataset.
Importing Libraries and Generating Data
First, we import the necessary libraries and generate a synthetic classification dataset for demonstration purposes:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
Defining the Naive Bayes Class
We create a Naive_Bayes
class that will handle the training and prediction processes:
class Naive_Bayes:
def __init__(self):
pass
Fit Method
The fit
method calculates the mean, variance, and prior probabilities for each class:
def fit(self, X, y):
self.x_train = X
self.y_train = y
self.n_samples, self.n_features = self.x_train.shape # Store no of Samples and Features
self._classes = np.unique(y) # Store all the classes in _classes
self.n_classes = len(self._classes)
# Initialize mean, variance, and prior probabilities, initially filled with 0.
self._mean = np.zeros((self.n_classes, self.n_features))
self._var = np.zeros((self.n_classes, self.n_features))
self._priors = np.zeros(self.n_classes)
for i, c in enumerate(self._classes):
xc = X[y == c] # Get only those feature records where class is equal to current class.
self._mean[i, :] = xc.mean(axis=0) # Mean of feature records with respect to current class.
self._var[i, :] = xc.var(axis=0) # Variance of feature records with respect to current class.
self._priors[i] = xc.shape[0] / float(self.n_samples) # No of samples in current class divided by total no of samples.
Predict Method
The predict
method returns the predicted class labels for the input data, by sending each data point to _predict
function separately and then returns a list of predictions.
def predict(self, X):
y_pred = [self._predict(x) for x in X]
return y_pred
Probability Density Function (PDF)
The _pdf
function computes the Gaussian (normal) probability density function for a given feature and class using the mean and variance.
def _pdf(self, class_index, x):
mean = self._mean[class_index]
var = self._var[class_index]
numerator = np.exp(-((x - mean) ** 2) / (2 * var))
denominator = np.sqrt(2 * np.pi * var)
return numerator / denominator
Predict Single Instance
The _predict
function calculates the log prior and log likelihood for each class, then combines them to find the log posterior probability, returning the class with the highest posterior.
def _predict(self, x):
posteriors = []
for i, c in enumerate(self._classes):
prior = np.log(self._priors[i])
posterior = np.sum(np.log(self._pdf(i, x)))
posterior = posterior + prior
posteriors.append(posterior)
return self._classes[np.argmax(posteriors)]
- For calculation we used log because multiplying the small probabilities would give very small value, so to reduce the inaccuracy we sum them and apply log.
Accuracy Function
The accuracy
function calculates the accuracy of the predictions:
def accuracy(y_true, pred):
acc = np.sum(y_true == pred) / len(y_true)
return acc
Main Function
Finally, we test our Naive Bayes classifier:
if __name__ == "__main__":
X, y = make_classification(n_samples=500, n_features=10, n_classes=2, random_state=44)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=44)
classifier = Naive_Bayes()
classifier.fit(X_train, y_train)
preds = classifier.predict(X_test)
acc = accuracy(y_test, preds)
print(f"Accuracy: {acc}")
Output
Common Misconceptions about Naive Bayes
Assumption of Feature Independence: Naive Bayes can function well in the presence of feature correlation, despite its "naive" assumption of feature independence given the class. Its performance can be influenced by linked traits; however, it is more resilient than it is believed.
Handling of Continuous Data: While Gaussian Naive Bayes assumes features are normally distributed, it can still work reasonably well with non-normal distributions. However, deviations from normality can impact its accuracy.
Zero Probability Issue: In cases where a feature value in the test set hasn't been seen in the training set (resulting in a zero probability), Naive Bayes assigns zero probability to that class. Techniques like Laplace smoothing or adding pseudocounts can mitigate this issue.
Performance on Small Datasets: Naive Bayes might not generalize on smaller datasets as it does on larger ones, especially if the data is sparse or highly complex.
Sensitive to Input Data Quality: Naive Bayes can be sensitive to noisy features. Feature selection or preprocessing techniques are often necessary to improve its performance.
When to Apply Naive Bayes: Key Points to Consider
Text Classification: It excels in tasks involving text data, such as spam filtering, sentiment analysis, and categorization, where features (e.g., word frequencies) can be treated independently.
Large Feature Spaces: It performs well when dealing with high-dimensional feature spaces, especially when the number of features is large compared to the number of instances.
Real-time Prediction: Due to its computational simplicity and efficiency, Naive Bayes is suitable for applications requiring real-time predictions, such as recommendation systems and chatbots.
Advantages of Naive Bayes
Simple and Easy to Implement: It's straightforward to understand and implement, making it ideal for quick deployment.
Efficient Training: Training Naive Bayes models is typically fast because it involves computing simple probabilities.
Handles Large Feature Spaces: It performs well with high-dimensional data, making it suitable for datasets with many features.
Effective with Categorical Data: It naturally handles categorical features and is robust to noise from irrelevant features.
Good Performance with Small Datasets: Naive Bayes can generalize well even with limited training data, making it useful when dataset sizes are small.
Disadvantages of Naive Bayes
Strong Assumption of Feature Independence: The "naive" assumption that features are independent given the class label may not hold true in real-world datasets, which can impact its accuracy.
Sensitive to Skewed Data: It can be sensitive to imbalanced datasets or data distributions that do not match the assumed distribution (e.g., Gaussian for Gaussian Naive Bayes).
Zero Probability Issue: When a categorical variable has a category in the test set that was not observed in the training set, Naive Bayes assigns it zero probability. Techniques like Laplace smoothing can help mitigate this issue.
Needs Sufficient Data: While Naive Bayes can perform well with small datasets, it requires sufficient data to estimate probabilities accurately, especially in cases where features are interdependent.
Conclusion
I hope this blog post has been helpful and encourages you to explore and experiment further with the Naive Bayes algorithm. If you find this blog useful, please leave a like and a follow. You can also check out my other blogs on machine learning algorithms. I have been posting these blogs in a series, and I hope you like them.