Introduction
Multinomial Naive Bayes is one of the most effective and widely-used algorithms for text classification tasks. It’s particularly well-suited for problems involving discrete features like word counts, making it ideal for document classification, spam detection, and sentiment analysis.
While there are better solutions for text classification, Multinomial Naive Bayes is simple and easy to setup and provides reliable relatively high accuracy in classification making it a great baseline for comparing other text classifiers.
Important: This is supervised learning which requires training a model on labeled data before it can classify new documents.
Outline:
- Introduction
- Statistics
- Practical Walkthrough
- Why “Multinomial”?
- Common Applications
- Advanced Extensions
- Future Work
- Summary
Core Concept
The algorithm is based on Bayes’ theorem with a “naive” assumption that features (words) are conditionally independent given the class. Despite this strong independence assumption, it performs well in practice for text classification. We can look at this as a way of relaxing the problem, making the solution more generalizable.
Statistics
Let’s take a peek at just the mathematics behind the algorithm.
Joint Probability
These two notations are used interchangeably.
This yields the probability that both A and B happen together.
Conditional Probability
If A and B are any events and P(B) ≠ 0, the conditional probability of A given B is:
This tells us the probability that event A happens, given that we already know event B has happened. We calculate it by taking the probability that both A and B happen together, and dividing by the probability that B happens at all. It’s like asking “out of all the times B occurs, what fraction of those times does A also occur?”
Independence of Events
Events A and B are independent if:
This also means:
Two events are independent if knowing about one doesn’t change the probability of the other. For example, flipping a coin and rolling a die are independent - the coin result doesn’t affect the die result. When events are independent, the probability of both happening is just the product of their individual probabilities.
Conditional Independence
A and B are independent given that we know C.
Naive Bayes Assumption
Key Assumption: Given the class , all features are conditionally independent:
The “naive” assumption is that once we know the class (like whether an email is spam or not), all the features (like specific words in the email) become independent of each other. This is often not true in reality, but it simplifies calculations enormously and often works well in practice.
Bayes’ Theorem
If B₁, B₂, …, Bₙ are mutually exclusive events (only one can happen) and one of them must occur, then:
Bayes’ theorem helps us “flip” conditional probabilities. If we know how likely A is given each possible cause B₁, B₂, etc., and we know how likely each cause is, then when we observe A happening, we can figure out which cause was most likely responsible.
The numerator calculates the probability that cause Bᵣ led to observing A. The denominator sums up all the ways A could have happened (through any of the possible causes), creating a total probability. So we’re asking: “Given that A happened, what’s the probability it was caused by Bᵣ?”
We know for any of our .
We know: P(word|spam), P(word|not_spam) from training data
We want: P(spam|document) when we see a new document
Multinomial Naive Bayes
For a document with features x₁, x₂, …, xₙ, the probability of class c is:
Since the denominator is the same for all classes, we can classify by finding:
The arg_max maps the function for each class, , then yields the highest value (probability of occurring given the features).
Parameter Estimation
- Prior probability:
- Likelihood:
Multinomial Naive Bayes is used for text classification where we count how often words appear. For each possible class (like “spam” or “not spam”), we:
- Calculate the prior probability - how often does this class appear in our training data?
- For each word in the document, calculate how likely that word is given the class
- Multiply all these probabilities together
- Pick the class with the highest final probability
The “+1” in the likelihood calculation is our smoothing - it prevents us from getting zero probabilities for words we haven’t seen before in a particular class.
Data Science Terminology
Term | Definition |
---|---|
Feature | Numerical representation of data describing a component of it. Some data -> feature algos include one-hot encoding, label encoding, binning, or normalization. In text classification, features are typically words, word counts, tokens, or derived attributes like term frequency-inverse document frequency (TF-IDF) scores that the algorithm uses to make predictions. |
Vocab | Short for vocabulary - the complete set of unique words/tokens that appear in the training dataset. It defines all possible features the model can recognize. Typically refers to entire dataset vocab, but can contextually mean a specific doc’s vocab. |
Word | A meaningful unit of language, often separated by spaces in text. Examples: “hello”, “world” |
Token | The basic unit after text preprocessing - what remains after tokenization. Can be words, subwords, punctuation, or other linguistic units depending on the tokenization strategy. |
Document | A single text sample or instance in the dataset. Could be an email, tweet, article, review, or any coherent piece of text that needs to be classified. |
Label | The correct answer or target value assigned to each document during training. Also called ground truth. Examples: “spam”, “positive sentiment”, “sports category”. Contextually, it could refer to predicted label, denoted as a variable with a hat, e.g.: |
Class | A category or group that documents can be classified into. The set of all possible labels. Examples: {spam, ham}, {positive, negative, neutral}, {sports, politics, tech} |
Practical Walkthrough
Implemented in python then later using scikit-learn MNB implementation.
Problem: Spam email classification based on its email text body.
Manual Implementation
# spam_classifier.py
# uv init; uv add scikit-learn numpy; uv run python spam_classifier.py
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# --- 1. Pre-classified data and preparation
# Data
documents = [
'click here to enter the contest for free, you could win $1,000,000 in prizes',
'buy one, get one free for a limited time at our store',
'Hello, I would like my leaking roof to be repaired.', # not_spam
'free money click now limited time offer',
'meeting scheduled for tomorrow at 3pm in conference room', # not_spam
'congratulations you won free cash prizes click here',
]
labels = [1, 1, 0, 1, 0, 1] # 1 for spam, 0 for not_spam
# Convert to feature vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents) # 6x42 matrix of int
y = np.array(labels) # 6x1 matrix of int
print("Feature matrix shape:", X.shape)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Vocabulary length:", len(vectorizer.get_feature_names_out()))
print("1st document:", X[0])
# Feature matrix shape: (6, 42)
# Vocabulary: ['000' '3pm' 'at' 'be' 'buy' 'cash' 'click' 'conference' 'congratulations'
# 'contest' 'could' 'enter' 'for' 'free' 'get' 'hello' 'here' 'in'
# 'leaking' 'like' 'limited' 'meeting' 'money' 'my' 'now' 'offer' 'one'
# 'our' 'prizes' 'repaired' 'roof' 'room' 'scheduled' 'store' 'the' 'time'
# 'to' 'tomorrow' 'win' 'won' 'would' 'you']
# Vocabulary length: 42
# 1st document: <Compressed Sparse Row sparse matrix of dtype 'int64'
# with 14 stored elements and shape (1, 42)>
# Coords Values
# (0, 6) 1
# (0, 16) 1
# (0, 36) 1
# (0, 11) 1
# (0, 34) 1
# (0, 9) 1
# (0, 12) 1
# (0, 13) 1
# (0, 41) 1
# (0, 10) 1
# (0, 38) 1
# (0, 0) 2
# (0, 17) 1
# (0, 28) 1
# --- 2. Priors
def calculate_priors(y: 'N_d x 1'):
unique, counts = np.unique(y, return_counts=True)
priors = counts / len(y) # 2x1 matrix of float since we have 2 classes
return priors, unique
priors, unique_classes = calculate_priors(y)
class_names = {0: 'not_spam', 1: 'spam'}
for class_int, prior in zip(unique_classes, priors):
print(f"P({class_names[class_int]}) = {prior:.3f}")
# P(not_spam) = 0.333
# P(spam) = 0.667
# --- 3. Likelihoods
def calculate_likelihood(X: 'N_d x N_feats', y: 'N_d x 1', vectorizer, alpha=1):
vocab = vectorizer.get_feature_names_out()
vocab_size = len(vocab)
classes = np.unique(y)
likelihoods = {}
for class_int in classes:
# Get documents for this class
class_mask = y == class_int
X_class = X[class_mask]
# Sum word counts across all documents in this class
word_counts_class = np.array(X_class.sum(axis=0)).flatten()
total_words_in_class = word_counts_class.sum()
likelihoods[class_int] = {}
for i, word in enumerate(vocab):
count = word_counts_class[i]
# P(word|class) = (count + alpha) / (total_words + alpha * vocab_size)
likelihood = (count + alpha) / (total_words_in_class + alpha * vocab_size)
likelihoods[class_int][word] = likelihood
return likelihoods
# Calculate likelihoods using our X and y
likelihoods = calculate_likelihood(X, y, vectorizer)
print(f"P('free'|spam) = {likelihoods[1]['free']:.3f}")
print(f"P('meeting'|not_spam) = {likelihoods[0]['meeting']:.3f}")
# P('free'|spam) = 0.060
# P('meeting'|not_spam) = 0.033
# --- 4. Tying it together to predict
def predict_document(document: str, priors, likelihoods, vectorizer, unique_classes):
# Tokenize document
X_doc = vectorizer.transform([document])
vocab = vectorizer.get_feature_names_out()
word_counts = X_doc.toarray()[0]
class_scores = {}
for class_int in unique_classes:
# Start with prior probability, log to avoid underflow
score = np.log(priors[class_int])
# Multiply likelihoods for each word
for i, count in enumerate(word_counts):
if count > 0: # Only consider words that appear
word = vocab[i]
if word in likelihoods[class_int]:
# Add log probability count times
score += count * np.log(likelihoods[class_int][word])
class_scores[class_int] = score
# Return class with highest score
predicted_class = max(class_scores, key=class_scores.get)
return predicted_class, class_scores
# Test prediction
test_doc = "free money click now"
prediction, scores = predict_document(test_doc, priors, likelihoods, vectorizer, unique_classes)
print(f"\nDocument: '{test_doc}'")
print(f"Predicted class: {class_names[prediction]} ({prediction})")
for class_int, score in scores.items():
print(f"Log score for {class_names[class_int]}: {score:.3f}")
# Document: 'free money click now'
# Predicted class: spam (1)
# Log score for not_spam: -17.476
# Log score for spam: -13.699
The docs and features (vocab words) are fit inside X
. Labels are stored in y
.
Note: Each document is represented as a row, and each vocab word is represented as a column. As the name CountVectorizer
suggests, it’s counting the occurrences of each vocab word/token in each document. Printing it, we can see 000
occurs
twice in the first doc which lines up with expectations.
Using libraries to help go from clean raw data to features is important as they help us do standard operating procedures such as removing stop words, stemming, and tokenizing in a performant manor.
Regarding # P('free'|spam) = 0.060
and # P('meeting'|not_spam) = 0.033
, these values appear small, but they are used
as just a part of the final calculation of the probability.
After writing an annotating this throughout, I realized next time I can write a notebook then compile it to HTML instead.
Big-O
O(c * d)
runtime where c is the number of classes and d is the average length of a document.
In the creation of the data-feature matrix, X, it’s assumed longer documents take longer to process than the O(c * f)
of the predict or likelihood functions.
O(c * f)
space.
Using MNB Implemented in Scikit-Learn
In the real world, we’ll use a library to implement the algo for us.
# spam_clf_scikit_learn.py
# uv init; uv add scikit-learn numpy; uv run python spam_clf_scikit_learn.py
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np
# Data
documents = [
'click here to enter the contest for free, you could win $1,000,000 in prizes',
'buy one, get one free for a limited time at our store',
'Hello, I would like my leaking roof to be repaired.', # not_spam
'free money click now limited time offer',
'meeting scheduled for tomorrow at 3pm in conference room', # not_spam
'congratulations you won free cash prizes click here',
]
labels = [1, 1, 0, 1, 0, 1] # 1 for spam, 0 for not_spam
# Convert to feature vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents) # 6x42 matrix of int
y = np.array(labels) # 6x1 matrix of int
# Train model
nb_model = MultinomialNB(alpha=1.0) # alpha=1.0 is Laplace smoothing
nb_model.fit(X, y)
# Access learned parameters
print("Class priors:", nb_model.class_log_prior_)
print("Feature log probabilities shape:", nb_model.feature_log_prob_.shape)
# Predict new documents
new_docs = ['free money now', 'meeting tomorrow']
X_new = vectorizer.transform(new_docs)
predictions = nb_model.predict(X_new)
probabilities = nb_model.predict_proba(X_new)
for doc, pred, prob in zip(new_docs, predictions, probabilities):
print(f"'{doc}' -> {pred} (prob: {prob.max():.3f})")
# Class priors: [-1.09861229 -0.40546511]
# Feature log probabilities shape: (2, 42)
# 'free money now' -> 1 (prob: 0.938)
# 'meeting tomorrow' -> 0 (prob: 0.793)
Why “Multinomial”?
The “multinomial” part refers to how the algorithm models document generation using a multinomial probability distribution - like rolling a loaded vocabulary die multiple times.
Multinomial vs Other Naive Bayes Variants
Variant | Features | Best For | Example |
---|---|---|---|
Multinomial NB | Word counts | Longer documents | ”excellent” appears 3 times → stronger signal |
Bernoulli NB | Word presence/absence | Short documents | ”excellent” present or not → binary signal |
Gaussian NB | Continuous values | Numerical data | Temperature, age, income measurements |
Common Applications
Application | Description |
---|---|
Spam Detection | Filter unwanted emails |
Sentiment Analysis | Classify positive/negative opinions |
Topic Classification | Categorize documents by subject |
Language Detection | Identify document language |
News Categorization | Sort articles by category |
Advanced Extensions
Complement Naive Bayes: Particularly effective for imbalanced datasets, estimates the probability that a document does NOT belong to a class.
Multivariate Bernoulli Naive Bayes: Uses binary features (word presence/absence) instead of counts, sometimes performs better for short documents.
Hierarchical Classification: Apply multinomial NB in a tree structure for hierarchical categorization tasks.
Modern Context: While deep learning models often achieve higher accuracy, Multinomial Naive Bayes remains valuable for its simplicity, speed, and effectiveness as a baseline. It’s often used in production systems where interpretability and low latency are priorities.
Future Work
- Run this on my scraped datasets after some supervised manual labeling
- articles, emails, others
- Deeper dive into practical examples of where and why accuracy sucks
- Analysis on understanding if and why MNB is the wrong choice for your problem
- Alternative solutions for more bespoke use cases
Summary
You learn Bayes’ Theorem in one of your first stats courses and can get started with some simple code using Python and scikit-learn. The algo has become the goto baseline for text classification.
Ideal Scenarios
- Text classification with discrete word counts
- When you need fast training and prediction
- Limited computational resources
- Baseline model for comparison
- When interpretability is important
- Multi-class classification problems
- Very small to moderately sized datasets
When to Avoid MNB
- Word order and context are crucial
- Feature interactions are important
- Phrases need weight in prediction
- Semantic meaning is important