Multinomial Naive Bayes for Text Classification

Introduction

Multinomial Naive Bayes is one of the most effective and widely-used algorithms for text classification tasks. It’s particularly well-suited for problems involving discrete features like word counts, making it ideal for document classification, spam detection, and sentiment analysis.

While there are better solutions for text classification, Multinomial Naive Bayes is simple and easy to setup and provides reliable relatively high accuracy in classification making it a great baseline for comparing other text classifiers.

Important: This is supervised learning which requires training a model on labeled data before it can classify new documents.

Outline:

Introduction
- Core Concept
Statistics
Practical Walkthrough
Why “Multinomial”?
- Multinomial vs Other Naive Bayes Variants
Common Applications
Advanced Extensions
Future Work
Summary
- Ideal Scenarios
- When to Avoid MNB

Core Concept

The algorithm is based on Bayes’ theorem with a “naive” assumption that features (words) are conditionally independent given the class. Despite this strong independence assumption, it performs well in practice for text classification. We can look at this as a way of relaxing the problem, making the solution more generalizable.

Statistics

Let’s take a peek at just the mathematics behind the algorithm.

Joint Probability

These two notations are used interchangeably.

P(A \cap B) = P(A, B)

This yields the probability that both A and B happen together.

Conditional Probability

If A and B are any events and P(B) ≠ 0, the conditional probability of A given B is:

P(A|B) = \frac{P(A \cap B)}{P(B)}

This tells us the probability that event A happens, given that we already know event B has happened. We calculate it by taking the probability that both A and B happen together, and dividing by the probability that B happens at all. It’s like asking “out of all the times B occurs, what fraction of those times does A also occur?”

Independence of Events

Events A and B are independent if: $P(A|B) = P(A) \text{ and } P(B|A) = P(B)$

This also means: $P(A \cap B) = P(A) \times P(B)$

Two events are independent if knowing about one doesn’t change the probability of the other. For example, flipping a coin and rolling a die are independent - the coin result doesn’t affect the die result. When events are independent, the probability of both happening is just the product of their individual probabilities.

Conditional Independence

P(A \cap B | C) = P(A|C) \times P(B|C)

A and B are independent given that we know C.

Naive Bayes Assumption

Key Assumption: Given the class $c$ , all features are conditionally independent:

P(x_1, x_2, ..., x_n | c) = P(x_1|c) \times P(x_2|c) \times ... \times P(x_n|c)

The “naive” assumption is that once we know the class (like whether an email is spam or not), all the features (like specific words in the email) become independent of each other. This is often not true in reality, but it simplifies calculations enormously and often works well in practice.

Bayes’ Theorem

If B₁, B₂, …, Bₙ are mutually exclusive events (only one can happen) and one of them must occur, then:

P(B_r|A) = \frac{P(B_r) \times P(A|B_r)}{\sum_{i=1}^{n} P(B_i) \times P(A|B_i)} \newline \text{for } r = 1, 2, \ldots, N_B

Bayes’ theorem helps us “flip” conditional probabilities. If we know how likely A is given each possible cause B₁, B₂, etc., and we know how likely each cause is, then when we observe A happening, we can figure out which cause was most likely responsible.

The numerator calculates the probability that cause Bᵣ led to observing A. The denominator sums up all the ways A could have happened (through any of the possible causes), creating a total probability. So we’re asking: “Given that A happened, what’s the probability it was caused by Bᵣ?”

We know $P(A|B_i)$ for any of our $P(B_i)$ .

We know: P(word|spam), P(word|not_spam) from training data

We want: P(spam|document) when we see a new document

Multinomial Naive Bayes

For a document with features x₁, x₂, …, xₙ, the probability of class c is:

P(c|x_1, x_2, ..., x_n) = \frac{P(c) \times \prod_{i=1}^{n} P(x_i|c)}{P(x_1, x_2, ..., x_n)}

Since the denominator is the same for all classes, we can classify by finding:

\arg\max_c P(c) \times \prod_{i=1}^{n} P(x_i|c)

The arg_max maps the function for each class, $c$ , then yields the highest value (probability of $c$ occurring given the features).

Parameter Estimation

Prior probability: $P(c) = \frac{\text{number of documents in class c}}{\text{total number of documents}}$
Likelihood: $P(x_i|c) = \frac{\text{(count of feature } x_i \text{ in class c)} + 1}{\text{total features in class c} + \text{vocabulary size}}$

Multinomial Naive Bayes is used for text classification where we count how often words appear. For each possible class (like “spam” or “not spam”), we:

Calculate the prior probability - how often does this class appear in our training data?
For each word in the document, calculate how likely that word is given the class
Multiply all these probabilities together
Pick the class with the highest final probability

The “+1” in the likelihood calculation is our smoothing - it prevents us from getting zero probabilities for words we haven’t seen before in a particular class.

Data Science Terminology

Term	Definition
Feature	Numerical representation of data describing a component of it. Some data -> feature algos include one-hot encoding, label encoding, binning, or normalization. In text classification, features are typically words, word counts, tokens, or derived attributes like term frequency-inverse document frequency (TF-IDF) scores that the algorithm uses to make predictions.
Vocab	Short for vocabulary - the complete set of unique words/tokens that appear in the training dataset. It defines all possible features the model can recognize. Typically refers to entire dataset vocab, but can contextually mean a specific doc’s vocab.
Word	A meaningful unit of language, often separated by spaces in text. Examples: “hello”, “world”
Token	The basic unit after text preprocessing - what remains after tokenization. Can be words, subwords, punctuation, or other linguistic units depending on the tokenization strategy.
Document	A single text sample or instance in the dataset. Could be an email, tweet, article, review, or any coherent piece of text that needs to be classified.
Label	The correct answer or target value assigned to each document during training. Also called ground truth. Examples: “spam”, “positive sentiment”, “sports category”. Contextually, it could refer to predicted label, denoted as a variable with a hat, e.g.: $\hat{l}$
Class	A category or group that documents can be classified into. The set of all possible labels. Examples: {spam, ham}, {positive, negative, neutral}, {sports, politics, tech}

Practical Walkthrough

Implemented in python then later using scikit-learn MNB implementation.

Problem: Spam email classification based on its email text body.

Manual Implementation

# spam_classifier.py
# uv init; uv add scikit-learn numpy; uv run python spam_classifier.py
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# --- 1. Pre-classified data and preparation
# Data
documents = [
   'click here to enter the contest for free, you could win $1,000,000 in prizes',
   'buy one, get one free for a limited time at our store',
   'Hello, I would like my leaking roof to be repaired.',  # not_spam
   'free money click now limited time offer',
   'meeting scheduled for tomorrow at 3pm in conference room',  # not_spam
   'congratulations you won free cash prizes click here',
]

labels = [1, 1, 0, 1, 0, 1]  # 1 for spam, 0 for not_spam

# Convert to feature vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)  # 6x42 matrix of int 
y = np.array(labels)  # 6x1 matrix of int

print("Feature matrix shape:", X.shape)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Vocabulary length:", len(vectorizer.get_feature_names_out()))
print("1st document:", X[0])
# Feature matrix shape: (6, 42)
# Vocabulary: ['000' '3pm' 'at' 'be' 'buy' 'cash' 'click' 'conference' 'congratulations'
#              'contest' 'could' 'enter' 'for' 'free' 'get' 'hello' 'here' 'in'
#              'leaking' 'like' 'limited' 'meeting' 'money' 'my' 'now' 'offer' 'one'
#              'our' 'prizes' 'repaired' 'roof' 'room' 'scheduled' 'store' 'the' 'time'
#              'to' 'tomorrow' 'win' 'won' 'would' 'you']
# Vocabulary length: 42
# 1st document: <Compressed Sparse Row sparse matrix of dtype 'int64'
# with 14 stored elements and shape (1, 42)>
# Coords        Values
# (0, 6)        1
# (0, 16)       1
# (0, 36)       1
# (0, 11)       1
# (0, 34)       1
# (0, 9)        1
# (0, 12)       1
# (0, 13)       1
# (0, 41)       1
# (0, 10)       1
# (0, 38)       1
# (0, 0)        2
# (0, 17)       1
# (0, 28)       1

# --- 2. Priors
def calculate_priors(y: 'N_d x 1'):
    unique, counts = np.unique(y, return_counts=True)
    priors = counts / len(y)  # 2x1 matrix of float since we have 2 classes
    return priors, unique

priors, unique_classes = calculate_priors(y)
class_names = {0: 'not_spam', 1: 'spam'}
for class_int, prior in zip(unique_classes, priors):
   print(f"P({class_names[class_int]}) = {prior:.3f}")
# P(not_spam) = 0.333
# P(spam) = 0.667

# --- 3. Likelihoods
def calculate_likelihood(X: 'N_d x N_feats', y: 'N_d x 1', vectorizer, alpha=1):
    vocab = vectorizer.get_feature_names_out()
    vocab_size = len(vocab)
    classes = np.unique(y)
    
    likelihoods = {}
    
    for class_int in classes:
        # Get documents for this class
        class_mask = y == class_int
        X_class = X[class_mask]
        
        # Sum word counts across all documents in this class
        word_counts_class = np.array(X_class.sum(axis=0)).flatten()
        total_words_in_class = word_counts_class.sum()
        
        likelihoods[class_int] = {}
        
        for i, word in enumerate(vocab):
            count = word_counts_class[i]
            # P(word|class) = (count + alpha) / (total_words + alpha * vocab_size)
            likelihood = (count + alpha) / (total_words_in_class + alpha * vocab_size)
            likelihoods[class_int][word] = likelihood
    
    return likelihoods

# Calculate likelihoods using our X and y
likelihoods = calculate_likelihood(X, y, vectorizer)

print(f"P('free'|spam) = {likelihoods[1]['free']:.3f}")
print(f"P('meeting'|not_spam) = {likelihoods[0]['meeting']:.3f}")
# P('free'|spam) = 0.060
# P('meeting'|not_spam) = 0.033

# --- 4. Tying it together to predict
def predict_document(document: str, priors, likelihoods, vectorizer, unique_classes):
    # Tokenize document
    X_doc = vectorizer.transform([document])
    vocab = vectorizer.get_feature_names_out()
    word_counts = X_doc.toarray()[0]
    
    class_scores = {}
    
    for class_int in unique_classes:
        # Start with prior probability, log to avoid underflow
        score = np.log(priors[class_int])
        
        # Multiply likelihoods for each word
        for i, count in enumerate(word_counts):
            if count > 0:  # Only consider words that appear
                word = vocab[i]
                if word in likelihoods[class_int]:
                    # Add log probability count times
                    score += count * np.log(likelihoods[class_int][word])
        
        class_scores[class_int] = score
    
    # Return class with highest score
    predicted_class = max(class_scores, key=class_scores.get)
    return predicted_class, class_scores

# Test prediction
test_doc = "free money click now"
prediction, scores = predict_document(test_doc, priors, likelihoods, vectorizer, unique_classes)

print(f"\nDocument: '{test_doc}'")
print(f"Predicted class: {class_names[prediction]} ({prediction})")
for class_int, score in scores.items():
    print(f"Log score for {class_names[class_int]}: {score:.3f}")

# Document: 'free money click now'
# Predicted class: spam (1)
# Log score for not_spam: -17.476
# Log score for spam: -13.699

The docs and features (vocab words) are fit inside X. Labels are stored in y.

Note: Each document is represented as a row, and each vocab word is represented as a column. As the name CountVectorizer suggests, it’s counting the occurrences of each vocab word/token in each document. Printing it, we can see 000 occurs twice in the first doc which lines up with expectations.

Using libraries to help go from clean raw data to features is important as they help us do standard operating procedures such as removing stop words, stemming, and tokenizing in a performant manor.

Regarding # P('free'|spam) = 0.060 and # P('meeting'|not_spam) = 0.033, these values appear small, but they are used as just a part of the final calculation of the probability.

After writing an annotating this throughout, I realized next time I can write a notebook then compile it to HTML instead.

Big-O

O(c * d) runtime where c is the number of classes and d is the average length of a document. In the creation of the data-feature matrix, X, it’s assumed longer documents take longer to process than the O(c * f) of the predict or likelihood functions.

O(c * f) space.

Using MNB Implemented in Scikit-Learn

In the real world, we’ll use a library to implement the algo for us.

# spam_clf_scikit_learn.py
# uv init; uv add scikit-learn numpy; uv run python spam_clf_scikit_learn.py
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np

# Data
documents = [
   'click here to enter the contest for free, you could win $1,000,000 in prizes',
   'buy one, get one free for a limited time at our store',
   'Hello, I would like my leaking roof to be repaired.',  # not_spam
   'free money click now limited time offer',
   'meeting scheduled for tomorrow at 3pm in conference room',  # not_spam
   'congratulations you won free cash prizes click here',
]

labels = [1, 1, 0, 1, 0, 1]  # 1 for spam, 0 for not_spam

# Convert to feature vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)  # 6x42 matrix of int 
y = np.array(labels)  # 6x1 matrix of int

# Train model
nb_model = MultinomialNB(alpha=1.0)  # alpha=1.0 is Laplace smoothing
nb_model.fit(X, y)

# Access learned parameters
print("Class priors:", nb_model.class_log_prior_)
print("Feature log probabilities shape:", nb_model.feature_log_prob_.shape)

# Predict new documents
new_docs = ['free money now', 'meeting tomorrow']
X_new = vectorizer.transform(new_docs)
predictions = nb_model.predict(X_new)
probabilities = nb_model.predict_proba(X_new)

for doc, pred, prob in zip(new_docs, predictions, probabilities):
   print(f"'{doc}' -> {pred} (prob: {prob.max():.3f})")

# Class priors: [-1.09861229 -0.40546511]
# Feature log probabilities shape: (2, 42)
# 'free money now' -> 1 (prob: 0.938)
# 'meeting tomorrow' -> 0 (prob: 0.793)

Why “Multinomial”?

The “multinomial” part refers to how the algorithm models document generation using a multinomial probability distribution - like rolling a loaded vocabulary die multiple times.

Multinomial vs Other Naive Bayes Variants

Variant	Features	Best For	Example
Multinomial NB	Word counts	Longer documents	”excellent” appears 3 times → stronger signal
Bernoulli NB	Word presence/absence	Short documents	”excellent” present or not → binary signal
Gaussian NB	Continuous values	Numerical data	Temperature, age, income measurements

Common Applications

Application	Description
Spam Detection	Filter unwanted emails
Sentiment Analysis	Classify positive/negative opinions
Topic Classification	Categorize documents by subject
Language Detection	Identify document language
News Categorization	Sort articles by category

Advanced Extensions

Complement Naive Bayes: Particularly effective for imbalanced datasets, estimates the probability that a document does NOT belong to a class.

Multivariate Bernoulli Naive Bayes: Uses binary features (word presence/absence) instead of counts, sometimes performs better for short documents.

Hierarchical Classification: Apply multinomial NB in a tree structure for hierarchical categorization tasks.

Modern Context: While deep learning models often achieve higher accuracy, Multinomial Naive Bayes remains valuable for its simplicity, speed, and effectiveness as a baseline. It’s often used in production systems where interpretability and low latency are priorities.

Future Work

Run this on my scraped datasets after some supervised manual labeling
- articles, emails, others
Deeper dive into practical examples of where and why accuracy sucks
Analysis on understanding if and why MNB is the wrong choice for your problem
Alternative solutions for more bespoke use cases

Summary

You learn Bayes’ Theorem in one of your first stats courses and can get started with some simple code using Python and scikit-learn. The algo has become the goto baseline for text classification.

Ideal Scenarios

Text classification with discrete word counts
When you need fast training and prediction
Limited computational resources
Baseline model for comparison
When interpretability is important
Multi-class classification problems
Very small to moderately sized datasets

When to Avoid MNB

Word order and context are crucial
Feature interactions are important
Phrases need weight in prediction
Semantic meaning is important