Naive Bayes classifier for sentiment analysis from scratch

Creating a Naive Bayes classifier for sentiment analysis from scratch involves several key steps. Here’s a simplified step-by-step guide using a dummy dataset.

1. Prepare Dataset

  • Gather a small set of sentences (texts).
  • Label each as ‘positive’ or ‘negative’.

2. Tokenize Text

  • Break texts into individual words (tokens).

3. Clean and Normalize Data

  • Convert to lowercase.
  • Remove punctuation and special characters.

4. Create Word Frequencies

  • Count how often each word appears in each class (positive/negative).

5. Calculate Probabilities

  • Compute the probability of each word given a class.
  • Use Laplace smoothing to avoid zero probabilities.

6. Classify New Text

  • For a new text, break it into tokens.
  • Calculate the product of probabilities for each class.
  • Assign the class with the higher probability.

7. Evaluate Classifier

  • Test with a separate set of labeled texts.
  • Calculate accuracy as the percentage of correctly classified texts.

Here’s a simplified example:


  • “I love this product” (Positive)
  • “I hate this product” (Negative)
  • “This is a great product” (Positive)
  • “This is a bad product” (Negative)

Tokenization and Cleaning:

  • [“i”, “love”, “this”, “product”]
  • [“i”, “hate”, “this”, “product”]
  • [“this”, “is”, “a”, “great”, “product”]
  • [“this”, “is”, “a”, “bad”, “product”]

Word Frequencies

  • Positive: {“i”: 1, “love”: 1, “this”: 2, “product”: 2, “is”: 1, “a”: 1, “great”: 1}
  • Negative: {“i”: 1, “hate”: 1, “this”: 2, “product”: 2, “is”: 1, “a”: 1, “bad”: 1}


  • P(“love” | Positive) = (1+1) / (7+7), considering Laplace smoothing


  • For “I love this movie”, calculate P(Positive | Text) and P(Negative | Text).


  • Test with new texts and calculate the accuracy.

This is a basic outline. In a real-world scenario, the dataset would be much larger, and additional preprocessing steps like removing stop words or using stemming might be necessary.

Python code snippet to implement a basic Naive Bayes classifier for sentiment analysis

# Import necessary libraries
import re
from collections import defaultdict

# Dummy dataset
data = [
("I love this product", "positive"),
("I hate this product", "negative"),
("This is a great product", "positive"),
("This is a bad product", "negative")

# Tokenize and clean the text
def tokenize(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(r'\W+', ' ', text) # Remove punctuation
    tokens = text.split() # Split into tokens
    return tokens

# Count word frequencies
def count_words(data):
    word_counts = defaultdict(lambda: {'positive': 0, 'negative': 0})
    for text, sentiment in data:
        tokens = tokenize(text)
        for token in tokens:
            word_counts[token][sentiment] += 1
    return word_counts

# Calculate word probabilities
def word_probabilities(word_counts, total_pos, total_neg, smoothing=1):
    probabilities = defaultdict(dict)
    for word in word_counts:
        probabilities[word]['positive'] = \
          (word_counts[word]['positive'] + smoothing) / (total_pos + 2 * smoothing)
        probabilities[word]['negative'] = \
          (word_counts[word]['negative'] + smoothing) / (total_neg + 2 * smoothing)
    return probabilities

# Classify a new text
def classify(text, word_probs):
    text_tokens = tokenize(text)
    pos_prob = neg_prob = 1
    for token in text_tokens:
        if token in word_probs:
            pos_prob *= word_probs[token]['positive']
            neg_prob *= word_probs[token]['negative']
    return 'positive' if pos_prob > neg_prob else 'negative'

# Training the classifier
word_counts = count_words(data)
total_pos = total_neg = 0
for sentiment_counts in word_counts.values():
    total_pos += sentiment_counts['positive']
    total_neg += sentiment_counts['negative']

word_probs = word_probabilities(word_counts, total_pos, total_neg)

# Test the classifier
test_text = "I love this movie"
print(f"Classification: {classify(test_text, word_probs)}")