Simplified step-by-step procedure for creating a Naive Bayes classifier for sentiment analysis without using any machine learning package

Creating a Naive Bayes classifier for sentiment analysis from scratch involves several key steps. Here’s a simplified step-by-step guide using a dummy dataset.

1. Prepare Dataset

Gather a small set of sentences (texts).
Label each as ‘positive’ or ‘negative’.

2. Tokenize Text

Break texts into individual words (tokens).

3. Clean and Normalize Data

Convert to lowercase.
Remove punctuation and special characters.

4. Create Word Frequencies

Count how often each word appears in each class (positive/negative).

5. Calculate Probabilities

Compute the probability of each word given a class.
Use Laplace smoothing to avoid zero probabilities.

6. Classify New Text

For a new text, break it into tokens.
Calculate the product of probabilities for each class.
Assign the class with the higher probability.

7. Evaluate Classifier

Test with a separate set of labeled texts.
Calculate accuracy as the percentage of correctly classified texts.

Here’s a simplified example:

Dataset

“I love this product” (Positive)
“I hate this product” (Negative)
“This is a great product” (Positive)
“This is a bad product” (Negative)

Tokenization and Cleaning:

[“i”, “love”, “this”, “product”]
[“i”, “hate”, “this”, “product”]
[“this”, “is”, “a”, “great”, “product”]
[“this”, “is”, “a”, “bad”, “product”]

Word Frequencies

Positive: {“i”: 1, “love”: 1, “this”: 2, “product”: 2, “is”: 1, “a”: 1, “great”: 1}
Negative: {“i”: 1, “hate”: 1, “this”: 2, “product”: 2, “is”: 1, “a”: 1, “bad”: 1}

Probabilities

P(“love” | Positive) = (1+1) / (7+7), considering Laplace smoothing

Classification

For “I love this movie”, calculate P(Positive | Text) and P(Negative | Text).

Evaluation

Test with new texts and calculate the accuracy.

This is a basic outline. In a real-world scenario, the dataset would be much larger, and additional preprocessing steps like removing stop words or using stemming might be necessary.

Python code snippet to implement a basic Naive Bayes classifier for sentiment analysis

# Import necessary libraries
import re
from collections import defaultdict

# Dummy dataset
data = [
("I love this product", "positive"),
("I hate this product", "negative"),
("This is a great product", "positive"),
("This is a bad product", "negative")
]

# Tokenize and clean the text
def tokenize(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(r'\W+', ' ', text) # Remove punctuation
    tokens = text.split() # Split into tokens
    return tokens

# Count word frequencies
def count_words(data):
    word_counts = defaultdict(lambda: {'positive': 0, 'negative': 0})
    for text, sentiment in data:
        tokens = tokenize(text)
        for token in tokens:
            word_counts[token][sentiment] += 1
    return word_counts

# Calculate word probabilities
def word_probabilities(word_counts, total_pos, total_neg, smoothing=1):
    probabilities = defaultdict(dict)
    for word in word_counts:
        probabilities[word]['positive'] = \
          (word_counts[word]['positive'] + smoothing) / (total_pos + 2 * smoothing)
        probabilities[word]['negative'] = \
          (word_counts[word]['negative'] + smoothing) / (total_neg + 2 * smoothing)
    return probabilities

# Classify a new text
def classify(text, word_probs):
    text_tokens = tokenize(text)
    pos_prob = neg_prob = 1
    for token in text_tokens:
        if token in word_probs:
            pos_prob *= word_probs[token]['positive']
            neg_prob *= word_probs[token]['negative']
    return 'positive' if pos_prob > neg_prob else 'negative'

# Training the classifier
word_counts = count_words(data)
total_pos = total_neg = 0
for sentiment_counts in word_counts.values():
    total_pos += sentiment_counts['positive']
    total_neg += sentiment_counts['negative']

word_probs = word_probabilities(word_counts, total_pos, total_neg)

# Test the classifier
test_text = "I love this movie"
print(f"Classification: {classify(test_text, word_probs)}")

Natural Language Processing

Naive Bayes classifier for sentiment analysis from scratch