Simplified step-by-step procedure for creating a Naive Bayes classifier for sentiment analysis without using any machine learning package

Creating a logistic regression model for sentiment analysis from scratch involves several steps. Here’s a simplified, step-by-step procedure tailored for a dummy dataset:

1. Understand the Dataset

Let’s assume a dataset with two columns: text (containing sentences) and sentiment (labeled as 0 for negative and 1 for positive).

Texts: ["I love this product", "I hate this product", "This is the best product", "This is the worst product"]

Sentiments: [1, 0, 1, 0]

2. Preprocess the Data

Tokenize Text: Split sentences into words.
Remove Stopwords: Eliminate common words like ‘the’, ‘is’, etc.
Stemming/Lemmatization: Convert words to their base form.

Lists of words after lowercasing, removing non-word characters, stopwords, and stemming. E.g., [['love', 'product'], ['hate', 'product'], ...]

3. Feature Extraction

Bag of Words: Create a matrix where each unique word represents a feature.
TF-IDF: Alternatively, use Term Frequency-Inverse Document Frequency.

Feature Extraction – Bag of Words

Vocabulary: Unique set of words in all texts. E.g., {'love', 'hate', 'product', 'best', 'worst'}
Features: Numeric vectors representing the frequency of vocabulary words in each text.

4. Create Target Variable

Your target variable is the sentiment column.
Labels: Numpy array of the sentiments. E.g., array([1, 0, 1, 0])

5. Split the Dataset

Divide the dataset into training and testing sets (e.g., 80% train, 20% test).

6. Initialize Parameters

Initialize weights and bias to zero (for each feature).
Weights: Initialized to zeros. E.g., array([0., 0., 0., 0., 0.])
Bias: Initialized to zero. E.g., 0

7. Define the Sigmoid Function

sigmoid(z) = 1 / (1 + exp(-z))
Sigmoid Output: This is a function; it will output values between 0 and 1 when called with a numeric input.

8. Compute the Prediction

Calculate z = weights * features + bias
Apply sigmoid on z to get predictions between 0 and 1.

9. Calculate the Loss Function

Predictions (y_hat): Probability values after applying the sigmoid function.
Loss: Calculated binary cross-entropy loss.
Use Binary Cross-Entropy Loss:
- loss = -[y*log(p) + (1-y)*log(1-p)]

10. Gradient Descent

Update weights and bias to minimize the loss.
weight = weight - learning_rate * d_weight
bias = bias - learning_rate * d_bias
Where d_weight and d_bias are gradients of loss w.r.t weights and bias.

11. Repeat for Multiple Epochs

Perform steps 8-10 for a set number of iterations (epochs).

12. Make Predictions on Test Data

Use the trained model to predict sentiments on the test set.

13. Evaluate the Model

Use metrics like accuracy, precision, recall, F1-score to evaluate.

14. Tune the Model

Adjust parameters like learning rate, number of epochs for better performance.

15. Deployment (Optional)

Integrate the model into an application for real-time sentiment analysis.

Remember, this is a basic outline. Real-world scenarios might require more sophisticated preprocessing and feature engineering techniques.

Python implementation of logistic regression for sentiment analysis on a dummy dataset

This code requires NLTK for stopwords and preprocessing, and NumPy for mathematical operations. To run it, you need to install NLTK and download the stopwords dataset:

!pip install nltk
!python -m nltk.downloader stopwords

This example will follow the steps I previously outlined, but keep in mind it’s a basic illustration and might need adjustments for real-world data.

import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
import math

# Step 1: Dummy Dataset
texts = ["I love this product", \
"I hate this product", "This is the best product", \
"This is the worst product"]
sentiments = [1, 0, 1, 0] # 1 for positive, 0 for negative

# Step 2: Preprocess the Data
def preprocess(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    words = text.split()
    words = [word for word in words if word not in stopwords.words('english')]
    ps = PorterStemmer()
    words = [ps.stem(word) for word in words]
    return words

processed_texts = [preprocess(text) for text in texts]

# Step 3: Feature Extraction - Bag of Words
def create_bag_of_words(processed_texts):
    all_words = sum(processed_texts, [])
    bag = Counter(all_words)
    return bag.keys()

vocab = create_bag_of_words(processed_texts)

def text_to_vector(text, vocab):
    text_counts = Counter(text)
    return [text_counts.get(word, 0) for word in vocab]

features = np.array([text_to_vector(text, vocab) for text in processed_texts])

# Step 4: Create Target Variable
labels = np.array(sentiments)

# Step 5: Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Step 6: Initialize Parameters
weights = np.zeros(X_train.shape[1])
bias = 0

# Step 7: Define the Sigmoid Function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Step 8 & 9: Compute Prediction and Calculate Loss
def compute_loss(y, y_hat):
    m = y.shape[0]
    return -(1/m) * np.sum(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

# Step 10: Gradient Descent
def update_weights(X, y, weights, bias, learning_rate):
    m = X.shape[0]
    y_hat = sigmoid(np.dot(X, weights) + bias)
    d_weight = (1/m) * np.dot(X.T, (y_hat - y))
    d_bias = (1/m) * np.sum(y_hat - y)
    weights -= learning_rate * d_weight
    bias -= learning_rate * d_bias
    return weights, bias

# Step 11: Training the Model
def train(X, y, weights, bias, learning_rate, epochs):
    for epoch in range(epochs):
        weights, bias = update_weights(X, y, weights, bias, learning_rate)
        y_hat = sigmoid(np.dot(X, weights) + bias)
        loss = compute_loss(y, y_hat)
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss {loss}")
    return weights, bias

# Train the model
weights, bias = train(X_train, y_train, weights, bias, learning_rate=0.01, epochs=1000)

# Step 12 & 13: Make Predictions and Evaluate the Model
def predict(X, weights, bias):
    return [1 if i > 0.5 else 0 for i in sigmoid(np.dot(X, weights) + bias)]

y_pred = predict(X_test, weights, bias)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# This code sets up a simple logistic regression model for sentiment analysis.

Natural Language Processing

Logistic Regression Model for Sentiment Analysis from Scratch