Logistic Regression Model for Sentiment Analysis from Scratch

Creating a logistic regression model for sentiment analysis from scratch involves several steps. Here’s a simplified, step-by-step procedure tailored for a dummy dataset:

1. Understand the Dataset

  • Let’s assume a dataset with two columns: text (containing sentences) and sentiment (labeled as 0 for negative and 1 for positive).

Texts: ["I love this product", "I hate this product", "This is the best product", "This is the worst product"]

Sentiments: [1, 0, 1, 0]

2. Preprocess the Data

  • Tokenize Text: Split sentences into words.
  • Remove Stopwords: Eliminate common words like ‘the’, ‘is’, etc.
  • Stemming/Lemmatization: Convert words to their base form.

Lists of words after lowercasing, removing non-word characters, stopwords, and stemming. E.g., [['love', 'product'], ['hate', 'product'], ...]

3. Feature Extraction

  • Bag of Words: Create a matrix where each unique word represents a feature.
  • TF-IDF: Alternatively, use Term Frequency-Inverse Document Frequency.

Feature Extraction – Bag of Words

  • Vocabulary: Unique set of words in all texts. E.g., {'love', 'hate', 'product', 'best', 'worst'}
  • Features: Numeric vectors representing the frequency of vocabulary words in each text.

4. Create Target Variable

  • Your target variable is the sentiment column.
  • Labels: Numpy array of the sentiments. E.g., array([1, 0, 1, 0])

5. Split the Dataset

  • Divide the dataset into training and testing sets (e.g., 80% train, 20% test).

6. Initialize Parameters

  • Initialize weights and bias to zero (for each feature).
  • Weights: Initialized to zeros. E.g., array([0., 0., 0., 0., 0.])
  • Bias: Initialized to zero. E.g., 0

7. Define the Sigmoid Function

  • sigmoid(z) = 1 / (1 + exp(-z))
  • Sigmoid Output: This is a function; it will output values between 0 and 1 when called with a numeric input.

8. Compute the Prediction

  • Calculate z = weights * features + bias
  • Apply sigmoid on z to get predictions between 0 and 1.

9. Calculate the Loss Function

  • Predictions (y_hat): Probability values after applying the sigmoid function.
  • Loss: Calculated binary cross-entropy loss.
  • Use Binary Cross-Entropy Loss:
    • loss = -[y*log(p) + (1-y)*log(1-p)]

10. Gradient Descent

  • Update weights and bias to minimize the loss.
  • weight = weight - learning_rate * d_weight
  • bias = bias - learning_rate * d_bias
  • Where d_weight and d_bias are gradients of loss w.r.t weights and bias.

11. Repeat for Multiple Epochs

  • Perform steps 8-10 for a set number of iterations (epochs).

12. Make Predictions on Test Data

  • Use the trained model to predict sentiments on the test set.

13. Evaluate the Model

  • Use metrics like accuracy, precision, recall, F1-score to evaluate.

14. Tune the Model

  • Adjust parameters like learning rate, number of epochs for better performance.

15. Deployment (Optional)

  • Integrate the model into an application for real-time sentiment analysis.

Remember, this is a basic outline. Real-world scenarios might require more sophisticated preprocessing and feature engineering techniques.

Python implementation of logistic regression for sentiment analysis on a dummy dataset

This code requires NLTK for stopwords and preprocessing, and NumPy for mathematical operations. To run it, you need to install NLTK and download the stopwords dataset:

!pip install nltk
!python -m nltk.downloader stopwords

This example will follow the steps I previously outlined, but keep in mind it’s a basic illustration and might need adjustments for real-world data.

import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
import math

# Step 1: Dummy Dataset
texts = ["I love this product", \
"I hate this product", "This is the best product", \
"This is the worst product"]
sentiments = [1, 0, 1, 0] # 1 for positive, 0 for negative

# Step 2: Preprocess the Data
def preprocess(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    words = text.split()
    words = [word for word in words if word not in stopwords.words('english')]
    ps = PorterStemmer()
    words = [ps.stem(word) for word in words]
    return words

processed_texts = [preprocess(text) for text in texts]

# Step 3: Feature Extraction - Bag of Words
def create_bag_of_words(processed_texts):
    all_words = sum(processed_texts, [])
    bag = Counter(all_words)
    return bag.keys()

vocab = create_bag_of_words(processed_texts)

def text_to_vector(text, vocab):
    text_counts = Counter(text)
    return [text_counts.get(word, 0) for word in vocab]

features = np.array([text_to_vector(text, vocab) for text in processed_texts])

# Step 4: Create Target Variable
labels = np.array(sentiments)

# Step 5: Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Step 6: Initialize Parameters
weights = np.zeros(X_train.shape[1])
bias = 0

# Step 7: Define the Sigmoid Function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Step 8 & 9: Compute Prediction and Calculate Loss
def compute_loss(y, y_hat):
    m = y.shape[0]
    return -(1/m) * np.sum(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

# Step 10: Gradient Descent
def update_weights(X, y, weights, bias, learning_rate):
    m = X.shape[0]
    y_hat = sigmoid(np.dot(X, weights) + bias)
    d_weight = (1/m) * np.dot(X.T, (y_hat - y))
    d_bias = (1/m) * np.sum(y_hat - y)
    weights -= learning_rate * d_weight
    bias -= learning_rate * d_bias
    return weights, bias

# Step 11: Training the Model
def train(X, y, weights, bias, learning_rate, epochs):
    for epoch in range(epochs):
        weights, bias = update_weights(X, y, weights, bias, learning_rate)
        y_hat = sigmoid(np.dot(X, weights) + bias)
        loss = compute_loss(y, y_hat)
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss {loss}")
    return weights, bias

# Train the model
weights, bias = train(X_train, y_train, weights, bias, learning_rate=0.01, epochs=1000)

# Step 12 & 13: Make Predictions and Evaluate the Model
def predict(X, weights, bias):
    return [1 if i > 0.5 else 0 for i in sigmoid(np.dot(X, weights) + bias)]

y_pred = predict(X_test, weights, bias)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# This code sets up a simple logistic regression model for sentiment analysis.