Supervised ML and Sentiment Analysis

Supervised machine learning (ML) is a type of machine learning where the algorithm is trained on a labeled dataset, which means that the input data is paired with corresponding output labels. The goal is for the algorithm to learn a mapping from inputs to outputs so that it can make predictions on new, unseen data. Sentiment analysis is a natural language processing (NLP) task where the goal is to determine the sentiment expressed in a piece of text, such as positive, negative, or neutral.

Steps to do Sentiment analysis

Here’s how supervised ML is commonly applied to sentiment analysis:

Dataset Preparation:
- Collect a labeled dataset for sentiment analysis, containing text samples and corresponding sentiment labels (positive, negative, neutral).
Vocabulary & Feature Extraction:
- Create a vocabulary from the text data. Use techniques like TF-IDF or word embeddings for feature extraction.
Negative and Positive Frequencies:
- Analyze the dataset to determine the frequencies of negative and positive sentiments.
Feature Extraction with Frequencies:
- Incorporate sentiment frequencies into the feature extraction process. For example, you might consider the frequency of certain positive or negative words as features.
Preprocessing:
- Preprocess the text data, including tasks like lowercasing, removing stop words, and handling special characters.
Visualizing Word Frequencies:
- Visualize the word frequencies in the dataset to gain insights into the distribution of words associated with positive and negative sentiments.
Logistic Regression Overview:
- Understand the logistic regression algorithm, which is a supervised learning algorithm suitable for binary classification tasks like sentiment analysis.
Model Training:
- Train a logistic regression model on the preprocessed and feature-extracted dataset. The model learns to map text features to sentiment labels.
Build a Sentiment Analysis Classifier using Logistic Regression:
- Implement and train a logistic regression model specifically tailored for sentiment analysis. Ensure the model is capable of making predictions on new text data.
Model Evaluation:
- Evaluate the performance of the logistic regression sentiment analysis model using metrics such as accuracy, precision, recall, F1 score, and confusion matrix.
Fine-Tuning (Optional):
- Fine-tune the logistic regression model by adjusting hyperparameters or exploring advanced techniques like regularization to optimize performance.
Prediction on New Data:
- Use the trained logistic regression model to predict sentiments in new, unseen text data.
Deployment:
- Deploy the logistic regression sentiment analysis classifier in a real-world application where it can automatically analyze sentiments in text data.

Create a Synthetic Dataset for Sentiment Analysis

We can create a simple synthetic dataset for sentiment analysis in the code itself. Here’s an example:

import pandas as pd
import numpy as np

# Function to generate synthetic sentiment data
def generate_sentiment_data(num_samples=1000):
np.random.seed(42)
positive_samples = ["I love this!", "This is great.", "Amazing product."]
negative_samples = ["Terrible experience.", "Hate it!", "Not good at all."]

positive_data = np.random.choice(positive_samples, size=num_samples // 2)
negative_data = np.random.choice(negative_samples, size=num_samples // 2)

data = np.concatenate([positive_data, negative_data])
labels = np.array(['positive'] * (num_samples // 2) + ['negative'] * (num_samples // 2))

# Shuffle the data
indices = np.arange(num_samples)
np.random.shuffle(indices)

return pd.DataFrame({'text': data[indices], 'sentiment': labels[indices]})

# Generate synthetic sentiment data
sentiment_data = generate_sentiment_data(num_samples=1000)

# Save the dataset to a CSV file
sentiment_data.to_csv('sentiment_data.csv', index=False)

# Display the first few rows of the dataset
print(sentiment_data.head())

Basic Example using Logistic Regression

Here’s a basic example using logistic regression and TF-IDF for feature extraction:

# Step 1: Dataset Preparation
import pandas as pd

# Assume you have a CSV file with 'text' and 'sentiment' columns
df = pd.read_csv('sentiment_data.csv')

# Step 2: Vocabulary & Feature Extraction with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['text'])
y = df['sentiment']

# Step 3: Negative and Positive Frequencies
negative_count = df[df['sentiment'] == 'negative'].shape[0]
positive_count = df[df['sentiment'] == 'positive'].shape[0]

# Step 4: Feature Extraction with Frequencies (Example: Word Counts)
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(df['text'])

# Step 5: Preprocessing (Example: Lowercasing)
df['text'] = df['text'].str.lower()

# Step 6: Visualizing Word Frequencies
import matplotlib.pyplot as plt

word_freq = df['text'].str.split(expand=True).stack().value_counts()
word_freq.plot(kind='bar', title='Word Frequencies')
plt.show()

# Step 7: Logistic Regression Overview
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Step 8: Model Training
# Build a Sentiment Analysis Classifier using Logistic Regression
X_train, X_test, y_train, y_test = 
\train_test_split(X, y, test_size=0.2, random_state=42)
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Step 9: Model Evaluation
y_pred = logistic_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Step 10: Prediction on New Data
new_data = ["I love this product!", "This movie is terrible."]
new_data_features = tfidf_vectorizer.transform(new_data)
predictions = logistic_model.predict(new_data_features)
print("Predictions for new data:", predictions)

Natural Language Processing