Natural Language Processing
- Natural Language Processing with Deep Learning
- NLP with Classification and Vector Spaces
- Logistic Regression [Simply Explained]
- Supervised ML and Sentiment Analysis
- Sentiment Analysis with Logistic Regression
- Logistic Regression Model for Sentiment Analysis from Scratch
- Sentiment Analysis using the Naive Bayes algorithm.
- Naive Bayes classifier for sentiment analysis from scratch
- Vector Space Models
- Implement a Vector Space Model from Scratch
Supervised ML and Sentiment Analysis

Supervised machine learning (ML) is a type of machine learning where the algorithm is trained on a labeled dataset, which means that the input data is paired with corresponding output labels. The goal is for the algorithm to learn a mapping from inputs to outputs so that it can make predictions on new, unseen data. Sentiment analysis is a natural language processing (NLP) task where the goal is to determine the sentiment expressed in a piece of text, such as positive, negative, or neutral.
Steps to do Sentiment analysis
Here’s how supervised ML is commonly applied to sentiment analysis:
- Dataset Preparation:
- Collect a labeled dataset for sentiment analysis, containing text samples and corresponding sentiment labels (positive, negative, neutral).
- Vocabulary & Feature Extraction:
- Create a vocabulary from the text data. Use techniques like TF-IDF or word embeddings for feature extraction.
- Negative and Positive Frequencies:
- Analyze the dataset to determine the frequencies of negative and positive sentiments.
- Feature Extraction with Frequencies:
- Incorporate sentiment frequencies into the feature extraction process. For example, you might consider the frequency of certain positive or negative words as features.
- Preprocessing:
- Preprocess the text data, including tasks like lowercasing, removing stop words, and handling special characters.
- Visualizing Word Frequencies:
- Visualize the word frequencies in the dataset to gain insights into the distribution of words associated with positive and negative sentiments.
- Logistic Regression Overview:
- Understand the logistic regression algorithm, which is a supervised learning algorithm suitable for binary classification tasks like sentiment analysis.
- Model Training:
- Train a logistic regression model on the preprocessed and feature-extracted dataset. The model learns to map text features to sentiment labels.
- Build a Sentiment Analysis Classifier using Logistic Regression:
- Implement and train a logistic regression model specifically tailored for sentiment analysis. Ensure the model is capable of making predictions on new text data.
- Model Evaluation:
- Evaluate the performance of the logistic regression sentiment analysis model using metrics such as accuracy, precision, recall, F1 score, and confusion matrix.
- Fine-Tuning (Optional):
- Fine-tune the logistic regression model by adjusting hyperparameters or exploring advanced techniques like regularization to optimize performance.
- Prediction on New Data:
- Use the trained logistic regression model to predict sentiments in new, unseen text data.
- Deployment:
- Deploy the logistic regression sentiment analysis classifier in a real-world application where it can automatically analyze sentiments in text data.
Create a Synthetic Dataset for Sentiment Analysis
We can create a simple synthetic dataset for sentiment analysis in the code itself. Here’s an example:
import pandas as pd
import numpy as np
# Function to generate synthetic sentiment data
def generate_sentiment_data(num_samples=1000):
np.random.seed(42)
positive_samples = ["I love this!", "This is great.", "Amazing product."]
negative_samples = ["Terrible experience.", "Hate it!", "Not good at all."]
positive_data = np.random.choice(positive_samples, size=num_samples // 2)
negative_data = np.random.choice(negative_samples, size=num_samples // 2)
data = np.concatenate([positive_data, negative_data])
labels = np.array(['positive'] * (num_samples // 2) + ['negative'] * (num_samples // 2))
# Shuffle the data
indices = np.arange(num_samples)
np.random.shuffle(indices)
return pd.DataFrame({'text': data[indices], 'sentiment': labels[indices]})
# Generate synthetic sentiment data
sentiment_data = generate_sentiment_data(num_samples=1000)
# Save the dataset to a CSV file
sentiment_data.to_csv('sentiment_data.csv', index=False)
# Display the first few rows of the dataset
print(sentiment_data.head())
Basic Example using Logistic Regression
Here’s a basic example using logistic regression and TF-IDF for feature extraction:
# Step 1: Dataset Preparation
import pandas as pd
# Assume you have a CSV file with 'text' and 'sentiment' columns
df = pd.read_csv('sentiment_data.csv')
# Step 2: Vocabulary & Feature Extraction with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['text'])
y = df['sentiment']
# Step 3: Negative and Positive Frequencies
negative_count = df[df['sentiment'] == 'negative'].shape[0]
positive_count = df[df['sentiment'] == 'positive'].shape[0]
# Step 4: Feature Extraction with Frequencies (Example: Word Counts)
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(df['text'])
# Step 5: Preprocessing (Example: Lowercasing)
df['text'] = df['text'].str.lower()
# Step 6: Visualizing Word Frequencies
import matplotlib.pyplot as plt
word_freq = df['text'].str.split(expand=True).stack().value_counts()
word_freq.plot(kind='bar', title='Word Frequencies')
plt.show()
# Step 7: Logistic Regression Overview
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
# Step 8: Model Training
# Build a Sentiment Analysis Classifier using Logistic Regression
X_train, X_test, y_train, y_test =
\train_test_split(X, y, test_size=0.2, random_state=42)
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
# Step 9: Model Evaluation
y_pred = logistic_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Step 10: Prediction on New Data
new_data = ["I love this product!", "This movie is terrible."]
new_data_features = tfidf_vectorizer.transform(new_data)
predictions = logistic_model.predict(new_data_features)
print("Predictions for new data:", predictions)