Natural Language Processing
- Natural Language Processing with Deep Learning
- NLP with Classification and Vector Spaces
- Logistic Regression [Simply Explained]
- Supervised ML and Sentiment Analysis
- Sentiment Analysis with Logistic Regression
- Logistic Regression Model for Sentiment Analysis from Scratch
- Sentiment Analysis using the Naive Bayes algorithm.
- Naive Bayes classifier for sentiment analysis from scratch
- Vector Space Models
- Implement a Vector Space Model from Scratch
Implement a Vector Space Model from Scratch

Implementing a basic Vector Space Model (VSM) from scratch involves several steps, mainly focusing on text processing, creating a term-document matrix, and then using this matrix for various analyses like similarity computations. We’ll go through these steps using Python, without relying on external machine learning libraries.
Step 1: Sample Data Preparation
First, you need a set of documents (texts) to work with. For simplicity, let’s consider a small dataset of sample sentences.
documents = [ "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun, the bright sun" ]
Step 2: Text Preprocessing
This includes converting text to lowercase, removing punctuation, and tokenizing (splitting text into words).
import string
def preprocess(document):
# Convert text to lowercase
document = document.lower()
# Remove punctuation
document = document.translate(str.maketrans('', '', string.punctuation))
return document.split()
# Preprocess each document
processed_docs = [preprocess(doc) for doc in documents]
Step 3: Building the Vocabulary
Create a list of all unique words across all documents.
vocabulary = set()
for doc in processed_docs:
vocabulary.update(doc)
vocabulary = list(vocabulary)
Step 4: Creating the Term-Document Matrix
This matrix represents the frequency of each word in each document. This function, create_term_doc_matrix, generates a term-document matrix from a given list of documents (docs) and a vocabulary (vocab). The resulting matrix represents the frequency of each term in each document.
import numpy as np
def create_term_doc_matrix(docs, vocab):
term_doc_matrix = np.zeros((len(vocab), len(docs)))
for i, word in enumerate(vocab):
for j, doc in enumerate(docs):
term_doc_matrix[i, j] = doc.count(word)
return term_doc_matrix
term_doc_matrix = create_term_doc_matrix(processed_docs, vocabulary)
- Rows: Each row in the matrix corresponds to a unique term (word) in the vocabulary (
vocab). The order of the rows is determined by the order of terms in the vocabulary list. - Columns: Each column in the matrix corresponds to a document in the corpus (
docs). The order of the columns is determined by the order of documents in thedocslist.
The element at position [i, j] represents the frequency of the term at index i in the vocabulary within the document at index j.
For example, if the term “apple” is at index 0 in your vocabulary, and the document “Doc1” is at index 1 in your document list, then term_doc_matrix[0, 1] would represent the frequency of the term “apple” in “Doc1”.
Step 5: Computing Similarities
We can use cosine similarity (which we implemented earlier) to compute similarities between documents.
def cosine_similarity_matrix(matrix):
similarity_matrix = np.zeros((matrix.shape[1], matrix.shape[1]))
for i in range(matrix.shape[1]):
for j in range(matrix.shape[1]):
similarity_matrix[i, j] = cosine_similarity(matrix[:, i], matrix[:, j])
return similarity_matrix
similarity_matrix = cosine_similarity_matrix(term_doc_matrix)
This function, cosine_similarity_matrix, calculates the cosine similarity between each pair of columns in a given matrix. The matrix is assumed to represent document vectors where each column corresponds to the vector representation of a document. The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space, often used to compare the similarity between documents.
Step 6: Querying the VSM
Let’s say you want to find the document most similar to a query.
def query_vsm(query, docs, vocab, term_doc_matrix):
query_processed = preprocess(query)
query_vec = np.zeros(len(vocab))
for word in query_processed:
if word in vocab:
query_vec[vocab.index(word)] += 1
similarities = [cosine_similarity(query_vec, term_doc_matrix[:, i]) for i in range(term_doc_matrix.shape[1])]
most_similar_doc_index = np.argmax(similarities)
return docs[most_similar_doc_index]
query = "bright blue sky"
most_similar_document = query_vsm(query, documents, vocabulary, term_doc_matrix)
This basic VSM implementation includes fundamental text processing, term-document matrix creation, similarity computations, and a simple query mechanism. Remember, this is a rudimentary model and lacks many refinements found in real-world applications, such as handling synonyms, advanced tokenization, and normalization techniques.