Implement a Vector Space Model from Scratch

Implementing a basic Vector Space Model (VSM) from scratch involves several steps, mainly focusing on text processing, creating a term-document matrix, and then using this matrix for various analyses like similarity computations. We’ll go through these steps using Python, without relying on external machine learning libraries.

Step 1: Sample Data Preparation

First, you need a set of documents (texts) to work with. For simplicity, let’s consider a small dataset of sample sentences.

documents = [
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
]

Step 2: Text Preprocessing

This includes converting text to lowercase, removing punctuation, and tokenizing (splitting text into words).

import string

def preprocess(document):
    # Convert text to lowercase
    document = document.lower()
    # Remove punctuation
    document = document.translate(str.maketrans('', '', string.punctuation))
    return document.split()

# Preprocess each document
processed_docs = [preprocess(doc) for doc in documents]

Step 3: Building the Vocabulary

Create a list of all unique words across all documents.

vocabulary = set()
for doc in processed_docs:
    vocabulary.update(doc)
vocabulary = list(vocabulary)

Step 4: Creating the Term-Document Matrix

This matrix represents the frequency of each word in each document. This function, create_term_doc_matrix, generates a term-document matrix from a given list of documents (docs) and a vocabulary (vocab). The resulting matrix represents the frequency of each term in each document.

import numpy as np

def create_term_doc_matrix(docs, vocab):
    term_doc_matrix = np.zeros((len(vocab), len(docs)))
    for i, word in enumerate(vocab):
        for j, doc in enumerate(docs):
            term_doc_matrix[i, j] = doc.count(word)
    return term_doc_matrix

term_doc_matrix = create_term_doc_matrix(processed_docs, vocabulary)
  • Rows: Each row in the matrix corresponds to a unique term (word) in the vocabulary (vocab). The order of the rows is determined by the order of terms in the vocabulary list.
  • Columns: Each column in the matrix corresponds to a document in the corpus (docs). The order of the columns is determined by the order of documents in the docs list.

The element at position [i, j] represents the frequency of the term at index i in the vocabulary within the document at index j.

For example, if the term “apple” is at index 0 in your vocabulary, and the document “Doc1” is at index 1 in your document list, then term_doc_matrix[0, 1] would represent the frequency of the term “apple” in “Doc1”.

Step 5: Computing Similarities

We can use cosine similarity (which we implemented earlier) to compute similarities between documents.

def cosine_similarity_matrix(matrix):
    similarity_matrix = np.zeros((matrix.shape[1], matrix.shape[1]))
    for i in range(matrix.shape[1]):
        for j in range(matrix.shape[1]):
           similarity_matrix[i, j] = cosine_similarity(matrix[:, i], matrix[:, j])
    return similarity_matrix

similarity_matrix = cosine_similarity_matrix(term_doc_matrix)

This function, cosine_similarity_matrix, calculates the cosine similarity between each pair of columns in a given matrix. The matrix is assumed to represent document vectors where each column corresponds to the vector representation of a document. The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space, often used to compare the similarity between documents.

Step 6: Querying the VSM

Let’s say you want to find the document most similar to a query.

def query_vsm(query, docs, vocab, term_doc_matrix):
    query_processed = preprocess(query)
    query_vec = np.zeros(len(vocab))
    for word in query_processed:
        if word in vocab:
            query_vec[vocab.index(word)] += 1
    similarities = [cosine_similarity(query_vec, term_doc_matrix[:, i]) for i in range(term_doc_matrix.shape[1])]
    most_similar_doc_index = np.argmax(similarities)
    return docs[most_similar_doc_index]

query = "bright blue sky"
most_similar_document = query_vsm(query, documents, vocabulary, term_doc_matrix)

This basic VSM implementation includes fundamental text processing, term-document matrix creation, similarity computations, and a simple query mechanism. Remember, this is a rudimentary model and lacks many refinements found in real-world applications, such as handling synonyms, advanced tokenization, and normalization techniques.