Comprehensive Guide To Semantic Chunking With Free Python Code

Table Of Contents

A Technique That Requires Chunking
Chunking Methods
How Does Semantic Chunking Work?
- Theoretical Example
Prototype
Libraries That Have Semantic Chunkers Built-in
Let’s Visualize the Results

In the ever-evolving landscape of natural language processing (NLP) and artificial intelligence (AI), semantic chunking has emerged as a powerful technique for breaking down large pieces of text into smaller, more manageable chunks based on their meaning.

This guide delves into the intricacies of semantic chunking, providing you with a comprehensive understanding of the technique, practical examples, and free Python code to implement it.

Semantic chunking can significantly enhance your projects whether you’re working on text analysis, AI, or software development.

A Technique That Requires Chunking

Semantic chunking is indispensable for various applications, from improving educational methods to enhancing AI models.

One of the main challenges in generative AI models like ChatGPT and Gemini is hallucination, where the AI provides incorrect or irrelevant responses.

For example, when asked to count the number of ‘m’s in “banana,” a generative model might erroneously include the letter despite its absence.

To address this, Retrieval-Augmented Generation (RAG) comes into play.

RAG enhances the accuracy and reliability of generative AI by fetching facts from external sources.

By providing the AI model with an external resource, such as an essay, and then asking it related questions, the AI analyzes the text and retrieves specific information rather than hallucinating answers.

This technique relies heavily on efficient chunking to split the text into manageable chunks for quick and accurate information retrieval.

Chunking Methods

Semantic chunking is one of many chunking methods used in text analysis and AI.

Here are some of the key methods:

By Character: This method breaks down text into individual characters, which is useful for deep text analysis.
By Character + SimplerLLM: Chunks text by characters while preserving sentence structure for meaningful segments.
By Token: Segments text into tokens, such as words or subwords, common in NLP.
By Paragraph: Chunks text by paragraphs, maintaining the text structure.
Recursive Chunking: Repeatedly breaks down data into smaller chunks, often used in hierarchical structures.
Semantic Chunking: Groups text based on meaning, crucial for understanding the context.
Agentic Chunking: Focuses on identifying and grouping text based on agents like people or organizations.

Each method has its unique applications and advantages.

However, semantic chunking stands out for tasks requiring deep contextual understanding.

How Does Semantic Chunking Work?

Semantic chunking splits a given text based on the similarity of its meaning.

The process involves chunking the text into sentences, converting these sentences into vector embeddings, and calculating the cosine similarity between these chunks.

A threshold is then set (e.g., 0.8), and whenever the cosine similarity between two consecutive segments exceeds this threshold, a split occurs.

Theoretical Example

Consider a theoretical example where we have sentences 1, 2, and 3.

If the cosine similarity between sentences 1 and 2 is 0.85 (greater than the threshold of 0.8), a chunk is made.

However, if the similarity between sentences 2 and 3 is 0.3 (less than 0.8), no chunking occurs.

Prototype

Here is a prototype of a semantic chunker based on the above algorithm, with a tweak to combine adjacent sentences:

pythonCopy codeimport re
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def chunk_text(text):
    single_sentences_list = _split_sentences(text)
    combined_sentences = _combine_sentences(single_sentences_list)
    embeddings = convert_to_vector(combined_sentences)
    distances = _calculate_cosine_distances(embeddings)
    breakpoint_percentile_threshold = 80
    breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold)
    indices_above_thresh = [i for i, distance in enumerate(distances) if distance > breakpoint_distance_threshold]
    chunks = []
    start_index = 0
    for index in indices_above_thresh:
        chunk = ' '.join(single_sentences_list[start_index:index+1])
        chunks.append(chunk)
        start_index = index + 1
    if start_index < len(single_sentences_list):
        chunk = ' '.join(single_sentences_list[start_index:])
        chunks.append(chunk)
    return chunks

def _split_sentences(text):
    sentences = re.split(r'(?<=[.?!])\s+', text)
    return sentences

def _combine_sentences(sentences):
    combined_sentences = []
    for i in range(len(sentences)):
        combined_sentence = sentences[i]
        if i > 0:
            combined_sentence = sentences[i-1] + ' ' + combined_sentence
        if i < len(sentences) - 1:
            combined_sentence += ' ' + sentences[i+1]
        combined_sentences.append(combined_sentence)
    return combined_sentences

def convert_to_vector(texts):
    try:
        response = openai.embeddings.create(
            input=texts,
            model="text-embedding-3-small"
        )
        embeddings = np.array([item.embedding for item in response.data])
        return embeddings
    except Exception as e:
        print("An error occurred:", e)
        return np.array([])

def _calculate_cosine_distances(embeddings):
    distances = []
    for i in range(len(embeddings) - 1):
        similarity = cosine_similarity([embeddings[i]], [embeddings[i + 1]])[0][0]
        distance = 1 - similarity
        distances.append(distance)
    return distances

text = """Your_Input_Text"""
chunks = chunk_text(text)
print("Chunks:", chunks)

This prototype breaks down the text into sentences, combines adjacent sentences, converts them into vector embeddings, calculates cosine similarities, and chunks the text based on a threshold.

You can experiment with the percentile threshold to see how it affects chunking.

Libraries That Have Semantic Chunkers Built-in

For those who prefer ready-made tools, several libraries offer semantic chunking functions:

LangChain: An open-source library designed for building language model applications, offering semantic chunking functions to enhance natural language understanding.
Llama Index: Provides efficient indexing and retrieval capabilities for large-scale language models, integrating semantic chunking for improved search precision.
SimplerLLM: A forthcoming library that will include advanced chunking functions.

These libraries vary in their approach and customization options, making it essential to choose one that aligns with your project’s requirements.

Let’s Visualize the Results

Visualizing the results of semantic chunking can provide valuable insights.

Here’s additional code to read a blog post using SimplerLLM and plot the cosine similarities:

pythonCopy codefrom SimplerLLM.tools.generic_loader import load_content
import matplotlib.pyplot as plt

def plot_cosine_similarities(text):
    sentences = _split_sentences(text)
    combined_sentences = _combine_sentences(sentences)
    embeddings = convert_to_vector(combined_sentences)
    similarities = _calculate_cosine_similarities(embeddings)
    
    plt.figure(figsize=(10, 5))
    plt.plot(similarities, marker='o', linestyle='-', color='blue', label='Cosine Similarity')
    for i, similarity in enumerate(similarities):
        if similarity >= 0.95:
            plt.plot(i, similarity, marker='o', color='red')
    plt.title('Cosine Similarities Between Consecutive Sentences')
    plt.xlabel('Sentence Pair Index')
    plt.ylabel('Cosine Similarity')
    plt.grid(True)
    plt.legend()
    plt.show()

load = load_content("https://learnwithhasan.com/how-to-build-a-semantic-plagiarism-detector/")
text = load.content
plot_cosine_similarities(text)

This code reads a blog post, calculates cosine similarities, and plots the results.

Cosine similarities greater than the threshold are highlighted in red, indicating where chunking will occur.

Semantic chunking is a versatile and powerful technique in text analysis and AI.

Breaking down text based on meaning enhances the efficiency and accuracy of information retrieval and natural language understanding.

Whether you’re developing AI models, analyzing large datasets, or exploring NLP, semantic chunking offers significant benefits.

With the provided Python code and practical insights, you can start implementing and experimenting with semantic chunking in your projects today.

This article has provided a comprehensive guide to semantic chunking, complete with theoretical explanations, practical examples, and free Python code.

By understanding and utilizing semantic chunking, you can enhance your text analysis and AI applications, making them more efficient and accurate.

Comprehensive Guide to Semantic Chunking with Free Python Code