2023-09-11 Share on: Twitter | Facebook | HackerNews | Reddit

From Fixed-Size to NLP Chunking - A Deep Dive into Text Chunking Techniques

Discover text chunking - the secret sauce behind accurate search results and smarter language models! By understanding how to effectively chunk text, we can improve the way we index documents, handle user queries, and utilize search results. Ready to uncover the secrets of text chunking?

Understanding Chunking

Chunking is a process that aims to embed a piece of content with as little noise as possible while maintaining semantic relevance[^2]. This process is particularly useful in semantic search, where we index a corpus of documents, each containing valuable information on a specific topic.

An effective chunking strategy ensures that search results accurately capture the essence of a user's query. If our chunks are too small or too large, it may lead to imprecise search results or missed opportunities to surface relevant content. As a rule of thumb, if the chunk of text makes sense without the surrounding context to a human, it will likely make sense to the language model as well[^2]. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.

Factors Influencing Chunking Strategy
Chunking Methods
References
Further Reading

Factors Influencing Chunking Strategy

There are three main factors to consider when determining a chunking strategy for a specific use case and application:

The size of the texts to be indexed and chunked
The length and complexity of user queries
The utilization of the retrieved results in the application

Size of the Texts to be Indexed

The chunking unit and size should be adjusted according to the nature of the text. The chunk should be long enough to contain the relevant semantic load. For instance, individual words may not convey a specific message or piece of information, while putting an entire encyclopedia in one chunk may result in a chunk that is "about everything."

Length and Complexity of User Queries

Longer queries or those with greater complexity typically benefit from a smaller chunk length. This helps to narrow down the search space and improve the precision of the search results. Smaller chunks allow more focused matching against embeddings, reducing the impact of irrelevant parts within the query.
Shorter and simpler queries might not require chunking at all, as they can be processed as a single unit. Chunking may introduce unnecessary overhead in these cases, potentially hampering search performance.

Utilization of the Retrieved Results in the Application

In cases where search results are only an intermediate step in the whole chain in the app, the size of the chunk might have significant importance for the seamless operation of the application. For example, if results from multiple search queries are the input context for the prompt to the LLM, having small chunks might ease fitting all inputs in the maximum allowed context size for a given LLM. Conversely, if the search result is presented to the user, larger chunks may be more appropriate.

Chunking Methods

There are several methods for chunking text, each with its own advantages and disadvantages. The choice of method depends on the specific requirements of the use case and application.

Fixed-size (in characters) Overlapping Sliding Window

The Fixed-size overlapping sliding window method is a naive approach to text chunking, dividing the text into fixed-size pieces regarded as chunks. In this method, the text is divided based on the count of characters, making it straightforward to implement. The use of overlap in this method aids in preserving the integrity of sentences or thoughts, ensuring they are not cut in the middle. If one window truncates a thought, another window might contain the complete thought.

However, this method presents certain limitations. One significant drawback is the lack of precise control over the context size. Most language models operate on the basis of tokens rather than characters or words, making this method less efficient. The strict and fixed-size nature of the window might also result in severing words, sentences, or paragraphs in the middle, which could impede comprehension and disrupt the flow of information.

Furthermore, this method does not take semantics into account, providing no guarantee that the semantic unit of the text capturing a given idea or thought will be accurately encapsulated within a chunk. Consequently, one chunk may not be semantically distinct from another.

Use Cases

The Fixed-size overlapping sliding window method can be beneficial in certain scenarios. It is especially useful in preliminary exploratory data analysis, where the goal is to obtain a general understanding of the text structure rather than a deep semantic analysis. Additionally, it could be employed in scenarios where the text data does not have a strong semantic structure, such as in certain types of raw data or logs.

However, for tasks that require semantic understanding and precise context, such as sentiment analysis, question-answering systems, or text summarization, more sophisticated text chunking methods would be more appropriate.

Summary

Pros:

Counting characters makes implementation easy
Using overlap helps to avoid having sentences or thoughts cut in the middle - if one window is cutting the thought, perhaps another will have it in one piece.

Cons:

Not precise control of the context size - models work and size the text in tokens not in characters or words
Having a strict, fixed-size window might lead to cutting words, sentences, or paragraphs in the middle.
Doesn't take semantics into account, no guarantee that the semantic unit of text capturing the given idea, thought will be accurately captured in the chunk and another chunk will be dedicated to another idea

Use cases:

Preliminary exploratory data analysis where a general understanding of the text is required
Scenarios where the text does not have a strong semantic structure, such as certain types of raw data or logs
Not recommended for tasks requiring semantic understanding and precise contexts like sentiment analysis, question-answering systems, or text summarization

Fixed-size (in tokens) Overlapping Sliding Window

The Fixed-size sliding window method in tokens is another approach to text chunking. Unlike the character-based method, this approach divides the text into chunks based on the count of tokens that came out from the tokenizer, making it more aligned with the way language models operate.

In this method, the size of the context is more precisely controlled as it works on tokens rather than characters. A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This can make avoiding cutting words in the middle a little better than when counting characters, but the problem still persists. It can still sever sentences or thoughts in the middle, which could disrupt the flow of information. Moreover, similar to the character-based method, this approach does not take semantics into account. There's no guarantee that a chunk accurately captures a unique thought or idea, making the chunks potentially semantically inconsistent.

Where to Use It

The use cases are similar to the fixed size window based on characters count with one difference - when the count is based on tokens it works better for the tasks where we are limited by the LLM context size.

Summary

Pros:

More precise control over LLM context size as it operates on tokens, not characters.
Still, relatively easy to implement

Cons:

Can still sever sentences or thoughts in the middle
Does not take semantics into account, hence no guarantee that a chunk accurately captures a unique thought or idea

Use cases:

For exploratory, initial work with LLMs
Not recommended for tasks requiring a deep understanding of the semantics and context of the text, like sentiment analysis or text summarization

Recursive Structure Aware Splitting

Recursive Structure-aware Aware Splitting is a hybrid approach to text chunking, combining elements of the fixed-size sliding window method and the structure-aware splitting method. This method attempts to create chunks of approximately fixed sizes, either in characters or tokens, while also trying to preserve the original units of text such as words, sentences, or paragraphs.

In this method, the text is recursively split using various separators such as paragraph breaks ("\n\n"), new lines ("\n"), or spaces (" "), moving to the next level of granularity only when necessary. This allows the method to balance the need for a fixed chunk size with the desire to respect the natural linguistic boundaries of the text.

The major advantage of this method is its flexibility. It provides more precise control over context size compared to fixed-size methods, while also ensuring that semantic units of text are not unnecessarily severed.

However, this method also has its drawbacks. The complexity of implementation is higher due to the recursive nature of the splitting. There's also the risk of ending up with chunks of highly variable sizes, especially with texts of varying structural complexity.

NOTE: LangChain has an implementation of Recursively split

Where to Use It

Recursive Structure Aware Splitting is particularly useful in tasks where both the granularity of tokens and the preservation of semantic integrity are crucial. This includes tasks such as text summarization, sentiment analysis, and document classification.

However, due to its complexity, it might not be the best fit for tasks that require quick and simple text chunking, or for tasks involving texts with inconsistent or unclear structural divisions.

Summary

Pros:

Balances the need for fixed chunk sizes with the preservation of natural linguistic boundaries
Provides more precise control over the context size

Cons:

Higher complexity of implementation due to the recursive nature of the splitting
Risk of ending up with chunks of highly variable sizes

Use cases:

Useful in tasks where both the granularity of tokens and the preservation of semantic integrity are crucial, such as text summarization, sentiment analysis, and document classification
Not recommended for tasks requiring quick and simple text chunking, or tasks involving texts with inconsistent or unclear structural divisions

Structure Aware Splitting (by Sentence, Paragraph, Section, Chapter)

Structure Aware Splitting is an advanced approach to text chunking, which takes into account the inherent structure of the text. Instead of using a fixed-size window, this method divides the text into chunks based on its natural divisions such as sentences, paragraphs, sections, or chapters.

This method is particularly beneficial as it respects the natural linguistic boundaries of the text, ensuring that words, sentences, and thoughts are not cut in the middle. This aids in preserving the semantic integrity of the information within each chunk.

However, this method does have certain limitations. Handling text of varying structural complexity might be challenging. For instance, some texts might not have clearly defined sections or chapters, e.g. text extracted from the OCR output, unformatted speech-to-text outputs, text extracted from tables. Also, while it's more semantically aware than the fixed-size methods, it still doesn't guarantee perfect semantic consistency within chunks, especially for larger structural units like sections or chapters.

Where to Use It

Structure Aware Splitting is highly effective for tasks that require a good understanding of the context and semantics of the text. It is particularly useful for text summarization, sentiment analysis, and document classification tasks.

However, it might not be the best fit for tasks involving texts that lack defined structural divisions, or for tasks that require a finer granularity, such as word-level Named Entity Recognition (NER).

Summary

Pros:

Respects natural linguistic boundaries, avoiding severing words, sentences, or thoughts
Preserves the semantic integrity of information within each chunk

Cons:

Challenging to handle text with varying structural complexity
Does not guarantee perfect semantic consistency within chunks, especially for larger structural units
We don't have control over chunk size. Chunks from given document might significantly vary in the size.

Use cases:

Effective for tasks requiring good understanding of context and semantics, such as text summarization, sentiment analysis, and document classification
Not recommended for tasks involving texts that lack defined structural divisions, or tasks needing finer granularity, like word-level NER

NLP Chunking: Tracking Topic Changes

NLP Chunking with Topic Tracking is a sophisticated approach to text chunking. This method divides the text into chunks based on semantic understanding, specifically by detecting significant shifts in the topics of sentences. If the topic of a sentence significantly differs from the topic of the previous chunk, this sentence is considered the beginning of a new chunk.

This method has the distinct advantage of maintaining semantic consistency within each chunk. By tracking the changes in topics, this method ensures that each chunk is semantically distinct from the others, thereby capturing the inherent structure and meaning of the text.

However, this method is not without its challenges. It requires advanced NLP techniques to accurately detect topic shifts, which adds to the complexity of implementation. Additionally, the accuracy of chunking heavily depends on the effectiveness of the topic modeling and detection techniques used.

Where to Use It

NLP Chunking with Topic Tracking is highly effective for tasks that require an understanding of the semantic context and topic continuity. It is particularly useful for text summarization, sentiment analysis, and document classification tasks.

This method might not be the best fit for tasks involving texts that have a high degree of topic overlap or for tasks that require simple text chunking without the need for deep semantic understanding.

Summary

Pros:

Maintains semantic consistency within each chunk
Captures the inherent structure and meaning of the text by tracking topic changes

Cons:

Requires advanced NLP techniques, increasing the complexity of implementation
The accuracy of chunking heavily depends on the effectiveness of the topic modeling and detection techniques used

Use cases:

Highly effective for tasks requiring semantic context and topic continuity, such as text summarization, sentiment analysis, and document classification
Not recommended for tasks involving texts with high degrees of topic overlap or tasks requiring simple text chunking without the need for deep semantic understanding

Content-Aware Splitting (Markdown, LaTeX, HTML)

Content-Aware Splitting is a method of text chunking that focuses on the type and structure of the content, particularly in structured documents like those written in Markdown, LaTeX, or HTML. This method identifies and respects the inherent structure and divisions of the content, such as headings, code blocks, and tables, to create distinct chunks.

The primary advantage of this method is that it ensures different types of content are not mixed within a single chunk. For instance, a chunk containing a code block will not also contain a part of a table. This helps maintain the integrity and context of the content within each chunk.

However, this method also presents certain challenges. It requires understanding and parsing the specific syntax of the structured document format, which can increase the complexity of implementation. Moreover, it might not be suitable for documents that lack clear structural divisions or those written in plain text without any specific format.

Where to Use It

Content Aware Splitting is especially useful when dealing with structured documents or content with clear formatting, such as technical documentation, academic papers, or web pages. It helps ensure that the chunks created are meaningful and contextually consistent.

However, this method might not be the best fit for unstructured or plain text documents, or for tasks that do not require a deep understanding of the content structure.

Summary

Pros:

Ensures different types of content are not mixed within a single chunk
Respects and maintains the integrity and context of the content within each chunk

Cons:

Requires understanding and parsing the specific syntax of the structured document format
Might not be suitable for unstructured or plain text documents

Where to Use It:

Particularly useful for structured documents or content with clear formatting, such as technical documentation, academic papers, or web pages
Not recommended for unstructured or plain text documents, or tasks that do not require a deep understanding of the content structure

Adding Extra Context to the Chunk (metadata, summaries)

Adding extra context to the chunks in the form of metadata or summaries can significantly enhance the value of each chunk and improve the overall understanding of the text[^3]. Here are two strategies:

Adding Metadata to Each Chunk

This strategy involves adding relevant metadata to each chunk. Metadata could include information such as the source of the text, the author, the date of publication, or even data about the content of the chunk itself, like its topic or keywords. This extra context can provide valuable insights and make the chunks more meaningful and easier to analyze.

NOTE: In the case of the chunks that are vectorized using text embeddings, be aware, that vector databases typically allow storage of metadata alongside the embedding vectors.

Pros:

Provides additional information about each chunk
Enhances the value of each chunk, making them more meaningful and easier to analyze
Can help to produce more effective embeddings by fixing the broader context for the chunk.

Cons:

Requires additional processing to generate and attach the metadata
The usefulness of the metadata depends on its relevance and accuracy

Where to Use It:

Especially useful in tasks that involve analyzing the origin, authorship, or content of the chunks, such as text classification, document clustering, or information retrieval
Can be used to filter the sources used to provide context to LLMs.

You can get intuition what is possible by reading llama_index documentation on metadata extraction and usage: Metadata Extraction Usage Pattern - LlamaIndex 🦙 0.9.30

Passing on Chunk Summaries

In this strategy, each chunk is summarized, and that summary is passed on to the next chunk. This method provides a 'running context' that can enhance the understanding of the text and maintain the continuity of information.

Pros:

Enhances the understanding of the text by maintaining a running context
Helps to ensure the continuity of information across chunks

Cons:

Requires advanced NLP techniques to generate accurate and meaningful summaries
The effectiveness of this method depends on the quality of the summaries

Where to Use It:

Particularly useful in tasks where understanding the continuity and context of the text is crucial, such as text summarization or reading comprehension tasks

Other Experimental Strategies for Adding Context to the Chunks

Keyword Tagging: This method involves identifying and tagging the most important keywords or phrases in each chunk. These tags then serve as a quick reference or summary of the chunk's content. Advanced NLP techniques can be used to identify these keywords based on their relevance and frequency.
Sentiment Analysis: For text that contains opinions or reviews, performing sentiment analysis on each chunk and attaching the sentiment score (positive, negative, neutral) as metadata can provide valuable context. This can be particularly useful in tasks such as customer feedback analysis or social media monitoring.
Entity Recognition: Applying Named Entity Recognition (NER) techniques to each chunk can identify and label entities such as names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. This entity information can be added to each chunk, providing valuable context, especially in tasks like information extraction or knowledge graph construction.
Topic Classification: Each chunk can be classified into one or more topics using machine learning or NLP techniques. This topic label can provide a quick understanding of what each chunk is about, adding valuable context, especially for tasks like document classification or recommendation.
Chunk Linking: This method involves creating links between related chunks based on shared keywords, entities, or topics. These links can provide a 'map' of the content, showing how different chunks relate to each other. This can be particularly useful in tasks involving large and complex texts, where understanding the overall structure and relations between different parts is important.

Conclusions

In the field of Natural Language Processing, text chunking emerges as a powerful technique that significantly enhances the performance of semantic search and language models. By breaking down text into manageable, contextually relevant chunks, we can ensure more accurate and meaningful search results.

The choice of chunking method, whether it's fixed-size, structure-aware, or NLP chunking, depends on the specific requirements of the use case and application. Each method has its own strengths and limitations, and understanding these is crucial to implementing an effective chunking strategy.

Moreover, adding extra context to the chunks, such as metadata or summaries, can further enhance the value of each chunk and improve the overall understanding of the text. Experimental strategies like keyword tagging, sentiment analysis, entity recognition, topic classification, and chunk linking offer promising avenues for further exploration.

Any comments or suggestions? Let me know.

References

[^1] Create a CustomGPT And Supercharge your Company with AI - Pick the Best LLM - The Abacus.AI Blog
[^2] Chunking Strategies for LLM Applications | Pinecone
[^3] Optimize LLM Enterprise Applications through Embeddings and Chunking Strategy. | by Actalyst | Aug, 2023 | Medium
[^4] Retrieval Augmented Generation (RAG) Done Right: Chunking - Vectara (NLP chunking, compare chunking strategies) + notebook

Understanding Chunking

Factors Influencing Chunking Strategy

Size of the Texts to be Indexed

Length and Complexity of User Queries

Utilization of the Retrieved Results in the Application

Chunking Methods

Fixed-size (in characters) Overlapping Sliding Window

Use Cases

Summary

Fixed-size (in tokens) Overlapping Sliding Window

Where to Use It

Summary

Recursive Structure Aware Splitting

Where to Use It

Summary

Structure Aware Splitting (by Sentence, Paragraph, Section, Chapter)

Where to Use It

Summary

NLP Chunking: Tracking Topic Changes

Where to Use It

Summary

Content-Aware Splitting (Markdown, LaTeX, HTML)

Where to Use It

Summary

Adding Extra Context to the Chunk (metadata, summaries)

Adding Metadata to Each Chunk

Passing on Chunk Summaries

Other Experimental Strategies for Adding Context to the Chunks

Conclusions

References

Further Reading

You might enjoy