Establishing a high quality RAG pipeline is essential for ensuring the responses you get from your AI are coherent and accurate.
The first step is to of course create a vector database containing chunks of your text data. The standard approaches include separators or recursive text splitting with fixed and predetermined chunk sizes. These can be problematic when your documents are highly varied.
We propose an ai-in-the-loop framework to generate more contextual chunks using LLMs to sample the input documents and determine suitable chunk sizes and overlap.
Text Chunking Techniques
Separator-Based
We have utilized separator based chunking methods in the past with positive results.
This works by simply setting a separator such as ‘,’ or ‘.’ in your text splitter. This will split your text into chunks wherever the separator is present.
This technique is useful when each sentence contains useful information you don’t want to miss or have misconstrued by overlapping sentences or statements.
Variable Chunk and Overlap Size
Let’s say you are chunking using size and overlap parameters, then we recommend using a control mechanism to influence the chunk size.
This might be counting the number of characters or words in the text and using a formula to determine the ideal chunk size. For instance, an article or blog you would like to ingest might be around 800 words. Instead of having a fixed chunk size regardless of the article size, your chunk size would be determined by the word count. In this case, the chunk size might be 50 words with an overlap of 15 words.
import requests
from bs4 import BeautifulSoup
import pypdf
def extract_text_from_pdf_url(pdf_url):
response = requests.get(pdf_url)
response.raise_for_status() # Ensure we notice bad responses
with open('temp.pdf', 'wb') as file:
file.write(response.content)
reader = pypdf.PdfReader('temp.pdf')
text = ''
for page in reader.pages:
text += page.extract_text()
return text
def determine_chunk_size(word_count):
chunk_size = int(word_count * 0.05) # 5% of word count for
overlap = int(word_count * 0.02) # 2% of word count for overlap
return chunk_size, overlap
def chunk_text(text, chunk_size, overlap):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = words[i:i + chunk_size]
chunks.append(' '.join(chunk))
return chunks
def main(pdf_url):
text = extract_text_from_pdf_url(pdf_url)
word_count = len(text.split())
chunk_size, overlap = determine_chunk_size(word_count)
chunks = chunk_text(text, chunk_size, overlap)
return chunks, chunk_size, overlap
# Example usage
pdf_url = 'https://www.press.umich.edu/pdf/0472108670-02.pdf' # Must be a .pdf file
chunks, chunk_size, overlap = main(pdf_url)
print(chunk_size, overlap, chunks[0])
There are all sorts of ways to algorithmically determine a suitable chunk size. It can be done by an LLM evaluating the type of text, it might be presented with a portion of the text and then deciding what chunk size is ideal.
Letting AI Choose
Instead of using separators or fixed chunk sizes, we can get an LLM to evaluate a sample of text from the document.
First, we need to provide it guidance on what chunk sizes are appropriate for different document types.
A Basic Implementation
We can format the sizing guide above into a prompt to feed to the LLM.
def ai_chunk_identifier(sample_text):
config = f"""You are an intelligent and diverse document analysis tool.
Your main purpose is to observe samples of text and return an ideal chunk size to split the text, and an appropriate overlap size.
Here's your guidance:
* Keep chunks short: Smaller chunks generally improve focus and coherence.
* Overlap chunks: Aim for 10-20% the size of the chunk size. For example, if the chunk size is 5%, overlap should be 1%.
* Chunk size by text type (as a percentage of total text):
* Articles/Blogs: 5%
* News Articles: 10%
* Scientific Papers: 3-5%
* Legal Documents: 2-5%
* Books: 1-2%
* Code: Chunk by logical blocks (functions, classes)
* Social Media Posts/Emails: Use the entire post/email as a single chunk.
Return the chunk size percentage and overlap percentage as comma-separated values (e.g., "5,1").
"""
message = f"Based on this sample text, what is the ideal chunk size and overlap size?:\n\n{sample_text}"
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": config},
{"role": "user", "content": message}
],
max_tokens=2000
)
chunk_size_overlap = response.choices[0].message['content']
return chunk_size_overlap
The inputs include:
- A sample of text from the document. 10% of the document should be enough to identify the type of document.
- That’s it!
The results
Here are some examples from a prospect theory PDF accessible here. It has a total word count of 13427 words.
The LLM selected a chunk size of just under 1% which would be closest to the ‘book’ category which is a fair choice considering the document is nearly 40 pages long. It was able to determine the type of document based on the small snippet of text.
Let’s try an article instead which should be a 5% chunk size. The article tested can be accessed here. It has just over 1100 words.
A chunk size selection of 55 represents 5.5% of the total word count which puts it in the ‘article’ document type.
You can adjust the guidance to be more specific to your document types. This approach is just an example of how to use AI to more intelligently process text data.
Intelligent text chunking techniques like the one stated above will provide more contextually aware chunks for the vector database.
So a combination of sentiment analysis as well just as counting the number of words can provide a robust text chunking process able to adapt to different file types and text types.
We look forward to developing this framework to incorporate more stringent and measured guidance to generate even more useful chunks for RAG purposes.
Learn more about Asycd and our work here. Check out our most recent digital art collection which is an excellent showcase of our commitment to putting out meaningful work advancing creative AI.