So, you’ve created a RAG (Retrieval Augmented Generation) application and things seem to be going well. Now, you found a couple documents that you want to query with an LLM (Large Language Model), so you send all that content along with your questions to the LLM, and it’s giving you confident and totally plausible responses. This is great! Gen AI is so cool! But wait a second, these responses aren’t always complete and sometimes they don’t reference the right document and sometimes they’re not even correct… what gives? Gen AI is supposed to solve all my problems! Turns out you need LLMOps.
One of the key components of an LLMOps workflow is grounding the model in the context of established factual data that the model can reference when answering a user’s query. Providing this grounding is a time-consuming and complex process, but absolutely necessary if you want any generative AI model like ChatGPT or Llama to be able to answer questions about data that’s relevant to you.
Where’s all my stuff?
The first step is to evaluate the information available to determine what types of data are present (including text, non-text, tabular, image, video, etc.) and which ones would be the most valuable to us. This in and of itself can be a daunting prospect if information is scattered among different information silos. All of that content needs to be brought together into one spot, like Azure Data Lake Storage Gen2 or the OneLake option provided by the recently released Microsoft Fabric.
Figure a: Microsoft Fabric foundation
What’s in all this stuff?
Once we have the information that we want to interact with, how do we actually access the content from each type of data and convert it to a format that can be utilized by the LLM? Here’s where additional AI tools come in handy like Azure AI Document Intelligence, AI Vision, and AI Video Indexer. Each of these tools can be leveraged to extract text from various types of data that the LLM can ingest as context to a query.
The recently updated Document Intelligence pre-built layout model (api version 2024-02-29-preview) can accurately extract text, tables, and figures from pdf documents, JPEGS and other image files, as well as Microsoft Word (docx), Excel (xlsx), and PowerPoint (pptx).
Figure b: Azure AI Document Intelligence Studio analysis
It does a very good job of recognizing the layout of a page and maintaining the flow of text through the document. This is a great option for extracting the needed content from all that information that used to be in those data silos.
Are we done yet?
Well, not quite. Finally, we need to determine how to chunk and embed the extracted content in such a way that preserves meaning, allows the content to fit within an LLM’s token limits, and sets the LLM up to be successful in identifying which content to use to answer a question.
That’s a big final step and brings up a lot of questions – What is chunking? How do I know if my chunk size preserves meaning? Given that token limits are rising higher and higher with each new model release, does chunking matter? Won’t the LLM be able to be successful regardless of the amount of content I give it?
Less is More
Chunking is the process of dividing up large amounts of information into smaller sizes in order for that information to be more easily matched as being relevant to a user’s question. If the chunk sizes are too big, relevant information may get lost in all the content around it and not be found when needed. If chunk sizes are too small, they may not contain enough information to answer the question.
Picking a chunk size, along with a corresponding amount of overlap between chunks, is vital for successful information retrieval that can be provided, in conjunction with the question, to an LLM and get a successful response. The perfect chunk and overlap size is unique to the kind of information being chunked and requires some experimentation to find the right fit.
So what do I do?
You need a testing plan! There are several different chunking approaches (naïve, sentence tokenizers, recursive text splitters, semantic chunking, agentic chunking), the chunk size can be token based or character based, and then there’s the overlap percentage to try out, maybe 10% or 20%?
You’ll need to embed those chunks (i.e. convert them to a vector format using an embedding model) and store those vector chunks in a database. Azure OpenAI has the latest embedding models like text-embedding-3-small and text-embedding-3-large and Azure AI Search allows for vector storage and vector query so those resources will be vital for our experiments.
You’ll also need a test set of questions that are relevant to your data so you can determine if the chunks that are coming back from the search have the answer to the question being asked.
Table a: Chunk evaluation table
Figure c: Code for cracking document with Azure AI Document Intelligence
Figure d: Code for chunking document with Langchain Recursive Character Text Splitter
Figure e: Code for setting up Azure AI Search to store embedded chunks
Figure f: Code for adding embedded chunks to Azure AI Search
This all sounds like a lot of work.
It is! But maybe there are other ways to accomplish this task. Share with us! How are you determining what chunking approach is best for your data? How are you extracting the valuable information locked away in non-text files? Add a comment below or contact us today to share your thoughts.
If you’re looking for some guidance on how to tackle LLMOps with RAG, SpyglassMTG has just launched AI Genie – an accelerator package designed to get you set up with all the infrastructure you need to start implementing AI initiatives, along with a customizable web chat interface, not to mention the expertise needed to help you determine the best RAG implementation for your data. Reach out and let us know what we can do for you!