How to test RAG systems
One of the most common patterns in modern LLM development is Retrieval-Augmented Generation (RAG).
These systems respond to queries by retrieving relevant information from data sources and then generating a response to the query with an LLM using that information.
Note that in this post I use a broad definition of RAG which includes keyword-based RAG and other RAG architectures. Some people use a more scoped-down definition that requires embeddings and a Vector DB for the retrieval step.
RAG systems can be used to
- Search the internet: like in ChatGPT Web Browsing , which uses RAG to query the public internet and an OpenAI model to generate a response to a query
- Query knowledge: like Mem.ai and Notion QA , which query your knowledge base to generate a response
- Summarize datasets: like Amazon’s Customer Review Highlights , which creates a high-level summary of all customer feedback about a product
For generative AI developers, RAG can be an easy way to stand up a quick MVP (“Wow, look at how our customers can now chat with their data in our product!”). But, once they get something up and running, developers struggle to systematically make changes to improve quality and reduce hallucination.
While it’s still a new field, best practices for testing RAG systems have already emerged and can be gradually implemented as follows.
Setup work: create some test data
In any of the scenarios below, you’ll need some test data.
Generally, a good test system consists of the following:
- Access to a large data system on which you’ll be doing retrieval
- A suite of 10+ test queries and good responses, tagged with their user ID or any other metadata necessary to handle the query.
For example, let’s say we wanted to implement a Q&A for customers of our KB product.
The test suite might look like:
Basic testing system: ask GPT-4 if the facts are right
Once you have your test suite, ask a model (eg GPT-4) if the facts are right. This technique, known as AI evaluation, requires some work to get right.
Frst, enrich your test dataset with good expected answers:
Then, programatically generate outputs using your RAG pipeline, and grade them using a prompt similar to the following.
This prompt converts the input query, expected value, and actual output into a grade between A and E.
Once you have automated grading working, compare and contrast different experiments to systematically get better performance.
With the basic pipeline, you’ll probably notice that it’s difficult to differentiate “retrieval” system failures and “generation” system failures.
Furthermore, the “Fact” evaluator does not do a good job of understanding the completeness of the response.
Let’s upgrade our testing system to the below:
Enhancement #1: Test specific stages of the RAG pipeline
This enhancement solves the issue of “where did my system break.”
To understand where failures are happening at-a-glance, extend your automated output collection to collect the outputs of each step.
Then, setup individual evaluations on each step.
And if you’re dealing with more complex RAG scenarios, the two main steps may have sub-steps. For example, your retrieval step might consist of one sub-step for SQL query generation and one for query execution. Test each sub-step as necessary.
Enhancement #2: break down the Fact evaluation into pieces
The “fact” evaluation above is a good starting point for RAG evaluation.
However, if you need to improve performance, I recommend breaking it down into a pair of evaluations.
This evaluation fails when there are new or contradictory facts in a generated output.
This evaluation fails when the generated output is not as complete in it’s response as the expected.
I’m Doug Safreno, co-founder and CEO at Gentrace . Gentrace helps automate test and production evaluations like the ones mentioned above for technology companies leveraging generative AI. It is useful to improve quality of LLM outputs at scale.
Thank you to our customers and partners for helping me with this post. Also, thank you to the OpenAI Evals contributors and Ragas contributors for providing inspiration for many of the ideas.