Blog

    How to test RAG systems

    author
    Doug Safreno
    1/11/2024
    header

    One of the most common patterns in modern LLM development is Retrieval-Augmented Generation (RAG).

    These systems respond to queries by retrieving relevant information from data sources and then generating a response to the query with an LLM using that information.

    Note that in this post I use a broad definition of RAG which includes keyword-based RAG and other RAG architectures. Some people use a more scoped-down definition that requires embeddings and a Vector DB for the retrieval step.

    image

    RAG systems can be used to

    • Search the internet: like in ChatGPT Web Browsing , which uses RAG to query the public internet and an OpenAI model to generate a response to a query
    • Query knowledge: like Mem.ai and Notion QA , which query your knowledge base to generate a response
    • Summarize datasets: like Amazon’s Customer Review Highlights , which creates a high-level summary of all customer feedback about a product

    For generative AI developers, RAG can be an easy way to stand up a quick MVP (“Wow, look at how our customers can now chat with their data in our product!”). But, once they get something up and running, developers struggle to systematically make changes to improve quality and reduce hallucination.

    While it’s still a new field, best practices for testing RAG systems have already emerged and can be gradually implemented as follows.

    Setup work: create some test data

    In any of the scenarios below, you’ll need some test data.

    Generally, a good test system consists of the following:

    • Access to a large data system on which you’ll be doing retrieval
    • A suite of 10+ test queries and good responses, tagged with their user ID or any other metadata necessary to handle the query.

    For example, let’s say we wanted to implement a Q&A for customers of our KB product.

    The test suite might look like:

    image

    Basic testing system: ask GPT-4 if the facts are right

    Once you have your test suite, ask a model (eg GPT-4) if the facts are right. This technique, known as AI evaluation, requires some work to get right.

    Frst, enrich your test dataset with good expected answers:

    image

    Then, programatically generate outputs using your RAG pipeline, and grade them using a prompt similar to the following.

    Text
    You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: {input} ************ [Expert]: {expected} ************ [Submission]: {output} ************ [END DATA] Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: (A) The submitted answer is a subset of the expert answer and is fully consistent with it. (B) The submitted answer is a superset of the expert answer and is fully consistent with it. (C) The submitted answer contains all the same details as the expert answer. (D) There is a disagreement between the submitted answer and the expert answer. (E) The answers differ, but these differences don't matter from the perspective of factuality.

    Credit: OpenAI evals  in their fact evaluation

    This prompt converts the input query, expected value, and actual output into a grade between A and E.

    image

    Once you have automated grading working, compare and contrast different experiments to systematically get better performance.

    image

    Optional enhancements

    With the basic pipeline, you’ll probably notice that it’s difficult to differentiate “retrieval” system failures and “generation” system failures.

    Furthermore, the “Fact” evaluator does not do a good job of understanding the completeness of the response.

    Let’s upgrade our testing system to the below:

    image

    Enhancement #1: Test specific stages of the RAG pipeline

    This enhancement solves the issue of “where did my system break.”

    To understand where failures are happening at-a-glance, extend your automated output collection to collect the outputs of each step.

    Then, setup individual evaluations on each step.

    For example, for the retrieval step you may want to use Ragas’ context precision and recall evaluations to see how “good” your retrieval is performing.

    And if you’re dealing with more complex RAG scenarios, the two main steps may have sub-steps. For example, your retrieval step might consist of one sub-step for SQL query generation and one for query execution. Test each sub-step as necessary.

    Enhancement #2: break down the Fact evaluation into pieces

    The “fact” evaluation above is a good starting point for RAG evaluation.

    However, if you need to improve performance, I recommend breaking it down into a pair of evaluations.

    Compliance

    This evaluation fails when there are new or contradictory facts in a generated output.

    Text
    You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: {{ input }} ************ [Expert]: {{ expected }} ************ [Submission]: {{ output }} ************ [END DATA] Compare the compliance of the facts of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. Also, ignore any missing information in the submission; we only care if there is new or contradictory information. Select one of the following options: (A) All facts in the submitted answer are consistent with the expert answer. (B) The submitted answer contains new information not present in the expert answer. (C) There is a disagreement between the submitted answer and the expert answer.

    Completeness

    This evaluation fails when the generated output is not as complete in it’s response as the expected.

    Text
    You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: {{ input }} ************ [Expert]: {{ expected }} ************ [Submission]: {{ output }} ************ [END DATA] Compare the completeness of the submitted answer and the expert answer to the question. Ignore any differences in style, grammar, or punctuation. Also, ignore any extra information in the submission; we only care that the submission completely answers the question. Select one of the following options: (A) The submitted answer completely answers the question in a way that is consistent with the expert answer. (B) The submitted answer is missing information present in the expert answer, but this does not matter for completeness. (C) The submitted answer is missing information present in the expert answer, which reduces the completeness of the response. (D) There is a disagreement between the submitted answer and the expert answer.

    Credits

    I’m Doug Safreno, co-founder and CEO at Gentrace . Gentrace helps automate test and production evaluations like the ones mentioned above for technology companies leveraging generative AI. It is useful to improve quality of LLM outputs at scale.

    Thank you to our customers and partners for helping me with this post. Also, thank you to the OpenAI Evals contributors and Ragas contributors for providing inspiration for many of the ideas.