How to test for AI hallucination
AI hallucinations occur when an LLM returns incoherent and/or factually inaccurate information in response to a query.
Hallucinations can be fixed through many types of iterative enhancements, including higher quality context, a more pointed prompt, fine-tuning, or post-processing of generative output.
Developers need to test these (and other) changes to see if they improve or worsen hallucination.
In this post, I share four strategies for testing for hallucination. Two are totally automated, one involves an expert, and one requires end-user feedback.
Method 1: Automated fact comparison
A fast general-purpose way to test for hallucination is to do an automated comparison of your generative AI's output to a task to one or more ground truths or expected values.
Compile a test dataset
In order to perform automated factual comparison, first create a test dataset.
A good test dataset consists of an input and some ground truth. The ground truth is organized however you want, but I find the easiest way is to just use a single value representing an ideal response from the generative AI.
To borrow from our RAG testing guide , a good test dataset might look like this:
Grade the output
Run your test dataset inputs through your generative AI feature, converting the inputs to outputs.
Then, grade the output using an automated comparison of factualness. To do this, use AI or model-graded evaluation. This technique prompts an AI and asks it to do the grading for us.
I recomend using the following prompt as a starting point (from the OpenAI evals Fact template ).
As you can see, this template returns a value between A and E. Let's dive into what each of them mean.
- (A) means factual omission - there is some fact(s) in the expected value that did not make it to the actual output
- (B) means POTENTIAL HALLUCINATION - there are some "new facts" in the output that weren't present in the expected output
- (C) means factual consistency - there is no factual difference
- (D) means inaccuracy - there is an actual discrepancy between the facts
- (E) means style-only differences - there is no factual difference, but they do look different
B (and occasionally D) show possible hallucination. Measure how often they occur across your test dataset. Consider also tracking omission (A).
But wait, can I really trust a LLM to grade itself?
It's a good question; in this case, the answer is generally yes.
To give a feel for why, consider how the LLM originally did a very difficult thing - produce something from scratch.
In this evaluation, the LLM does something much easier - compare two things.
As a result, even if the same model was used to produce and evaluate the output, it will do a much better job of finding factual mismatches than of producing the result.
Method 2: Automated specific checks
Consider the case of this email generator.
Jack Matthews is never mentioned as a Stanford alumni, but GPT-3.5, attempting to write the best email possible, decides to slip it in.
Let's say this is happening frequently in your email generator; way more than any other form of hallucination.
Method 1 is overkill in this case. We just need to create an evaluator that looks out for this particular form of hallucination.
As in Method 1, first produce a test dataset. This time, there's no need for expected values.
Once you convert the inputs into outputs, again leverage model-graded evaluation, only this time with a more specific prompt:
The number of (A) answers are the amount of this specific type of hallucination.
Note that this form of automated evaluation has a nice characteristic - it doesn't rely on ground truth / an expected value, so it can be run against production traffic as it happens.
Consider siphoning ~100 samples per day from your production traffic to specific evaluations like this to measure production hallucination.
Method 3: Manual expert review
The third method is to have experts manually review generations for hallucination.
To make this work correctly, make sure that experts have acess to all information. It should be easy for the expert to validate / invalidate facts if they don't have the information in their head.
This approach is slow as it requires manual human intervention.
To make things faster, pair this method with the automated methods above. For example, you may want to run automated checks on 100% of test outputs but only manually review 10% of them.
Method 4: End-user feedback
The last way to measure hallucination is to provide end-users with a way to flag it.
An example: you implement a thumbs up / thumbs down UI that shows up whenever a generation is presented to the user. When the user clicks the thumbs down, the UI asks why the generation was wrong, where hallucination is a possible option.
The data isn't perfect. End-users flag hallucinations that aren't actually there, or more frequently, they miss hallucinations or fail to report them.
Even so, the data is directionally useful as this "hallucination report rate" can be used a benchmark for a (likely higher) true hallucination rate.
Additionally, the data can be improved by combining this mechanism with other methods above; e.g. you could have an expert review the potential hallucinations to remove false positives.
There's no one-size-fits-all solution to the hallucination problem. Start with automated checks, and then experiment with approaches in both development and production to maximize your odds of success.