blog post cover

How to test for AI hallucination

AI hallucinations occur when an LLM returns incoherent and/or factually inaccurate information in response to a query.
blog post author photo
Doug Safreno
January 24, 2024

AI hallucinations occur when an LLM returns incoherent and/or factually inaccurate information in response to a query.

Hallucinations can be fixed through many types of iterative enhancements, including higher quality context, a more pointed prompt, fine-tuning, or post-processing of generative output.

Developers need to test these (and other) changes to see if they improve or worsen hallucination.

In this post, I share four strategies for testing for hallucination. Two are totally automated, one involves an expert, and one requires end-user feedback.

Method 1: Automated fact comparison

A fast general-purpose way to test for hallucination is to do an automated comparison of your generative AI's output to a task to one or more ground truths or expected values.

Compile a test dataset

In order to perform automated factual comparison, first create a test dataset.

A good test dataset consists of an input and some ground truth. The ground truth is organized however you want, but I find the easiest way is to just use a single value representing an ideal response from the generative AI.

To borrow from our RAG testing guide , a good test dataset might look like this:

image

Grade the output

Run your test dataset inputs through your generative AI feature, converting the inputs to outputs.

Then, grade the output using an automated comparison of factualness. To do this, use AI or model-graded evaluation. This technique prompts an AI and asks it to do the grading for us.

I recomend using the following prompt as a starting point (from the OpenAI evals Fact template ).

You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{ input }}
************
[Expert]: {{ expected }}
************
[Submission]: {{ output }}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

As you can see, this template returns a value between A and E. Let's dive into what each of them mean.

  • (A) means factual omission - there is some fact(s) in the expected value that did not make it to the actual output
  • (B) means POTENTIAL HALLUCINATION - there are some "new facts" in the output that weren't present in the expected output
  • (C) means factual consistency - there is no factual difference
  • (D) means inaccuracy - there is an actual discrepancy between the facts
  • (E) means style-only differences - there is no factual difference, but they do look different

B (and occasionally D) show possible hallucination. Measure how often they occur across your test dataset. Consider also tracking omission (A).

But wait, can I really trust a LLM to grade itself?

It's a good question; in this case, the answer is generally yes.

To give a feel for why, consider how the LLM originally did a very difficult thing - produce something from scratch.

In this evaluation, the LLM does something much easier - compare two things.

As a result, even if the same model was used to produce and evaluate the output, it will do a much better job of finding factual mismatches than of producing the result.

Method 2: Automated specific checks

Consider the case of this email generator.

image

Jack Matthews is never mentioned as a Stanford alumni, but GPT-3.5, attempting to write the best email possible, decides to slip it in.

Let's say this is happening frequently in your email generator; way more than any other form of hallucination.

Method 1 is overkill in this case. We just need to create an evaluator that looks out for this particular form of hallucination.

Setup

As in Method 1, first produce a test dataset. This time, there's no need for expected values.

Once you convert the inputs into outputs, again leverage model-graded evaluation, only this time with a more specific prompt:

Does the following email mention where the author went to college? Ignore mentions of where others (including the recipient) went to college.
[BEGIN DATA]
************
[Email]: {{ output }}
************
[END DATA]

(A) The email mentions where the author went to college.
(B) The email does not mention where the author went to college. 

The number of (A) answers are the amount of this specific type of hallucination.

In production

Note that this form of automated evaluation has a nice characteristic - it doesn't rely on ground truth / an expected value, so it can be run against production traffic as it happens.

Consider siphoning ~100 samples per day from your production traffic to specific evaluations like this to measure production hallucination.

Method 3: Manual expert review

The third method is to have experts manually review generations for hallucination.

To make this work correctly, make sure that experts have acess to all information. It should be easy for the expert to validate / invalidate facts if they don't have the information in their head.

This approach is slow as it requires manual human intervention.

To make things faster, pair this method with the automated methods above. For example, you may want to run automated checks on 100% of test outputs but only manually review 10% of them.

Method 4: End-user feedback

The last way to measure hallucination is to provide end-users with a way to flag it.

An example: you implement a thumbs up / thumbs down UI that shows up whenever a generation is presented to the user. When the user clicks the thumbs down, the UI asks why the generation was wrong, where hallucination is a possible option.

image

The data isn't perfect. End-users flag hallucinations that aren't actually there, or more frequently, they miss hallucinations or fail to report them.

Even so, the data is directionally useful as this "hallucination report rate" can be used a benchmark for a (likely higher) true hallucination rate.

Additionally, the data can be improved by combining this mechanism with other methods above; e.g. you could have an expert review the potential hallucinations to remove false positives.

Conclusion

There's no one-size-fits-all solution to the hallucination problem. Start with automated checks, and then experiment with approaches in both development and production to maximize your odds of success.

Method 1: Automated fact comparison

A fast general-purpose way to test for hallucination is to do an automated comparison of your generative AI's output to a task to one or more ground truths or expected values.

Compile a test dataset

In order to perform automated factual comparison, first create a test dataset.

A good test dataset consists of an input and some ground truth. The ground truth is organized however you want, but I find the easiest way is to just use a single value representing an ideal response from the generative AI.

To borrow from our RAG testing guide , a good test dataset might look like this:

image

Grade the output

Run your test dataset inputs through your generative AI feature, converting the inputs to outputs.

Then, grade the output using an automated comparison of factualness. To do this, use AI or model-graded evaluation. This technique prompts an AI and asks it to do the grading for us.

I recomend using the following prompt as a starting point (from the OpenAI evals Fact template ).

You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{ input }}
************
[Expert]: {{ expected }}
************
[Submission]: {{ output }}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

As you can see, this template returns a value between A and E. Let's dive into what each of them mean.

  • (A) means factual omission - there is some fact(s) in the expected value that did not make it to the actual output
  • (B) means POTENTIAL HALLUCINATION - there are some "new facts" in the output that weren't present in the expected output
  • (C) means factual consistency - there is no factual difference
  • (D) means inaccuracy - there is an actual discrepancy between the facts
  • (E) means style-only differences - there is no factual difference, but they do look different

B (and occasionally D) show possible hallucination. Measure how often they occur across your test dataset. Consider also tracking omission (A).

But wait, can I really trust a LLM to grade itself?

It's a good question; in this case, the answer is generally yes.

To give a feel for why, consider how the LLM originally did a very difficult thing - produce something from scratch.

In this evaluation, the LLM does something much easier - compare two things.

As a result, even if the same model was used to produce and evaluate the output, it will do a much better job of finding factual mismatches than of producing the result.

Method 2: Automated specific checks

Consider the case of this email generator.

image

Jack Matthews is never mentioned as a Stanford alumni, but GPT-3.5, attempting to write the best email possible, decides to slip it in.

Let's say this is happening frequently in your email generator; way more than any other form of hallucination.

Method 1 is overkill in this case. We just need to create an evaluator that looks out for this particular form of hallucination.

Setup

As in Method 1, first produce a test dataset. This time, there's no need for expected values.

Once you convert the inputs into outputs, again leverage model-graded evaluation, only this time with a more specific prompt:

Does the following email mention where the author went to college? Ignore mentions of where others (including the recipient) went to college.
[BEGIN DATA]
************
[Email]: {{ output }}
************
[END DATA]

(A) The email mentions where the author went to college.
(B) The email does not mention where the author went to college. 

The number of (A) answers are the amount of this specific type of hallucination.

In production

Note that this form of automated evaluation has a nice characteristic - it doesn't rely on ground truth / an expected value, so it can be run against production traffic as it happens.

Consider siphoning ~100 samples per day from your production traffic to specific evaluations like this to measure production hallucination.

Method 3: Manual expert review

The third method is to have experts manually review generations for hallucination.

To make this work correctly, make sure that experts have acess to all information. It should be easy for the expert to validate / invalidate facts if they don't have the information in their head.

This approach is slow as it requires manual human intervention.

To make things faster, pair this method with the automated methods above. For example, you may want to run automated checks on 100% of test outputs but only manually review 10% of them.

Method 4: End-user feedback

The last way to measure hallucination is to provide end-users with a way to flag it.

An example: you implement a thumbs up / thumbs down UI that shows up whenever a generation is presented to the user. When the user clicks the thumbs down, the UI asks why the generation was wrong, where hallucination is a possible option.

image

The data isn't perfect. End-users flag hallucinations that aren't actually there, or more frequently, they miss hallucinations or fail to report them.

Even so, the data is directionally useful as this "hallucination report rate" can be used a benchmark for a (likely higher) true hallucination rate.

Additionally, the data can be improved by combining this mechanism with other methods above; e.g. you could have an expert review the potential hallucinations to remove false positives.

Conclusion

There's no one-size-fits-all solution to the hallucination problem. Start with automated checks, and then experiment with approaches in both development and production to maximize your odds of success.

Evaluate

Experiment

Compare