How to test for AI hallucination

Method 1: Automated fact comparison

A fast general-purpose way to test for hallucination is to do an automated comparison of your generative AI's output to a task to one or more ground truths or expected values.

Compile a test dataset

In order to perform automated factual comparison, first create a test dataset.

A good test dataset consists of an input and some ground truth. The ground truth is organized however you want, but I find the easiest way is to just use a single value representing an ideal response from the generative AI.

To borrow from our RAG testing guide , a good test dataset might look like this:

Grade the output

Run your test dataset inputs through your generative AI feature, converting the inputs to outputs.

Then, grade the output using an automated comparison of factualness. To do this, use AI or model-graded evaluation. This technique prompts an AI and asks it to do the grading for us.

I recomend using the following prompt as a starting point (from the OpenAI evals Fact template ).

You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{ input }}
************
[Expert]: {{ expected }}
************
[Submission]: {{ output }}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

As you can see, this template returns a value between A and E. Let's dive into what each of them mean.

(A) means factual omission - there is some fact(s) in the expected value that did not make it to the actual output
(B) means POTENTIAL HALLUCINATION - there are some "new facts" in the output that weren't present in the expected output
(C) means factual consistency - there is no factual difference
(D) means inaccuracy - there is an actual discrepancy between the facts
(E) means style-only differences - there is no factual difference, but they do look different

B (and occasionally D) show possible hallucination. Measure how often they occur across your test dataset. Consider also tracking omission (A).

But wait, can I really trust a LLM to grade itself?

It's a good question; in this case, the answer is generally yes.

To give a feel for why, consider how the LLM originally did a very difficult thing - produce something from scratch.

In this evaluation, the LLM does something much easier - compare two things.

As a result, even if the same model was used to produce and evaluate the output, it will do a much better job of finding factual mismatches than of producing the result.

Method 2: Automated specific checks

Consider the case of this email generator.

Jack Matthews is never mentioned as a Stanford alumni, but GPT-3.5, attempting to write the best email possible, decides to slip it in.

Let's say this is happening frequently in your email generator; way more than any other form of hallucination.

Method 1 is overkill in this case. We just need to create an evaluator that looks out for this particular form of hallucination.

Setup

As in Method 1, first produce a test dataset. This time, there's no need for expected values.

Once you convert the inputs into outputs, again leverage model-graded evaluation, only this time with a more specific prompt:

Does the following email mention where the author went to college? Ignore mentions of where others (including the recipient) went to college.
[BEGIN DATA]
************
[Email]: {{ output }}
************
[END DATA]

(A) The email mentions where the author went to college.
(B) The email does not mention where the author went to college.

The number of (A) answers are the amount of this specific type of hallucination.

In production

Note that this form of automated evaluation has a nice characteristic - it doesn't rely on ground truth / an expected value, so it can be run against production traffic as it happens.

Consider siphoning ~100 samples per day from your production traffic to specific evaluations like this to measure production hallucination.

Method 3: Manual expert review

The third method is to have experts manually review generations for hallucination.

To make this work correctly, make sure that experts have acess to all information. It should be easy for the expert to validate / invalidate facts if they don't have the information in their head.

This approach is slow as it requires manual human intervention.

To make things faster, pair this method with the automated methods above. For example, you may want to run automated checks on 100% of test outputs but only manually review 10% of them.

Method 4: End-user feedback

The last way to measure hallucination is to provide end-users with a way to flag it.

An example: you implement a thumbs up / thumbs down UI that shows up whenever a generation is presented to the user. When the user clicks the thumbs down, the UI asks why the generation was wrong, where hallucination is a possible option.

The data isn't perfect. End-users flag hallucinations that aren't actually there, or more frequently, they miss hallucinations or fail to report them.

Even so, the data is directionally useful as this "hallucination report rate" can be used a benchmark for a (likely higher) true hallucination rate.

Additionally, the data can be improved by combining this mechanism with other methods above; e.g. you could have an expert review the potential hallucinations to remove false positives.

Conclusion

There's no one-size-fits-all solution to the hallucination problem. Start with automated checks, and then experiment with approaches in both development and production to maximize your odds of success.

‍

How to test for AI hallucination

Method 1: Automated fact comparison

Compile a test dataset

Grade the output

Method 2: Automated specific checks

Setup

In production

Method 3: Manual expert review

Method 4: End-user feedback

Conclusion

Building datasets for LLM product evaluations

Building datasets for LLM product evaluations

Securing Microservices with Istio: A Self-Hosted Journey

Simplifying task queues with PostgreSQL

Press Release: Gentrace Raises $8M Series A to Transform Generative AI Testing, Making LLM Development More Accessible and Reliable

Announcing our Series A

Unfair advantages - a framework for building LLM-as-a-judge evaluations that reliably work

Incident 7/29/2024: Evaluator outage

Security advisory: XSS vulnerability patched