Skip to main content
Version: 2.0.0

AI evaluators

AI evaluators use AI to evaluate your generative output. We use OpenAI (gpt4-0613 with function calling) to grade test results. We will add additional foundation model providers as configurable options in the future.

Evaluator enum scoring template With AI evaluators, you can:

  • Classify the sentiment of the output.
  • Semantically compare the output to the test case.
  • Verify that the output is logically consistent throughout the entire generation.

Defining AI grading prompts

In any AI grading prompt, you can reference values from a test case or a test result.

Test case values

  • {{ inputs.<key> }} references a key that is specified on the "inputs" value from a test case
  • {{ expectedOutputs.<key> }} references a key that is specified on the "Expected" value from a test cas

Test result values

  • {{ outputs.<key> }} references a key that is specified in the actual outputs from a test output that is created by your generative pipeline

Let's make this concrete. We created a test case and a corresponding test result for that code. Here are the attributes of the test case and test result.

Test case

toml
[name]
Message to Lex Luthor
[input]
{
"sender": "[email protected]",
"receiver": "[email protected]",
"query": "bragging about superiority"
}
[expectedOutputs]
{
"value": "Dear Lex,\\n\\nI hope this email finds you well. I am writing to address some concerns I have about your \\nrecent behavior. It has come to my attention that you have been bragging about your superiority \\nover others, and I must say, I find this behavior quite concerning.\\n\\nAs a member of the Justice League, I believe in working together as equals, rather than trying \\nto prove oneself superior to others. It is important that we all work towards a common goal, \\nrather than trying to one-up each other.\\n\\nI hope you will take my concerns to heart and reconsider your behavior. If you would like to \\ndiscuss this further, I am available to talk.\\n\\nSincerely,\\nDiana"
}

Test result

toml
[outputs]
{
"value": "Dear Lex,\\n\\nI am writing to express my concern about some recent comments you made regarding your superiority \\nover others. As a fellow superhero, I believe it is important that we work together as a team, \\nrather than trying to prove ourselves better than one another.\\n\\nI understand that competition can be healthy, but when it becomes a matter of superiority, it \\ncan create unnecessary tension and division. Our goal should be to work towards a common good, \\nand to support each other in achieving our objectives.\\n\\nI would appreciate it if you could reconsider your attitude towards others, and focus on \\ncollaboration rather than competition. If you would like to discuss this further, please let me \\nknow.\\n\\nSincerely,\\nDiana"
}

You could write an AI evaluator prompt similar to below that leverages this data.

handlebars
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Task]: Write an email from {{ inputs.sender }} to {{ inputs.receiver }} {{ inputs.query }}
************
[Expert]:
{{ expectedOutputs.value }}
************
[Submission]:
{{ outputs.value }}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any
differences in style, grammar, or punctuation. The submitted answer may either be a subset
or superset of the expert answer, or it may conflict with it. Determine which case applies.
Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

The final interpolated string that is sent to OpenAI would look like this.

handlebars
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Task]: Write an email from [email protected] to [email protected] bragging about superiority
************
[Expert]:
Dear Lex,
I hope this email finds you well. I am writing to address some concerns I have about your
recent behavior. It has come to my attention that you have been bragging about your superiority
over others, and I must say, I find this behavior quite concerning.
As a member of the Justice League, I believe in working together as equals, rather than trying
to prove oneself superior to others. It is important that we all work towards a common goal,
rather than trying to one-up each other.
I hope you will take my concerns to heart and reconsider your behavior. If you would like to
discuss this further, I am available to talk.
Sincerely,
Diana
************
[Submission]:
Dear Lex,
I am writing to express my concern about some recent comments you made regarding your superiority
over others. As a fellow superhero, I believe it is important that we work together as a team,
rather than trying to prove ourselves better than one another.
I understand that competition can be healthy, but when it becomes a matter of superiority, it
can create unnecessary tension and division. Our goal should be to work towards a common good,
and to support each other in achieving our objectives.
I would appreciate it if you could reconsider your attitude towards others, and focus on
collaboration rather than competition. If you would like to discuss this further, please let me
know.
Sincerely,
Diana
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any
differences in style, grammar, or punctuation. The submitted answer may either be a subset
or superset of the expert answer, or it may conflict with it. Determine which case applies.
Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

In this case,

  • {{ inputs.sender }} maps to [email protected]
  • {{ inputs.receiver }} maps to [email protected]
  • {{ inputs.query }} maps to bragging about superiority
  • {{ outputs.value }} maps to the test result email specified above
  • {{ expectedOutputs.value }} maps to the test case ideal email specified above

Testing AI evaluators

Gentrace provides an interactive tool to check that your AI prompt works correctly before integrating with code. When you create an AI evaluator with a prompt, you can test it by specifying a test case and example output.

Testing AI evaluators

AI Templates

We provide several AI templates as a starting point to define AI evaluators. Templates provide reasonable default scoring values and a prompt template that works for many generic use cases.

  • Factual consistency: this template is inspired by OpenAI's Fact eval template. It compares the output to the benchmark for factual consistency
  • Company safety policy: this template compares whether the output conforms to a company safety policy.
  • Sentiment: this template classifies the sentiment of the output.
  • Grammar: this template evaluates the grammatical correctness of the output.
  • Logical consistency: this template determines whether the output is logically consistent.

You can find out more information about these templates in our AI evaluator creation tool.

Chain-of-Thought (CoT) Prompting

Our templates use a prompting strategy called Chain-of-Thought Prompting.

This form of prompting helps large language models perform complex reasoning. It works by providing the model with a series of intermediate reasoning steps, which the model can then use to solve the problem at hand.

Here's an example from our grammar template.

You are evaluating an AI-generated output for its grammatical correctness and spelling. Here is
the data:
[BEGIN DATA]
************
[Submission]:
{{ outputs.value }}
************
[END DATA]
Evaluate whether the AI-generated response is grammatically correct and free of spelling errors.
Ignore any differences in style or punctuation. The submitted answer may be entirely grammatically
correct and free of spelling errors, or it may have some discrepancies. Determine which case
applies. Answer the question by selecting one of the following options:
(A) The AI-generated response is grammatically correct and has no spelling errors.
(B) The AI-generated response is grammatically correct but contains minor spelling errors.
(C) The AI-generated response contains minor grammatical errors but no spelling errors.
(D) The AI-generated response contains both minor grammatical errors and spelling errors.
(E) The AI-generated response contains significant grammatical errors or spelling errors.
Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

The last two lines are what makes this prompt a CoT prompt.

Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

OpenAI recommends using this prompting method to get the best results for its GPT models.