Skip to main content
Version: 4.6.17

Evaluators

Evaluators evaluate the live production data and test results that you submit from your generative AI pipelines and score them on a scale that you specify.

Creating evaluators

Create an evaluator by selecting "Evaluators" underneath the relevant pipeline and then selecting "New evaluator".

Create new evaluator

Evaluator scoring

Evaluators must define a scoring method. Gentrace provides two different ways to define scoring.

Enum scoring

In enum scoring, the evaluator output must map options to percentage values, which are used to calculate the final score of evaluations. All enum options must have a percentage mapping. We compute the final score of a series of evaluations by using the associated percentage.

Evaluator enum scoring

Percentage scoring

In percentage scoring, the evaluator output must output a value between 0 and 1, inclusive.

Evaluator types

Gentrace has three evaluator types.

AI evaluators

AI evaluators use AI to evaluate your generative output. We allow using several OpenAI models (including the gpt-4 and gpt-3.5-turbo families) to grade test results. We will add additional foundation model providers as configurable options in the future.

Evaluator enum scoring template With AI evaluators, you can:

  • Classify the sentiment of the output.
  • Semantically compare the output to the test case.
  • Verify that the output is logically consistent throughout the entire generation.

Defining AI grading prompts

In any AI grading prompt, you can reference values from a test case or a test result.

Test case values

  • {{ inputs.<key> }} references a key that is specified on the "inputs" value from a test case
  • {{ expectedOutputs.<key> }} references a key that is specified on the "Expected" value from a test case

Test output values

  • {{ outputs.<key> }} references a key that is specified in the actual outputs from a test output that is created by your generative pipeline
  • {{ steps }} references the list of output steps that is created by your generative pipeline.
  • (Comparison only) {{ comparison.outputs }} is the same as the outputs object but for the comparison result in a comparative evaluator
  • (Comparison only) {{ comparison.steps }} is the same as the steps array but for the comparison result in a comparative evaluator

Let's make this concrete. We created a test case and a corresponding test result for that code. Here are the attributes of the test case and test result.

Test case

toml
[name]
Message to Lex Luthor
[input]
{
"sender": "[email protected]",
"receiver": "[email protected]",
"query": "bragging about superiority"
}
[expectedOutputs]
{
"value": "Dear Lex,\\n\\nI hope this email finds you well. I am writing to address some concerns I have about your \\nrecent behavior. It has come to my attention that you have been bragging about your superiority \\nover others, and I must say, I find this behavior quite concerning.\\n\\nAs a member of the Justice League, I believe in working together as equals, rather than trying \\nto prove oneself superior to others. It is important that we all work towards a common goal, \\nrather than trying to one-up each other.\\n\\nI hope you will take my concerns to heart and reconsider your behavior. If you would like to \\ndiscuss this further, I am available to talk.\\n\\nSincerely,\\nDiana"
}

Test result

toml
[outputs]
{
"value": "Dear Lex,\\n\\nI am writing to express my concern about some recent comments you made regarding your superiority \\nover others. As a fellow superhero, I believe it is important that we work together as a team, \\nrather than trying to prove ourselves better than one another.\\n\\nI understand that competition can be healthy, but when it becomes a matter of superiority, it \\ncan create unnecessary tension and division. Our goal should be to work towards a common good, \\nand to support each other in achieving our objectives.\\n\\nI would appreciate it if you could reconsider your attitude towards others, and focus on \\ncollaboration rather than competition. If you would like to discuss this further, please let me \\nknow.\\n\\nSincerely,\\nDiana"
}

You could write an AI evaluator prompt similar to below that leverages this data.

handlebars
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Task]: Write an email from {{ inputs.sender }} to {{ inputs.receiver }} {{ inputs.query }}
************
[Expert]:
{{ expectedOutputs.value }}
************
[Submission]:
{{ outputs.value }}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any
differences in style, grammar, or punctuation. The submitted answer may either be a subset
or superset of the expert answer, or it may conflict with it. Determine which case applies.
Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

The final interpolated string that is sent to OpenAI would look like this.

handlebars
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Task]: Write an email from [email protected] to [email protected] bragging about superiority
************
[Expert]:
Dear Lex,
I hope this email finds you well. I am writing to address some concerns I have about your
recent behavior. It has come to my attention that you have been bragging about your superiority
over others, and I must say, I find this behavior quite concerning.
As a member of the Justice League, I believe in working together as equals, rather than trying
to prove oneself superior to others. It is important that we all work towards a common goal,
rather than trying to one-up each other.
I hope you will take my concerns to heart and reconsider your behavior. If you would like to
discuss this further, I am available to talk.
Sincerely,
Diana
************
[Submission]:
Dear Lex,
I am writing to express my concern about some recent comments you made regarding your superiority
over others. As a fellow superhero, I believe it is important that we work together as a team,
rather than trying to prove ourselves better than one another.
I understand that competition can be healthy, but when it becomes a matter of superiority, it
can create unnecessary tension and division. Our goal should be to work towards a common good,
and to support each other in achieving our objectives.
I would appreciate it if you could reconsider your attitude towards others, and focus on
collaboration rather than competition. If you would like to discuss this further, please let me
know.
Sincerely,
Diana
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any
differences in style, grammar, or punctuation. The submitted answer may either be a subset
or superset of the expert answer, or it may conflict with it. Determine which case applies.
Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

In this case,

  • {{ inputs.sender }} maps to [email protected]
  • {{ inputs.receiver }} maps to [email protected]
  • {{ inputs.query }} maps to bragging about superiority
  • {{ outputs.value }} maps to the test result email specified above
  • {{ expectedOutputs.value }} maps to the test case ideal email specified above

Testing AI evaluators

Gentrace provides an interactive tool to check that your AI prompt works correctly before integrating with code. When you create an AI evaluator with a prompt, you can test it by specifying a test case and example output.

Testing AI evaluators

AI starting templates

We provide several AI templates as a starting point to define AI evaluators. Templates provide reasonable default scoring values and a prompt template that works for many generic use cases.

  • Factual consistency: this template is inspired by OpenAI's Fact eval template. It compares the output to the benchmark for factual consistency
  • Company safety policy: this template compares whether the output conforms to a company safety policy.
  • Grammar: this template evaluates the grammatical correctness of the output.
  • Sentiment analysis: this template classifies the sentiment of the output.
  • Logical consistency: this template determines whether the output is logically consistent.

You can find out more information about these templates in our AI evaluator creation tool.

Chain-of-Thought (CoT) prompting

Our templates use a prompting strategy called Chain-of-Thought Prompting.

This form of prompting helps large language models perform complex reasoning. It works by providing the model with a series of intermediate reasoning steps, which the model can then use to solve the problem at hand.

Here's an example from our grammar template.

You are evaluating an AI-generated output for its grammatical correctness and spelling. Here is
the data:
[BEGIN DATA]
************
[Submission]:
{{ outputs.value }}
************
[END DATA]
Evaluate whether the AI-generated response is grammatically correct and free of spelling errors.
Ignore any differences in style or punctuation. The submitted answer may be entirely grammatically
correct and free of spelling errors, or it may have some discrepancies. Determine which case
applies. Answer the question by selecting one of the following options:
(A) The AI-generated response is grammatically correct and has no spelling errors.
(B) The AI-generated response is grammatically correct but contains minor spelling errors.
(C) The AI-generated response contains minor grammatical errors but no spelling errors.
(D) The AI-generated response contains both minor grammatical errors and spelling errors.
(E) The AI-generated response contains significant grammatical errors or spelling errors.
Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

The last two lines are what makes this prompt a CoT prompt.

Explain your reasoning, then on the last line of your answer, write the letter corresponding to
your choice. Do not write anything except the letter.

OpenAI recommends using this prompting method to get the best results for its GPT models.

Image evaluation

Gentrace supports certain models that can evaluate images, eg GPT-4 with Vision.

Submit images to Gentrace

To use an image evaluator, first send image URLs to Gentrace.

These can be in outputs, metadata, expected outputs, or even on step metadata.

If you want Gentrace to host the image, first upload it to Gentrace, then include the URL in the relevant location.

If you do not host it on Gentrace, make sure the URL is publicly accessible.

Here's an example code snippet where Gentrace hosts the images.

typescript
const runOneTest = async (testCase: TestCase) => {
// uses an LLM to generate HTML from a prompt
const htmlStr = await makeHTML(testCase.inputs.prompt);
// converts the HTML to an image
const fPath = htmlStr ? await convertHtmlToFsPng(htmlStr) : null;
// uploads the image to the Gentrace server
const buffer = await fs.readFile(fPath);
const imageUrl = await uploadFile(
new File([buffer], fName, { type: "image/png", lastModified: Date.now() })
);
// returns both the raw HTML and image for evaluation
return { output: htmlStr, image: imageUrl };
};
const test = async () => {
const tcs = await getTestCases(pipelineSlug);
const outputs = await Promise.all(
tcs.map((testCase) => runOneTest(testCase))
);
await submitTestResult(pipelineSlug, tcs, outputs);
};

Set up an image evaluator

First, pick a model that supports image evaluation (for example, "OpenAI GPT-4 with Vision"). Then, specify the paths to the image URLs in your expectedOutputs, inputs, outputs, or steps that you want to evaluate.

Image evaluator

When you run the evaluator, it generates a report that includes the images and the evaluation.

Image evaluation report

Image eval description

Heuristic evaluators

You can define functions to assess quality. For example, you can:

  • Check that the output contains a whitelist (or blacklist) of keywords
  • Check that the output only contains alphanumeric characters
  • Validate that the output is in JSON format

Heuristic functions can be written in either JavaScript or Python.

Heuristic evaluator functions in JavaScript or Python

Heuristic evaluator function: Language dropdown

Heuristic functions are passed attributes that you can use in your evaluation. These attributes are listed below.

Test case values

  • inputs is an object that represents all the key/value pairs that were defined on the "inputs" value of a test case
  • expectedOutputs is an object that represents all of the key/value pairs that that were defined on the "expected outputs" value from a test case

Test output values

  • outputs is an object with the actual output key/value pairs from a test run of your generative pipeline
  • steps references the list of output steps that is created by your generative pipeline.
  • (Comparison only) comparison.outputs is the same as the outputs object but for the comparison output in a comparative evaluator
  • (Comparison only) comparison.steps is the same as the steps array but for the comparison output in a comparative evaluator
JavaScript functions and sandboxing

JavaScript heuristic functions are executed in sandboxed WebAssembly QuickJS contexts. Functions cannot invoke network requests or access host system resources. ES2023 JavaScript language features are supported out of the box.

Python functions and included imports

These imports are available for use in Python heuristic functions:

  • base64
  • bisect
  • bz2
  • calendar
  • cmath
  • codecs
  • collections
  • configparser
  • copy
  • csv
  • datetime
  • decimal
  • difflib
  • enum
  • fractions
  • functools
  • graphlib
  • gzip
  • hashlib
  • heapq
  • hmac
  • html
  • ipaddress
  • itertools
  • json
  • locale
  • lzma
  • mailbox
  • marshal
  • math
  • mimetypes
  • netrc
  • numbers
  • operator
  • plistlib
  • quopri
  • random
  • re
  • secrets
  • statistics
  • string
  • stringprep
  • struct
  • tarfile
  • textwrap
  • time
  • tomllib
  • types
  • uuid
  • weakref
  • xml
  • zipfile
  • zlib
  • zoneinfo

Testing heuristic evaluators

Gentrace provides an interactive tool to check that your heuristic evaluator works correctly. Specify a test case, expected outputs, and test outputs to run the evaluation.

Test heuristic evaluator

Human evaluators

Certain generative pipelines might require that you spend time manually reviewing before deploying your changes. Human evaluators provide manual annotation workflows for your team to grade.

When creating your evaluators, write instructions in markdown that your team can follow to grade your results.

Create human evaluator

Your team can then manually grade a generative output using the markdown instructions that you provide.

Comparative evaluators

Comparative evaluators compare the output of two different results. This evaluator type can be either a heuristic or AI evaluator.

The main difference between comparative and non-comparative evaluators is that comparative evaluators do not run when a test result is submitted. They run specifically when comparing two test results within the Gentrace UI.

For example, you can compare:

  • Which output has more professional language
  • Which output is more succinct
  • Which output gives a more comprehensive answer to a question

Setting run condition

To specify that an evaluator is comparative, you must set the run condition of the evaluator as Comparison (2). This specifies that the evaluator will run only when comparing two test results.

Comparative evaluator run condition

Scoring

With comparative evaluators, you specify a score from 0% to 100% either as an enum or percentage value. If the score is:

  • Closer to 0%, the baseline output more closely satisfies the evaluation criteria.
  • Closer to 100%, the comparison output more closely satisfies the evaluation criteria.
  • Exactly at 50%, the baseline and comparison outputs equally match the evaluation criteria.

The image below shows an example of enum scoring.

Comparative evaluator scoring

Accessing comparison data

With regular AI and heuristic evaluators, you have access to outputs and steps variables that contain the output data.

With comparative evaluators, you can access an additional variable comparison that contains the comparison's outputs and steps data.

Here's an AI evaluation prompt that uses this comparison data. This prompt evaluates which email is better written.

handlebars
[BEGIN DATA]
************
[Task]: Which email is better written?
************
[Baseline]: {{ outputs.value }}
************
[Comparison]: {{ comparison.outputs.value }}
************
[END DATA]
(A) Comparison
(B) Baseline
(C) About the same
Explain your reasoning, then on the last line of your answer, write the letter corresponding to your choice. Do not
write anything except the letter.

Here's a heuristic evaluation function that uses comparison data. The function evaluates which output is shorter.

typescript
function evaluate({ outputs, comparison }) {
if (outputs.value.length > comparison.outputs.value.length) {
// Closer to 0% means the baseline output is better
return 0;
} else if (outputs.value.length < comparison.outputs.value.length) {
// Closer to 100% means the baseline output is better
return 1;
} else {
// Exactly at 50% means both outputs are equally good
return 0.5;
}
}

Testing

You can use the evaluation editor to test that you have created your comparative evaluator correctly.

If your run condition is set to Comparison (2), you should instead see two "Test outputs" blocks to specify your baseline and comparison output blocks.

Comparative evaluator output blocks

Creating comparative evaluations

To run your comparative evaluator on two test results, select a test result and then select another test result to compare against.

Comparative evaluator - select comparison

Once you have selected two test results, you can run your comparative evaluator to create evaluations between the results.

Comparative evaluator - run comparatives

Comparative evaluator - final evaluations

Classification evaluators

Classification evaluators evaluate whether or not a classification matches an expected classification.

Use them to create confusion matrices and aggregate statistics on classification outputs.

Classification evaluator - result

Setup

To setup a classification evaluator, first pick the classification template during setup:

Classification template

Then, choose the paths to the predicted value (generally in your outputs or processed values) and expected value (generally in your expectedOutputs).

Classification binary

Finally, choose whether this value is binary or multiclass.

For binary outputs, we convert all values to strings and treat "false", "False", "f", "F", "0", "undefined", "null", or "" as falsey, and anything else as truthy.

For multiclass, we expected you to define the possible set of values up front.

Classification multiclass

Now, when you submit new results to Gentrace or re-run evaluators on old results, Gentrace computes aggregate statistics and confusion matrices for these evaluators.