Evaluators
Evaluators evaluate the live production data and test results that you submit from your generative AI pipelines and score them on a scale that you specify.
Creating evaluators
Create an evaluator by selecting "Evaluators" underneath the relevant pipeline and then selecting "New evaluator".
Evaluator scoring
Evaluators must define a scoring method. Gentrace provides two different ways to define scoring.
Enum scoring
In enum scoring, the evaluator output must map options to percentage values, which are used to calculate the final score of evaluations. All enum options must have a percentage mapping. We compute the final score of a series of evaluations by using the associated percentage.
Percentage scoring
In percentage scoring, the evaluator output must output a value between 0
and 1
, inclusive.
Evaluator types
Gentrace has three evaluator types.
AI evaluators
AI evaluators use AI to evaluate your generative output. We support several AI model providers including OpenAI (e.g. gpt-4o, gpt-4o-mini), Google (Gemini), Anthropic (e.g. Claude 3.5 Sonnet), and Azure OpenAI (same models) to grade test results.
💡 Want to use your own models as evaluators? Check out our guide on custom provider models to learn how to integrate your own AI models for evaluation.
With AI evaluators, you can:
- Classify the sentiment of the output.
- Semantically compare the output to the test case.
- Verify that the output is logically consistent throughout the entire generation.
Defining AI grading prompts
In any AI grading prompt, you can reference values from a test case or a test result.
Test case values
{{ inputs.<key> }}
references a key that is specified on the "inputs" value from a test case{{ expectedOutputs.<key> }}
references a key that is specified on the "Expected" value from a test case
Test output values
{{ outputs.<key> }}
references a key that is specified in the actual outputs from a test output that is created by your generative pipeline{{ steps }}
references the list of output steps that is created by your generative pipeline.- (Comparison only)
{{ comparison.outputs }}
is the same as theoutputs
object but for the comparison result in a comparative evaluator - (Comparison only)
{{ comparison.steps }}
is the same as thesteps
array but for the comparison result in a comparative evaluator
Let's make this concrete. We created a test case and a corresponding test result for that code. Here are the attributes of the test case and test result.
Test case
toml
[name]Message to Lex Luthor[input]{"sender": "[email protected]","receiver": "[email protected]","query": "bragging about superiority"}[expectedOutputs]{"value": "Dear Lex,\\n\\nI hope this email finds you well. I am writing to address some concerns I have about your \\nrecent behavior. It has come to my attention that you have been bragging about your superiority \\nover others, and I must say, I find this behavior quite concerning.\\n\\nAs a member of the Justice League, I believe in working together as equals, rather than trying \\nto prove oneself superior to others. It is important that we all work towards a common goal, \\nrather than trying to one-up each other.\\n\\nI hope you will take my concerns to heart and reconsider your behavior. If you would like to \\ndiscuss this further, I am available to talk.\\n\\nSincerely,\\nDiana"}
Test result
toml
[outputs]{"value": "Dear Lex,\\n\\nI am writing to express my concern about some recent comments you made regarding your superiority \\nover others. As a fellow superhero, I believe it is important that we work together as a team, \\nrather than trying to prove ourselves better than one another.\\n\\nI understand that competition can be healthy, but when it becomes a matter of superiority, it \\ncan create unnecessary tension and division. Our goal should be to work towards a common good, \\nand to support each other in achieving our objectives.\\n\\nI would appreciate it if you could reconsider your attitude towards others, and focus on \\ncollaboration rather than competition. If you would like to discuss this further, please let me \\nknow.\\n\\nSincerely,\\nDiana"}
You could write an AI evaluator prompt similar to below that leverages this data.
handlebars
You are comparing a submitted answer to an expert answer on a given question. Here is the data:[BEGIN DATA]************[Task]: Write an email from {{ inputs.sender }} to {{ inputs.receiver }} {{ inputs.query }}************[Expert]:{{ expectedOutputs.value }}************[Submission]:{{ outputs.value }}************[END DATA]Compare the factual content of the submitted answer with the expert answer. Ignore anydifferences in style, grammar, or punctuation. The submitted answer may either be a subsetor superset of the expert answer, or it may conflict with it. Determine which case applies.Answer the question by selecting one of the following options:(A) The submitted answer is a subset of the expert answer and is fully consistent with it.(B) The submitted answer is a superset of the expert answer and is fully consistent with it.(C) The submitted answer contains all the same details as the expert answer.(D) There is a disagreement between the submitted answer and the expert answer.(E) The answers differ, but these differences don't matter from the perspective of factuality.Explain your reasoning, then on the last line of your answer, write the letter corresponding toyour choice. Do not write anything except the letter.
The final interpolated string that is sent to OpenAI would look like this.
handlebars
You are comparing a submitted answer to an expert answer on a given question. Here is the data:[BEGIN DATA]************[Task]: Write an email from [email protected] to [email protected] bragging about superiority************[Expert]:Dear Lex,I hope this email finds you well. I am writing to address some concerns I have about yourrecent behavior. It has come to my attention that you have been bragging about your superiorityover others, and I must say, I find this behavior quite concerning.As a member of the Justice League, I believe in working together as equals, rather than tryingto prove oneself superior to others. It is important that we all work towards a common goal,rather than trying to one-up each other.I hope you will take my concerns to heart and reconsider your behavior. If you would like todiscuss this further, I am available to talk.Sincerely,Diana************[Submission]:Dear Lex,I am writing to express my concern about some recent comments you made regarding your superiorityover others. As a fellow superhero, I believe it is important that we work together as a team,rather than trying to prove ourselves better than one another.I understand that competition can be healthy, but when it becomes a matter of superiority, itcan create unnecessary tension and division. Our goal should be to work towards a common good,and to support each other in achieving our objectives.I would appreciate it if you could reconsider your attitude towards others, and focus oncollaboration rather than competition. If you would like to discuss this further, please let meknow.Sincerely,Diana************[END DATA]Compare the factual content of the submitted answer with the expert answer. Ignore anydifferences in style, grammar, or punctuation. The submitted answer may either be a subsetor superset of the expert answer, or it may conflict with it. Determine which case applies.Answer the question by selecting one of the following options:(A) The submitted answer is a subset of the expert answer and is fully consistent with it.(B) The submitted answer is a superset of the expert answer and is fully consistent with it.(C) The submitted answer contains all the same details as the expert answer.(D) There is a disagreement between the submitted answer and the expert answer.(E) The answers differ, but these differences don't matter from the perspective of factuality.Explain your reasoning, then on the last line of your answer, write the letter corresponding toyour choice. Do not write anything except the letter.
In this case,
{{ inputs.sender }}
maps to[email protected]
{{ inputs.receiver }}
maps to[email protected]
{{ inputs.query }}
maps tobragging about superiority
{{ outputs.value }}
maps to the test result email specified above{{ expectedOutputs.value }}
maps to the test case ideal email specified above
Testing AI evaluators
Gentrace provides an interactive tool to check that your AI prompt works correctly before integrating with code. When you create an AI evaluator with a prompt, you can test it by specifying a test case and example output.
AI starting templates
We provide several AI templates as a starting point to define AI evaluators. Templates provide reasonable default scoring values and a prompt template that works for many generic use cases.
- Factual consistency: this template is inspired by OpenAI's Fact eval template. It compares the output to the benchmark for factual consistency
- Company safety policy: this template compares whether the output conforms to a company safety policy.
- Grammar: this template evaluates the grammatical correctness of the output.
- Sentiment analysis: this template classifies the sentiment of the output.
- Logical consistency: this template determines whether the output is logically consistent.
You can find out more information about these templates in our AI evaluator creation tool.
Chain-of-Thought (CoT) prompting
Our templates use a prompting strategy called Chain-of-Thought Prompting.
This form of prompting helps large language models perform complex reasoning. It works by providing the model with a series of intermediate reasoning steps, which the model can then use to solve the problem at hand.
Here's an example from our grammar template.
You are evaluating an AI-generated output for its grammatical correctness and spelling. Here isthe data:[BEGIN DATA]************[Submission]:{{ outputs.value }}************[END DATA]Evaluate whether the AI-generated response is grammatically correct and free of spelling errors.Ignore any differences in style or punctuation. The submitted answer may be entirely grammaticallycorrect and free of spelling errors, or it may have some discrepancies. Determine which caseapplies. Answer the question by selecting one of the following options:(A) The AI-generated response is grammatically correct and has no spelling errors.(B) The AI-generated response is grammatically correct but contains minor spelling errors.(C) The AI-generated response contains minor grammatical errors but no spelling errors.(D) The AI-generated response contains both minor grammatical errors and spelling errors.(E) The AI-generated response contains significant grammatical errors or spelling errors.Explain your reasoning, then on the last line of your answer, write the letter corresponding toyour choice. Do not write anything except the letter.
The last two lines are what makes this prompt a CoT prompt.
Explain your reasoning, then on the last line of your answer, write the letter corresponding toyour choice. Do not write anything except the letter.
OpenAI recommends using this prompting method to get the best results for its GPT models.
Image evaluation
Gentrace supports certain models that can evaluate images, eg GPT-4 with Vision.
Submit images to Gentrace
To use an image evaluator, first send image URLs to Gentrace.
These can be in outputs, metadata, expected outputs, or even on step metadata.
If you want Gentrace to host the image, first upload it to Gentrace, then include the URL in the relevant location.
If you do not host it on Gentrace, make sure the URL is publicly accessible.
Here's an example code snippet where Gentrace hosts the images.
typescript
const runOneTest = async (testCase: TestCase) => {// uses an LLM to generate HTML from a promptconst htmlStr = await makeHTML(testCase.inputs.prompt);// converts the HTML to an imageconst fPath = htmlStr ? await convertHtmlToFsPng(htmlStr) : null;// uploads the image to the Gentrace serverconst buffer = await fs.readFile(fPath);const imageUrl = await uploadFile(new File([buffer], fName, { type: "image/png", lastModified: Date.now() }));// returns both the raw HTML and image for evaluationreturn { output: htmlStr, image: imageUrl };};const test = async () => {const tcs = await getTestCases(pipelineSlug);const outputs = await Promise.all(tcs.map((testCase) => runOneTest(testCase)));await submitTestResult(pipelineSlug, tcs, outputs);};
Set up an image evaluator
First, pick a model that supports image evaluation (for example, "OpenAI GPT-4 with Vision"). Then, specify the paths to the image URLs in your expectedOutputs, inputs, outputs, or steps that you want to evaluate.
When you run the evaluator, it generates a report that includes the images and the evaluation.
Heuristic evaluators
You can define functions to assess quality. For example, you can:
- Check that the output contains a whitelist (or blacklist) of keywords
- Check that the output only contains alphanumeric characters
- Validate that the output is in JSON format
Heuristic functions can be written in either JavaScript or Python.
Heuristic functions are passed attributes that you can use in your evaluation. These attributes are listed below.
Test case values
inputs
is an object that represents all the key/value pairs that were defined on the "inputs" value of a test caseexpectedOutputs
is an object that represents all of the key/value pairs that that were defined on the "expected outputs" value from a test case
Test output values
outputs
is an object with the actual output key/value pairs from a test run of your generative pipelinesteps
references the list of output steps that is created by your generative pipeline.- (Comparison only)
comparison.outputs
is the same as theoutputs
object but for the comparison output in a comparative evaluator - (Comparison only)
comparison.steps
is the same as thesteps
array but for the comparison output in a comparative evaluator
JavaScript functions and sandboxing
JavaScript heuristic functions are executed in sandboxed WebAssembly QuickJS contexts. Functions cannot invoke network requests or access host system resources. ES2023 JavaScript language features are supported out of the box.
Python functions and included imports
These imports are available for use in Python heuristic functions:
- base64
- bisect
- bz2
- calendar
- cmath
- codecs
- collections
- configparser
- copy
- csv
- datetime
- decimal
- difflib
- enum
- fractions
- functools
- graphlib
- gzip
- hashlib
- heapq
- hmac
- html
- ipaddress
- itertools
- json
- locale
- lzma
- mailbox
- marshal
- math
- mimetypes
- netrc
- numbers
- operator
- plistlib
- quopri
- random
- re
- secrets
- statistics
- string
- stringprep
- struct
- tarfile
- textwrap
- time
- tomllib
- types
- uuid
- weakref
- xml
- zipfile
- zlib
- zoneinfo
Testing heuristic evaluators
Gentrace provides an interactive tool to check that your heuristic evaluator works correctly. Specify a test case, expected outputs, and test outputs to run the evaluation.
Human evaluators
Certain generative pipelines might require that you spend time manually reviewing before deploying your changes. Human evaluators provide manual annotation workflows for your team to grade.
When creating your evaluators, write instructions in markdown that your team can follow to grade your results.
Your team can then manually grade a generative output using the markdown instructions that you provide.
Comparative evaluators
Comparative evaluators compare the output of two different results. This evaluator type can be either a heuristic or AI evaluator.
The main difference between comparative and non-comparative evaluators is that comparative evaluators do not run when a test result is submitted. They run specifically when comparing two test results within the Gentrace UI.
For example, you can compare:
- Which output has more professional language
- Which output is more succinct
- Which output gives a more comprehensive answer to a question
Setting run condition
To specify that an evaluator is comparative, you must set the run condition of the evaluator as Comparison (2)
. This
specifies that the evaluator will run only when comparing two test results.
Scoring
With comparative evaluators, you specify a score from 0%
to 100%
either as an enum or percentage value. If the score is:
- Closer to
0%
, thebaseline
output more closely satisfies the evaluation criteria. - Closer to
100%
, thecomparison
output more closely satisfies the evaluation criteria. - Exactly at
50%
, thebaseline
andcomparison
outputs equally match the evaluation criteria.
The image below shows an example of enum scoring.
Accessing comparison data
With regular AI and heuristic evaluators, you have access to outputs
and steps
variables that contain the
output data.
With comparative evaluators, you can access an additional variable comparison
that contains the comparison's
outputs
and steps
data.
Here's an AI evaluation prompt that uses this comparison
data. This prompt evaluates which email is better written.
handlebars
[BEGIN DATA]************[Task]: Which email is better written?************[Baseline]: {{ outputs.value }}************[Comparison]: {{ comparison.outputs.value }}************[END DATA](A) Comparison(B) Baseline(C) About the sameExplain your reasoning, then on the last line of your answer, write the letter corresponding to your choice. Do notwrite anything except the letter.
Here's a heuristic evaluation function that uses comparison
data. The function evaluates which output is shorter.
typescript
function evaluate({ outputs, comparison }) {if (outputs.value.length > comparison.outputs.value.length) {// Closer to 0% means the baseline output is betterreturn 0;} else if (outputs.value.length < comparison.outputs.value.length) {// Closer to 100% means the baseline output is betterreturn 1;} else {// Exactly at 50% means both outputs are equally goodreturn 0.5;}}
Testing
You can use the evaluation editor to test that you have created your comparative evaluator correctly.
If your run condition is set to Comparison (2)
, you should instead see two "Test outputs" blocks to specify your
baseline
and comparison
output blocks.
Creating comparative evaluations
To run your comparative evaluator on two test results, select a test result and then select another test result to compare against.
Once you have selected two test results, you can run your comparative evaluator to create evaluations between the results.
Classification evaluators
Classification evaluators evaluate whether or not a classification matches an expected classification.
Use them to create confusion matrices and aggregate statistics on classification outputs.
Setup
To setup a classification evaluator, first pick the classification template during setup:
Then, choose the paths to the predicted value (generally in your outputs or processed values) and expected value (generally in your expectedOutputs).
Finally, choose whether this value is binary or multiclass.
For binary outputs, we convert all values to strings and treat "false", "False", "f", "F", "0", "undefined", "null", or "" as falsey, and anything else as truthy.
For multiclass, we expected you to define the possible set of values up front.
Now, when you submit new results to Gentrace or re-run evaluators on old results, Gentrace computes aggregate statistics and confusion matrices for these evaluators.