Releases
Production evaluation graphs
Production evaluators now automatically create graphs to show how performance is trending over time.
For example, you can create a "Safety" evaluator which uses LLM-as-a-judge to score whether an output is compliant with your AI safety policy.
Then, you can see how the average output "Safety" trends over time.
Local evals & local datasets
Gentrace now allows you to more easily define local evaluations and use completely local data / datasets.
This makes Gentrace work better with existing unit testing frameworks and patterns. It also makes Gentrace incrementally adoptable into homegrown testing stacks.
Revamped, realtime test results
Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.
To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.
We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.
Aggregate comparison
The new aggregate view shows the statistical differences between different versions of your LLM-based feature.
Added explicit compare button
Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.
To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.
We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.
Aggregate comparison
The new aggregate view shows the statistical differences between different versions of your LLM-based feature.
List of test cases with evaluations
The new list view shows all of the scores for different test cases.
When doing a comparison, it highlights what changed between the different modes.
Drilldown per test case
The new drilldown view presents a clear picture of the output(s) for a specific test case. It includes the JSON representation, evaluations, and timeline.
Accessing the legacy view
For the time being, you can access the old view by clicking the "Legacy results page" link at the top of the new results view. Please provide us with feedback if you find yourself going back to that view.