Releases
Production evaluation graphs
Production evaluators now automatically create graphs to show how performance is trending over time.
For example, you can create a "Safety" evaluator which uses LLM-as-a-judge to score whether an output is compliant with your AI safety policy.
Then, you can see how the average output "Safety" trends over time.
Local evals & local datasets
Gentrace now allows you to more easily define local evaluations and use completely local data / datasets.
This makes Gentrace work better with existing unit testing frameworks and patterns. It also makes Gentrace incrementally adoptable into homegrown testing stacks.
Added explicit compare button
Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.
To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.
We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.
Aggregate comparison
The new aggregate view shows the statistical differences between different versions of your LLM-based feature.
List of test cases with evaluations
The new list view shows all of the scores for different test cases.
When doing a comparison, it highlights what changed between the different modes.
Drilldown per test case
The new drilldown view presents a clear picture of the output(s) for a specific test case. It includes the JSON representation, evaluations, and timeline.
Accessing the legacy view
For the time being, you can access the old view by clicking the "Legacy results page" link at the top of the new results view. Please provide us with feedback if you find yourself going back to that view.
Load 1
Gentrace test cases in pipelines currently work well when only one engineer is working on the pipeline. However, once more than one engineer is working on the same pipeline, it becomes difficult to manage test data in Gentrace.
In practice, engineers end up cloning pipelines or overwriting test data, both of which have significant drawbacks.
To solve this, Gentrace has:
- Introduced datasets, which organize test data into separate groups within a pipeline
- Migrated existing test data into a default "Golden dataset"
- Made existing API routes and SDK methods operate on the "Golden dataset," and created optional parameters or new versions which allow you to specify an alternative dataset.
Please give us feedback on how datasets feel.
Test result settings memory
Settings in any of the test result pages (such as hiding evaluators, collapsing inputs, outputs, or metadata, re-ordering fields, etc) are now remembered across test results in the same pipeline.
This makes it easier to see exactly what you want (and only exactly what you want), without having to redo your work every time.