Releases

October 24, 2024

Production evaluation graphs

Production evaluation graphs

Production evaluators now automatically create graphs to show how performance is trending over time.

For example, you can create a "Safety" evaluator which uses LLM-as-a-judge to score whether an output is compliant with your AI safety policy.
Then, you can see how the average output "Safety" trends over time.

Local evals & local datasets

Gentrace now allows you to more easily define local evaluations  and use completely local data  / datasets.

This makes Gentrace work better with existing unit testing frameworks and patterns. It also makes Gentrace incrementally adoptable into homegrown testing stacks.

Improvements [6]
Persistent user-specific view settings, which can be saved and overriden from a URL Filter test runs by their input values Added explicit compare button Pinecone v3 (Node) support o1 support Fixed 68 bugs
July 09, 2024

Added explicit compare button

Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.
To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.
We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.

Aggregate comparison

The new aggregate view shows the statistical differences between different versions of your LLM-based feature.

List of test cases with evaluations

The new list view shows all of the scores for different test cases.

When doing a comparison, it highlights what changed between the different modes.

Drilldown per test case

The new drilldown view presents a clear picture of the output(s) for a specific test case. It includes the JSON representation, evaluations, and timeline.

Accessing the legacy view

For the time being, you can access the old view by clicking the "Legacy results page" link at the top of the new results view. Please provide us with feedback if you find yourself going back to that view.

Improvements [6]
Persistent user-specific view settings, which can be saved and overriden from a URL o1 support Fixed 68 bugs Added explicit compare button
July 09, 2024

Load 1

Datasets

Gentrace test cases in pipelines currently work well when only one engineer is working on the pipeline. However, once more than one engineer is working on the same pipeline, it becomes difficult to manage test data in Gentrace.
In practice, engineers end up cloning pipelines or overwriting test data, both of which have significant drawbacks.
To solve this, Gentrace has:

  • Introduced datasets, which organize test data into separate groups within a pipeline
  • Migrated existing test data into a default "Golden dataset"
  • Made existing API routes and SDK methods operate on the "Golden dataset," and created optional parameters or new versions which allow you to specify an alternative dataset.

Please give us feedback on how datasets feel.

Test result settings memory

Settings in any of the test result pages (such as hiding evaluators, collapsing inputs, outputs, or metadata, re-ordering fields, etc) are now remembered across test results in the same pipeline.
This makes it easier to see exactly what you want (and only exactly what you want), without having to redo your work every time.

Improvements [6]
Persistent user-specific view settings, which can be saved and overriden from a URL Filter test runs by their input values Added explicit compare button

Evaluate

Experiment

Compare