Releases

October 24, 2024

Production evaluation graphs

Production evaluators now automatically create graphs to show how performance is trending over time.

For example, you can create a "Safety" evaluator which uses LLM-as-a-judge to score whether an output is compliant with your AI safety policy.
Then, you can see how the average output "Safety" trends over time.

Local evals & local datasets

Gentrace now allows you to more easily define local evaluations and use completely local data / datasets.

This makes Gentrace work better with existing unit testing frameworks and patterns. It also makes Gentrace incrementally adoptable into homegrown testing stacks.

Improvements [6]

Persistent user-specific view settings, which can be saved and overriden from a URL Filter test runs by their input values Added explicit compare button Pinecone v3 (Node) support o1 support Fixed 68 bugs

July 09, 2024

Revamped, realtime test results

Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.

‍
To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.
We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.

Aggregate comparison

The new aggregate view shows the statistical differences between different versions of your LLM-based feature.

Improvements [6]

Persistent user-specific view settings, which can be saved and overriden from a URL o1 support Fixed 68 bugs Added explicit compare button

July 09, 2024

Added explicit compare button

Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.
To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.
We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.

Aggregate comparison

The new aggregate view shows the statistical differences between different versions of your LLM-based feature.

List of test cases with evaluations

The new list view shows all of the scores for different test cases.

When doing a comparison, it highlights what changed between the different modes.

Drilldown per test case

The new drilldown view presents a clear picture of the output(s) for a specific test case. It includes the JSON representation, evaluations, and timeline.

Accessing the legacy view

For the time being, you can access the old view by clicking the "Legacy results page" link at the top of the new results view. Please provide us with feedback if you find yourself going back to that view.

Improvements [6]

Persistent user-specific view settings, which can be saved and overriden from a URL o1 support Fixed 68 bugs Added explicit compare button

Releases

Production evaluation graphs

Local evals & local datasets

Revamped, realtime test results

Aggregate comparison

Added explicit compare button

Aggregate comparison

List of test cases with evaluations

Drilldown per test case

Accessing the legacy view

Evaluate

Experiment

Compare