Skip to main content
Version: 4.7.1


Gentrace evaluates generative AI in development and production using heuristics, AI, and manual human grading. This helps developers catch regressions and improve the quality of their generative AI features and products.

The Problem

In traditional software development, you write a series of tests and use a test runner like Jest (JavaScript/TypeScript) or Pytest (Python) to validate the expected behavior of your code locally.

Then before deploying your code, you use a CI/CD pipeline to validate your code on more complex integration tests. If a test fails, your CI/CD system throws an error notifying you that something is wrong with your code. This prevents costly regressions.

With generative AI, this model breaks down. The output of generative AI often does not have a definitive answer that can be evaluated deterministically. As a result, AI developers don’t have clear visibility into whether changes will cause regressions.

Why evaluate with Gentrace?

Gentrace uses heuristics, AI, and manual human workflows to evaluate your code locally, in CI/CD pipelines, and production using automated evaluators. These evaluators are designed specifically to evaluate generative output.

For example:

  • If your generative AI should only output JSON, define a heuristic evaluator to validate that it parses correctly.
  • If your generative AI should comply with your company’s safety policy, define an AI evaluator that scores how well it complies.
  • If your pipeline is especially important and you need a human to manually review the output, define a human evaluator that creates an annotation workflow for your team to review the output.

Gentrace provides a different testing model that provides a developer-centric approach to evaluating generative AI.


Our product introduces several entities to help developers. Links for more information on each entity are provided below.

  • Pipelines group test cases, evaluators, and test runs together.
  • Live data comes from production data sources. It may include traces and other context that you define.
  • Test cases are example scenarios that you define in Gentrace. Your code pulls test cases using our SDK and passes them to your AI pipeline to generate test output. Once your pipeline finishes, you submit the test results back to our service using the SDK.
  • Test results are reports of the performance of your generative AI pipeline. After your code pulls the cases and generates the run for each case, you submit all runs as a single result using our SDK. A test result represents a single submission of runs. Each run is then evaluated by the evaluators you defined. Results may include traces and other context that you define.
  • Evaluators evaluate results or live data that you submit to Gentrace. Evaluators scores results in two different ways: enum or percentage. There are also different evaluator methods: AI, heuristic, and human.

Integration requirements

We provide SDKs for both Node.JS (>= 16.16.0) and Python (>= 3.7.1). If you're not using either language, you can directly interact with our API.

Model agnostic

We do not expect you to use a particular model. Gentrace accepts only the results of your generative pipelines as strings to compare against your defined test cases.


Once you've defined your test cases and evaluators, you'll need to learn more about how these entities fit together with your code. Here's a simplified architecture.

Production evaluation

Step 1 is optional and will only occur during experimentation / CI/CD. In production, you'll only run steps 2-4.

Here's a summary of the steps:

  1. (Does not apply in production) Pull your test cases for your use in your code
  2. Run the specific version of your generative code on these test cases
  3. Aggregate the test results and submit them to our service for processing
  4. Wait until the evaluators you defined finish evaluating the results

For an in-depth walkthrough, read through the quickstart.