Skip to main content
Version: 3.0.3

Evaluate overview

Gentrace helps developers improve their generative AI pipelines. We do this by evaluating their generative data using heuristics, AI, and manual human grading.

The Problem

In traditional software development, you write a series of tests and use a test runner like Jest (JavaScript/TypeScript) or Pytest (Python) to validate the expected behavior of your code locally.

Then before deploying your code, you use a CI/CD pipeline to validate your code on more complex integration tests. If a test fails, your CI/CD system throws an error notifying you that something is wrong with your code. This prevents costly regressions.

With generative AI, this model breaks down. The output of generative AI often does not have a definitive answer that can be evaluated deterministically. As a result, AI developers don’t have clear visibility into whether changes will cause regressions.

Why evaluate with Gentrace?

Gentrace uses heuristics, AI, and manual human workflows to evaluate your code locally and in CI/CD pipelines, using automated evaluators. These evaluators are designed specifically to evaluate generative output.

For example:

  • If your generative AI should only output JSON, define a heuristic evaluator to validate that it parses correctly.
  • If your generative AI should comply with your company’s safety policy, define an AI evaluator that scores how well it complies.
  • If your pipeline is especially important and you need a human to manually review the output, define a human evaluator that creates an annotation workflow for your team to review the output.

Gentrace provides a different testing model that provides a developer-centric approach to evaluating generative AI. We call this model continuous evaluation.


Our product introduces several entities to help developers. Links for more information on each entity are provided below.

  • Pipelines
  • Test cases
  • Test runs
  • Evaluators

Pipelines group test cases, evaluators, and test runs together.

Test cases are example scenarios that you define in Gentrace. Your code pulls test cases using our SDK and passes them to your AI pipeline to generate test output. Once your pipeline finishes, you submit the test results back to our service using the SDK.

Test results are reports of the performance of your generative AI pipeline. After your code pulls the cases and generates the run for each case, you submit all runs as a single result using our SDK. A test result represents a single submission of runs. Each run is then evaluated by the evaluators you defined.

Evaluators evaluate the test results that your code submitted. Evaluators scores results in two different ways: enum or percentage. There are also different evaluator methods: AI, heuristic, and human.

Integration requirements

We provide SDKs for both Node.JS (>= 16.16.0) and Python (>= 3.7.1). If you're not using either language, you can directly interact with our API.

Model agnostic

We do not expect you to use a particular model. Gentrace accepts only the results of your generative pipelines as strings to compare against your defined test cases.


Once you've defined your test cases and evaluators, you'll need to learn more about how these entities fit together with your code. Here's a simplified architecture.

Here's a summary of the steps:

  1. Pull your test cases for your use in your code
  2. Run the specific version of your generative code on these test cases
  3. Aggregate the test results and submit them to our service for processing
  4. Wait until the evaluators you defined finish evaluating the results

For an in-depth walkthrough, read through the quickstart.