Skip to main content
Version: 2.0.0

Evaluate quickstart

This guide covers how to get started with Gentrace Evaluate. This product evaluates any generative pipeline with AI, heuristics, and manual human grading.

The other critical part of our product is Gentrace Observe, which traces your generative requests and tracks speed, cost and aggregate statistics. We will not cover Observe in this guide. Visit here to get started with Observe.

To do so, we'll set up a pipeline with:

  • Test cases
  • Evaluators
  • Test script

Then, we'll optionally:

  • Integrate with your CI / CD environment

Creating your first pipeline

To create your first set, create a Gentrace account or sign in.

Then, navigate to the "Evaluate" subpage and select "New Pipeline" (or click here).

  • Name: choose a name for your pipeline.
  • Main branch: If you use Git, specify a branch name as a baseline. We will compare evaluation results against this baseline.

Test cases

Test cases consist of:

  • A unique name
  • Inputs that will be passed to your AI pipeline, expressed as a JSON string.
  • Expected outputs (optional, depending on the evaluators that you plan to create)
info

Inputs should match what your AI pipeline expects.

For example, if your pipeline compose expects { sender: string, query: string }, an Input could be { "sender": "[email protected]", "query": "say hello" }

Otherwise, you'll need to transform your inputs in code as part of your testing script.

Navigate to "Test cases" for the Pipeline. If you only need to specify a few test cases, you can create them directly from the UI by selecting "New test case".

New test case

You can alternatively bulk upload by clicking the "Import from CSV" button in the top right.

Evaluators

Evaluators score test results using AI, heuristics, or humans.

In this quickstart, we will create an AI evaluator using a template to automatically grade outputs. Read more about the other types of evaluators.

To start, create an evaluator by selecting "Evaluators" underneath "Evaluate" on the relevant pipeline and then select "New evaluator."

Select an evaluator template

Let's start with the AI Factual Consistency (CoT + Classification) template.

This evaluator compares the expected output (from the test case) to the actual output of your pipeline, using a prompt template that you can adjust.

New evaluator

Notice that this prompt template includes a TODO block. For this template, this block is where you specify the instruction that your AI should perform.

Specify AI evaluator task

Modify the [Task] block to reflect our pipeline's goal. We interpolate inputs from the test cases by prefixing the JSON input variables with inputs..

AI evaluator inputs

Adjust evaluator scoring

info

Adjusting evaluator scoring is optional. We have selected reasonable default values for the provided AI templates.

The provided template uses enum scoring, which describes a set of options that maps to percentage grades. Read more about all supported scoring types here.

The evaluator will use the provided options to grade the output. You can make adjustments to these options and percentages.

Define AI options mapping and scoring

The template also includes the options in the prompt template to describe what each option represents. You must enumerate and describe all the options within the prompt template.

Define AI options

Test your evaluator

Press "Test" in the top right, then select a test case and write test output to see if your evaluator performs well.

Select test case for evaluator

For instance, you might test with output that is exactly what is expected by the test case.

Test evaluator output

A few seconds after submission, you should see the reasoning and result.

When you are satisfied with your evaluator, press finish.

Test script

Once your test cases and evaluators have been created, you need to write code that:

  • Pulls the test cases for a given set using our SDK
  • Runs your test cases against your generative AI pipeline
  • Submits the test results as a single test result for evaluation using our SDK

SDK installation

We provide our SDK for two platforms currently: Node.JS and Python. Choose the appropriate version below.

bash
npm i @gentrace/core
🛑danger

Please only use these libraries on the server-side. Using it on the client-side will reveal your API key.

Making your script

Let's say you have a generative pipeline pipeline(). Make a test script that looks like the following:

gentrace/tests/MyPipelineGentraceTest.ts
typescript
import { init, getTestCases, submitTestResult } from "@gentrace/core";
import { pipeline } from "../src/pipeline"; // TODO: REPLACE WITH YOUR PIPELINE
 
init({ apiKey: process.env.GENTRACE_API_KEY });
 
const PIPELINE_SLUG = "your-pipeline-slug";
 
async function main() {
const testCases = await getTestCases(PIPELINE_SLUG);
 
const promises = testCases.map((test) => pipeline(test));
 
const outputs = await Promise.all(promises);
 
const response = await submitTestResult(PIPELINE_SLUG, testCases, outputs);
}
 
main();

The submission function - submit_test_result() in Python, submitTestResult() in Node.JS - expects that the outputs array is an array of dictionaries. If your pipeline returns a primitive type (e.g. a completion response as a string), you can simply wrap the value in an object.

gentrace/tests/MyPipelineGentraceTest.ts
typescript
import { init, getTestCases, submitTestResult } from "@gentrace/core";
import { pipeline } from "../src/pipeline"; // TODO: REPLACE WITH YOUR PIPELINE
 
init({ apiKey: process.env.GENTRACE_API_KEY });
 
const PIPELINE_SLUG = "your-pipeline-slug";
 
async function main() {
const testCases = await getTestCases(PIPELINE_SLUG);
 
const promises = testCases.map((test) => pipeline(test)); // TODO: REPLACE WITH YOUR PIPELINE
 
const outputs = await Promise.all(promises);
 
const outputDicts = outputs.map((output) => {
return {
value: output,
};
});
 
const response = await submitTestResult(
PIPELINE_SLUG,
testCases,
outputDicts
);
}
 
main();
info

Earlier versions of our SDK require that the UUID of the pipeline is provided instead of the slug. We have changed the logic to accept either a slug or UUID:

  • NPM package version >= 0.15.9 for our v3 OpenAI-compatible
  • NPM package version >= 1.0.4 for our v4 OpenAI-compatible

If your AI pipeline contains multiple interim steps, you can follow this guide to capture the steps and process them for use in an evaluator.

View your test results

Once Gentrace receives your test result, the evaluators start grading your submitted results. You can view the in-progress evaluation in the Gentrace UI in the "Results" tab for a pipeline.

In the below image, you can see how the pipeline has performed on our various evaluators.

View evaluation result

Integrating with your CI/CD

To run your script in a CI/CD environment, you can add the script as a step in your workflow configuration.

If you define the env variables $GENTRACE_BRANCH and $GENTRACE_COMMIT in your CI/CD environment, the Gentrace SDK will automatically tag all submitted test results with that metadata.

Here's an example for Github Actions with a TypeScript script.

.github/workflows/gentrace.yml
yaml
name: Run Gentrace Tests
on:
workflow_dispatch:
push:
env:
GENTRACE_BRANCH: ${{ github.head_ref || github.ref_name }}
GENTRACE_COMMIT: ${{ github.sha }}
GENTRACE_API_KEY: ${{ secrets.GENTRACE_API_KEY }}
OPENAI_KEY: ${{ secrets.OPENAI_KEY }}
jobs:
# Workflow for TypeScript
typescript:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: 18.20.2
- name: Install dependencies
run: npm install
# $GENTRACE_API_KEY, $GENTRACE_BRANCH, and $GENTRACE_COMMIT environmental variables
# will be passed to the script. Replace "pipeline-test.ts" with the path to your
# script.
- name: Run Gentrace tests
run: npx tsx ./pipeline-test.ts

After your script finishes submitting the test result to Gentrace, the result should properly reflect the branch and commit name in the row.

Result Git SHA

See also

This wraps up the quickstart. See the links below to learn more about evaluators.

Evaluator types

Evaluator scoring