Skip to main content
Version: 4.7.38

Evaluate while experimenting

This guide covers how to get started with Gentrace for testing generative pipelines with AI, heuristics, and manual human grading during experimentation.

To do so, we'll set up a pipeline with:

  • Datasets
  • Test cases (contained within datasets)
  • Evaluators
  • Test script

Then, we'll optionally:

  • Integrate with your CI / CD environment

Creating your first pipeline

To create your first pipeline, create a Gentrace account or sign in.

Then, select "New Pipeline" (or click here).

  • Name: choose a name for your pipeline.
  • Main branch: If you use Git, specify a branch name as a baseline. We will compare evaluation results against this baseline.

Datasets and Test cases

Datasets are used to organize test cases into groups within a pipeline. Each dataset contains multiple test cases.

Test cases consist of:

  • A unique name
  • Inputs that will be passed to your AI pipeline, expressed as a JSON string.
  • Expected outputs (optional, depending on the evaluators that you plan to create)
info

Inputs should match what your AI pipeline expects.

For example, if your pipeline compose expects { sender: string, query: string }, an Input could be { "sender": "[email protected]", "query": "say hello" }

Otherwise, you'll need to transform your inputs in code as part of your testing script.

Navigate to "Datasets" for the Pipeline. You can create a new dataset by selecting "New Dataset". Within each dataset, you can add test cases.

Alternatively, you can use the golden dataset for the pipeline.

New dataset

If you only need to specify a few test cases, you can create them directly from the UI by selecting "New test case" within a dataset.

New test case

You can alternatively bulk upload test cases by clicking the "Import from CSV" button in the top right of the dataset view.

Evaluators

Evaluators score test results using AI, heuristics, or humans.

In this quickstart, we will create an AI evaluator using a template to automatically grade outputs. Read more about the other types of evaluators.

To start, create an evaluator by selecting "Evaluators" underneath "Evaluate" on the relevant pipeline and then select "New evaluator."

Select an evaluator template

Let's start with the AI Factual Consistency (CoT + Classification) template.

This evaluator compares the expected output (from the test case) to the actual output of your pipeline, using a prompt template that you can adjust.

New evaluator

Notice that this prompt template includes a TODO block. For this template, this block is where you specify the instruction that your AI should perform.

Specify AI evaluator task

Modify the [Task] block to reflect our pipeline's goal. We interpolate inputs from the test cases by prefixing the JSON input variables with inputs..

AI evaluator inputs

Adjust evaluator scoring

info

Adjusting evaluator scoring is optional. We have selected reasonable default values for the provided AI templates.

The provided template uses enum scoring, which describes a set of options that maps to percentage grades. Read more about all supported scoring types here.

The evaluator will use the provided options to grade the output. You can make adjustments to these options and percentages.

Define AI options mapping and scoring

The template also includes the options in the prompt template to describe what each option represents. You must enumerate and describe all the options within the prompt template.

Define AI options

Test your evaluator

Press "Test" in the top right, then select a test case and write test output to see if your evaluator performs well.

Select test case for evaluator

For instance, you might test with output that is exactly what is expected by the test case.

Test evaluator output

A few seconds after submission, you should see the reasoning and result.

When you are satisfied with your evaluator, press finish.

Test script

Once your datasets and test cases have been created, you need to write code that:

  • Pulls the test cases for a given dataset using our SDK
  • Runs your test cases against your generative AI pipeline
  • Submits the test results as a single test result for evaluation using our SDK

SDK installation

We provide our SDK for two platforms currently: Node.JS and Python. Choose the appropriate version below.

bash
npm i @gentrace/core
🛑danger

Please only use these libraries on the server-side. Using it on the client-side will reveal your API key.

Making your script

Let's say you have logic to compose an email with generative AI: createEmailWithAI() / create_email_with_ai. Make a test script that looks like the following:

gentrace/tests/MyPipelineGentraceTest.ts
typescript
import { init, getTestCasesForDataset, getTestCases submitTestResult } from '@gentrace/core';
import { createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINE
 
init({ apiKey: process.env.GENTRACE_API_KEY });
 
const PIPELINE_SLUG = 'your-pipeline-slug';
 
const DATASET_ID = '123e4567-e89b-12d3-a456-426614174000'
 
async function main() {
const testCases = await getTestCasesForDataset(DATASET_ID);
// Using getTestCases() with a pipeline slug pulls test cases for
// the golden dataset in the pipeline
 
// const testCases = await getTestCases(PIPELINE_SLUG);
 
const promises = testCases.map((testCase) => createEmailWithAI(testCase));
 
const outputs = await Promise.all(promises);
 
const outputDicts = outputs.map((output) => {
return {
value: output,
};
});
 
const response = await submitTestResult(
PIPELINE_SLUG,
testCases,
outputDicts,
);
}
 
main();

The submission function - submit_test_result() in Python, submitTestResult() in Node.JS - expects that the outputs array is an array of dictionaries. If your pipeline returns a primitive type (e.g. a completion response as a string), you can simply wrap the value in an object.

gentrace/tests/MyPipelineGentraceTest.ts
typescript
import { init, getTestCasesForDataset, submitTestResult } from '@gentrace/core';
import { createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINE
 
init({ apiKey: process.env.GENTRACE_API_KEY });
 
const DATASET_ID = '123e4567-e89b-12d3-a456-426614174000';
 
async function main() {
const testCases = await getTestCasesForDataset(DATASET_ID);
 
const promises = testCases.map((test) => createEmailWithAI(test.inputs)); // TODO: REPLACE WITH YOUR PIPELINE
 
const outputs = await Promise.all(promises);
 
const outputDicts = outputs.map((output) => {
return {
value: output,
};
});
 
const response = await submitTestResult(
PIPELINE_SLUG,
testCases,
outputDicts,
);
}
 
main();

If your AI pipeline contains multiple interim steps, you can follow this guide to capture the steps and process them for use in an evaluator.

View your test results

Once Gentrace receives your test result, the evaluators start grading your submitted results. You can view the in-progress evaluation in the Gentrace UI in the "Results" tab for a pipeline.

In the below image, you can see how the pipeline has performed on our various evaluators.

View evaluation result

Integrating with your CI/CD

To run your script in a CI/CD environment, you can add the script as a step in your workflow configuration.

If you define the env variables $GENTRACE_BRANCH and $GENTRACE_COMMIT in your CI/CD environment, the Gentrace SDK will automatically tag all submitted test results with that metadata.

Here's an example for Github Actions with a TypeScript script.

.github/workflows/gentrace.yml
yaml
name: Run Gentrace Tests
on:
workflow_dispatch:
push:
env:
GENTRACE_BRANCH: ${{ github.head_ref || github.ref_name }}
GENTRACE_COMMIT: ${{ github.sha }}
GENTRACE_API_KEY: ${{ secrets.GENTRACE_API_KEY }}
OPENAI_KEY: ${{ secrets.OPENAI_KEY }}
jobs:
# Workflow for TypeScript
typescript:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: 18.20.2
- name: Install dependencies
run: npm install
# $GENTRACE_API_KEY, $GENTRACE_BRANCH, and $GENTRACE_COMMIT environmental variables
# will be passed to the script. Replace "pipeline-test.ts" with the path to your
# script.
- name: Run Gentrace tests
run: npx tsx ./pipeline-test.ts

After your script finishes submitting the test result to Gentrace, the result should properly reflect the branch and commit name in the row.

Result Git SHA

See also

This wraps up the quickstart. Learn more about evaluators, including the three types and how they are scored, here.