Evaluate while experimenting
This guide covers how to get started with Gentrace for testing generative pipelines with AI, heuristics, and manual human grading during experimentation.
To do so, we'll set up a pipeline with:
- Datasets
- Test cases (contained within datasets)
- Evaluators
- Test script
Then, we'll optionally:
- Integrate with your CI / CD environment
Creating your first pipeline
To create your first pipeline, create a Gentrace account or sign in.
Then, select "New Pipeline" (or click here).
- Name: choose a name for your pipeline.
- Main branch: If you use Git, specify a branch name as a baseline. We will compare evaluation results against this baseline.
Datasets and Test cases
Datasets are used to organize test cases into groups within a pipeline. Each dataset contains multiple test cases.
Test cases consist of:
- A unique name
- Inputs that will be passed to your AI pipeline, expressed as a JSON string.
- Expected outputs (optional, depending on the evaluators that you plan to create)
Inputs should match what your AI pipeline expects.
For example, if your pipeline compose
expects { sender: string, query: string }
, an Input could be { "sender": "[email protected]", "query": "say hello" }
Otherwise, you'll need to transform your inputs in code as part of your testing script.
Navigate to "Datasets" for the Pipeline. You can create a new dataset by selecting "New Dataset". Within each dataset, you can add test cases.
Alternatively, you can use the golden dataset for the pipeline.
If you only need to specify a few test cases, you can create them directly from the UI by selecting "New test case" within a dataset.
You can alternatively bulk upload test cases by clicking the "Import from CSV" button in the top right of the dataset view.
Evaluators
Evaluators score test results using AI, heuristics, or humans.
In this quickstart, we will create an AI evaluator using a template to automatically grade outputs. Read more about the other types of evaluators.
To start, create an evaluator by selecting "Evaluators" underneath "Evaluate" on the relevant pipeline and then select "New evaluator."
Select an evaluator template
Let's start with the AI
Factual Consistency (CoT + Classification)
template.
This evaluator compares the expected output (from the test case) to the actual output of your pipeline, using a prompt template that you can adjust.
Notice that this prompt template includes a TODO
block. For this template, this block is where you specify the instruction that your AI should perform.
Modify the [Task]
block to reflect our pipeline's goal. We interpolate inputs from the test cases by prefixing the JSON input variables with inputs.
.
Adjust evaluator scoring
Adjusting evaluator scoring is optional. We have selected reasonable default values for the provided AI templates.
The provided template uses enum scoring, which describes a set of options that maps to percentage grades. Read more about all supported scoring types here.
The evaluator will use the provided options to grade the output. You can make adjustments to these options and percentages.
The template also includes the options in the prompt template to describe what each option represents. You must enumerate and describe all the options within the prompt template.
Test your evaluator
Press "Test" in the top right, then select a test case and write test output to see if your evaluator performs well.
For instance, you might test with output that is exactly what is expected by the test case.
A few seconds after submission, you should see the reasoning and result.
When you are satisfied with your evaluator, press finish.
Test script
Once your datasets and test cases have been created, you need to write code that:
- Pulls the test cases for a given dataset using our SDK
- Runs your test cases against your generative AI pipeline
- Submits the test results as a single test result for evaluation using our SDK
SDK installation
We provide our SDK for two platforms currently: Node.JS and Python. Choose the appropriate version below.
- (JS) npm
- (JS) yarn
- (JS) pnpm
- (Python) pip
- (Python) poetry
bash
npm i @gentrace/core
bash
yarn add @gentrace/core
bash
pnpm i @gentrace/core
bash
pip install gentrace-py
bash
poetry add gentrace-py
Please only use these libraries on the server-side. Using it on the client-side will reveal your API key.
Making your script
Let's say you have logic to compose an email with generative AI: createEmailWithAI()
/ create_email_with_ai
. Make a test script that looks like the following:
- TypeScript
- Python
gentrace/tests/MyPipelineGentraceTest.tstypescript
import {init ,getTestCasesForDataset ,getTestCases submitTestResult } from '@gentrace/core';import {createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINEinit ({apiKey :process .env .GENTRACE_API_KEY });constPIPELINE_SLUG = 'your-pipeline-slug';constDATASET_ID = '123e4567-e89b-12d3-a456-426614174000'async functionmain () {consttestCases = awaitgetTestCasesForDataset (DATASET_ID );// Using getTestCases() with a pipeline slug pulls test cases for// the golden dataset in the pipeline// const testCases = await getTestCases(PIPELINE_SLUG);constpromises =testCases .map ((testCase ) =>createEmailWithAI (testCase ));constoutputs = awaitPromise .all (promises );constoutputDicts =outputs .map ((output ) => {return {value :output ,};});constresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputDicts ,);}main ();
gentrace/tests/my_pipeline_gentrace_test.pypython
import gentraceimport osfrom dotenv import load_dotenvfrom email_creation import create_email_with_ai # TODO: REPLACE WITH YOUR PIPELINEload_dotenv()gentrace.init(os.getenv("GENTRACE_API_KEY"))PIPELINE_SLUG = "your-pipeline-slug"def main():test_cases = gentrace.get_test_cases(dataset_id=DATASET_ID)# Using get_test_cases() with a pipeline slug pulls test cases for# the golden dataset in the pipeline# test_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = create_email_with_ai(inputs)outputs.append({"value": result})response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)main()
The submission function - submit_test_result()
in Python, submitTestResult()
in Node.JS - expects that the outputs
array is an array of dictionaries. If your pipeline returns a primitive type (e.g. a completion response as a string), you can simply wrap the value in an object.
- TypeScript
- Python
gentrace/tests/MyPipelineGentraceTest.tstypescript
import {init ,getTestCasesForDataset ,submitTestResult } from '@gentrace/core';import {createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINEinit ({apiKey :process .env .GENTRACE_API_KEY });constDATASET_ID = '123e4567-e89b-12d3-a456-426614174000';async functionmain () {consttestCases = awaitgetTestCasesForDataset (DATASET_ID );constpromises =testCases .map ((test ) =>createEmailWithAI (test .inputs )); // TODO: REPLACE WITH YOUR PIPELINEconstoutputs = awaitPromise .all (promises );constoutputDicts =outputs .map ((output ) => {return {value :output ,};});constresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputDicts ,);}main ();
gentrace/tests/my_pipeline_gentrace_test.pypython
# Omit initializationtest_cases = gentrace.get_test_cases(dataset_id=DATASET_ID)outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = create_email_with_ai(inputs) # TODO: REPLACE WITH YOUR PIPELINEoutputs.append({"value": result})response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)
If your AI pipeline contains multiple interim steps, you can follow this guide to capture the steps and process them for use in an evaluator.
View your test results
Once Gentrace receives your test result, the evaluators start grading your submitted results. You can view the in-progress evaluation in the Gentrace UI in the "Results" tab for a pipeline.
In the below image, you can see how the pipeline has performed on our various evaluators.
Integrating with your CI/CD
To run your script in a CI/CD environment, you can add the script as a step in your workflow configuration.
If you define the env variables $GENTRACE_BRANCH
and $GENTRACE_COMMIT
in your CI/CD environment, the Gentrace SDK will automatically tag all submitted test results with that metadata.
Here's an example for Github Actions with a TypeScript script.
- TypeScript CI
- Python CI
.github/workflows/gentrace.ymlyaml
name: Run Gentrace Testson:workflow_dispatch:push:env:GENTRACE_BRANCH: ${{ github.head_ref || github.ref_name }}GENTRACE_COMMIT: ${{ github.sha }}GENTRACE_API_KEY: ${{ secrets.GENTRACE_API_KEY }}OPENAI_KEY: ${{ secrets.OPENAI_KEY }}jobs:# Workflow for TypeScripttypescript:runs-on: ubuntu-lateststeps:- uses: actions/checkout@v3- uses: actions/setup-node@v3with:node-version: 18.20.2- name: Install dependenciesrun: npm install# $GENTRACE_API_KEY, $GENTRACE_BRANCH, and $GENTRACE_COMMIT environmental variables# will be passed to the script. Replace "pipeline-test.ts" with the path to your# script.- name: Run Gentrace testsrun: npx tsx ./pipeline-test.ts
.github/workflows/gentrace.ymlyaml
name: Run Gentrace Testson:workflow_dispatch:push:env:GENTRACE_BRANCH: ${{ github.head_ref || github.ref_name }}GENTRACE_COMMIT: ${{ github.sha }}GENTRACE_API_KEY: ${{ secrets.GENTRACE_API_KEY }}OPENAI_KEY: ${{ secrets.OPENAI_KEY }}jobs:# Workflow for Pythonpython:runs-on: ubuntu-lateststeps:- uses: actions/checkout@v3- name: Set up Pythonid: setup-pythonuses: actions/setup-python@v4with:python-version: '3.11'- name: Install Poetryuses: snok/install-poetry@v1with:virtualenvs-create: truevirtualenvs-in-project: trueinstaller-parallel: true- name: Poetry install and buildrun: |poetry install# $GENTRACE_API_KEY, $GENTRACE_BRANCH, and $GENTRACE_COMMIT environmental variables# will be passed to the script. Replace "pipeline_test.py" with the path to your# script.- name: Run Gentrace testsrun: python ./pipeline_test.py
After your script finishes submitting the test result to Gentrace, the result should properly reflect the branch and commit name in the row.
See also
This wraps up the quickstart. Learn more about evaluators, including the three types and how they are scored, here.