Evaluate quickstart
This guide covers how to get started with Gentrace Evaluate. This product evaluates any generative pipeline with AI, heuristics, and manual human grading.
The other critical part of our product is Gentrace Observe, which traces your generative requests and tracks speed, cost and aggregate statistics. We will not cover Observe in this guide. Visit here to get started with Observe.
To do so, we'll set up a pipeline with:
- Test cases
- Evaluators
- Test script
Then, we'll optionally:
- Integrate with your CI / CD environment
Creating your first pipeline
To create your first set, create a Gentrace account or sign in.
Then, navigate to the "Evaluate" subpage and select "New Pipeline" (or click here).
- Name: choose a name for your pipeline.
- Main branch: If you use Git, specify a branch name as a baseline. We will compare evaluation results against this baseline.
Test cases
Test cases consist of:
- A unique name
- Inputs that will be passed to your AI pipeline, expressed as a JSON string.
- Expected outputs (optional, depending on the evaluators that you plan to create)
Inputs should match what your AI pipeline expects.
For example, if your pipeline compose
expects { sender: string, query: string }
, an Input could be { "sender": "[email protected]", "query": "say hello" }
Otherwise, you'll need to transform your inputs in code as part of your testing script.
Navigate to "Test cases" for the Pipeline. If you only need to specify a few test cases, you can create them directly from the UI by selecting "New test case".
You can alternatively bulk upload by clicking the "Import from CSV" button in the top right.
Evaluators
Evaluators score test results using AI, heuristics, or humans.
In this quickstart, we will create an AI evaluator using a template to automatically grade outputs. Read more about the other types of evaluators.
To start, create an evaluator by selecting "Evaluators" underneath "Evaluate" on the relevant pipeline and then select "New evaluator."
Select an evaluator template
Let's start with the AI
Factual Consistency (CoT + Classification)
template.
This evaluator compares the expected output (from the test case) to the actual output of your pipeline, using a prompt template that you can adjust.
Notice that this prompt template includes a TODO
block. For this template, this block is where you specify the instruction that your AI should perform.
Modify the [Task]
block to reflect our pipeline's goal. We interpolate inputs from the test cases by prefixing the JSON input variables with inputs.
.
Adjust evaluator scoring
Adjusting evaluator scoring is optional. We have selected reasonable default values for the provided AI templates.
The provided template uses enum scoring, which describes a set of options that maps to percentage grades. Read more about all supported scoring types here.
The evaluator will use the provided options to grade the output. You can make adjustments to these options and percentages.
The template also includes the options in the prompt template to describe what each option represents. You must enumerate and describe all the options within the prompt template.
Test your evaluator
Press "Test" in the top right, then select a test case and write test output to see if your evaluator performs well.
For instance, you might test with output that is exactly what is expected by the test case.
A few seconds after submission, you should see the reasoning and result.
When you are satisfied with your evaluator, press finish.
Test script
Once your test cases and evaluators have been created, you need to write code that:
- Pulls the test cases for a given set using our SDK
- Runs your test cases against your generative AI pipeline
- Submits the test results as a single test result for evaluation using our SDK
SDK installation
We provide our SDK for two platforms currently: Node.JS and Python. Choose the appropriate version below.
- (JS) npm
- (JS) yarn
- (JS) pnpm
- (Python) pip
- (Python) poetry
bash
npm i @gentrace/core
bash
yarn add @gentrace/core
bash
pnpm i @gentrace/core
bash
pip install gentrace-py
bash
poetry add gentrace-py
Please only use these libraries on the server-side. Using it on the client-side will reveal your API key.
Making your script
Let's say you have a generative pipeline pipeline()
. Make a test script that looks like the following:
- TypeScript
- Python
gentrace/tests/MyPipelineGentraceTest.tstypescript
import {init ,getTestCases ,submitTestResult } from "@gentrace/core";import {pipeline } from "../src/pipeline"; // TODO: REPLACE WITH YOUR PIPELINEinit ({apiKey :process .env .GENTRACE_API_KEY });constPIPELINE_SLUG = "your-pipeline-slug";async functionmain () {consttestCases = awaitgetTestCases (PIPELINE_SLUG );constpromises =testCases .map ((test ) =>pipeline (test ));constoutputs = awaitPromise .all (promises );constresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputs );}main ();
gentrace/tests/my_pipeline_gentrace_test.pypython
import gentraceimport osfrom dotenv import load_dotenvfrom pipeline import pipeline # TODO: REPLACE WITH YOUR PIPELINEload_dotenv()gentrace.init(os.getenv("GENTRACE_API_KEY"))PIPELINE_SLUG = "your-pipeline-slug"def main():test_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = pipeline(inputs)outputs.append(result)response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)main()
The submission function - submit_test_result()
in Python, submitTestResult()
in Node.JS - expects that the outputs
array is an array of dictionaries. If your pipeline returns a primitive type (e.g. a completion response as a string), you can simply wrap the value in an object.
- TypeScript
- Python
gentrace/tests/MyPipelineGentraceTest.tstypescript
import {init ,getTestCases ,submitTestResult } from "@gentrace/core";import {pipeline } from "../src/pipeline"; // TODO: REPLACE WITH YOUR PIPELINEinit ({apiKey :process .env .GENTRACE_API_KEY });constPIPELINE_SLUG = "your-pipeline-slug";async functionmain () {consttestCases = awaitgetTestCases (PIPELINE_SLUG );constpromises =testCases .map ((test ) =>pipeline (test )); // TODO: REPLACE WITH YOUR PIPELINEconstoutputs = awaitPromise .all (promises );constoutputDicts =outputs .map ((output ) => {return {value :output ,};});constresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputDicts );}main ();
gentrace/tests/my_pipeline_gentrace_test.pypython
# Omit initializationtest_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = pipeline(inputs) # TODO: REPLACE WITH YOUR PIPELINEoutputs.append({"value": result})response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)
Earlier versions of our SDK require that the UUID of the pipeline is provided instead of the slug. We have changed the logic to accept either a slug or UUID:
- TypeScript
- Python
- NPM package version >=
0.15.9
for our v3 OpenAI-compatible - NPM package version >=
1.0.4
for our v4 OpenAI-compatible
- PyPI package version >=
0.15.6
If your AI pipeline contains multiple interim steps, you can follow this guide to capture the steps and process them for use in an evaluator.
View your test results
Once Gentrace receives your test result, the evaluators start grading your submitted results. You can view the in-progress evaluation in the Gentrace UI in the "Results" tab for a pipeline.
In the below image, you can see how the pipeline has performed on our various evaluators.
Integrating with your CI/CD
To run your script in a CI/CD environment, you can add the script as a step in your workflow configuration.
If you define the env variables $GENTRACE_BRANCH
and $GENTRACE_COMMIT
in your CI/CD environment, the Gentrace SDK will automatically tag all submitted test results with that metadata.
Here's an example for Github Actions with a TypeScript script.
- TypeScript CI
- Python CI
.github/workflows/gentrace.ymlyaml
name: Run Gentrace Testson:workflow_dispatch:push:env:GENTRACE_BRANCH: ${{ github.head_ref || github.ref_name }}GENTRACE_COMMIT: ${{ github.sha }}GENTRACE_API_KEY: ${{ secrets.GENTRACE_API_KEY }}OPENAI_KEY: ${{ secrets.OPENAI_KEY }}jobs:# Workflow for TypeScripttypescript:runs-on: ubuntu-lateststeps:- uses: actions/checkout@v3- uses: actions/setup-node@v3with:node-version: 18.20.2- name: Install dependenciesrun: npm install# $GENTRACE_API_KEY, $GENTRACE_BRANCH, and $GENTRACE_COMMIT environmental variables# will be passed to the script. Replace "pipeline-test.ts" with the path to your# script.- name: Run Gentrace testsrun: npx tsx ./pipeline-test.ts
.github/workflows/gentrace.ymlyaml
name: Run Gentrace Testson:workflow_dispatch:push:env:GENTRACE_BRANCH: ${{ github.head_ref || github.ref_name }}GENTRACE_COMMIT: ${{ github.sha }}GENTRACE_API_KEY: ${{ secrets.GENTRACE_API_KEY }}OPENAI_KEY: ${{ secrets.OPENAI_KEY }}jobs:# Workflow for Pythonpython:runs-on: ubuntu-lateststeps:- uses: actions/checkout@v3- name: Set up Pythonid: setup-pythonuses: actions/setup-python@v4with:python-version: "3.11"- name: Install Poetryuses: snok/install-poetry@v1with:virtualenvs-create: truevirtualenvs-in-project: trueinstaller-parallel: true- name: Poetry install and buildrun: |poetry install# $GENTRACE_API_KEY, $GENTRACE_BRANCH, and $GENTRACE_COMMIT environmental variables# will be passed to the script. Replace "pipeline_test.py" with the path to your# script.- name: Run Gentrace testsrun: python ./pipeline_test.py
After your script finishes submitting the test result to Gentrace, the result should properly reflect the branch and commit name in the row.
See also
This wraps up the quickstart. See the links below to learn more about evaluators.
Evaluator types
Evaluator scoring