Skip to main content
Version: 4.7.23

Running local evaluations

Gentrace lets you run evaluations locally. You can run and manage evaluations in your code if you need more control over how your evaluations are run.

🛑Alpha

Local evaluation is currently in alpha and may undergo significant changes.

No Python support

We have not implemented this SDK function in Python yet.

Using runner.addEval() syntax​

Add evals by instrumenting your code using the pipeline runner syntax from our tracing guide. The example below calculates a score from 0 to 1 based on the email's length.

typescript
import { Pipeline, getTestRunners, submitTestRunners } from "@gentrace/core";
import { initPlugin } from "@gentrace/openai";
 
const PIPELINE_SLUG = "compose"
 
const plugin = await initPlugin({
apiKey: process.env.OPENAI_KEY,
});
 
const pipeline = new Pipeline({
slug: PIPELINE_SLUG,
plugins: {
openai: plugin
}
});
 
// Get test runners for all test cases in the pipeline
const testRunners = await getTestRunners(pipeline);
 
// Function to process each test case
async function processTestCase([runner, testCase]) {
const { sender, receiver, query } = testCase.inputs;
 
// This is a near type-match of the official OpenAI Node.JS package handle.
const openai = runner.openai;
 
const initialDraftResponse = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: `Write an email on behalf of ${sender} to ${receiver}: ${query}`,
},
],
});
 
const initialDraft = initialDraftResponse.choices[0]?.message?.content || "";
const wordCount = initialDraft.split(/\s+/).length;
 
// Bell curve scoring function to evaluate if the created email has an appropriate length
// Penalizes emails that are too short or too long, with the ideal length around 100 words
const calculateScore = (count) => {
const mean = 100;
const stdDev = 30;
const maxScore = 1;
const zScore = Math.abs(count - mean) / stdDev;
return Math.max(0, maxScore * Math.exp(-0.5 * Math.pow(zScore, 2)));
};
 
const score = calculateScore(wordCount);
 
runner.addEval({
name: "initial-draft-word-count-score",
value: score,
});
}
 
// Process all test cases (you can implement parallelization here if needed)
await Promise.all(testRunners.map(processTestCase));
 
// Submit all test runners
const result = await submitTestRunners(pipeline, testRunners);
console.log("Test result ID:", result.resultId);

Using @gentrace/evals library​

To simplify eval creation, we created a separate open source library to simplify computing common evaluations.

bash
# Execute only one, depending on your package manager
npm i @gentrace/evals
yarn add @gentrace/evals
pnpm i @gentrace/evals

Currently, we support a single, flexible LLM-as-a-judge evaluation with evals.llm.base().

typescript
// Function to process each test case
async function processTestCase([runner, testCase]) {
const { sender, receiver, query } = testCase.inputs;
 
const openai = runner.openai;
 
const initialDraftResponse = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: `Write an email on behalf of ${sender} to ${receiver}: ${query}`,
},
],
});
 
const initialDraft = initialDraftResponse.choices[0]?.message?.content || "";
 
const evalResult = await evals.llm.base({
name: "email-sentiment-evaluation",
prompt: `Evaluate the sentiment of the following email draft:
 
"${initialDraft}"
 
This email was written by ${sender} to ${receiver} based on the following query:
 
${query}
 
Please provide a score and reasoning for your evaluation. Consider the overall tone
and emotional content of the email. Rate the sentiment as one of the following:
 
- Very Negative
- Negative
- Neutral
- Positive
- Very Positive`,
scoreAs: {
"Very Negative": 0,
"Negative": 0.25,
"Neutral": 0.5,
"Positive": 0.75,
"Very Positive": 1,
},
});
 
runner.addEval(evalResult);
}
 
// Process all test cases (you can implement parallelization here if needed)
await Promise.all(testRunners.map(processTestCase));
 
// Submit all test runners
const result = await submitTestRunners(pipeline, testRunners);
console.log("Test result ID:", result.resultId);

Disabling remote evals​

If you want to run only local evaluations without triggering remote Gentrace evals, pass triggerRemoteEvals to the runner submission.

typescript
const result = await submitTestRunners(pipeline, testRunners, {
triggerRemoteEvals: true,
});

Viewing results in Gentrace​

Locally-created evaluations wil render in the results UI alongside evaluations by Gentrace.

Result grid​

Local evaluations result grid

Test case view​

Local evaluations in test case view

Evaluation view​

Local evaluations in evaluation view