Version: 4.7.66

Running local evaluations

Gentrace lets you run evaluations locally. You can run and manage evaluations in your code if you need more control over how your evaluations are run.

🛑Alpha

Local evaluation is currently in alpha and may undergo significant changes.

No Python support

We have not implemented this SDK function in Python yet.

Using `runner.addEval()` syntax

Add evals by instrumenting your code using the pipeline runner syntax from our tracing guide. The example below calculates a score from 0 to 1 based on the email's length.

typescript
import { Pipeline, getTestRunners, submitTestRunners } from "@gentrace/core";
import { initPlugin } from "@gentrace/openai";
 
const PIPELINE_SLUG = "compose"
 
const plugin = await initPlugin({
  apiKey: process.env.OPENAI_KEY,
});
 
const pipeline = new Pipeline({
  slug: PIPELINE_SLUG,
  plugins: {
    openai: plugin
  }
});
 
// Get test runners for all test cases in the pipeline
const testRunners = await getTestRunners(pipeline);
 
// Function to process each test case
async function processTestCase([runner, testCase]) {
  const { sender, receiver, query } = testCase.inputs;
 
  // This is a near type-match of the official OpenAI Node.JS package handle.
  const openai = runner.openai;
 
  const initialDraftResponse = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "system",
        content: `Write an email on behalf of ${sender} to ${receiver}: ${query}`,
      },
    ],
  });
 
  const initialDraft = initialDraftResponse.choices[0]?.message?.content || "";
  const wordCount = initialDraft.split(/\s+/).length;
 
  // Bell curve scoring function to evaluate if the created email has an appropriate length
  // Penalizes emails that are too short or too long, with the ideal length around 100 words
  const calculateScore = (count) => {
    const mean = 100;
    const stdDev = 30;
    const maxScore = 1;
    
    const zScore = Math.abs(count - mean) / stdDev;
    return Math.max(0, maxScore * Math.exp(-0.5 * Math.pow(zScore, 2)));
  };
 
  const score = calculateScore(wordCount);
 
  runner.addEval({
    name: "initial-draft-word-count-score",
    value: score,
  });
}
 
// Process all test cases (you can implement parallelization here if needed)
await Promise.all(testRunners.map(processTestCase));
 
// Submit all test runners
const result = await submitTestRunners(pipeline, testRunners);
console.log("Test result ID:", result.resultId);

Using `@gentrace/evals` library

To simplify eval creation, we created a separate open source library to simplify computing common evaluations.

bash
# Execute only one, depending on your package manager
npm i @gentrace/evals
yarn add @gentrace/evals
pnpm i @gentrace/evals

Currently, we support a single, flexible LLM-as-a-judge evaluation with evals.llm.base().

typescript
// Function to process each test case
async function processTestCase([runner, testCase]) {
  const { sender, receiver, query } = testCase.inputs;
 
  const openai = runner.openai;
 
  const initialDraftResponse = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {
        role: "system",
        content: `Write an email on behalf of ${sender} to ${receiver}: ${query}`,
      },
    ],
  });
 
  const initialDraft = initialDraftResponse.choices[0]?.message?.content || "";
 
  const evalResult = await evals.llm.base({
    name: "email-sentiment-evaluation",
    prompt: `Evaluate the sentiment of the following email draft:
 
"${initialDraft}"
 
This email was written by ${sender} to ${receiver} based on the following query: 
 
${query}
 
Please provide a score and reasoning for your evaluation. Consider the overall tone 
and emotional content of the email. Rate the sentiment as one of the following: 
 
- Very Negative
- Negative
- Neutral
- Positive
- Very Positive`,
    scoreAs: {
      "Very Negative": 0,
      "Negative": 0.25,
      "Neutral": 0.5,
      "Positive": 0.75,
      "Very Positive": 1,
    },
  });
 
  runner.addEval(evalResult);
}
 
// Process all test cases (you can implement parallelization here if needed)
await Promise.all(testRunners.map(processTestCase));
 
// Submit all test runners
const result = await submitTestRunners(pipeline, testRunners);
console.log("Test result ID:", result.resultId);

Disabling remote evals

If you want to run only local evaluations without triggering remote Gentrace evals, pass triggerRemoteEvals to the runner submission.

typescript
const result = await submitTestRunners(pipeline, testRunners, {
  triggerRemoteEvals: true,
});

Viewing results in Gentrace

Locally-created evaluations will render in the results UI alongside evaluations by Gentrace.

Result grid

Local evaluations result grid

Test case view

Local evaluations in test case view

Evaluation view

Local evaluations in evaluation view

Using runner.addEval() syntax​

Using @gentrace/evals library​

Disabling remote evals​

Viewing results in Gentrace​

Result grid​

Test case view​

Evaluation view​