The Python eval() and TypeScript evalOnce() functions create individual test cases within an experiment(). These functions capture the execution of specific test logic, automatically creating OpenTelemetry spans with detailed tracing information, and associating results with the parent experiment. These functions must be called within the context of an experiment() and automatically create individual test spans for each evaluation.
The Gentrace SDK automatically configures OpenTelemetry when you call init(). If you have an existing OpenTelemetry setup or need custom configuration, see the manual setup guide.

Basic usage

import { init, experiment, evalOnce, interaction } from 'gentrace';

init({
  apiKey: process.env.GENTRACE_API_KEY,
});

const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;

// Define your AI function with interaction wrapper
const myAIFunction = interaction(
  'my-ai-function',
  async (input: string): Promise<string> => {
    // Your AI logic here - e.g., call to OpenAI, Anthropic, etc.
    return "This is a sample AI response";
  },
  { pipelineId: PIPELINE_ID }
);

// Basic evalOnce usage within an experiment
experiment(PIPELINE_ID, async () => {
  await evalOnce('simple-accuracy-test', async () => {
    const input = 'What is 2 + 2?';
    const result = await myAIFunction(input);
    const expected = '4';

    return {
      input,
      result,
      expected,
      passed: result.includes(expected)
    };
  });
});

Overview

Individual evaluations in Gentrace represent specific test cases or validation steps within your AI pipeline testing. The eval() and evalOnce() functions provide comprehensive test execution capabilities:

Key features

OpenTelemetry integration
feature
Generate OpenTelemetry spans with detailed execution tracing for each test case
Automatic data capture
feature
Record the return value of eval() or @eval()-decorated functions as the output for each test span
Experiment association
feature
Link test cases to the parent experiment context for proper grouping and analysis
Error resilience
feature
Preserve full error information in traces while allowing tests to continue running
Execution flexibility
feature
Support both synchronous and asynchronous test functions seamlessly

Parameters

/**
 * Run a single evaluation test case
 * 
 * @param spanName - A descriptive name for the test case,
 *                   used for tracing and reporting
 * @param callback - The function containing your test logic.
 *                   Can be synchronous or asynchronous
 * 
 * @returns A Promise<TResult | null> that resolves with:
 *          - The result of your callback function on success
 *          - null if an error occurs (errors are captured
 *            in the OpenTelemetry span)
 */
function evalOnce<TResult>(
  spanName: string,
  callback: () => TResult | null | Promise<TResult | null>
): Promise<TResult | null>

Multiple test cases with different evaluation types

import { callAIModel } from './models';
import { experiment, evalOnce, interaction } from 'gentrace';

const myAIFunction = interaction(
  'multi-test-function',
  async (input: string): Promise<string> => {
    return await callAIModel(input);
  },
  { pipelineId: PIPELINE_ID }
);

function calculateAccuracy(result: string, expected: string): number {
  const resultWords = result.toLowerCase().split(/\s+/);
  const expectedWords = expected.toLowerCase().split(/\s+/);
  const matches = expectedWords.filter(word => resultWords.includes(word));
  return matches.length / expectedWords.length;
}

experiment(PIPELINE_ID, async () => {
  // Test accuracy
  await evalOnce('accuracy-test', async () => {
    const input = 'What is machine learning?';
    const expected = 'machine learning is artificial intelligence';
    const result = await myAIFunction(input);
    const accuracy = calculateAccuracy(result, expected);

    return {
      input,
      result,
      expected,
      accuracy,
      passed: accuracy >= 0.7
    };
  });

  // Test latency
  await evalOnce('latency-test', async () => {
    const start = Date.now();
    const result = await myAIFunction('Quick test input');
    const latency = Date.now() - start;

    return {
      result,
      latency,
      threshold: 2000,
      passed: latency < 2000
    };
  });
});

OTEL span error integration

When errors occur during individual evaluations:
  • Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
  • Individual Spans: Each evaluation gets its own span, so errors are isolated to specific test cases
  • Continued Processing: Failed evaluations don’t stop the execution of other evaluations in the experiment
  • Error Attributes: Error messages, types, and metadata are recorded as span attributes
  • Span Status: Individual evaluation spans are marked with ERROR status when exceptions occur
  • Error Handling: See OpenTelemetry error recording and exception recording for more details

Requirements

  • Gentrace SDK Initialization: Must call init() with a valid API key. The SDK automatically configures OpenTelemetry for you. For custom OpenTelemetry setups, see the manual setup guide
  • Experiment Context: Must be called within an experiment() function
  • Valid Pipeline ID: Parent experiment must have a valid Gentrace pipeline ID

Context requirements

Both eval() and evalOnce() must be called within an active experiment context. They automatically:
  1. Retrieve experiment context from the parent experiment() function
  2. Associate spans with the experiment ID for proper grouping
  3. Inherit experiment metadata and configuration
// ❌ This will throw an error - no experiment context
await evalOnce('invalid-test', async () => {
  return 'This will fail';
});

// ✅ Correct usage within experiment context
experiment(PIPELINE_ID, async () => {
  await evalOnce('valid-test', async () => {
    return { message: 'This will work', passed: true };
  });
});

OpenTelemetry integration

The evaluation functions create rich OpenTelemetry spans with comprehensive tracing information:

Span attributes

gentrace.experiment_id
string
required
Links the evaluation to its parent experiment
gentrace.test_case_name
string
required
The name provided to the evaluation function
Error information
automatic
Automatic error type and message capture

Span events

gentrace.fn.output
event
Records function outputs for result tracking
Exception events
event
Automatic exception recording with full stack traces

Example span structure

  • init() - Initialize the Gentrace SDK
  • interaction() - Instrument AI functions for tracing within experiments
  • evalDataset() / eval_dataset() - Run tests against a dataset within an experiment
  • experiment() - Create experiment contexts for grouping evaluations
  • traced() - Alternative approach for tracing functions