Gentrace Documentation

The Python eval() and TypeScript evalOnce() functions create individual test cases within an experiment(). These functions capture the execution of specific test logic, automatically creating OpenTelemetry spans with detailed tracing information, and associating results with the parent experiment. These functions must be called within the context of an experiment() and automatically create individual test spans for each evaluation.

The Gentrace SDK automatically configures OpenTelemetry when you call init(). If you have an existing OpenTelemetry setup or need custom configuration, see the manual setup guide.

Basic usage

import { init, experiment, evalOnce, interaction } from 'gentrace';

init({
  apiKey: process.env.GENTRACE_API_KEY,
});

// Define your AI function with interaction wrapper
const myAIFunction = interaction(
  'my-ai-function',
  async (input: string): Promise<string> => {
    // Your AI logic here - e.g., call to OpenAI, Anthropic, etc.
    return 'This is a sample AI response';
  },
);

// Basic evalOnce usage within an experiment
experiment(async () => {
  await evalOnce('simple-accuracy-test', async () => {
    const input = 'What is 2 + 2?';
    const result = await myAIFunction(input);
    const expected = '4';

    return {
      input,
      result,
      expected,
      passed: result.includes(expected),
    };
  });
});

Overview

Individual evaluations in Gentrace represent specific test cases or validation steps within your AI pipeline testing. The eval() and evalOnce() functions provide comprehensive test execution capabilities:

Key features

OpenTelemetry integration

feature

Generate OpenTelemetry spans with detailed execution tracing for each test case

Automatic data capture

feature

Record the return value of eval() or @eval()-decorated functions as the output for each test span

Experiment association

feature

Link test cases to the parent experiment context for proper grouping and analysis

Error resilience

feature

Preserve full error information in traces while allowing tests to continue running

Execution flexibility

feature

Support both synchronous and asynchronous test functions seamlessly

Parameters

/**
 * Run a single evaluation test case
 *
 * @param spanName - A descriptive name for the test case,
 *                   used for tracing and reporting
 * @param callback - The function containing your test logic.
 *                   Can be synchronous or asynchronous
 *
 * @returns A Promise<TResult | null> that resolves with:
 *          - The result of your callback function on success
 *          - null if an error occurs (errors are captured
 *            in the OpenTelemetry span)
 */
function evalOnce<TResult>(
  spanName: string,
  callback: () => TResult | null | Promise<TResult | null>,
): Promise<TResult | null>;

Multiple test cases with different evaluation types

import { callAIModel } from './models';
import { experiment, evalOnce, interaction } from 'gentrace';

const myAIFunction = interaction(
  'multi-test-function',
  async (input: string): Promise<string> => {
    return await callAIModel(input);
  },
);

function calculateAccuracy(result: string, expected: string): number {
  const resultWords = result.toLowerCase().split(/\s+/);
  const expectedWords = expected.toLowerCase().split(/\s+/);
  const matches = expectedWords.filter((word) =>
    resultWords.includes(word),
  );
  return matches.length / expectedWords.length;
}

experiment(async () => {
  // Test accuracy
  await evalOnce('accuracy-test', async () => {
    const input = 'What is machine learning?';
    const expected = 'machine learning is artificial intelligence';
    const result = await myAIFunction(input);
    const accuracy = calculateAccuracy(result, expected);

    return {
      input,
      result,
      expected,
      accuracy,
      passed: accuracy >= 0.7,
    };
  });

  // Test latency
  await evalOnce('latency-test', async () => {
    const start = Date.now();
    const result = await myAIFunction('Quick test input');
    const latency = Date.now() - start;

    return {
      result,
      latency,
      threshold: 2000,
      passed: latency < 2000,
    };
  });
});

OTEL span error integration

When errors occur during individual evaluations:

Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
Individual Spans: Each evaluation gets its own span, so errors are isolated to specific test cases
Continued Processing: Failed evaluations don’t stop the execution of other evaluations in the experiment
Error Attributes: Error messages, types, and metadata are recorded as span attributes
Span Status: Individual evaluation spans are marked with ERROR status when exceptions occur
Error Handling: See OpenTelemetry error recording and exception recording for more details

Requirements

Gentrace SDK Initialization: Must call init() with a valid API key. The SDK automatically configures OpenTelemetry for you. For custom OpenTelemetry setups, see the manual setup guide
Experiment Context: Must be called within an experiment() function
Valid Pipeline ID: When using an explicit pipeline ID, it must be valid

Context requirements

Both eval() and evalOnce() must be called within an active experiment context. They automatically:

Retrieve experiment context from the parent experiment() function
Associate spans with the experiment ID for proper grouping
Inherit experiment metadata and configuration

// ❌ This will throw an error - no experiment context
await evalOnce('invalid-test', async () => {
  return 'This will fail';
});

// ✅ Correct usage within experiment context
experiment(async () => {
  await evalOnce('valid-test', async () => {
    return { message: 'This will work', passed: true };
  });
});

OpenTelemetry integration

The evaluation functions create rich OpenTelemetry spans with comprehensive tracing information:

Span attributes

gentrace.experiment_id

string

required

Links the evaluation to its parent experiment

gentrace.test_case_name

string

required

The name provided to the evaluation function

Error information

automatic

Automatic error type and message capture

Span events

gentrace.fn.output

event

Records function outputs for result tracking

Exception events

event

Automatic exception recording with full stack traces

Example span structure

init() - Initialize the Gentrace SDK
interaction() - Instrument AI functions for tracing within experiments
evalDataset() / eval_dataset() - Run tests against a dataset within an experiment
experiment() - Create experiment contexts for grouping evaluations
traced() - Alternative approach for tracing functions

Getting started

Error analysis

Tracing

Evaluation

Integrations

Administration

Reference

SDK Entities

Unit tests

Basic usage

Overview

Key features

Parameters

Multiple test cases with different evaluation types

OTEL span error integration

Requirements

Context requirements

OpenTelemetry integration

Span attributes

Span events

Example span structure

Getting started

Error analysis

Tracing

Evaluation

Integrations

Administration

Reference

SDK Entities

​Basic usage

​Overview

​Key features

​Parameters

​Multiple test cases with different evaluation types

​OTEL span error integration

​Requirements

​Context requirements

​OpenTelemetry integration

​Span attributes

​Span events

​Example span structure

​Related functions

Basic usage

Overview

Key features

Parameters

Multiple test cases with different evaluation types

OTEL span error integration

Requirements

Context requirements

OpenTelemetry integration

Span attributes

Span events

Example span structure

Related functions