Skip to main content
Version: 4.7.66

eval() / evalOnce()

The eval() and evalOnce() functions create individual test cases within an experiment(). These functions capture the execution of specific test logic, automatically creating OpenTelemetry spans with detailed tracing information, and associating results with the parent experiment.

These functions must be called within the context of an experiment() and automatically create individual test spans for each evaluation.

tip

OpenTelemetry needs to be properly set up for trace data to show up appropriately. See the OpenTelemetry Setup guide for configuration details.

Basic usage

typescript
import { init, experiment, evalOnce, interaction } from 'gentrace';
init({
apiKey: process.env.GENTRACE_API_KEY,
});
const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;
// Define your AI function with interaction wrapper
const myAIFunction = interaction(
'my-ai-function',
async (input: string): Promise<string> => {
// Your AI logic here - e.g., call to OpenAI, Anthropic, etc.
return "This is a sample AI response";
},
{ pipelineId: PIPELINE_ID }
);
// Basic evalOnce usage within an experiment
experiment(PIPELINE_ID, async () => {
await evalOnce('simple-accuracy-test', async () => {
const input = 'What is 2 + 2?';
const result = await myAIFunction(input);
const expected = '4';
return {
input,
result,
expected,
passed: result.includes(expected)
};
});
});

Overview

Individual evaluations in Gentrace represent specific test cases or validation steps within your AI pipeline testing. The eval() and evalOnce() functions:

  1. Create test case spans in OpenTelemetry with detailed execution tracing
  2. Capture inputs and outputs for each test span automatically; the return value of eval() or the @eval()-decorated function is recorded as the output for that span
  3. Associate with experiments by linking to the parent experiment context
  4. Handle errors gracefully while preserving full error information in traces
  5. Support both sync and async test functions seamlessly

Parameters

Function signature

typescript
function evalOnce<TResult>(
spanName: string,
callback: () => TResult | null | Promise<TResult | null>
): Promise<TResult | null>

Parameters

  • spanName (string, required): A descriptive name for the test case, used for tracing and reporting
  • callback (function, required): The function containing your test logic. Can be synchronous or asynchronous

Return Value

Returns a Promise<TResult | null> that resolves with:

  • The result of your callback function on success
  • null if an error occurs (errors are captured in the OpenTelemetry span)

Multiple test cases with different evaluation types

typescript
import { callAIModel } from './models';
import { experiment, evalOnce, interaction } from 'gentrace';
const myAIFunction = interaction(
'multi-test-function',
async (input: string): Promise<string> => {
return await callAIModel(input);
},
{ pipelineId: PIPELINE_ID }
);
function calculateAccuracy(result: string, expected: string): number {
const resultWords = result.toLowerCase().split(/\s+/);
const expectedWords = expected.toLowerCase().split(/\s+/);
const matches = expectedWords.filter(word => resultWords.includes(word));
return matches.length / expectedWords.length;
}
experiment(PIPELINE_ID, async () => {
// Test accuracy
await evalOnce('accuracy-test', async () => {
const input = 'What is machine learning?';
const expected = 'machine learning is artificial intelligence';
const result = await myAIFunction(input);
const accuracy = calculateAccuracy(result, expected);
return {
input,
result,
expected,
accuracy,
passed: accuracy >= 0.7
};
});
// Test latency
await evalOnce('latency-test', async () => {
const start = Date.now();
const result = await myAIFunction('Quick test input');
const latency = Date.now() - start;
return {
result,
latency,
threshold: 2000,
passed: latency < 2000
};
});
});

OTEL span error integration

When errors occur during individual evaluations:

  • Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
  • Individual Spans: Each evaluation gets its own span, so errors are isolated to specific test cases
  • Continued Processing: Failed evaluations don't stop the execution of other evaluations in the experiment
  • Error Attributes: Error messages, types, and metadata are recorded as span attributes
  • Span Status: Individual evaluation spans are marked with ERROR status when exceptions occur
  • Error Handling: See OpenTelemetry error recording and exception recording for more details

Requirements

  • OpenTelemetry Setup: Individual evaluations require OpenTelemetry to be configured for tracing (see OpenTelemetry setup guide)
  • Experiment Context: Must be called within an experiment() function
  • Valid Pipeline ID: Parent experiment must have a valid Gentrace pipeline ID
  • API Key: Gentrace API key must be configured via init()

Context requirements

Both eval() and evalOnce() must be called within an active experiment context. They automatically:

  1. Retrieve experiment context from the parent experiment() function
  2. Associate spans with the experiment ID for proper grouping
  3. Inherit experiment metadata and configuration
typescript
// ❌ This will throw an error - no experiment context
await evalOnce('invalid-test', async () => {
return 'This will fail';
});
// ✅ Correct usage within experiment context
experiment(PIPELINE_ID, async () => {
await evalOnce('valid-test', async () => {
return { message: 'This will work', passed: true };
});
});

OpenTelemetry integration

The evaluation functions create rich OpenTelemetry spans with comprehensive tracing information:

Span attributes

  • gentrace.experiment_id: Links the evaluation to its parent experiment
  • gentrace.test_case_name: The name provided to the evaluation function
  • Error information: Automatic error type and message capture

Span events

  • gentrace.fn.output: Records function outputs for result tracking
  • Exception events: Automatic exception recording with full stack traces

Example span structure

Experiment Span (experiment-name)
├── Evaluation Span (accuracy-test)
│ ├── Attributes:
│ │ ├── gentrace.experiment_id: "exp-123"
│ │ ├── gentrace.test_case_name: "accuracy-test"
│ │ └── custom.metadata: "value"
│ ├── Events:
│ │ ├── gentrace.fn.args: {"args": ["input data"]}
│ │ └── gentrace.fn.output: {"output": {"accuracy": 0.95}}
│ └── Child Spans (from interaction() calls)
└── Evaluation Span (latency-test)
└── ...
  • init() - Initialize the Gentrace SDK
  • interaction() - Instrument AI functions for tracing within experiments
  • evalDataset() / eval_dataset() - Run tests against a dataset within an experiment
  • experiment() - Create experiment contexts for grouping evaluations
  • traced() - Alternative approach for tracing functions

See also