The Python eval_dataset() and TypeScript evalDataset() functions run a series of evaluations against a dataset using a provided interaction() function. These functions must be called within the context of an experiment() and automatically create individual test spans for each test case in the dataset.
The Gentrace SDK automatically configures OpenTelemetry when you call init(). If you have an existing OpenTelemetry setup or need custom configuration, see the manual setup guide.

Basic usage

import { init, experiment, evalDataset, testCases, interaction } from 'gentrace';
import { OpenAI } from 'openai';

init({
  apiKey: process.env.GENTRACE_API_KEY,
});

const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;
const DATASET_ID = process.env.GENTRACE_DATASET_ID!;

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Define your AI function
async function processAIRequest(inputs: Record<string, any>) {
  // Your AI implementation here (e.g., call to OpenAI, Anthropic, custom model, etc.)

  return {
    result: "This is a sample AI response",
    metadata: { processed: true }
  };
}

// Wrap with interaction tracing (see interaction() docs)
const tracedProcessAIRequest = interaction('Process AI Request', processAIRequest, {
  pipelineId: PIPELINE_ID,
});

// Basic dataset evaluation
experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: async () => {
      const testCasesList = await testCases.list({ datasetId: DATASET_ID });
      return testCasesList.data;
    },
    interaction: tracedProcessAIRequest,
  });
});

Overview

Dataset evaluation functions allow you to:
  1. Run batch evaluations against multiple test cases from a dataset
  2. Validate inputs using optional schema validation (Pydantic for Python, Zod or any Standard Schema-compliant schema library for TypeScript)
  3. Trace each test case as individual OpenTelemetry spans within the experiment context
  4. Handle errors gracefully with automatic span error recording
  5. Process results from both synchronous and asynchronous data providers and interaction functions

Parameters

Function signature

function evalDataset<
  TSchema extends ParseableSchema<any> | undefined = undefined,
  TInput = TSchema extends ParseableSchema<infer TOutput> ? TOutput : Record<string, any>,
>(options: EvalDatasetOptions<TSchema>): Promise<void>

Parameters

options
EvalDatasetOptions
required
Configuration object containing data provider, interaction function, and optional schema

TestInput Type

type TestInput<TInput extends Record<string, any>> = {
  name?: string | undefined;
  id?: string | undefined;
  inputs: TInput;
}

Return Value

Returns a sequence of results from the interaction function. Failed test cases (due to validation errors) will have None values in the corresponding positions.

IGNORE

data
Callable
required
Function that returns a sequence of test cases, either synchronously or asynchronously
schema
Type[BaseModel]
Pydantic model for input validation
interaction
Callable
required
The function to test against each test case. Typically created with interaction().

Advanced usage

With schema validation

Schema validation ensures that your test cases have the correct structure and data types before being passed to your interaction function. Use Zod for TypeScript or Pydantic for Python to define your input schemas.
import { z } from 'zod';
import { evalDataset, testCases } from 'gentrace';
import { OpenAI } from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Define input schema
const InputSchema = z.object({
  prompt: z.string(),
  temperature: z.number().optional().default(0.7),
  maxTokens: z.number().optional().default(100)
});

experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: async () => {
      const testCasesList = await testCases.list({ datasetId: DATASET_ID });
      return testCasesList.data;
    },
    schema: InputSchema,
    interaction: async ({ prompt, temperature, maxTokens }) => {
      // Inputs are now typed and validated
      const response = await openai.chat.completions.create({
        model: 'gpt-4o',
        messages: [{ role: 'user', content: prompt }],
        temperature,
        max_tokens: maxTokens,
      });
      return response.choices[0].message.content;
    },
  });
});

Custom data providers

You can provide test cases from any source by implementing a custom data provider function. Each data point must conform to the TestInputs structure from above. This is useful when you want to use test cases from the Gentrace API via the test cases SDK or directly defining them in-line.
experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: () => [
      {
        name: "Basic greeting test",
        id: "test-1",
        inputs: { prompt: "Say hello", temperature: 0.5 }
      },
      {
        name: "Complex reasoning test", 
        id: "test-2",
        inputs: { prompt: "Explain quantum computing", temperature: 0.8 }
      }
    ],
    interaction: async (inputs) => {
      return await myAIFunction(inputs);
    },
  });
});

Error handling

The dataset evaluation functions handle errors gracefully and automatically associate all errors and exceptions with OpenTelemetry spans. When validation or interaction errors occur, they are captured as span events and attributes.
experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: () => [
      { inputs: { prompt: "Valid input" } },
      { inputs: { invalid: "data" } }, // This might fail validation
      { inputs: { prompt: "Another valid input" } }
    ],
    schema: z.object({ prompt: z.string() }),
    interaction: async (inputs) => {
      // This function might throw errors
      if (inputs.prompt.includes('error')) {
        throw new Error('Simulated error');
      }
      return await myAIFunction(inputs);
    },
  });
  
  // evalDataset continues processing other test cases even if some fail
  // All errors are automatically captured in OpenTelemetry spans
});

OTEL span error integration

When errors occur during dataset evaluation:
  • Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
  • Individual Spans: Each test case gets its own span, so errors are isolated to specific test cases
  • Continued Processing: Failed test cases don’t stop the evaluation of other test cases
  • Error Attributes: Error messages, types, and metadata are recorded as span attributes
  • Span Status: Individual test case spans are marked with ERROR status when exceptions occur
  • Error Handling: See OpenTelemetry error recording and exception recording for more details

Example span structure

Requirements

  • Gentrace SDK Initialization: Must call init() with a valid API key. The SDK automatically configures OpenTelemetry for you. For custom OpenTelemetry setups, see the manual setup guide
  • Experiment Context: Must be called within an experiment() function
  • Valid Data Provider: The data function must return an array of test cases
  • init() - Initialize the Gentrace SDK
  • interaction() - Instrument AI functions for tracing within experiments
  • experiment() - Create experiment context for dataset evaluations
  • eval() / evalOnce() - Run individual test cases within an experiment
  • traced() - Alternative approach for tracing functions