The Python eval_dataset() and TypeScript evalDataset() functions run a series of evaluations against a dataset using a provided interaction() function. These functions must be called within the context of an experiment() and automatically create individual test spans for each test case in the dataset.

The Gentrace SDK automatically configures OpenTelemetry when you call init(). If you have an existing OpenTelemetry setup or need custom configuration, see the manual setup guide.

Basic usage

import { init, experiment, evalDataset, testCases, interaction } from 'gentrace';
import { OpenAI } from 'openai';

init({
  apiKey: process.env.GENTRACE_API_KEY,
});

const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;
const DATASET_ID = process.env.GENTRACE_DATASET_ID!;

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Define your AI function
async function processAIRequest(inputs: Record<string, any>) {
  // Your AI implementation here (e.g., call to OpenAI, Anthropic, custom model, etc.)

  return {
    result: "This is a sample AI response",
    metadata: { processed: true }
  };
}

// Wrap with interaction tracing (see interaction() docs)
const tracedProcessAIRequest = interaction('Process AI Request', processAIRequest, {
  pipelineId: PIPELINE_ID,
});

// Basic dataset evaluation
experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: async () => {
      const testCasesList = await testCases.list({ datasetId: DATASET_ID });
      return testCasesList.data;
    },
    interaction: tracedProcessAIRequest,
  });
});

Overview

Dataset evaluation functions allow you to:

Run batch evaluations against multiple test cases from a dataset
Validate inputs using optional schema validation (Pydantic for Python, Zod or any Standard Schema-compliant schema library for TypeScript)
Trace each test case as individual OpenTelemetry spans within the experiment context
Handle errors gracefully with automatic span error recording
Process results from both synchronous and asynchronous data providers and interaction functions

Parameters

Function signature

function evalDataset<
  TSchema extends ParseableSchema<any> | undefined = undefined,
  TInput = TSchema extends ParseableSchema<infer TOutput> ? TOutput : Record<string, any>,
>(options: EvalDatasetOptions<TSchema>): Promise<void>

Parameters

options

EvalDatasetOptions

required

Configuration object containing data provider, interaction function, and optional schema

Hide EvalDatasetOptions

data

function

required

Function that returns test cases, either synchronously or asynchronously

schema

ParseableSchema

Schema object with a parse() method for input validation

interaction

function

required

The function to test against each test case. Typically created with interaction().

TestInput Type

type TestInput<TInput extends Record<string, any>> = {
  name?: string | undefined;
  id?: string | undefined;
  inputs: TInput;
}

Return Value

Returns a sequence of results from the interaction function. Failed test cases (due to validation errors) will have None values in the corresponding positions.

IGNORE

data

Callable

required

Function that returns a sequence of test cases, either synchronously or asynchronously

schema

Type[BaseModel]

Pydantic model for input validation

interaction

Callable

required

The function to test against each test case. Typically created with interaction().

Advanced usage

With schema validation

Schema validation ensures that your test cases have the correct structure and data types before being passed to your interaction function. Use Zod for TypeScript or Pydantic for Python to define your input schemas.

import { z } from 'zod';
import { evalDataset, testCases } from 'gentrace';
import { OpenAI } from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Define input schema
const InputSchema = z.object({
  prompt: z.string(),
  temperature: z.number().optional().default(0.7),
  maxTokens: z.number().optional().default(100)
});

experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: async () => {
      const testCasesList = await testCases.list({ datasetId: DATASET_ID });
      return testCasesList.data;
    },
    schema: InputSchema,
    interaction: async ({ prompt, temperature, maxTokens }) => {
      // Inputs are now typed and validated
      const response = await openai.chat.completions.create({
        model: 'gpt-4o',
        messages: [{ role: 'user', content: prompt }],
        temperature,
        max_tokens: maxTokens,
      });
      return response.choices[0].message.content;
    },
  });
});

Custom data providers

You can provide test cases from any source by implementing a custom data provider function. Each data point must conform to the TestInputs structure from above. This is useful when you want to use test cases from the Gentrace API via the test cases SDK or directly defining them in-line.

experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: () => [
      {
        name: "Basic greeting test",
        id: "test-1",
        inputs: { prompt: "Say hello", temperature: 0.5 }
      },
      {
        name: "Complex reasoning test", 
        id: "test-2",
        inputs: { prompt: "Explain quantum computing", temperature: 0.8 }
      }
    ],
    interaction: async (inputs) => {
      return await myAIFunction(inputs);
    },
  });
});

Error handling

The dataset evaluation functions handle errors gracefully and automatically associate all errors and exceptions with OpenTelemetry spans. When validation or interaction errors occur, they are captured as span events and attributes.

experiment(PIPELINE_ID, async () => {
  await evalDataset({
    data: () => [
      { inputs: { prompt: "Valid input" } },
      { inputs: { invalid: "data" } }, // This might fail validation
      { inputs: { prompt: "Another valid input" } }
    ],
    schema: z.object({ prompt: z.string() }),
    interaction: async (inputs) => {
      // This function might throw errors
      if (inputs.prompt.includes('error')) {
        throw new Error('Simulated error');
      }
      return await myAIFunction(inputs);
    },
  });
  
  // evalDataset continues processing other test cases even if some fail
  // All errors are automatically captured in OpenTelemetry spans
});

OTEL span error integration

When errors occur during dataset evaluation:

Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
Individual Spans: Each test case gets its own span, so errors are isolated to specific test cases
Continued Processing: Failed test cases don’t stop the evaluation of other test cases
Error Attributes: Error messages, types, and metadata are recorded as span attributes
Span Status: Individual test case spans are marked with ERROR status when exceptions occur
Error Handling: See OpenTelemetry error recording and exception recording for more details

Example span structure

Requirements

Gentrace SDK Initialization: Must call init() with a valid API key. The SDK automatically configures OpenTelemetry for you. For custom OpenTelemetry setups, see the manual setup guide
Experiment Context: Must be called within an experiment() function
Valid Data Provider: The data function must return an array of test cases

init() - Initialize the Gentrace SDK
interaction() - Instrument AI functions for tracing within experiments
experiment() - Create experiment context for dataset evaluations
eval() / evalOnce() - Run individual test cases within an experiment
traced() - Alternative approach for tracing functions

Getting started

Error analysis

Tracing

Evaluation

Integrations

Administration

Reference

SDK Entities

Dataset tests

Basic usage

Overview

Parameters

Function signature

Parameters

TestInput Type

Return Value

IGNORE

Advanced usage

With schema validation

Custom data providers

Error handling

OTEL span error integration

Example span structure

Requirements

Getting started

Error analysis

Tracing

Evaluation

Integrations

Administration

Reference

SDK Entities

​Basic usage

​Overview

​Parameters

​Function signature

​Parameters

​TestInput Type

​Return Value

​IGNORE

​Advanced usage

​With schema validation

​Custom data providers

​Error handling

​OTEL span error integration

​Example span structure

​Requirements

​Related functions

Basic usage

Overview

Parameters

Function signature

Parameters

TestInput Type

Return Value

IGNORE

Advanced usage

With schema validation

Custom data providers

Error handling

OTEL span error integration

Example span structure

Requirements

Related functions