Skip to main content
Version: 4.7.66

eval_dataset() / evalDataset()

The Python eval_dataset() and TypeScript evalDataset() functions run a series of evaluations against a dataset using a provided interaction() function.

These functions must be called within the context of an experiment() and automatically create individual test spans for each test case in the dataset.

tip

OpenTelemetry needs to be properly set up for trace data to show up appropriately. See the OpenTelemetry Setup guide for configuration details.

Basic usage

typescript
import { init, experiment, evalDataset, testCases, interaction } from 'gentrace';
import { OpenAI } from 'openai';
init({
apiKey: process.env.GENTRACE_API_KEY,
});
const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;
const DATASET_ID = process.env.GENTRACE_DATASET_ID!;
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
// Define your AI function
async function processAIRequest(inputs: Record<string, any>) {
// Your AI implementation here (e.g., call to OpenAI, Anthropic, custom model, etc.)
return {
result: "This is a sample AI response",
metadata: { processed: true }
};
}
// Wrap with interaction tracing (see interaction() docs)
const tracedProcessAIRequest = interaction('Process AI Request', processAIRequest, {
pipelineId: PIPELINE_ID,
});
// Basic dataset evaluation
experiment(PIPELINE_ID, async () => {
await evalDataset({
data: async () => {
const testCasesList = await testCases.list({ datasetId: DATASET_ID });
return testCasesList.data;
},
interaction: tracedProcessAIRequest,
});
});

Overview

Dataset evaluation functions allow you to:

  1. Run batch evaluations against multiple test cases from a dataset
  2. Validate inputs using optional schema validation (Pydantic for Python, Zod or any Standard Schema-compliant schema library for TypeScript)
  3. Trace each test case as individual OpenTelemetry spans within the experiment context (requires OpenTelemetry setup)
  4. Handle errors gracefully with automatic span error recording
  5. Process results from both synchronous and asynchronous data providers and interaction functions

Parameters

Function signature

typescript
function evalDataset<
TSchema extends ParseableSchema<any> | undefined = undefined,
TInput = TSchema extends ParseableSchema<infer TOutput> ? TOutput : Record<string, any>,
>(options: EvalDatasetOptions<TSchema>): Promise<void>

Parameters

  • options (EvalDatasetOptions, required): Configuration object containing data provider, interaction function, and optional schema

EvalDatasetOptions

typescript
type EvalDatasetOptions<TSchema extends ParseableSchema<any> | undefined> = {
data: () => Promise<TestInput<Record<string, any>>[]> | TestInput<Record<string, any>>[];
schema?: TSchema;
interaction: (arg: TSchema extends ParseableSchema<infer O> ? O : Record<string, any>) => any; // See interaction()
}
  • data (function, required): Function that returns test cases, either synchronously or asynchronously
  • schema (ParseableSchema, optional): Schema object with a parse() method for input validation
  • interaction (function, required): The function to test against each test case. Typically created with interaction().

TestInput Type

typescript
type TestInput<TInput extends Record<string, any>> = {
name?: string | undefined;
id?: string | undefined;
inputs: TInput;
}

Advanced usage

With schema validation

Schema validation ensures that your test cases have the correct structure and data types before being passed to your interaction function.

Use Zod for TypeScript or Pydantic for Python to define your input schemas.

typescript
import { z } from 'zod';
import { evalDataset, testCases } from 'gentrace';
import { OpenAI } from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
// Define input schema
const InputSchema = z.object({
prompt: z.string(),
temperature: z.number().optional().default(0.7),
maxTokens: z.number().optional().default(100)
});
experiment(PIPELINE_ID, async () => {
await evalDataset({
data: async () => {
const testCasesList = await testCases.list({ datasetId: DATASET_ID });
return testCasesList.data;
},
schema: InputSchema,
interaction: async ({ prompt, temperature, maxTokens }) => {
// Inputs are now typed and validated
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
temperature,
max_tokens: maxTokens,
});
return response.choices[0].message.content;
},
});
});

Custom data providers

You can provide test cases from any source by implementing a custom data provider function. Each data point must conform to the TestInputs structure from above.

This is useful when you want to use test cases from the Gentrace API via the test cases API or directly defining them in-line.

typescript
experiment(PIPELINE_ID, async () => {
await evalDataset({
data: () => [
{
name: "Basic greeting test",
id: "test-1",
inputs: { prompt: "Say hello", temperature: 0.5 }
},
{
name: "Complex reasoning test",
id: "test-2",
inputs: { prompt: "Explain quantum computing", temperature: 0.8 }
}
],
interaction: async (inputs) => {
return await myAIFunction(inputs);
},
});
});

Error handling

The dataset evaluation functions handle errors gracefully and automatically associate all errors and exceptions with OpenTelemetry spans. When validation or interaction errors occur, they are captured as span events and attributes.

typescript
experiment(PIPELINE_ID, async () => {
await evalDataset({
data: () => [
{ inputs: { prompt: "Valid input" } },
{ inputs: { invalid: "data" } }, // This might fail validation
{ inputs: { prompt: "Another valid input" } }
],
schema: z.object({ prompt: z.string() }),
interaction: async (inputs) => {
// This function might throw errors
if (inputs.prompt.includes('error')) {
throw new Error('Simulated error');
}
return await myAIFunction(inputs);
},
});
// evalDataset continues processing other test cases even if some fail
// All errors are automatically captured in OpenTelemetry spans
});

OTEL span error integration

When errors occur during dataset evaluation:

  • Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
  • Individual Spans: Each test case gets its own span, so errors are isolated to specific test cases
  • Continued Processing: Failed test cases don't stop the evaluation of other test cases
  • Error Attributes: Error messages, types, and metadata are recorded as span attributes
  • Span Status: Individual test case spans are marked with ERROR status when exceptions occur
  • Error Handling: See OpenTelemetry error recording and exception recording for more details

Requirements

  • OpenTelemetry Setup: Dataset evaluation requires OpenTelemetry to be configured for tracing (see OpenTelemetry setup guide)
  • Experiment Context: Must be called within an experiment() function
  • Valid Data Provider: The data function must return an array of test cases
  • API Key: Gentrace API key must be configured via init()
  • init() - Initialize the Gentrace SDK
  • interaction() - Instrument AI functions for tracing within experiments
  • experiment() - Create experiment context for dataset evaluations
  • eval() / evalOnce() - Run individual test cases within an experiment
  • traced() - Alternative approach for tracing functions

See also