eval_dataset() / evalDataset()
The Python eval_dataset()
and TypeScript evalDataset()
functions run a series of evaluations against a dataset using a provided interaction()
function.
These functions must be called within the context of an experiment()
and automatically create individual test spans for each test case in the dataset.
OpenTelemetry needs to be properly set up for trace data to show up appropriately. See the OpenTelemetry Setup guide for configuration details.
Basic usage
- TypeScript
- Python
typescript
import { init, experiment, evalDataset, testCases, interaction } from 'gentrace';import { OpenAI } from 'openai';init({apiKey: process.env.GENTRACE_API_KEY,});const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;const DATASET_ID = process.env.GENTRACE_DATASET_ID!;const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY,});// Define your AI functionasync function processAIRequest(inputs: Record<string, any>) {// Your AI implementation here (e.g., call to OpenAI, Anthropic, custom model, etc.)return {result: "This is a sample AI response",metadata: { processed: true }};}// Wrap with interaction tracing (see interaction() docs)const tracedProcessAIRequest = interaction('Process AI Request', processAIRequest, {pipelineId: PIPELINE_ID,});// Basic dataset evaluationexperiment(PIPELINE_ID, async () => {await evalDataset({data: async () => {const testCasesList = await testCases.list({ datasetId: DATASET_ID });return testCasesList.data;},interaction: tracedProcessAIRequest,});});
python
import osimport asynciofrom gentrace import init, experiment, eval_dataset, test_cases, interactionfrom openai import AsyncOpenAIinit(api_key=os.environ["GENTRACE_API_KEY"])PIPELINE_ID = os.environ["GENTRACE_PIPELINE_ID"]DATASET_ID = os.environ["GENTRACE_DATASET_ID"]openai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])# Define your AI functionasync def process_ai_request(inputs):# Your AI implementation here (e.g., call to OpenAI, Anthropic, custom model, etc.)return {"result": "This is a sample AI response","metadata": {"processed": True}}# Wrap with interaction tracing (see interaction() docs)@interaction(pipeline_id=PIPELINE_ID, name="Process AI Request")async def traced_process_ai_request(inputs):return await process_ai_request(inputs)@experiment(pipeline_id=PIPELINE_ID)async def dataset_evaluation() -> None:async def fetch_test_cases():test_case_list = await test_cases.list(dataset_id=DATASET_ID)return test_case_list.dataawait eval_dataset(data=fetch_test_cases,interaction=traced_process_ai_request,)# Run the experimentasyncio.run(dataset_evaluation())
Overview
Dataset evaluation functions allow you to:
- Run batch evaluations against multiple test cases from a dataset
- Validate inputs using optional schema validation (Pydantic for Python, Zod or any Standard Schema-compliant schema library for TypeScript)
- Trace each test case as individual OpenTelemetry spans within the experiment context (requires OpenTelemetry setup)
- Handle errors gracefully with automatic span error recording
- Process results from both synchronous and asynchronous data providers and interaction functions
Parameters
- TypeScript
- Python
Function signature
typescript
function evalDataset<TSchema extends ParseableSchema<any> | undefined = undefined,TInput = TSchema extends ParseableSchema<infer TOutput> ? TOutput : Record<string, any>,>(options: EvalDatasetOptions<TSchema>): Promise<void>
Parameters
options
(EvalDatasetOptions, required): Configuration object containing data provider, interaction function, and optional schema
EvalDatasetOptions
typescript
type EvalDatasetOptions<TSchema extends ParseableSchema<any> | undefined> = {data: () => Promise<TestInput<Record<string, any>>[]> | TestInput<Record<string, any>>[];schema?: TSchema;interaction: (arg: TSchema extends ParseableSchema<infer O> ? O : Record<string, any>) => any; // See interaction()}
data
(function, required): Function that returns test cases, either synchronously or asynchronouslyschema
(ParseableSchema, optional): Schema object with aparse()
method for input validationinteraction
(function, required): The function to test against each test case. Typically created withinteraction()
.
TestInput Type
typescript
type TestInput<TInput extends Record<string, any>> = {name?: string | undefined;id?: string | undefined;inputs: TInput;}
Function Signature
python
async def eval_dataset(*,data: DataProviderType[InputPayload],schema: Optional[Type[SchemaPydanticModel]] = None,interaction: Callable[[Any], Union[TResult, Awaitable[TResult]]],) -> Sequence[Optional[TResult]]
Parameters
data
(Callable, required): Function that returns a sequence of test cases, either synchronously or asynchronouslyschema
(Type[BaseModel], optional): Pydantic model for input validationinteraction
(Callable, required): The function to test against each test case. Typically created withinteraction()
.
TestInput Type
python
class TestInput(TypedDict, Generic[InputPayload], total=False):name: strinputs: InputPayload
Return Value
Returns a sequence of results from the interaction function. Failed test cases (due to validation errors) will have None
values in the corresponding positions.
Advanced usage
With schema validation
Schema validation ensures that your test cases have the correct structure and data types before being passed to your interaction function.
Use Zod for TypeScript or Pydantic for Python to define your input schemas.
- TypeScript
- Python
typescript
import { z } from 'zod';import { evalDataset, testCases } from 'gentrace';import { OpenAI } from 'openai';const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY,});// Define input schemaconst InputSchema = z.object({prompt: z.string(),temperature: z.number().optional().default(0.7),maxTokens: z.number().optional().default(100)});experiment(PIPELINE_ID, async () => {await evalDataset({data: async () => {const testCasesList = await testCases.list({ datasetId: DATASET_ID });return testCasesList.data;},schema: InputSchema,interaction: async ({ prompt, temperature, maxTokens }) => {// Inputs are now typed and validatedconst response = await openai.chat.completions.create({model: 'gpt-4o',messages: [{ role: 'user', content: prompt }],temperature,max_tokens: maxTokens,});return response.choices[0].message.content;},});});
python
import osfrom openai import AsyncOpenAIfrom pydantic import BaseModelfrom gentrace import eval_datasetopenai = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])class QueryInputs(BaseModel):prompt: strtemperature: float = 0.7max_tokens: int = 100@experiment(pipeline_id=PIPELINE_ID)async def schema_validation_example() -> None:async def fetch_test_cases():test_case_list = await test_cases.list(dataset_id=DATASET_ID)return test_case_list.dataasync def ai_interaction(inputs):# Inputs are validated against QueryInputs schemaresponse = await openai.chat.completions.create(model="gpt-4o",messages=[{"role": "user", "content": inputs["prompt"]}],temperature=inputs.get("temperature", 0.7),max_tokens=inputs.get("max_tokens", 100),)return response.choices[0].message.contentresults = await eval_dataset(data=fetch_test_cases,schema=QueryInputs,interaction=ai_interaction,)print(f"Processed {len(results)} test cases")
Custom data providers
You can provide test cases from any source by implementing a custom data provider function. Each data point must conform to the TestInputs
structure from above.
This is useful when you want to use test cases from the Gentrace API via the test cases API or directly defining them in-line.
- TypeScript
- Python
typescript
experiment(PIPELINE_ID, async () => {await evalDataset({data: () => [{name: "Basic greeting test",id: "test-1",inputs: { prompt: "Say hello", temperature: 0.5 }},{name: "Complex reasoning test",id: "test-2",inputs: { prompt: "Explain quantum computing", temperature: 0.8 }}],interaction: async (inputs) => {return await myAIFunction(inputs);},});});
python
@experiment(pipeline_id=PIPELINE_ID)async def custom_data_example() -> None:async def ai_interaction(inputs):return await my_ai_function(inputs)results = await eval_dataset(data=lambda: [{"name": "Basic greeting test","inputs": {"prompt": "Say hello", "temperature": 0.5}},{"name": "Complex reasoning test","inputs": {"prompt": "Explain quantum computing", "temperature": 0.8}}],interaction=ai_interaction,)
Error handling
The dataset evaluation functions handle errors gracefully and automatically associate all errors and exceptions with OpenTelemetry spans. When validation or interaction errors occur, they are captured as span events and attributes.
- TypeScript
- Python
typescript
experiment(PIPELINE_ID, async () => {await evalDataset({data: () => [{ inputs: { prompt: "Valid input" } },{ inputs: { invalid: "data" } }, // This might fail validation{ inputs: { prompt: "Another valid input" } }],schema: z.object({ prompt: z.string() }),interaction: async (inputs) => {// This function might throw errorsif (inputs.prompt.includes('error')) {throw new Error('Simulated error');}return await myAIFunction(inputs);},});// evalDataset continues processing other test cases even if some fail// All errors are automatically captured in OpenTelemetry spans});
python
@experiment(pipeline_id=PIPELINE_ID)async def error_handling_example() -> None:def get_test_cases_with_errors():return [{"inputs": {"prompt": "Valid input"}},{"inputs": {"invalid": "data"}}, # This might fail validation{"inputs": {"prompt": "Another valid input"}}]async def ai_interaction_with_errors(inputs):# This function might raise exceptionsif "error" in inputs.get("prompt", ""):raise ValueError("Simulated error")return await my_ai_function(inputs)results = await eval_dataset(data=get_test_cases_with_errors,schema=QueryInputs,interaction=ai_interaction_with_errors,)# results will contain None for failed test cases# All exceptions are automatically captured in OpenTelemetry spanssuccessful_results = [r for r in results if r is not None]print(f"Successfully processed {len(successful_results)} out of {len(results)} test cases")
OTEL span error integration
When errors occur during dataset evaluation:
- Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
- Individual Spans: Each test case gets its own span, so errors are isolated to specific test cases
- Continued Processing: Failed test cases don't stop the evaluation of other test cases
- Error Attributes: Error messages, types, and metadata are recorded as span attributes
- Span Status: Individual test case spans are marked with
ERROR
status when exceptions occur - Error Handling: See OpenTelemetry error recording and exception recording for more details
Requirements
- OpenTelemetry Setup: Dataset evaluation requires OpenTelemetry to be configured for tracing (see OpenTelemetry setup guide)
- Experiment Context: Must be called within an
experiment()
function - Valid Data Provider: The
data
function must return an array of test cases - API Key: Gentrace API key must be configured via
init()
Related functions
init()
- Initialize the Gentrace SDKinteraction()
- Instrument AI functions for tracing within experimentsexperiment()
- Create experiment context for dataset evaluationseval()
/evalOnce()
- Run individual test cases within an experimenttraced()
- Alternative approach for tracing functions