Version: 4.7.56

experiment()

The experiment() function creates a testing context for grouping related evaluations and tests. It manages the lifecycle of a Gentrace experiment, automatically starting and finishing the experiment while providing context for evaluation functions like eval() and evalDataset().

Overview

An experiment in Gentrace represents a collection of test cases or evaluations run against your AI pipeline. The experiment() function:

Creates an experiment run in Gentrace with a unique experiment ID
Provides context for evaluation functions to associate their results with the experiment
Manages lifecycle by automatically starting and finishing the experiment
Captures metadata and organizes test results for analysis

Basic Usage

TypeScript
Python

typescript
import { init, experiment, evalOnce } from 'gentrace';
init({
  apiKey: process.env.GENTRACE_API_KEY,
});
const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;
// Basic experiment usage
experiment(PIPELINE_ID, async () => {
  await evalOnce('simple-test', async () => {
    // Your test logic here
    return 'test result';
  });
});

python
import os
import asyncio
from gentrace import init, experiment, eval
init(api_key=os.environ["GENTRACE_API_KEY"])
PIPELINE_ID = os.environ["GENTRACE_PIPELINE_ID"]
@experiment(pipeline_id=PIPELINE_ID)
async def my_experiment() -> None:
    @eval(name="simple-test")
    async def simple_test() -> str:
        # Your test logic here
        return "test result"
    
    await simple_test()
# Run the experiment
asyncio.run(my_experiment())

Parameters

TypeScript
Python

Function Signature

typescript
function experiment<T>(
  pipelineId: string,
  callback: () => T | Promise<T>,
  options?: ExperimentOptions
): Promise<T>

Parameters

pipelineId (string, required): The UUID of the Gentrace pipeline to associate with this experiment
callback (function, required): The function containing your experiment logic and test cases
options (ExperimentOptions, optional): Additional configuration options

ExperimentOptions

typescript
type ExperimentOptions = {
  metadata?: Record<string, any>;
}

metadata (object, optional): Custom metadata to associate with the experiment run

Decorator Signature

python
def experiment(
    *,
    pipeline_id: str,
    options: Optional[ExperimentOptions] = None,
) -> Callable[[Callable[P, Any]], Callable[P, Coroutine[Any, Any, None]]]

Parameters

pipeline_id (str, required): The UUID of the Gentrace pipeline to associate with this experiment
options (ExperimentOptions, optional): Additional configuration options

ExperimentOptions

python
class ExperimentOptions(TypedDict, total=False):
    name: Optional[str]
    metadata: Optional[Dict[str, Any]]

name (str, optional): A descriptive name for the experiment run
metadata (dict, optional): Custom metadata to associate with the experiment run

Advanced Usage

With Metadata

TypeScript
Python

typescript
import { experiment, evalOnce } from 'gentrace';
experiment(
  PIPELINE_ID,
  async () => {
    await evalOnce('model-comparison-test', async () => {
      // Test logic comparing different models
      return { accuracy: 0.95, latency: 120 };
    });
  },
  {
    metadata: {
      model: 'gpt-4o',
      temperature: 0.7,
      version: '1.2.0',
      environment: 'staging'
    }
  }
);

python
@experiment(
    pipeline_id=PIPELINE_ID,
    options={
        "name": "Model Comparison Experiment",
        "metadata": {
            "model": "gpt-4o",
            "temperature": 0.7,
            "version": "1.2.0",
            "environment": "staging"
        }
    }
)
async def model_comparison_experiment() -> None:
    @eval(name="model-comparison-test")
    async def model_comparison_test() -> dict:
        # Test logic comparing different models
        return {"accuracy": 0.95, "latency": 120}
    
    await model_comparison_test()

Multiple Test Cases

TypeScript
Python

typescript
import { experiment, evalOnce, evalDataset, testCases } from 'gentrace';
experiment(PIPELINE_ID, async () => {
  // Individual test cases
  await evalOnce('accuracy-test', async () => {
    const result = await myAIFunction('test input');
    return { accuracy: calculateAccuracy(result) };
  });
  await evalOnce('latency-test', async () => {
    const start = Date.now();
    await myAIFunction('test input');
    const latency = Date.now() - start;
    return { latency };
  });
  // Dataset evaluation
  await evalDataset({
    data: async () => {
      const DATASET_ID = process.env.GENTRACE_DATASET_ID!;
      const testCasesList = await testCases.list({ datasetId: DATASET_ID });
      return testCases.data;
    },
    interaction: myAIFunction,
  });
});

python
@experiment(pipeline_id=PIPELINE_ID)
async def comprehensive_experiment() -> None:
    @eval(name="accuracy-test")
    async def accuracy_test() -> dict:
        result = await my_ai_function("test input")
        return {"accuracy": calculate_accuracy(result)}
    
    @eval(name="latency-test")
    async def latency_test() -> dict:
        start = time.time()
        await my_ai_function("test input")
        latency = time.time() - start
        return {"latency": latency}
    
    # Run individual tests
    await accuracy_test()
    await latency_test()
    
    # Run dataset evaluation
    from gentrace import test_cases
    
    await eval_dataset(
        data=lambda: test_cases.list(dataset_id=DATASET_ID).data,
        interaction=my_ai_function,
    )

Context and Lifecycle

The experiment() function manages the experiment lifecycle automatically:

Start: Creates a new experiment run in Gentrace
Context: Provides experiment context to nested evaluation functions
Execution: Runs your experiment callback/function
Finish: Marks the experiment as complete in Gentrace

Accessing Experiment Context

TypeScript
Python

typescript
import { getCurrentExperimentContext } from 'gentrace';
experiment(PIPELINE_ID, async () => {
  const context = getCurrentExperimentContext();
  console.log('Experiment ID:', context?.experimentId);
  console.log('Pipeline ID:', context?.pipelineId);
  
  // Your test logic here
});

python
from gentrace import get_current_experiment_context
@experiment(pipeline_id=PIPELINE_ID)
async def my_experiment() -> None:
    context = get_current_experiment_context()
    print(f"Experiment ID: {context['experiment_id']}")
    print(f"Pipeline ID: {context['pipeline_id']}")
    
    # Your test logic here

Error Handling

The experiment function handles errors gracefully and automatically associates all errors and exceptions with the OpenTelemetry span. When an error occurs within an experiment or evaluation, it is captured as span events and attributes, providing full traceability in your observability stack.

TypeScript
Python

typescript
experiment(PIPELINE_ID, async () => {
  try {
    await evalOnce('test-that-might-fail', async () => {
      // Test logic that might throw an error
      // This error will be automatically captured in the OTEL span
      throw new Error('Test failed');
    });
  } catch (error) {
    console.log('Test failed as expected:', error.message);
    // The error is already recorded in the span with full stack trace
  }
  
  // Experiment will still finish properly
  // All error information is preserved in the OpenTelemetry trace
});

python
@experiment(pipeline_id=PIPELINE_ID)
async def error_handling_experiment() -> None:
    @eval(name="test-that-might-fail")
    async def failing_test() -> None:
        # Test logic that might raise an error
        # This exception will be automatically captured in the OTEL span
        raise ValueError("Test failed")
    
    try:
        await failing_test()
    except ValueError as e:
        print(f"Test failed as expected: {e}")
        # The exception is already recorded in the span with full stack trace
    
    # Experiment will still finish properly
    # All error information is preserved in the OpenTelemetry trace

OTEL Span Error Integration

When errors occur within experiments:

Automatic Capture: All Error objects (TypeScript) and exceptions (Python) are automatically captured as span events
Stack Traces: Full stack traces are preserved in the span attributes for debugging
Error Attributes: Error messages, types, and metadata are recorded as span attributes
Span Status: The span status is automatically set to ERROR when unhandled exceptions occur

This integration ensures that failed experiments and evaluations are fully observable and debuggable through your OpenTelemetry-compatible monitoring tools.

Best Practices

1. Use Descriptive Names and Metadata

typescript
// Good: Descriptive metadata
experiment(PIPELINE_ID, async () => {
  // tests...
}, {
  metadata: {
    model: 'gpt-4o',
    prompt_version: 'v2.1',
    test_suite: 'regression',
    branch: 'feature/new-prompts'
  }
});

Organize related test cases within a single experiment:

typescript
experiment(PIPELINE_ID, async () => {
  // All accuracy-related tests
  await evalOnce('accuracy-basic', async () => { /* ... */ });
  await evalOnce('accuracy-edge-cases', async () => { /* ... */ });
  
  // All performance-related tests
  await evalOnce('latency-test', async () => { /* ... */ });
  await evalOnce('throughput-test', async () => { /* ... */ });
});

3. Handle Async Operations Properly

TypeScript
Python

typescript
// Ensure all async operations are awaited
experiment(PIPELINE_ID, async () => {
  await evalOnce('test-1', async () => { /* ... */ });
  await evalOnce('test-2', async () => { /* ... */ });
  
  // Run tests in parallel if they're independent
  await Promise.all([
    evalOnce('parallel-test-1', async () => { /* ... */ }),
    evalOnce('parallel-test-2', async () => { /* ... */ }),
  ]);
});

python
@experiment(pipeline_id=PIPELINE_ID)
async def async_experiment() -> None:
    @eval(name="test-1")
    async def test_1() -> None:
        # Test logic here
        pass
    
    @eval(name="test-2") 
    async def test_2() -> None:
        # Test logic here
        pass
    
    @eval(name="parallel-test-1")
    async def parallel_test_1() -> None:
        # Test logic here
        pass
    
    @eval(name="parallel-test-2")
    async def parallel_test_2() -> None:
        # Test logic here
        pass
    
    # Ensure all async operations are awaited
    await test_1()
    await test_2()
    
    # Run tests in parallel if they're independent
    await asyncio.gather(
        parallel_test_1(),
        parallel_test_2(),
    )

Requirements

OpenTelemetry Setup: The experiment() function requires OpenTelemetry to be configured for tracing
Valid Pipeline ID: Must provide a valid UUID for an existing Gentrace pipeline
API Key: Gentrace API key must be configured via init()

init() - Initialize the Gentrace SDK
interaction() - Instrument AI functions for tracing within experiments
evalDataset() / eval_dataset() - Run tests against a dataset within an experiment
evalOnce() / eval() - Run individual test cases within an experiment

experiment()

Overview​

Basic Usage​

Parameters​

Function Signature​

Parameters​

ExperimentOptions​

Decorator Signature​

Parameters​

ExperimentOptions​

Advanced Usage​

With Metadata​

Multiple Test Cases​

Context and Lifecycle​

Accessing Experiment Context​

Error Handling​

OTEL Span Error Integration​

Best Practices​

1. Use Descriptive Names and Metadata​

2. Group Related Tests​

3. Handle Async Operations Properly​

Requirements​

Related Functions​

See Also​

Overview

Basic Usage

Parameters

Function Signature

Parameters

ExperimentOptions

Decorator Signature

Parameters

ExperimentOptions

Advanced Usage

With Metadata

Multiple Test Cases

Context and Lifecycle

Accessing Experiment Context

Error Handling

OTEL Span Error Integration

Best Practices

1. Use Descriptive Names and Metadata

2. Group Related Tests

3. Handle Async Operations Properly

Requirements

Related Functions

See Also