experiment()
The experiment()
function creates a testing context for grouping related evaluations and tests. It manages the lifecycle of a Gentrace experiment, automatically starting and finishing the experiment while providing context for evaluation functions like eval()
and evalDataset()
.
Overview
An experiment in Gentrace represents a collection of test cases or evaluations run against your AI pipeline. The experiment()
function:
- Creates an experiment run in Gentrace with a unique experiment ID
- Provides context for evaluation functions to associate their results with the experiment
- Manages lifecycle by automatically starting and finishing the experiment
- Captures metadata and organizes test results for analysis
Basic Usage
- TypeScript
- Python
typescript
import { init, experiment, evalOnce } from 'gentrace';init({apiKey: process.env.GENTRACE_API_KEY,});const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;// Basic experiment usageexperiment(PIPELINE_ID, async () => {await evalOnce('simple-test', async () => {// Your test logic herereturn 'test result';});});
python
import osimport asynciofrom gentrace import init, experiment, evalinit(api_key=os.environ["GENTRACE_API_KEY"])PIPELINE_ID = os.environ["GENTRACE_PIPELINE_ID"]@experiment(pipeline_id=PIPELINE_ID)async def my_experiment() -> None:@eval(name="simple-test")async def simple_test() -> str:# Your test logic herereturn "test result"await simple_test()# Run the experimentasyncio.run(my_experiment())
Parameters
- TypeScript
- Python
Function Signature
typescript
function experiment<T>(pipelineId: string,callback: () => T | Promise<T>,options?: ExperimentOptions): Promise<T>
Parameters
pipelineId
(string, required): The UUID of the Gentrace pipeline to associate with this experimentcallback
(function, required): The function containing your experiment logic and test casesoptions
(ExperimentOptions, optional): Additional configuration options
ExperimentOptions
typescript
type ExperimentOptions = {metadata?: Record<string, any>;}
metadata
(object, optional): Custom metadata to associate with the experiment run
Decorator Signature
python
def experiment(*,pipeline_id: str,options: Optional[ExperimentOptions] = None,) -> Callable[[Callable[P, Any]], Callable[P, Coroutine[Any, Any, None]]]
Parameters
pipeline_id
(str, required): The UUID of the Gentrace pipeline to associate with this experimentoptions
(ExperimentOptions, optional): Additional configuration options
ExperimentOptions
python
class ExperimentOptions(TypedDict, total=False):name: Optional[str]metadata: Optional[Dict[str, Any]]
name
(str, optional): A descriptive name for the experiment runmetadata
(dict, optional): Custom metadata to associate with the experiment run
Advanced Usage
With Metadata
- TypeScript
- Python
typescript
import { experiment, evalOnce } from 'gentrace';experiment(PIPELINE_ID,async () => {await evalOnce('model-comparison-test', async () => {// Test logic comparing different modelsreturn { accuracy: 0.95, latency: 120 };});},{metadata: {model: 'gpt-4o',temperature: 0.7,version: '1.2.0',environment: 'staging'}});
python
@experiment(pipeline_id=PIPELINE_ID,options={"name": "Model Comparison Experiment","metadata": {"model": "gpt-4o","temperature": 0.7,"version": "1.2.0","environment": "staging"}})async def model_comparison_experiment() -> None:@eval(name="model-comparison-test")async def model_comparison_test() -> dict:# Test logic comparing different modelsreturn {"accuracy": 0.95, "latency": 120}await model_comparison_test()
Multiple Test Cases
- TypeScript
- Python
typescript
import { experiment, evalOnce, evalDataset, testCases } from 'gentrace';experiment(PIPELINE_ID, async () => {// Individual test casesawait evalOnce('accuracy-test', async () => {const result = await myAIFunction('test input');return { accuracy: calculateAccuracy(result) };});await evalOnce('latency-test', async () => {const start = Date.now();await myAIFunction('test input');const latency = Date.now() - start;return { latency };});// Dataset evaluationawait evalDataset({data: async () => {const DATASET_ID = process.env.GENTRACE_DATASET_ID!;const testCasesList = await testCases.list({ datasetId: DATASET_ID });return testCases.data;},interaction: myAIFunction,});});
python
@experiment(pipeline_id=PIPELINE_ID)async def comprehensive_experiment() -> None:@eval(name="accuracy-test")async def accuracy_test() -> dict:result = await my_ai_function("test input")return {"accuracy": calculate_accuracy(result)}@eval(name="latency-test")async def latency_test() -> dict:start = time.time()await my_ai_function("test input")latency = time.time() - startreturn {"latency": latency}# Run individual testsawait accuracy_test()await latency_test()# Run dataset evaluationfrom gentrace import test_casesawait eval_dataset(data=lambda: test_cases.list(dataset_id=DATASET_ID).data,interaction=my_ai_function,)
Context and Lifecycle
The experiment()
function manages the experiment lifecycle automatically:
- Start: Creates a new experiment run in Gentrace
- Context: Provides experiment context to nested evaluation functions
- Execution: Runs your experiment callback/function
- Finish: Marks the experiment as complete in Gentrace
Accessing Experiment Context
- TypeScript
- Python
typescript
import { getCurrentExperimentContext } from 'gentrace';experiment(PIPELINE_ID, async () => {const context = getCurrentExperimentContext();console.log('Experiment ID:', context?.experimentId);console.log('Pipeline ID:', context?.pipelineId);// Your test logic here});
python
from gentrace import get_current_experiment_context@experiment(pipeline_id=PIPELINE_ID)async def my_experiment() -> None:context = get_current_experiment_context()print(f"Experiment ID: {context['experiment_id']}")print(f"Pipeline ID: {context['pipeline_id']}")# Your test logic here
Error Handling
The experiment function handles errors gracefully and automatically associates all errors and exceptions with the OpenTelemetry span. When an error occurs within an experiment or evaluation, it is captured as span events and attributes, providing full traceability in your observability stack.
- TypeScript
- Python
typescript
experiment(PIPELINE_ID, async () => {try {await evalOnce('test-that-might-fail', async () => {// Test logic that might throw an error// This error will be automatically captured in the OTEL spanthrow new Error('Test failed');});} catch (error) {console.log('Test failed as expected:', error.message);// The error is already recorded in the span with full stack trace}// Experiment will still finish properly// All error information is preserved in the OpenTelemetry trace});
python
@experiment(pipeline_id=PIPELINE_ID)async def error_handling_experiment() -> None:@eval(name="test-that-might-fail")async def failing_test() -> None:# Test logic that might raise an error# This exception will be automatically captured in the OTEL spanraise ValueError("Test failed")try:await failing_test()except ValueError as e:print(f"Test failed as expected: {e}")# The exception is already recorded in the span with full stack trace# Experiment will still finish properly# All error information is preserved in the OpenTelemetry trace
OTEL Span Error Integration
When errors occur within experiments:
- Automatic Capture: All
Error
objects (TypeScript) and exceptions (Python) are automatically captured as span events - Stack Traces: Full stack traces are preserved in the span attributes for debugging
- Error Attributes: Error messages, types, and metadata are recorded as span attributes
- Span Status: The span status is automatically set to
ERROR
when unhandled exceptions occur
This integration ensures that failed experiments and evaluations are fully observable and debuggable through your OpenTelemetry-compatible monitoring tools.
Best Practices
1. Use Descriptive Names and Metadata
typescript
// Good: Descriptive metadataexperiment(PIPELINE_ID, async () => {// tests...}, {metadata: {model: 'gpt-4o',prompt_version: 'v2.1',test_suite: 'regression',branch: 'feature/new-prompts'}});
2. Group Related Tests
Organize related test cases within a single experiment:
typescript
experiment(PIPELINE_ID, async () => {// All accuracy-related testsawait evalOnce('accuracy-basic', async () => { /* ... */ });await evalOnce('accuracy-edge-cases', async () => { /* ... */ });// All performance-related testsawait evalOnce('latency-test', async () => { /* ... */ });await evalOnce('throughput-test', async () => { /* ... */ });});
3. Handle Async Operations Properly
- TypeScript
- Python
typescript
// Ensure all async operations are awaitedexperiment(PIPELINE_ID, async () => {await evalOnce('test-1', async () => { /* ... */ });await evalOnce('test-2', async () => { /* ... */ });// Run tests in parallel if they're independentawait Promise.all([evalOnce('parallel-test-1', async () => { /* ... */ }),evalOnce('parallel-test-2', async () => { /* ... */ }),]);});
python
@experiment(pipeline_id=PIPELINE_ID)async def async_experiment() -> None:@eval(name="test-1")async def test_1() -> None:# Test logic herepass@eval(name="test-2")async def test_2() -> None:# Test logic herepass@eval(name="parallel-test-1")async def parallel_test_1() -> None:# Test logic herepass@eval(name="parallel-test-2")async def parallel_test_2() -> None:# Test logic herepass# Ensure all async operations are awaitedawait test_1()await test_2()# Run tests in parallel if they're independentawait asyncio.gather(parallel_test_1(),parallel_test_2(),)
Requirements
- OpenTelemetry Setup: The
experiment()
function requires OpenTelemetry to be configured for tracing - Valid Pipeline ID: Must provide a valid UUID for an existing Gentrace pipeline
- API Key: Gentrace API key must be configured via
init()
Related Functions
init()
- Initialize the Gentrace SDKinteraction()
- Instrument AI functions for tracing within experimentsevalDataset()
/eval_dataset()
- Run tests against a dataset within an experimentevalOnce()
/eval()
- Run individual test cases within an experiment