The experiment() function creates a testing context for grouping related evaluations and tests. It manages the lifecycle of a Gentrace experiment, automatically starting and finishing the experiment while providing context for evaluation functions like eval() / evalOnce() and evalDataset() / eval_dataset().
Important: The experiment() function is designed to work with evaluation functions. To fully understand how to use experiments effectively, you should also review:

Basic usage

import { init, experiment, evalOnce } from 'gentrace';

init({
  apiKey: process.env.GENTRACE_API_KEY,
});

const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;

// Basic experiment usage
experiment(PIPELINE_ID, async () => {
  await evalOnce('simple-test', async () => {
    // Your test logic here
    return 'test result';
  });
});

Overview

An experiment in Gentrace represents a collection of test cases or evaluations run against your AI pipeline. The experiment() function:
  1. Creates an experiment run in Gentrace with a unique experiment ID
  2. Provides context for evaluation functions to associate their results with the experiment
  3. Manages lifecycle by automatically starting and finishing the experiment
  4. Captures metadata and organizes test results for analysis

Parameters

/**
 * @param {string} pipelineId - The UUID of the Gentrace pipeline to 
   associate with this experiment
 * @param {() => T | Promise<T>} callback - The function containing 
   your experiment logic and test cases
 * @param {ExperimentOptions} [options] - Additional config options
 * @param {Record<string, any>} [options.metadata] - Custom metadata 
   to associate with the experiment run  
 * @returns {Promise<T>}
 */
function experiment<T>(
  pipelineId: string,
  callback: () => T | Promise<T>,
  options?: ExperimentOptions
): Promise<T>

Advanced usage

With metadata

import { experiment, evalOnce } from 'gentrace';

experiment(
  PIPELINE_ID,
  async () => {
    await evalOnce('model-comparison-test', async () => {
      // Test logic comparing different models
      return { accuracy: 0.95, latency: 120 };
    });
  },
  {
    metadata: {
      model: 'o3',
      version: '1.2.0',
      environment: 'staging'
    }
  }
);

Multiple test cases

import { experiment, evalOnce, evalDataset, testCases } from 'gentrace';
experiment(PIPELINE_ID, async () => {
  // Individual test cases
  await evalOnce('accuracy-test', async () => {
    const result = await myAIFunction('test input');
    return { accuracy: calculateAccuracy(result) };
  });

  await evalOnce('latency-test', async () => {
    const start = Date.now();
    await myAIFunction('test input');
    const latency = Date.now() - start;
    return { latency };
  });

  // Dataset evaluation
  await evalDataset({
    data: async () => {
      const DATASET_ID = process.env.GENTRACE_DATASET_ID!;
      const testCasesList = await testCases.list({ datasetId: DATASET_ID });
      return testCases.data;
    },
    interaction: myAIFunction,
  });
});

Context and lifecycle

The experiment() function manages the experiment lifecycle automatically:
1

Start

Creates a new experiment run in Gentrace
2

Context

Provides experiment context to nested evaluation functions
3

Execution

Runs your experiment callback/function
4

Finish

Marks the experiment as complete in Gentrace with status updates

Accessing experiment context

import { getCurrentExperimentContext } from 'gentrace';

experiment(PIPELINE_ID, async () => {
  const context = getCurrentExperimentContext();
  console.log('Experiment ID:', context?.experimentId);
  console.log('Pipeline ID:', context?.pipelineId);

  // Your test logic here
});

Error handling

The experiment function handles errors gracefully and automatically associates all errors and exceptions with the OpenTelemetry span. When an error occurs within an experiment or evaluation, it is captured as span events and attributes, providing full traceability in your observability stack.
experiment(PIPELINE_ID, async () => {
  try {
    await evalOnce('test-that-might-fail', async () => {
      // Test logic that might throw an error
      // This error will be automatically captured in the OTEL span
      throw new Error('Test failed');
    });
  } catch (error) {
    console.log('Test failed as expected:', error.message);
    // The error is already recorded in the span with full stack trace
  }
  
  // Experiment will still finish properly
  // All error information is preserved in the OpenTelemetry trace
});

OTEL span error integration

When errors occur within experiments:
  • Automatic Capture: All Error objects (TypeScript) and exceptions (Python) are automatically captured as span events
  • Stack Traces: Full stack traces are preserved in the span attributes for debugging
  • Error Attributes: Error messages, types, and metadata are recorded as span attributes
  • Span Status: The span status is automatically set to ERROR when unhandled exceptions occur
This integration ensures that failed experiments and evaluations are fully observable and debuggable through your OpenTelemetry-compatible monitoring tools.

Best practices

1. Use descriptive names and metadata

// Good: Descriptive metadata
experiment(
  PIPELINE_ID,
  async () => {
    // tests...
  },
  {
    metadata: {
      model: 'o3',
      prompt_version: 'v2.1',
      test_suite: 'regression',
      branch: 'feature/new-prompts',
    },
  },
);
Organize related test cases within a single experiment:
experiment(PIPELINE_ID, async () => {
  // All accuracy-related tests
  await evalOnce('accuracy-basic', async () => {
    /* ... */
  });
  await evalOnce('accuracy-edge-cases', async () => {
    /* ... */
  });

  // All performance-related tests
  await evalOnce('latency-test', async () => {
    /* ... */
  });
  await evalOnce('throughput-test', async () => {
    /* ... */
  });
});

3. Handle async operations properly

// Ensure all async operations are awaited
experiment(PIPELINE_ID, async () => {
  await evalOnce('test-1', async () => { /* ... */ });
  await evalOnce('test-2', async () => { /* ... */ });
  
  // Run tests in parallel if they're independent
  await Promise.all([
    evalOnce('parallel-test-1', async () => { /* ... */ }),
    evalOnce('parallel-test-2', async () => { /* ... */ }),
  ]);
});

Requirements

  • Gentrace SDK Initialization: Must call init() with a valid API key. The SDK automatically configures OpenTelemetry for you. For custom OpenTelemetry setups, see the manual setup guide
  • Valid Pipeline ID: Must provide a valid UUID for an existing Gentrace pipeline