Gentrace Documentation

The experiment() function creates a testing context for grouping related evaluations and tests. It manages the lifecycle of a Gentrace experiment, automatically starting and finishing the experiment while providing context for evaluation functions like eval() / evalOnce() and evalDataset() / eval_dataset().

Important: The experiment() function is designed to work with evaluation functions. To fully understand how to use experiments effectively, you should also review:

eval() / evalOnce() - For running individual test cases
evalDataset() / eval_dataset() - For batch evaluation against datasets These functions must be called within an experiment context to properly track and group your test results.

Basic usage

import { init, experiment, evalOnce } from 'gentrace';

init({
  apiKey: process.env.GENTRACE_API_KEY,
});

const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;

// Basic experiment usage
experiment(PIPELINE_ID, async () => {
  await evalOnce('simple-test', async () => {
    // Your test logic here
    return 'test result';
  });
});

Overview

An experiment in Gentrace represents a collection of test cases or evaluations run against your AI pipeline. The experiment() function:

Creates an experiment run in Gentrace with a unique experiment ID
Provides context for evaluation functions to associate their results with the experiment
Manages lifecycle by automatically starting and finishing the experiment
Captures metadata and organizes test results for analysis

Parameters

/**
 * @param {string} pipelineId - The UUID of the Gentrace pipeline to 
   associate with this experiment
 * @param {() => T | Promise<T>} callback - The function containing 
   your experiment logic and test cases
 * @param {ExperimentOptions} [options] - Additional config options
 * @param {Record<string, any>} [options.metadata] - Custom metadata 
   to associate with the experiment run  
 * @returns {Promise<T>}
 */
function experiment<T>(
  pipelineId: string,
  callback: () => T | Promise<T>,
  options?: ExperimentOptions
): Promise<T>

Advanced usage

With metadata

import { experiment, evalOnce } from 'gentrace';

experiment(
  PIPELINE_ID,
  async () => {
    await evalOnce('model-comparison-test', async () => {
      // Test logic comparing different models
      return { accuracy: 0.95, latency: 120 };
    });
  },
  {
    metadata: {
      model: 'o3',
      version: '1.2.0',
      environment: 'staging'
    }
  }
);

Multiple test cases

import { experiment, evalOnce, evalDataset, testCases } from 'gentrace';
experiment(PIPELINE_ID, async () => {
  // Individual test cases
  await evalOnce('accuracy-test', async () => {
    const result = await myAIFunction('test input');
    return { accuracy: calculateAccuracy(result) };
  });

  await evalOnce('latency-test', async () => {
    const start = Date.now();
    await myAIFunction('test input');
    const latency = Date.now() - start;
    return { latency };
  });

  // Dataset evaluation
  await evalDataset({
    data: async () => {
      const DATASET_ID = process.env.GENTRACE_DATASET_ID!;
      const testCasesList = await testCases.list({ datasetId: DATASET_ID });
      return testCases.data;
    },
    interaction: myAIFunction,
  });
});

Context and lifecycle

The experiment() function manages the experiment lifecycle automatically:

Start

Creates a new experiment run in Gentrace

Context

Provides experiment context to nested evaluation functions

Execution

Runs your experiment callback/function

Finish

Marks the experiment as complete in Gentrace with status updates

Accessing experiment context

import { getCurrentExperimentContext } from 'gentrace';

experiment(PIPELINE_ID, async () => {
  const context = getCurrentExperimentContext();
  console.log('Experiment ID:', context?.experimentId);
  console.log('Pipeline ID:', context?.pipelineId);

  // Your test logic here
});

Error handling

The experiment function handles errors gracefully and automatically associates all errors and exceptions with the OpenTelemetry span. When an error occurs within an experiment or evaluation, it is captured as span events and attributes, providing full traceability in your observability stack.

experiment(PIPELINE_ID, async () => {
  try {
    await evalOnce('test-that-might-fail', async () => {
      // Test logic that might throw an error
      // This error will be automatically captured in the OTEL span
      throw new Error('Test failed');
    });
  } catch (error) {
    console.log('Test failed as expected:', error.message);
    // The error is already recorded in the span with full stack trace
  }
  
  // Experiment will still finish properly
  // All error information is preserved in the OpenTelemetry trace
});

OTEL span error integration

When errors occur within experiments:

Automatic Capture: All Error objects (TypeScript) and exceptions (Python) are automatically captured as span events
Stack Traces: Full stack traces are preserved in the span attributes for debugging
Error Attributes: Error messages, types, and metadata are recorded as span attributes
Span Status: The span status is automatically set to ERROR when unhandled exceptions occur

This integration ensures that failed experiments and evaluations are fully observable and debuggable through your OpenTelemetry-compatible monitoring tools.

Best practices

1. Use descriptive names and metadata

// Good: Descriptive metadata
experiment(
  PIPELINE_ID,
  async () => {
    // tests...
  },
  {
    metadata: {
      model: 'o3',
      prompt_version: 'v2.1',
      test_suite: 'regression',
      branch: 'feature/new-prompts',
    },
  },
);

Organize related test cases within a single experiment:

experiment(PIPELINE_ID, async () => {
  // All accuracy-related tests
  await evalOnce('accuracy-basic', async () => {
    /* ... */
  });
  await evalOnce('accuracy-edge-cases', async () => {
    /* ... */
  });

  // All performance-related tests
  await evalOnce('latency-test', async () => {
    /* ... */
  });
  await evalOnce('throughput-test', async () => {
    /* ... */
  });
});

3. Handle async operations properly

// Ensure all async operations are awaited
experiment(PIPELINE_ID, async () => {
  await evalOnce('test-1', async () => { /* ... */ });
  await evalOnce('test-2', async () => { /* ... */ });
  
  // Run tests in parallel if they're independent
  await Promise.all([
    evalOnce('parallel-test-1', async () => { /* ... */ }),
    evalOnce('parallel-test-2', async () => { /* ... */ }),
  ]);
});

Requirements

Gentrace SDK Initialization: Must call init() with a valid API key. The SDK automatically configures OpenTelemetry for you. For custom OpenTelemetry setups, see the manual setup guide
Valid Pipeline ID: Must provide a valid UUID for an existing Gentrace pipeline

init() - Initialize the Gentrace SDK
interaction() - Instrument AI functions for tracing within experiments
evalDataset() / eval_dataset() - Run tests against a dataset within an experiment
evalOnce() / eval() - Run individual test cases within an experiment
traced() - Alternative approach for tracing functions

Getting started

Error analysis

Tracing

Evaluation

Integrations

Administration

Reference

SDK Entities

Experiments

Basic usage

Overview

Parameters

Advanced usage

With metadata

Multiple test cases

Context and lifecycle

Accessing experiment context

Error handling

OTEL span error integration

Best practices

1. Use descriptive names and metadata

3. Handle async operations properly

Requirements

Getting started

Error analysis

Tracing

Evaluation

Integrations

Administration

Reference

SDK Entities

​Basic usage

​Overview

​Parameters

​Advanced usage

​With metadata

​Multiple test cases

​Context and lifecycle

​Accessing experiment context

​Error handling

​OTEL span error integration

​Best practices

​1. Use descriptive names and metadata

​2. Group related tests

​3. Handle async operations properly

​Requirements

​Related functions

Basic usage

Overview

Parameters

Advanced usage

With metadata

Multiple test cases

Context and lifecycle

Accessing experiment context

Error handling

OTEL span error integration

Best practices

1. Use descriptive names and metadata

2. Group related tests

3. Handle async operations properly

Requirements

Related functions