eval() / evalOnce()
The eval()
and evalOnce()
functions create individual test cases within an experiment()
. These functions capture the execution of specific test logic, automatically creating OpenTelemetry spans with detailed tracing information, and associating results with the parent experiment.
These functions must be called within the context of an experiment()
and automatically create individual test spans for each evaluation.
OpenTelemetry needs to be properly set up for trace data to show up appropriately. See the OpenTelemetry Setup guide for configuration details.
Basic usage
- TypeScript
- Python
typescript
import { init, experiment, evalOnce, interaction } from 'gentrace';init({apiKey: process.env.GENTRACE_API_KEY,});const PIPELINE_ID = process.env.GENTRACE_PIPELINE_ID!;// Define your AI function with interaction wrapperconst myAIFunction = interaction('my-ai-function',async (input: string): Promise<string> => {// Your AI logic here - e.g., call to OpenAI, Anthropic, etc.return "This is a sample AI response";},{ pipelineId: PIPELINE_ID });// Basic evalOnce usage within an experimentexperiment(PIPELINE_ID, async () => {await evalOnce('simple-accuracy-test', async () => {const input = 'What is 2 + 2?';const result = await myAIFunction(input);const expected = '4';return {input,result,expected,passed: result.includes(expected)};});});
python
import asyncioimport osfrom gentrace import init, experiment, eval, interactioninit(api_key=os.environ["GENTRACE_API_KEY"])PIPELINE_ID = os.environ["GENTRACE_PIPELINE_ID"]# Define your AI function with interaction wrapper@interaction(name="my-ai-function", pipeline_id=PIPELINE_ID)async def my_ai_function(input_text: str) -> str:# Your AI logic here - e.g., call to OpenAI, Anthropic, etc.return "This is a sample AI response"@experiment(pipeline_id=PIPELINE_ID)async def my_experiment() -> None:@eval(name="simple-accuracy-test")async def simple_accuracy_test() -> dict:input_text = "What is 2 + 2?"result = await my_ai_function(input_text)expected = "4"return {"input": input_text,"result": result,"expected": expected,"passed": expected in result}await simple_accuracy_test()# Run the experimentasyncio.run(my_experiment())
Overview
Individual evaluations in Gentrace represent specific test cases or validation steps within your AI pipeline testing. The eval()
and evalOnce()
functions:
- Create test case spans in OpenTelemetry with detailed execution tracing
- Capture inputs and outputs for each test span automatically; the return value of
eval()
or the@eval()
-decorated function is recorded as the output for that span - Associate with experiments by linking to the parent experiment context
- Handle errors gracefully while preserving full error information in traces
- Support both sync and async test functions seamlessly
Parameters
- TypeScript
- Python
Function signature
typescript
function evalOnce<TResult>(spanName: string,callback: () => TResult | null | Promise<TResult | null>): Promise<TResult | null>
Parameters
spanName
(string, required): A descriptive name for the test case, used for tracing and reportingcallback
(function, required): The function containing your test logic. Can be synchronous or asynchronous
Return Value
Returns a Promise<TResult | null>
that resolves with:
- The result of your callback function on success
null
if an error occurs (errors are captured in the OpenTelemetry span)
Function signature
python
def eval(*,name: str,metadata: Optional[Dict[str, Any]] = None,) -> Callable[[Callable[P, Any]], Callable[P, Coroutine[Any, Any, Any]]]
Parameters
name
(str, required): A descriptive name for the test case, used for tracing and reportingmetadata
(dict, optional): Custom metadata to associate with this evaluation's span
Return Value
Returns a decorator that wraps your function. The wrapped function, when called, will:
- Execute within an OpenTelemetry span
- Capture inputs, outputs, and any errors
- Return the original function's result
Multiple test cases with different evaluation types
- TypeScript
- Python
typescript
import { callAIModel } from './models';import { experiment, evalOnce, interaction } from 'gentrace';const myAIFunction = interaction('multi-test-function',async (input: string): Promise<string> => {return await callAIModel(input);},{ pipelineId: PIPELINE_ID });function calculateAccuracy(result: string, expected: string): number {const resultWords = result.toLowerCase().split(/\s+/);const expectedWords = expected.toLowerCase().split(/\s+/);const matches = expectedWords.filter(word => resultWords.includes(word));return matches.length / expectedWords.length;}experiment(PIPELINE_ID, async () => {// Test accuracyawait evalOnce('accuracy-test', async () => {const input = 'What is machine learning?';const expected = 'machine learning is artificial intelligence';const result = await myAIFunction(input);const accuracy = calculateAccuracy(result, expected);return {input,result,expected,accuracy,passed: accuracy >= 0.7};});// Test latencyawait evalOnce('latency-test', async () => {const start = Date.now();const result = await myAIFunction('Quick test input');const latency = Date.now() - start;return {result,latency,threshold: 2000,passed: latency < 2000};});});
python
import timeimport asynciofrom models import call_ai_model@interaction(name="multi-test-function", pipeline_id=PIPELINE_ID)async def my_ai_function(input_text: str) -> str:return await call_ai_model(input_text)def calculate_accuracy(result: str, expected: str) -> float:result_words = set(result.lower().split())expected_words = set(expected.lower().split())matches = result_words.intersection(expected_words)return len(matches) / len(expected_words) if expected_words else 0.0@experiment(pipeline_id=PIPELINE_ID)async def comprehensive_experiment() -> None:@eval(name="accuracy-test")async def accuracy_test() -> dict:input_text = "What is machine learning?"expected = "machine learning is artificial intelligence"result = await my_ai_function(input_text)accuracy = calculate_accuracy(result, expected)return {"input": input_text,"result": result,"expected": expected,"accuracy": accuracy,"passed": accuracy >= 0.7}@eval(name="latency-test")async def latency_test() -> dict:start = time.time()result = await my_ai_function("Quick test input")latency = time.time() - startreturn {"result": result,"latency": latency,"threshold": 2.0,"passed": latency < 2.0}await accuracy_test()await latency_test()
OTEL span error integration
When errors occur during individual evaluations:
- Automatic Capture: All validation errors and interaction exceptions are automatically captured as span events
- Individual Spans: Each evaluation gets its own span, so errors are isolated to specific test cases
- Continued Processing: Failed evaluations don't stop the execution of other evaluations in the experiment
- Error Attributes: Error messages, types, and metadata are recorded as span attributes
- Span Status: Individual evaluation spans are marked with
ERROR
status when exceptions occur - Error Handling: See OpenTelemetry error recording and exception recording for more details
Requirements
- OpenTelemetry Setup: Individual evaluations require OpenTelemetry to be configured for tracing (see OpenTelemetry setup guide)
- Experiment Context: Must be called within an
experiment()
function - Valid Pipeline ID: Parent experiment must have a valid Gentrace pipeline ID
- API Key: Gentrace API key must be configured via
init()
Context requirements
Both eval()
and evalOnce()
must be called within an active experiment context. They automatically:
- Retrieve experiment context from the parent
experiment()
function - Associate spans with the experiment ID for proper grouping
- Inherit experiment metadata and configuration
- TypeScript
- Python
typescript
// ❌ This will throw an error - no experiment contextawait evalOnce('invalid-test', async () => {return 'This will fail';});// ✅ Correct usage within experiment contextexperiment(PIPELINE_ID, async () => {await evalOnce('valid-test', async () => {return { message: 'This will work', passed: true };});});
python
# ❌ This will throw an error - no experiment context@eval(name="invalid-test")async def invalid_test() -> str:return "This will fail"await invalid_test() # RuntimeError: must be called within @experiment context# ✅ Correct usage within experiment context@experiment(pipeline_id=PIPELINE_ID)async def valid_experiment() -> None:@eval(name="valid-test")async def valid_test() -> dict:return {"message": "This will work", "passed": True}await valid_test()
OpenTelemetry integration
The evaluation functions create rich OpenTelemetry spans with comprehensive tracing information:
Span attributes
gentrace.experiment_id
: Links the evaluation to its parent experimentgentrace.test_case_name
: The name provided to the evaluation function- Error information: Automatic error type and message capture
Span events
gentrace.fn.output
: Records function outputs for result tracking- Exception events: Automatic exception recording with full stack traces
Example span structure
Experiment Span (experiment-name)├── Evaluation Span (accuracy-test)│ ├── Attributes:│ │ ├── gentrace.experiment_id: "exp-123"│ │ ├── gentrace.test_case_name: "accuracy-test"│ │ └── custom.metadata: "value"│ ├── Events:│ │ ├── gentrace.fn.args: {"args": ["input data"]}│ │ └── gentrace.fn.output: {"output": {"accuracy": 0.95}}│ └── Child Spans (from interaction() calls)└── Evaluation Span (latency-test)└── ...
Related functions
init()
- Initialize the Gentrace SDKinteraction()
- Instrument AI functions for tracing within experimentsevalDataset()
/eval_dataset()
- Run tests against a dataset within an experimentexperiment()
- Create experiment contexts for grouping evaluationstraced()
- Alternative approach for tracing functions