Version: 4.7.56

Test results

Test results are the reports generated on your generative AI pipeline.

Creating test results

After your code pulls the cases and generates the outputs for each case, you submit the outputs using our SDK. A test result represents a single submission of a group of test cases and their outputs. The result is then evaluated by the evaluators you defined.

TypeScript
Python

typescript
import { init, getTestCases, submitTestResult } from '@gentrace/core';
import { createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINE
 
init({ apiKey: process.env.GENTRACE_API_KEY });
 
const PIPELINE_SLUG = 'your-pipeline-slug';
 
async function main() {
  const testCases = await getTestCases(PIPELINE_SLUG);
 
  const promises = testCases.map((testCase) => 
    createEmailWithAI(
      testCase.inputs.sender, 
      testCase.inputs.receiver, 
      testCase.inputs.query
    )
  );
  const outputs = await Promise.all(promises);
 
  // This SDK method creates a single result entity on our servers
  const response = await submitTestResult(PIPELINE_SLUG, testCases, outputs);
}
main();

python
import gentrace
import os
from dotenv import load_dotenv
from email_creation import create_email_with_ai
load_dotenv()
gentrace.init(os.getenv("GENTRACE_API_KEY"))
PIPELINE_SLUG = "your-pipeline-slug"
def main():
    # If no dataset ID is provided, the golden dataset will be selected by default
    test_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)
    
    # To specify a particular dataset, you can provide its ID:
    # test_cases = gentrace.get_test_cases(dataset_id="123e4567-e89b-12d3-a456-426614174000")
    outputs = []
    for test_case in test_cases:
        inputs = test_case["inputs"]
        result = create_email_with_ai(inputs)
        outputs.append(result)
    # This SDK method creates a single result entity on our servers
    response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)
    print("Result ID: ", response["resultId"]);
main()

Once the test result has submitted, your evaluators start grading your results. You can view the in-progress and finished results in the Gentrace UI. In the below example, we have several evaluators which score the result.

Result basics

Get test results from the SDK

We expose a few methods to retrieve your submitted test results.

Get test result (single)

TypeScript
Python

typescript
import { getTestResult } from '@gentrace/core';
 
const result = await getTestResult('ede8271a-699f-4db7-a198-2c51a99e2dab');
 
console.log('Created at: ', result.createdAt);

python
import gentrace
gentrace.init(
    api_key=os.getenv("GENTRACE_API_KEY")
)
result = gentrace.get_test_result(result_id="ede8271a-699f-4db7-a198-2c51a99e2dab")
print("Created at: ", result["createdAt"])

This test result endpoint returns a deeply expanded object that includes the runs, pipeline, evaluations, and evaluators associated with that test result. This entity allows you to perform deeper analysis on a given test result.

Since the returned object is more complex, consult the API response reference for this route here to learn how to traverse the structure. Select the 200 status code button once you reach the page.

Get test results (multiple)

TypeScript
Python

typescript
import { getTestResults } from '@gentrace/core';
 
const results = await getTestResults('guess-the-year');
 
for (const result of results) {
  console.log('Result ID: ', result.id);
}

python
import gentrace
gentrace.init(
    api_key=os.getenv("GENTRACE_API_KEY")
)
results = gentrace.get_test_results(pipeline_slug="guess-the-year")
for result in results:
    print(result.get("id"))

This test result endpoint returns a list of test result objects. Each test result object is purposefully thin, containing only the test result metadata (e.g. the name, branch, commit information). Use the gentrace.get_test_result() SDK method to pull detailed information for a given test result.

To traverse this structure, consult the API response reference for this route here. Select the 200 status code button once you reach the page.

Attaching result metadata

You can attach metadata to your test results. For example, If you want to associate a URL to an external service or a prompt variant, you can use metadata to tag your result.

Result vs run metadata

This attaches metadata to an entire result (eg the version of the code base). To attach metadata to a single run, see the run metadata section.

TypeScript
Python

typescript
import { init, getTestCases, submitTestResult } from '@gentrace/core';
import { pipeline } from '../src/pipeline'; // TODO: REPLACE WITH YOUR PIPELINE
 
console.log('Branch name: ', process.env.GENTRACE_BRANCH);
console.log('Commit: ', process.env.GENTRACE_COMMIT);
 
init({ apiKey: process.env.GENTRACE_API_KEY });
 
const PIPELINE_SLUG = 'your-pipeline-slug';
 
async function main() {
  // If no dataset ID is provided, the golden dataset will be selected by default
  const testCases = await getTestCases(PIPELINE_SLUG);
  
  // To specify a particular dataset, you can provide its ID:
  // const testCases = await getTestCasesForDataset("123e4567-e89b-12d3-a456-426614174000");
 
  const promises = testCases.map((testCase) => pipeline(testCase));
 
  const outputs = await Promise.all(promises);
 
  const response = await submitTestResult(PIPELINE_SLUG, testCases, outputs, {
    metadata: {
      // We support multiple metadata keys within the metadata object
      externalServiceUrl: {
        // Every metadata object must have a "type" key
        type: 'url',
        url: 'https://external-service.example.com',
        text: 'External service',
      },
      gitSha: {
        type: 'string',
        value: '71548c22b939f3ef4ab3a5f067e1a5746fa352ff',
      },
    },
  });
}
 
main();

python
import gentrace
gentrace.init(os.getenv("GENTRACE_API_KEY"))
PIPELINE_SLUG = "your-pipeline-slug"
def main():
    # If no dataset ID is provided, the golden dataset will be selected by default
    test_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)
    
    # To specify a particular dataset, you can provide its ID:
    # test_cases = gentrace.get_test_cases(dataset_id="123e4567-e89b-12d3-a456-426614174000")
    outputs = []
    for test_case in test_cases:
        inputs = test_case["inputs"]
        result = pipeline(inputs)
        outputs.append(result)
    # This SDK method creates a single result entity on our servers
    response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs, {
      "metadata": {
        "externalServiceUrl": {
          # Every metadata object must have a "type" key
          "type": "url",
          "url": "https://external-service.example.com",
          "text": "External service"
        },
        "promptVariantId": {
          "type": "string",
          "value": "SGVsbG8sIFdvcmxkIQ=="
        },
      }
    })
    print("Result ID: ", response["resultId"])
main()

The value object for each metadata value is a union type. We only support two value types at this time. We will add more upon request.

type: url - if specified as a URL, you must specify url and text fields.

type: string - if specified as a string, you must specify a value field.

Malformed metadata

If your metadata is not properly formatted with the specified type, our API will reject your request.

When viewing the detailed result page, the supplied metadata will render in the sidebar in the detailed result page.

Attaching result metadata

You can create additional metadata keys and edit the existing values in-line. Edit result metadata

Setting a name

Setting custom names for test results is useful to better identify different test results.

If you are testing between gpt-4 and gpt-3.5-turbo models, you could create test results with the names Experiment: GPT-4 and Experiment: GPT-3.5 Turbo, respectively.

Gentrace allows you to set names through your code or our UI.

SDK

You can use the SDKs to set result names.

TypeScript
Python

Setting from the init() or GENTRACE_RESULT_NAME environment variable.

typescript
import { init, getTestCases, submitTestResult } from '@gentrace/core';
import { createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINE
 
const PIPELINE_SLUG = 'your-pipeline-slug';
 
init({
  apiKey: process.env.GENTRACE_API_KEY,
  // Set the result name when initializing the SDK.
  resultName: 'try gpt-4',
});
 
// Alternatively, you can set the result name with the GENTRACE_RESULT_NAME env variable. Our
// library will access this if `resultName` was not passed to the init() function. If both
// values are supplied, the `resultName` parameter is given precedence.
console.log('Result name: ', process.env.GENTRACE_RESULT_NAME);
 
async function main() {
  const testCases = await getTestCases(PIPELINE_SLUG);
 
  const promises = testCases.map((testCase) => 
    createEmailWithAI(testCase.inputs.sender, testCase.inputs.receiver, testCase.inputs.query)
  );
  const outputs = await Promise.all(promises);
 
  // Upon submission, the test result will have the provided result name in the Gentrace UI.
  const response = await submitTestResult(PIPELINE_SLUG, testCases, outputs);
}

Setting directly from submitTestResult().

typescript
async function main() {
  // Grabs the golden dataset for the pipeline
  const testCases = await getTestCases(PIPELINE_SLUG);
 
  const promises = testCases.map((testCase) => createEmailWithAI(testCase));
  const outputs = await Promise.all(promises);
 
  // Upon submission, the test result will have the provided result name in the Gentrace UI.
  const response = await submitTestResult(PIPELINE_SLUG, testCases, outputs, {
    name: 'try gpt-4',
  });
}

Setting from the init() or GENTRACE_RESULT_NAME environment variable.

python
import gentrace
import os
from dotenv import load_dotenv
from email_creation import create_email_with_ai
load_dotenv()
print("Branch name: ", os.getenv("GENTRACE_BRANCH"));
print("Commit: ", os.getenv("GENTRACE_COMMIT"));
gentrace.init(
  api_key=os.getenv("GENTRACE_API_KEY"),
  result_name="try gpt-4"
)
# Alternatively, you can set the result name with the GENTRACE_RESULT_NAME env variable. Our
# library will access the variable if `result_name` was not passed to the init() function. If both
#  values are supplied, the `result_name` parameter is given preference.
console.log("Result name: ", os.getenv("GENTRACE_RESULT_NAME"));
PIPELINE_SLUG = "your-pipeline-slug"
def main():
    # If no dataset ID is provided, the golden dataset will be selected by default
    test_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)
    
    # To specify a particular dataset, you can provide its ID:
    # test_cases = gentrace.get_test_cases(dataset_id="123e4567-e89b-12d3-a456-426614174000")
    outputs = []
    for test_case in test_cases:
        inputs = test_case["inputs"]
        result = create_email_with_ai(inputs)
        outputs.append(result)
    # Upon submission, the test result will have the provided result name in the Gentrace UI.
    response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)
    print("Result ID: ", response["resultId"]);
main()

Setting directly from submit_test_result().

python
def main():
    test_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)
    # Note: If you want to use a specific dataset instead of the golden dataset,
    # you can provide a dataset ID like this:
    # test_cases = gentrace.get_test_cases(dataset_id="your-dataset-id-here")
    outputs = []
    for test_case in test_cases:
        inputs = test_case["inputs"]
        result = create_email_with_ai(inputs)
        outputs.append(result)
    # Upon submission, the test result will have the provided result name in the Gentrace UI.
    response = gentrace.submit_test_result(
        PIPELINE_SLUG,
        test_cases,
        outputs,
        result_name="try gpt-4"
    )
    print("Result ID: ", response["resultId"]);
main()

View Test Results in the UI

You can compare them by checking their corresponding rows and selecting the "Compare" button at the top right of the screen.

Compare test results

Dashboard

This view provides a comprehensive look at how your test runs are performing across different evaluators. Each graph consolidates all the test runs, grouped by their evaluated results for easier comparison.

UI Lab

Click on the Runs tab to zoom in on the graded output for your test cases.

UI Lab

You can set a different test result as the baseline and the rest as benchmarks in the Runs table. Notice how the delta changes relative to the baseline. You can also click on the column headers to sort by results delta.

UI Lab

Select an evaluation cell to see its grading history.

Evaluation History

Creating test results​

Get test results from the SDK​

Get test result (single)​

Get test results (multiple)​

Attaching result metadata​

Setting a name​

SDK​

View Test Results in the UI​

Dashboard​

Creating test results

Get test results from the SDK

Get test result (single)

Get test results (multiple)

Attaching result metadata

Setting a name

SDK

View Test Results in the UI

Dashboard