Test results
Test results are the reports generated on your generative AI pipeline.
Creating test results
After your code pulls the cases and generates the outputs for each case, you submit the outputs using our SDK. A test result represents a single submission of a group of test cases and their outputs. The result is then evaluated by the evaluators you defined.
- TypeScript
- Python
typescript
import {init ,getTestCases ,submitTestResult } from '@gentrace/core';import {createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINEinit ({apiKey :process .env .GENTRACE_API_KEY });constPIPELINE_SLUG = 'your-pipeline-slug';async functionmain () {consttestCases = awaitgetTestCases (PIPELINE_SLUG );constpromises =testCases .map ((testCase ) =>createEmailWithAI (testCase .inputs .sender ,testCase .inputs .receiver ,testCase .inputs .query ));constoutputs = awaitPromise .all (promises );// This SDK method creates a single result entity on our serversconstresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputs );}main ();
python
import gentraceimport osfrom dotenv import load_dotenvfrom email_creation import create_email_with_aiload_dotenv()gentrace.init(os.getenv("GENTRACE_API_KEY"))PIPELINE_SLUG = "your-pipeline-slug"def main():# If no dataset ID is provided, the golden dataset will be selected by defaulttest_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)# To specify a particular dataset, you can provide its ID:# test_cases = gentrace.get_test_cases(dataset_id="123e4567-e89b-12d3-a456-426614174000")outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = create_email_with_ai(inputs)outputs.append(result)# This SDK method creates a single result entity on our serversresponse = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)print("Result ID: ", response["resultId"]);main()
Once the test result has submitted, your evaluators start grading your results. You can view the in-progress and finished results in the Gentrace UI. In the below example, we have several evaluators which score the result.
Get test results from the SDK
We expose a few methods to retrieve your submitted test results.
Get test result (single)
- TypeScript
- Python
typescript
import {getTestResult } from '@gentrace/core';constresult = awaitgetTestResult ('ede8271a-699f-4db7-a198-2c51a99e2dab');console .log ('Created at: ',result .createdAt );
python
import gentracegentrace.init(api_key=os.getenv("GENTRACE_API_KEY"))result = gentrace.get_test_result(result_id="ede8271a-699f-4db7-a198-2c51a99e2dab")print("Created at: ", result["createdAt"])
This test result endpoint returns a deeply expanded object that includes the runs, pipeline, evaluations, and evaluators associated with that test result. This entity allows you to perform deeper analysis on a given test result.
Since the returned object is more complex, consult the API response reference for this route here to learn how to traverse the structure. Select the 200 status code button once you reach the page.
Get test results (multiple)
- TypeScript
- Python
typescript
import {getTestResults } from '@gentrace/core';constresults = awaitgetTestResults ('guess-the-year');for (constresult ofresults ) {console .log ('Result ID: ',result .id );}
python
import gentracegentrace.init(api_key=os.getenv("GENTRACE_API_KEY"))results = gentrace.get_test_results(pipeline_slug="guess-the-year")for result in results:print(result.get("id"))
This test result endpoint returns a list of test result objects. Each test result object is purposefully thin, containing only the test result metadata (e.g. the name, branch, commit information). Use the gentrace.get_test_result()
SDK method to pull detailed information for a given test result.
To traverse this structure, consult the API response reference for this route here. Select the 200 status code button once you reach the page.
Attaching result metadata
You can attach metadata to your test results. For example, If you want to associate a URL to an external service or a prompt variant, you can use metadata to tag your result.
This attaches metadata to an entire result (eg the version of the code base). To attach metadata to a single run, see the run metadata section.
- TypeScript
- Python
typescript
import {init ,getTestCases ,submitTestResult } from '@gentrace/core';import {pipeline } from '../src/pipeline'; // TODO: REPLACE WITH YOUR PIPELINEconsole .log ('Branch name: ',process .env .GENTRACE_BRANCH );console .log ('Commit: ',process .env .GENTRACE_COMMIT );init ({apiKey :process .env .GENTRACE_API_KEY });constPIPELINE_SLUG = 'your-pipeline-slug';async functionmain () {// If no dataset ID is provided, the golden dataset will be selected by defaultconsttestCases = awaitgetTestCases (PIPELINE_SLUG );// To specify a particular dataset, you can provide its ID:// const testCases = await getTestCasesForDataset("123e4567-e89b-12d3-a456-426614174000");constpromises =testCases .map ((testCase ) =>pipeline (testCase ));constoutputs = awaitPromise .all (promises );constresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputs , {metadata : {// We support multiple metadata keys within the metadata objectexternalServiceUrl : {// Every metadata object must have a "type" keytype : 'url',url : 'https://external-service.example.com',text : 'External service',},gitSha : {type : 'string',value : '71548c22b939f3ef4ab3a5f067e1a5746fa352ff',},},});}main ();
python
import gentracegentrace.init(os.getenv("GENTRACE_API_KEY"))PIPELINE_SLUG = "your-pipeline-slug"def main():# If no dataset ID is provided, the golden dataset will be selected by defaulttest_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)# To specify a particular dataset, you can provide its ID:# test_cases = gentrace.get_test_cases(dataset_id="123e4567-e89b-12d3-a456-426614174000")outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = pipeline(inputs)outputs.append(result)# This SDK method creates a single result entity on our serversresponse = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs, {"metadata": {"externalServiceUrl": {# Every metadata object must have a "type" key"type": "url","url": "https://external-service.example.com","text": "External service"},"promptVariantId": {"type": "string","value": "SGVsbG8sIFdvcmxkIQ=="},}})print("Result ID: ", response["resultId"])main()
The value object for each metadata value is a union type. We only support two value types at this time. We will add more upon request.
type: url
- if specified as a URL, you must specify url
and text
fields.
type: string
- if specified as a string, you must specify a value
field.
If your metadata is not properly formatted with the specified type, our API will reject your request.
When viewing the detailed result page, the supplied metadata will render in the sidebar in the detailed result page.
You can create additional metadata keys and edit the existing values in-line.
Setting a name
Setting custom names for test results is useful to better identify different test results.
If you are testing between gpt-4
and gpt-3.5-turbo
models, you could create test results with the names Experiment: GPT-4
and Experiment: GPT-3.5 Turbo
, respectively.
Gentrace allows you to set names through your code or our UI.
SDK
You can use the SDKs to set result names.
- TypeScript
- Python
Setting from the init()
or GENTRACE_RESULT_NAME
environment variable.
typescript
import {init ,getTestCases ,submitTestResult } from '@gentrace/core';import {createEmailWithAI } from '../src/email-creation'; // TODO: REPLACE WITH YOUR PIPELINEconstPIPELINE_SLUG = 'your-pipeline-slug';init ({apiKey :process .env .GENTRACE_API_KEY ,// Set the result name when initializing the SDK.resultName : 'try gpt-4',});// Alternatively, you can set the result name with the GENTRACE_RESULT_NAME env variable. Our// library will access this if `resultName` was not passed to the init() function. If both// values are supplied, the `resultName` parameter is given precedence.console .log ('Result name: ',process .env .GENTRACE_RESULT_NAME );async functionmain () {consttestCases = awaitgetTestCases (PIPELINE_SLUG );constpromises =testCases .map ((testCase ) =>createEmailWithAI (testCase .inputs .sender ,testCase .inputs .receiver ,testCase .inputs .query ));constoutputs = awaitPromise .all (promises );// Upon submission, the test result will have the provided result name in the Gentrace UI.constresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputs );}
Setting directly from submitTestResult()
.
typescript
async functionmain () {// Grabs the golden dataset for the pipelineconsttestCases = awaitgetTestCases (PIPELINE_SLUG );constpromises =testCases .map ((testCase ) =>createEmailWithAI (testCase ));constoutputs = awaitPromise .all (promises );// Upon submission, the test result will have the provided result name in the Gentrace UI.constresponse = awaitsubmitTestResult (PIPELINE_SLUG ,testCases ,outputs , {name : 'try gpt-4',});}
Setting from the init()
or GENTRACE_RESULT_NAME
environment variable.
python
import gentraceimport osfrom dotenv import load_dotenvfrom email_creation import create_email_with_aiload_dotenv()print("Branch name: ", os.getenv("GENTRACE_BRANCH"));print("Commit: ", os.getenv("GENTRACE_COMMIT"));gentrace.init(api_key=os.getenv("GENTRACE_API_KEY"),result_name="try gpt-4")# Alternatively, you can set the result name with the GENTRACE_RESULT_NAME env variable. Our# library will access the variable if `result_name` was not passed to the init() function. If both# values are supplied, the `result_name` parameter is given preference.console.log("Result name: ", os.getenv("GENTRACE_RESULT_NAME"));PIPELINE_SLUG = "your-pipeline-slug"def main():# If no dataset ID is provided, the golden dataset will be selected by defaulttest_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)# To specify a particular dataset, you can provide its ID:# test_cases = gentrace.get_test_cases(dataset_id="123e4567-e89b-12d3-a456-426614174000")outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = create_email_with_ai(inputs)outputs.append(result)# Upon submission, the test result will have the provided result name in the Gentrace UI.response = gentrace.submit_test_result(PIPELINE_SLUG, test_cases, outputs)print("Result ID: ", response["resultId"]);main()
Setting directly from submit_test_result()
.
python
def main():test_cases = gentrace.get_test_cases(pipeline_slug=PIPELINE_SLUG)# Note: If you want to use a specific dataset instead of the golden dataset,# you can provide a dataset ID like this:# test_cases = gentrace.get_test_cases(dataset_id="your-dataset-id-here")outputs = []for test_case in test_cases:inputs = test_case["inputs"]result = create_email_with_ai(inputs)outputs.append(result)# Upon submission, the test result will have the provided result name in the Gentrace UI.response = gentrace.submit_test_result(PIPELINE_SLUG,test_cases,outputs,result_name="try gpt-4")print("Result ID: ", response["resultId"]);main()
View Test Results in the UI
You can compare them by checking their corresponding rows and selecting the "Compare" button at the top right of the screen.
Dashboard
This view provides a comprehensive look at how your test runs are performing across different evaluators. Each graph consolidates all the test runs, grouped by their evaluated results for easier comparison.
Click on the Runs tab to zoom in on the graded output for your test cases.
You can set a different test result as the baseline and the rest as benchmarks in the Runs table. Notice how the delta changes relative to the baseline. You can also click on the column headers to sort by results delta.
Select an evaluation cell to see its grading history.