Building comparisons
After creating your first result, you will likely need to change your generative AI pipeline and compare the quality of results. Comparisons help you understand whether your generative AI feature is improving over time.
Creating both results
To create a comparison, you need to create two results.
If you're not familiar with creating test results, read the prior section.
To allow easy differentiation between two test results in the Gentrace UI, you have several options:
- Use the
$GENTRACE_RESULT_NAME
environmental variable - Use the
init()
function with theresultName
parameter - Specify the
name
parameter in thesubmitTestResult()
method of your pipeline runner
Result creation walkthrough
Let's say that we have a generative AI feature, powered by a single gpt-4o prompt. We follow these steps to create a result:
- Construct a suite of test cases within a dataset.
- Evaluate its output with an AI evaluator that checks whether the output complies with our company safety policy.
- Write and invoke the associated submission code with the name
Try gpt-4
We end up with a report similar to this.
We then change our generative AI pipeline to use GPT 3.5 Turbo to reduce cost. We want to see whether the performance of our feature has regressed. We follow the above steps with a different result name and generate a new report.
Comparing results
Now that we have two results, we can compare them. Select the result that you want to compare against.
In the comparison view, we can see that the GPT 3.5 Turbo pipeline performs worse than GPT-4 on 🚧 8 more test cases when using the Safety evaluator.
This may be acceptable to us, given the cost savings. If not, we can refine our GPT 3.5 prompt and re-compare the results again to improve the performance.