After creating your first result, you will likely need to change your generative AI pipeline and compare the quality of results. Comparisons help you understand whether your generative AI feature is improving over time.
Creating both results
To create a comparison, you need to create two results.
If you're not familiar with creating test results, read the prior section.
Result creation walkthrough
Let's say that we have a generative AI feature, powered by a single GPT-4 prompt. We follow these steps to create a result:
- Construct a suite of test cases.
- Evaluate its output with an AI evaluator that checks whether the output complies with our company safety policy.
- Write and invoke the associated submission code with the name
Try gpt-4using the
We end up with a report similar to this.
We then change our generative AI pipeline to use GPT 3.5 Turbo to reduce cost. We want to see whether the performance of our
feature has regressed. We follow the above steps with a different
$GENTRACE_RESULT_NAME value and generate a new report.
Now that we have two results, we can compare them. Select the result that you want to compare against.
In the comparison view, we can see that the GPT 3.5 Turbo pipeline performs worse than GPT-4 on 🚧 8 more test cases when using the Safety evaluator.
This may be acceptable to us, given the cost savings. If not, we can refine our GPT 3.5 prompt and re-compare the results again to improve the performance.