Skip to main content
Version: 4.6.19

Building comparisons

After creating your first result, you will likely need to change your generative AI pipeline and compare the quality of results. Comparisons help you understand whether your generative AI feature is improving over time.

Comparing GPT4 and GPT 3.5

Creating both results

To create a comparison, you need to create two results.


If you're not familiar with creating test results, read the prior section.

To allow easy differentiation between two test results in the Gentrace UI, you can use the $GENTRACE_RESULT_NAME environmental variable or init() function to name your results upon submission.

Result creation walkthrough

Let's say that we have a generative AI feature, powered by a single GPT-4 prompt. We follow these steps to create a result:

  1. Construct a suite of test cases.
  2. Evaluate its output with an AI evaluator that checks whether the output complies with our company safety policy.
  3. Write and invoke the associated submission code with the name Try gpt-4 using the $GENTRACE_RESULT_NAME environmental variable.

We end up with a report similar to this.

We then change our generative AI pipeline to use GPT 3.5 Turbo to reduce cost. We want to see whether the performance of our feature has regressed. We follow the above steps with a different $GENTRACE_RESULT_NAME value and generate a new report.

Comparing results

Now that we have two results, we can compare them. Select the result that you want to compare against.

In the comparison view, we can see that the GPT 3.5 Turbo pipeline performs worse than GPT-4 on 🚧 8 more test cases when using the Safety evaluator.

This may be acceptable to us, given the cost savings. If not, we can refine our GPT 3.5 prompt and re-compare the results again to improve the performance.