Building datasets for LLM product evaluations

Initially, when developing LLM products, "vibes" are enough to make decisions. However, to make products reliable, companies generally end up building out evaluations. Evaluations consist of:

Datasets of inputs to the LLM product
Evaluators for grading the LLM product outputs

Lots has been written about making better evaluators. However, I frequently see companies make the same mistakes in building out their datasets, slowing down their development process and leading to stale tests that don't accurately capture how well the product will function in the hands of users.

In this post, I provide a system for building, maintaining, and scaling datasets for modern LLM product workflows.

The myth of the “golden” dataset

The biggest misconception I see from folks getting started with evaluation datasets is that they need to find their "golden" dataset. They believe this "golden" dataset needs to be a perfect 200 - 10,000 examples, and they set up internal experts or outsourced firms with 2-3 month projects to put together the ultimate dataset.

This doesn't work for a few reasons:

It paralyzes development teams. The search for the "golden" dataset takes months, slowing down teams who now need to wait to be able to start testing when they could've gotten started sooner.
The dataset doesn't keep up with evolving product specifications. If your product works well for one use case, you'll inevitably get requests for enhancements and new capabilities. If it doesn't work well, you may need to scope down your ambitions. Either way, the dataset needs to be able to evolve with the scope of the application.
The dataset may not reflect common ways the application breaks. Inevitably, any golden dataset will have gaps, and developers or users will break the app in a way that wasn't captured by the dataset. When they do find these gaps, they won't have a process to bring those learnings quickly back to the golden dataset, so the same mistakes happen repeatedly until the next golden dataset is created several months later.

Start small, iterate continuously

Instead of spending months on your golden dataset, the best teams build their dataset continuously.

Change your mindset

I encourage teams to adopt the following mindset changes:

My dataset is dynamic and will never be perfect.
My dataset starts small and grows slowly and steadily.
Every time someones uses the product and something goes wrong, there's an opportunity to create a corresponding failing test.

In practice, this means they do the following:

Start with 5-10 examples that reflect the most common use cases for the product.
Record all user interactions with the application. Automate flagging potential negative user experience. Review negative outcomes and ensure a corresponding “failing test” is created in the dataset reflecting the inputs that led to the outcome.
Involve stakeholders. Ask product managers and subject matter experts to directly engage with the dataset, adding examples and running them to see if they think the outputs are being correctly evaluated.

Over time, it's normal for datasets to grow into hundreds of examples. But the best ones don't start there. They grow organically over time.

Connect your dataset to real, or realistic, datasources

When building RAG systems, the input dataset might consist of a query that operates against a vector database. Lots of folks try to setup their test database with either:

A bunch of fixtures hard-coded into the test code
A snapshotted, perfect DB of representative examples.

In practice, I've found that the best companies do neither of these approaches - they test their RAG system directly against a staging or production system.

This does lead to the obvious downside that tests may start failing due to changes to the underlying data.

For example, if I'm doing a RAG over my docs, I might include a row in my test dataset asking “What authentication methods are supported” with an expected outcome of “Magic link and password.” When we launch SAML support and update our docs, the RAG system returns “Magic link, password, and SAML” and the test starts failing - not because of a change to the LLM product code, but because the underlying data changed.

As such, with this process, you need to review failing tests to see if the test got stale and update them if so. However, I've found that this tends to get far more realistic results since it constantly stays up-to-date with the characteristics of real-world data, so it's well worth the time invested and leads to a much more reliable LLM application.

The future of dataset tooling

Dataset tooling suffers from many drawbacks today that make continuous development painful. In the future, tooling must evolve to make the continuous development process easier.

There are two especially big problems that will be solved:

Problem #1: Textual formats (e.g., JSON, CSV) are disconnected from LLM application code.

Let's say we're testing an LLM product for Q&A against a document. It's function signature is answerQuestion(question: string, document: Document). Because the dataset is in a CSV, the dataset has columns question and documentId, with the latter being difficult to audit or edit without another tool for querying the underlying document. As such, people don't add rows because there's too much friction.

Solution: Make custom columns in datasets that correspond to registered object types in code. Allow for selection and search in the actual application. In this case, “document” would be a picker of document names, corresponding to fixtures in code or database entries.

Problem #2: Important large-scale dataset edits are too difficult.

For example, we might discover that the LLM product performs well in English, but then learn that it's not performing well when the query is in Japanese. How do we create the Japanese dataset without a bunch of scripting or spreadsheets?

Solution: Dataset tooling includes LLM-based assistance, such as the ability to “clone this dataset but translate the query column to Japanese.”

Get started

Don't spend months building the perfect "golden" dataset, only to discover it's out of date or missing key use cases and needs to be updated anyway. Instead, start immediately with a small test sample volume and slowly but steadily build up to a comprehensive "golden" dataset. Here's how to set up your first dataset in Gentrace.

Gentrace is building the next generation dataset with software-defined columns and LLM assistance for faster editing. If you're interested in working on this problem, please reach out - we're hiring.

‍