Gentrace suffered an evaluator outage that affected all customers who ran tests from Friday, July 26th at 1:42PM PT to Monday, July 29th at ~4:30 PM PT.
How did we discover it
We did not automatically detect this issue pre-deploy or in our alerting post-deploy. We learned when a customer reported the issue on Monday at 3:17pm PT and we promoted to a P0.
Why this happened
Over the last few months, as users have submitted increasingly more data to Gentrace, we noticed that our databases (Clickhouse and Postgres) were struggling to keep up with the submissions which would often degrade page load times for end-users and cause frustrating evaluation delays.
A huge source of this load was due to our task runner. We only passed our evaluation tasks IDs to operate on, so they would refetch data and then perform the evaluation. Those refetches were adding up to a ton of load.
To fix this, we introduced a new architecture where all necessary data is included directly in the task. This eliminated the excessive querying because data is included up front rather than re-fetched in the task.
This significantly improved performance but resulted in a critical bug where evaluators failed to pass outputs data into the evaluator context correctly. This resulted in evaluations failing because those values were blank.
Why we didn’t catch this sooner
Manual testing
Although we manually ran evaluations on our seed data, our testing focused on submitting AI evaluations to gain conviction that basic functionality was working. These AI evaluations were “silently failing” because our AI evaluations still returned the expected evaluation even when the prompt is malformed. Had we looked directly at the generated evaluation prompt, the issue would have been evident.
Automated testing
Our automated testing was built under the assumption that tasks fetch and process their own data. As such, we had tests similar to this:
test('Validate that AIEvaluationTask saves the correct evaluation result', async () => {
// ...
await fireTask(
new AIEvaluationTask(evaluationId)
);
// ...
expect(evaluation.evalLabel).toBe('D');
});
After we introduced the BatchEvaluationTask
, we changed our tests to reflect the new way we passed data.
test('Validate that AIEvaluationTask saves the correct evaluation result', async () => {
// ...
await fireTask(
new AIEvaluationTask({
data: { /* big data payload */ }
})
);
// ...
expect(evaluation.evalLabel).toBe('D');
});
The problem is the modified tests only tested that, given correctly formatted data, the evaluation was correct. It did not test whether data was being correctly formatted, which is where the bug occurred.
Monitoring
Our monitoring differentiates evaluation failures and task failures. A task failure is where the task errors out, whereas an evaluation failure is shown to the user so that they can then fix their evaluator.
In this case, because we were interpolating blank data into the evaluators, our monitoring considered it to be an evaluation failure rather than a task failure.
Currently, we only alert on spikes in task failure rather than evaluation failure, as evaluation failure can happen for user-generated reasons. As such, we did not alert here.
Improvements moving forward
Most important:
- To improve our automated testing, we plan to add “router-level” integration tests for test result submission that test the entire lifecycle of an evaluation rather than just a single evaluation task.
- We will add anomaly detection and alerting for spikes in evaluation failure rates, not just >0 task failures, as spikes in evaluation failures can also indicate an issue on our end.
Also:
- To improve our manual testing, improve the variety of examples in our seed data such that evaluator failures will be more obvious.
- Validate existing task failure monitoring and alerting.
How did we discover it
We did not automatically detect this issue pre-deploy or in our alerting post-deploy. We learned when a customer reported the issue on Monday at 3:17pm PT and we promoted to a P0.
Why this happened
Over the last few months, as users have submitted increasingly more data to Gentrace, we noticed that our databases (Clickhouse and Postgres) were struggling to keep up with the submissions which would often degrade page load times for end-users and cause frustrating evaluation delays.
A huge source of this load was due to our task runner. We only passed our evaluation tasks IDs to operate on, so they would refetch data and then perform the evaluation. Those refetches were adding up to a ton of load.
To fix this, we introduced a new architecture where all necessary data is included directly in the task. This eliminated the excessive querying because data is included up front rather than re-fetched in the task.
This significantly improved performance but resulted in a critical bug where evaluators failed to pass outputs data into the evaluator context correctly. This resulted in evaluations failing because those values were blank.
Why we didn’t catch this sooner
Manual testing
Although we manually ran evaluations on our seed data, our testing focused on submitting AI evaluations to gain conviction that basic functionality was working. These AI evaluations were “silently failing” because our AI evaluations still returned the expected evaluation even when the prompt is malformed. Had we looked directly at the generated evaluation prompt, the issue would have been evident.
Automated testing
Our automated testing was built under the assumption that tasks fetch and process their own data. As such, we had tests similar to this:
test('Validate that AIEvaluationTask saves the correct evaluation result', async () => {
// ...
await fireTask(
new AIEvaluationTask(evaluationId)
);
// ...
expect(evaluation.evalLabel).toBe('D');
});
After we introduced the BatchEvaluationTask
, we changed our tests to reflect the new way we passed data.
test('Validate that AIEvaluationTask saves the correct evaluation result', async () => {
// ...
await fireTask(
new AIEvaluationTask({
data: { /* big data payload */ }
})
);
// ...
expect(evaluation.evalLabel).toBe('D');
});
The problem is the modified tests only tested that, given correctly formatted data, the evaluation was correct. It did not test whether data was being correctly formatted, which is where the bug occurred.
Monitoring
Our monitoring differentiates evaluation failures and task failures. A task failure is where the task errors out, whereas an evaluation failure is shown to the user so that they can then fix their evaluator.
In this case, because we were interpolating blank data into the evaluators, our monitoring considered it to be an evaluation failure rather than a task failure.
Currently, we only alert on spikes in task failure rather than evaluation failure, as evaluation failure can happen for user-generated reasons. As such, we did not alert here.
Improvements moving forward
Most important:
- To improve our automated testing, we plan to add “router-level” integration tests for test result submission that test the entire lifecycle of an evaluation rather than just a single evaluation task.
- We will add anomaly detection and alerting for spikes in evaluation failure rates, not just >0 task failures, as spikes in evaluation failures can also indicate an issue on our end.
Also:
- To improve our manual testing, improve the variety of examples in our seed data such that evaluator failures will be more obvious.
- Validate existing task failure monitoring and alerting.