Running local evaluations
Gentrace lets you run evaluations locally. You can run and manage evaluations in your code if you need more control over how your evaluations are run.
Local evaluation is currently in alpha and may undergo significant changes.
We have not implemented this SDK function in Python yet.
Using runner.addEval()
syntax​
Add evals by instrumenting your code using the pipeline runner syntax from our tracing guide. The example below calculates a score from 0 to 1 based on the email's length.
typescript
import {Pipeline ,getTestRunners ,submitTestRunners } from "@gentrace/core";import {initPlugin } from "@gentrace/openai";ÂconstPIPELINE_SLUG = "compose"Âconstplugin = awaitinitPlugin ({apiKey :process .env .OPENAI_KEY ,});Âconstpipeline = newPipeline ({slug :PIPELINE_SLUG ,plugins : {openai :plugin }});Â// Get test runners for all test cases in the pipelineconsttestRunners = awaitgetTestRunners (pipeline );Â// Function to process each test caseasync functionprocessTestCase ([runner ,testCase ]) {const {sender ,receiver ,query } =testCase .inputs ;Â// This is a near type-match of the official OpenAI Node.JS package handle.constopenai =runner .openai ;ÂconstinitialDraftResponse = awaitopenai .chat .completions .create ({model : "gpt-4",messages : [{role : "system",content : `Write an email on behalf of ${sender } to ${receiver }: ${query }`,},],});ÂconstinitialDraft =initialDraftResponse .choices [0]?.message ?.content || "";constwordCount =initialDraft .split (/\s+/).length ;Â// Bell curve scoring function to evaluate if the created email has an appropriate length// Penalizes emails that are too short or too long, with the ideal length around 100 wordsconstcalculateScore = (count ) => {constmean = 100;conststdDev = 30;constmaxScore = 1;constzScore =Math .abs (count -mean ) /stdDev ;returnMath .max (0,maxScore *Math .exp (-0.5 *Math .pow (zScore , 2)));};Âconstscore =calculateScore (wordCount );Ârunner .addEval ({name : "initial-draft-word-count-score",value :score ,});}Â// Process all test cases (you can implement parallelization here if needed)awaitPromise .all (testRunners .map (processTestCase ));Â// Submit all test runnersconstresult = awaitsubmitTestRunners (pipeline ,testRunners );console .log ("Test result ID:",result .resultId );
Using @gentrace/evals
library​
To simplify eval creation, we created a separate open source library to simplify computing common evaluations.
bash
# Execute only one, depending on your package managernpm i @gentrace/evalsyarn add @gentrace/evalspnpm i @gentrace/evals
Currently, we support a single, flexible LLM-as-a-judge evaluation with evals.llm.base()
.
typescript
// Function to process each test caseasync functionprocessTestCase ([runner ,testCase ]) {const {sender ,receiver ,query } =testCase .inputs ;Âconstopenai =runner .openai ;ÂconstinitialDraftResponse = awaitopenai .chat .completions .create ({model : "gpt-4",messages : [{role : "system",content : `Write an email on behalf of ${sender } to ${receiver }: ${query }`,},],});ÂconstinitialDraft =initialDraftResponse .choices [0]?.message ?.content || "";ÂconstevalResult = awaitevals .llm .base ({name : "email-sentiment-evaluation",prompt : `Evaluate the sentiment of the following email draft:Â"${initialDraft }"ÂThis email was written by ${sender } to ${receiver } based on the following query:Â${query }ÂPlease provide a score and reasoning for your evaluation. Consider the overall toneand emotional content of the email. Rate the sentiment as one of the following:Â- Very Negative- Negative- Neutral- Positive- Very Positive`,scoreAs : {"Very Negative": 0,"Negative": 0.25,"Neutral": 0.5,"Positive": 0.75,"Very Positive": 1,},});Ârunner .addEval (evalResult );}Â// Process all test cases (you can implement parallelization here if needed)awaitPromise .all (testRunners .map (processTestCase ));Â// Submit all test runnersconstresult = awaitsubmitTestRunners (pipeline ,testRunners );console .log ("Test result ID:",result .resultId );
Disabling remote evals​
If you want to run only local evaluations without triggering remote Gentrace evals, pass triggerRemoteEvals
to the runner submission.
typescript
const result = await submitTestRunners(pipeline, testRunners, {triggerRemoteEvals: true,});
Viewing results in Gentrace​
Locally-created evaluations will render in the results UI alongside evaluations by Gentrace.