Advanced version with tracing
The Gentrace SDK has an advanced SDK that supports tracing multi-output and interim steps for evaluation. This is particularly useful for agents and chains.
Example pipeline: Let’s say you have an agent that receives a user chat, uses that as instructions to crawl through a file structure (making modifications along the way), and then responds to the user. You probably do not want to evaluate just the end chatbot response. You want to evaluate the actual changes that were made to the files.
Gentrace now supports this via:
- Capturing all “steps” along the way
- Allowing custom processing of “steps” to turn them into processed "outputs” that can be easily evaluated
Capturing steps
In our quickstart, we pull test cases and submit test results using our basic SDK evaluation functions.
To capture multiple outputs/interim steps, you need to change the test script to use the function run_test()
in Python or runTest()
in TypeScript.
If we're building a simple email compose feature, our AI pipeline might have two stages:
- Create an initial draft of the email
- Simplify the language of the email to make it more readable
We want to evaluate whether the initial draft (from stage 1) preserves the same meaning as the simplified email (stage 2).
Test script modifications
If you set up Gentrace using the simple SDK version from the quickstart or
the simple version docs in the page prior. Replace any getTestCases()
and
submitTestResult()
invocations with a single runTest()
invocation (as shown below).
Note that a single invocation of runTest()
creates a single test result report.
If you set up Gentrace using the simple SDK version from the quickstart or
the simple version docs in the page prior. Replace any get_test_cases()
and
submit_test_result()
invocations with a single run_test()
invocation (as shown below).
Note that a single invocation of run_test()
creates a single test result report.
- TypeScript
- Python
typescript
import {init ,runTest ,Configuration } from "@gentrace/core";import {compose } from "../src/compose"; // TODO: REPLACE WITH YOUR PIPELINEinit ({apiKey :process .env .GENTRACE_API_KEY });constPIPELINE_SLUG = "your-pipeline-id";async functionmain () {// runTest() unifies getTestsCases() and submitTestResult() into a single// function.//// The callback passed to the second parameter is invoked once// for every test case associated with the pipeline.//// This assumes you have previously created test cases for the specified// pipeline that have three parameters: sender, receiver, and query.awaitrunTest (PIPELINE_SLUG , async (testCase ) => {returncompose (testCase .inputs .sender ,testCase .inputs .receiver ,testCase .inputs .query );});}main ();
python
import gentraceimport pipelinesgentrace.init(api_key=process.env.GENTRACE_API_KEY,host="https://gentrace.ai/api/v1")def test_case_callback(test_case):return pipelines.compose(test_case.get("inputs").get("sender"),test_case.get("inputs").get("receiver"),test_case.get("inputs").get("query"))def main():# run_test() unifies get_test_cases() and submit_test_results() into a single# function.## The callback passed to the second parameter is invoked once for every test# case associated with the pipeline.## This assumes you have previously created test cases for the specified# pipeline that have three parameters: sender, receiver, and query.gentrace.run_test(pipelines.PIPELINE_SLUG, test_case_callback)main()
Pipeline feature modifications
Then, you need to modify your compose feature to use a Gentrace pipeline runner to instrument your code.
Here's how you might modify the compose()
function.
- TypeScript
- Python
typescript
import {Pipeline } from "@gentrace/core";import {initPlugin } from "@gentrace/openai";constPIPELINE_SLUG = "compose"constplugin = awaitinitPlugin ({apiKey :process .env .OPENAI_KEY ,});constpipeline = newPipeline ({slug :PIPELINE_SLUG ,plugins : {openai :plugin }});awaitpipeline .setup ();export constcompose = async (sender : string,receiver : string,query : string) => {// Runner automatically captures and meters invocations to OpenAIconstrunner =pipeline .start ();// This is a near type-match of the official OpenAI Node.JS package handle.constopenai =runner .openai ;// FIRST STAGE (INITIAL DRAFT).// Since we're using the OpenAI handle provided by our runner, we capture inputs// and outputs automatically as a distinct step.constinitialDraftResponse = awaitopenai .chat .completions .create ({model : "gpt-3.5-turbo",temperature : 0.8,messages : [{role : "system",content : `Write an email on behalf of ${sender } to ${receiver }: ${query }`,},],});constinitialDraft =initialDraftResponse .data .choices [0]!.message !.content ;// SECOND STAGE (SIMPLIFICATION)// We also automatically capture inputs and outputs as a step here too.constsimplificationResponse = awaitopenai .chat .completions .create ({model : "gpt-3.5-turbo",temperature : 0.8,messages : [{role : "system",content : "Simplify the following email draft for clarity: \n" +initialDraft ,},],});constsimplification =simplificationResponse .choices [0]!.message !.content ;awaitrunner .submit ();return [simplification ,runner ];};
runTest()
callback must return [output, runner] arrayKeep in mind that the runTest()
callback function requires that an array is returned where the second value is the
Gentrace TypeScript runner that captures pipeline step information.
Notice that the final line in the above compose()
function is [simplification, runner]
python
import gentracePIPELINE_SLUG = "compose"pipeline = gentrace.Pipeline(PIPELINE_SLUG,openai_config={"api_key": process.env.OPENAI_KEY,},)pipeline.setup()def compose(sender, receiver, query):# Runner automatically captures and meters invocations to OpenAIrunner = pipeline.start()# Near type-match of the official OpenAI Node.JS package handle.openai = runner.get_openai()# FIRST STAGE (INITIAL DRAFT).# Since we're using the OpenAI handle provided by our runner, we capture inputs# and outputs automatically as a distinct step.initial_draft_response = openai.ChatCompletion.create(messages=[{"role": "system","content": f"Write an email on behalf of {sender} to {receiver}: {query}"},],model="gpt-3.5-turbo")initial_draft = initial_draft_response.choices[0].message.content# SECOND STAGE (SIMPLIFICATION)# We also automatically capture inputs and outputs as a step here too.simplification_response = openai.ChatCompletion.create(messages=[{"role": "system","content": "Simplify the following email draft for clarity: \n" + initial_draft},],model="gpt-3.5-turbo")simplification = simplification_response.choices[0].message.contentrunner.submit()return [simplification, runner]
runTest()
callback must return [output, runner] tupleKeep in mind that the runTest()
callback function requires that an array tuple is returned where the second value is the Gentrace runner that captures pipeline step information.
Notice that the final line in the above compose()
function is [simplification, runner]
Metering custom steps
The Gentrace pipeline runner tracks OpenAI and Pinecone out-of-the-box with the runner.openai
and runner.pinecone
invocations.
The Gentrace pipeline runner tracks OpenAI and Pinecone out-of-the-box with the runner.get_openai()
and runner.get_pinecone()
invocations.
However, many AI pipelines are more complex than single-step LLM/vector store queries. They might involve multiple network requests to databases/external APIs to construct more complex prompts.
If you want to track network calls to databases or APIs, you can wrap your invocations with runner.measure()
. Inputs and outputs are automatically captured as they are passed into measure()
.
typescript
export constcomposeEmailForMembers = async (sender : string,organizationId : string,organizationName : string,) => {// Runner captures and meters invocations to OpenAIconstrunner =pipeline .start ();constusernames = awaitrunner .measure ((organizationId ) => {// Sends a database call to retrieve the names of users in the organizationreturngetOrganizationUsernames (organizationId );},[organizationId ],);constopenai =runner .openai ;constinitialDraftResponse = awaitopenai .chat .completions .create ({model : "gpt-3.5-turbo",temperature : 0.8,messages : [{role : "system",content : `Write an email on behalf of ${sender } to the members of organization ${organizationName }.These are the names of the members: ${usernames .join (", ")}`},],});constinitialDraft =initialDraftResponse .choices [0]!.message !.content ;awaitrunner .submit ();return [initialDraft ,runner ];}
python
def compose_email_for_members(sender, organization_id, organization_name):runner = pipeline.start()usernames = runner.measure(# TODO: replace with your own function that you want to measureget_organization_usernames,org_id=organization_id,)openai = runner.get_openai()initial_draft_response = openai.ChatCompletion.create(messages=[{"role": "system","content": f"Write an email on behalf of {sender} to the members of organization {organization_name}. These are the names of the members: {usernames.join(", ")}"},],model="gpt-3.5-turbo")initial_draft = initial_draft_response.choices[0].message.contentrunner.submit()return [initial_draft, runner]
Creating the evaluator
Once the test script is created and your feature is properly instrumented, you need to create an evaluator that compares the initial draft email and simplified email on an evaluation rubric.
We will use a AI (model-graded) evaluator to compare the steps.
Defining a processor
In the above image, we first use a basic processor called "Extract Steps" to transform the two OpenAI completions from the pipeline to variables processed.initialDraft
and processed.simplification
that can be interpolated into the AI model-graded evaluation.
typescript
function process({ outputs, steps }) {const processedOutputs = {// Convert the output JSON object returned by OpenAI to a string value// of only the completion.initialDraft: steps[0].outputs.choices[0].message.content,simplification: steps[1].outputs.choices[0].message.content,}return processedOutputs;}
The return value of any processor is made available to the AI evaluator as the processed
object.
Note that the processor runs within Gentrace, not within your own code. Here's how the processor function looks within our UI.
This means that you can use our Python or TypeScript SDKs to write your evaluation code, but the processor must be written in JavaScript.
With this specific processor, two keys will be made available to the evaluator: processed.initialDraft
(the initial draft as a string) and processed.simplification
(the simplified draft as a string).
Defining a prompt
We then use the two keys (processed.initialDraft
and processed.simplification
) to interpolate into the following prompt.
handlebars
You are comparing two emails. Here is the data:[BEGIN DATA]************[Task]: Are these emails semantically similar?************[First Email]: {{ processed.initialDraft }}************[Second Email]: {{ processed.simplification }}************[END DATA]Select between one of the two options below:(A) The first email is essentially semantically identical to the second email(B) The first email is fundamentally semantically different from the second email
Testing the evaluator
To test that your evaluator is working correctly, you can select a previously observed pipeline output and then press "Evaluate". You will see how the evaluator response in the far right pane.
Once you're done testing, you can press finish to create the evaluator.