Version: 2.0.0

Advanced version with tracing

The Gentrace SDK has an advanced SDK that supports tracing multi-output and interim steps for evaluation. This is particularly useful for agents and chains.

Example pipeline: Let’s say you have an agent that receives a user chat, uses that as instructions to crawl through a file structure (making modifications along the way), and then responds to the user. You probably do not want to evaluate just the end chatbot response. You want to evaluate the actual changes that were made to the files.

Gentrace now supports this via:

Capturing all “steps” along the way
Allowing custom processing of “steps” to turn them into processed "outputs” that can be easily evaluated

Capturing steps

In our quickstart, we pull test cases and submit test results using our basic SDK evaluation functions.

To capture multiple outputs/interim steps, you need to change the test script to use the function run_test() in Python or runTest() in TypeScript.

If we're building a simple email compose feature, our AI pipeline might have two stages:

Create an initial draft of the email
Simplify the language of the email to make it more readable

We want to evaluate whether the initial draft (from stage 1) preserves the same meaning as the simplified email (stage 2).

Test script modifications

typescript
python

If you set up Gentrace using the simple SDK version from the quickstart or the simple version docs in the page prior. Replace any getTestCases() and submitTestResult() invocations with a single runTest() invocation (as shown below).

Note that a single invocation of runTest() creates a single test result report.

If you set up Gentrace using the simple SDK version from the quickstart or the simple version docs in the page prior. Replace any get_test_cases() and submit_test_result() invocations with a single run_test() invocation (as shown below).

Note that a single invocation of run_test() creates a single test result report.

TypeScript
Python

typescript
import { init, runTest, Configuration } from "@gentrace/core";
import { compose } from "../src/compose"; // TODO: REPLACE WITH YOUR PIPELINE
 
init({ apiKey: process.env.GENTRACE_API_KEY });
 
const PIPELINE_SLUG = "your-pipeline-id";
 
async function main() {
  // runTest() unifies getTestsCases() and submitTestResult() into a single
  // function. 
  // 
	// The callback passed to the second parameter is invoked once
  // for every test case associated with the pipeline.
  //
  // This assumes you have previously created test cases for the specified
  // pipeline that have three parameters: sender, receiver, and query.
  await runTest(PIPELINE_SLUG, async (testCase) => {
    return compose(
      testCase.inputs.sender,
      testCase.inputs.receiver,
      testCase.inputs.query
    );
  });
}
 
main();

python
import gentrace
import pipelines
gentrace.init(
    api_key=process.env.GENTRACE_API_KEY,
    host="https://gentrace.ai/api/v1"
)
def test_case_callback(test_case):
    return pipelines.compose(
        test_case.get("inputs").get("sender"), 
        test_case.get("inputs").get("receiver"), 
        test_case.get("inputs").get("query")
    )
def main():
  # run_test() unifies get_test_cases() and submit_test_results() into a single
  # function.
  #
  # The callback passed to the second parameter is invoked once for every test 
  # case associated with the pipeline.
  #
  # This assumes you have previously created test cases for the specified
  # pipeline that have three parameters: sender, receiver, and query.
  gentrace.run_test(pipelines.PIPELINE_SLUG, test_case_callback)
main()

Pipeline feature modifications

Then, you need to modify your compose feature to use a Gentrace pipeline runner to instrument your code.

Here's how you might modify the compose() function.

TypeScript
Python

typescript
import { Pipeline } from "@gentrace/core";
import { initPlugin } from "@gentrace/openai";
 
const PIPELINE_SLUG = "compose"
 
const plugin = await initPlugin({
  apiKey: process.env.OPENAI_KEY,
});
 
const pipeline = new Pipeline({
  slug: PIPELINE_SLUG,
  plugins: {
    openai: plugin
  }
});
 
await pipeline.setup();
 
export const compose = async (
  sender: string,
  receiver: string,
  query: string
) => {
  // Runner automatically captures and meters invocations to OpenAI
  const runner = pipeline.start();
  
  // This is a near type-match of the official OpenAI Node.JS package handle.
  const openai = runner.openai;
  
  // FIRST STAGE (INITIAL DRAFT).
  // Since we're using the OpenAI handle provided by our runner, we capture inputs
  // and outputs automatically as a distinct step.
  const initialDraftResponse = await openai.chat.completions.create({
    model: "gpt-3.5-turbo",
    temperature: 0.8,
    messages: [
      {
        role: "system",
        content: `Write an email on behalf of ${sender} to ${receiver}: ${query}`,
      },
    ],
  });
 
  const initialDraft = initialDraftResponse.data.choices[0]!.message!.content;
  
  // SECOND STAGE (SIMPLIFICATION)
  // We also automatically capture inputs and outputs as a step here too.
  const simplificationResponse = await openai.chat.completions.create({
    model: "gpt-3.5-turbo",
    temperature: 0.8,
    messages: [
      {
        role: "system",
        content: "Simplify the following email draft for clarity: \n" + initialDraft,
      },
    ],
  });
  
  const simplification = simplificationResponse.choices[0]!.message!.content;
  
  await runner.submit();
 
  return [simplification, runner];
};

🛑runTest() callback must return [output, runner] array

Keep in mind that the runTest() callback function requires that an array is returned where the second value is the Gentrace TypeScript runner that captures pipeline step information.

Notice that the final line in the above compose() function is [simplification, runner]

python
import gentrace
PIPELINE_SLUG = "compose"
pipeline = gentrace.Pipeline(
    PIPELINE_SLUG,
    openai_config={
        "api_key": process.env.OPENAI_KEY,
    },
)
pipeline.setup()
def compose(sender, receiver, query):
    # Runner automatically captures and meters invocations to OpenAI
    runner = pipeline.start()
    # Near type-match of the official OpenAI Node.JS package handle.
    openai = runner.get_openai()
    # FIRST STAGE (INITIAL DRAFT).
    # Since we're using the OpenAI handle provided by our runner, we capture inputs
    # and outputs automatically as a distinct step.
    initial_draft_response = openai.ChatCompletion.create(
      messages=[
        {
          "role": "system",
          "content": f"Write an email on behalf of {sender} to {receiver}: {query}"
        },
      ],
      model="gpt-3.5-turbo"
    )
    
    initial_draft = initial_draft_response.choices[0].message.content
    
     # SECOND STAGE (SIMPLIFICATION)
     # We also automatically capture inputs and outputs as a step here too.
    simplification_response = openai.ChatCompletion.create(
      messages=[
        {
          "role": "system",
          "content": "Simplify the following email draft for clarity: \n" + initial_draft
        },
      ],
      model="gpt-3.5-turbo"
    )
    
    simplification = simplification_response.choices[0].message.content
    
    runner.submit()
    
    return [simplification, runner]

🛑runTest() callback must return [output, runner] tuple

Keep in mind that the runTest() callback function requires that an array tuple is returned where the second value is the Gentrace runner that captures pipeline step information.

Notice that the final line in the above compose() function is [simplification, runner]

Metering custom steps

typescript
python

The Gentrace pipeline runner tracks OpenAI and Pinecone out-of-the-box with the runner.openai and runner.pinecone invocations.

The Gentrace pipeline runner tracks OpenAI and Pinecone out-of-the-box with the runner.get_openai() and runner.get_pinecone() invocations.

However, many AI pipelines are more complex than single-step LLM/vector store queries. They might involve multiple network requests to databases/external APIs to construct more complex prompts.

If you want to track network calls to databases or APIs, you can wrap your invocations with runner.measure(). Inputs and outputs are automatically captured as they are passed into measure().

typescript
python

typescript
export const composeEmailForMembers = async (
  sender: string,
  organizationId: string,
  organizationName: string,
) => {
  // Runner captures and meters invocations to OpenAI
  const runner = pipeline.start();
  
  const usernames = await runner.measure(
    (organizationId) => {
      // Sends a database call to retrieve the names of users in the organization
      return getOrganizationUsernames(organizationId);
    },
    [organizationId],
  );
  
  const openai = runner.openai;
  
  const initialDraftResponse = await openai.chat.completions.create({
    model: "gpt-3.5-turbo",
    temperature: 0.8,
    messages: [
      {
        role: "system",
        content: `
Write an email on behalf of ${sender} to the members of organization ${organizationName}.
These are the names of the members: ${usernames.join(", ")}
`
      },
    ],
  });
 
  const initialDraft = initialDraftResponse.choices[0]!.message!.content;
 
  await runner.submit();  
  
  return [initialDraft, runner];
}

python
def compose_email_for_members(sender, organization_id, organization_name):
    runner = pipeline.start()
    usernames = runner.measure(
        # TODO: replace with your own function that you want to measure
        get_organization_usernames,
        org_id=organization_id,
    )
    openai = runner.get_openai()
    initial_draft_response = openai.ChatCompletion.create(
        messages=[
            {
                "role": "system",
                "content": f"Write an email on behalf of {sender} to the members of organization {organization_name}. These are the names of the members: {usernames.join(", ")}"
            },
        ],
        model="gpt-3.5-turbo"
    )
    initial_draft = initial_draft_response.choices[0].message.content
    runner.submit()
    return [initial_draft, runner]

Creating the evaluator

Once the test script is created and your feature is properly instrumented, you need to create an evaluator that compares the initial draft email and simplified email on an evaluation rubric.

We will use a AI (model-graded) evaluator to compare the steps.

Defining a processor

In the above image, we first use a basic processor called "Extract Steps" to transform the two OpenAI completions from the pipeline to variables processed.initialDraft and processed.simplification that can be interpolated into the AI model-graded evaluation.

typescript
function process({ outputs, steps }) { 
  const processedOutputs = {
    // Convert the output JSON object returned by OpenAI to a string value
    // of only the completion.
    initialDraft: steps[0].outputs.choices[0].message.content,
    simplification: steps[1].outputs.choices[0].message.content,
  }
  return processedOutputs;
}

The return value of any processor is made available to the AI evaluator as the processed object.

Processor runs in Gentrace

Note that the processor runs within Gentrace, not within your own code. Here's how the processor function looks within our UI.

In-context processor within Gentrace

This means that you can use our Python or TypeScript SDKs to write your evaluation code, but the processor must be written in JavaScript.

With this specific processor, two keys will be made available to the evaluator: processed.initialDraft (the initial draft as a string) and processed.simplification (the simplified draft as a string).

Defining a prompt

We then use the two keys (processed.initialDraft and processed.simplification) to interpolate into the following prompt.

handlebars
You are comparing two emails. Here is the data:
[BEGIN DATA]
************
[Task]: Are these emails semantically similar?
************
[First Email]: {{ processed.initialDraft }}
************
[Second Email]: {{ processed.simplification }} 
************
[END DATA]
Select between one of the two options below:
(A) The first email is essentially semantically identical to the second email
(B) The first email is fundamentally semantically different from the second email

Testing the evaluator

To test that your evaluator is working correctly, you can select a previously observed pipeline output and then press "Evaluate". You will see how the evaluator response in the far right pane.

Once you're done testing, you can press finish to create the evaluator.

Capturing steps​

Test script modifications​

Pipeline feature modifications​

Metering custom steps​

Creating the evaluator​

Defining a processor​

Defining a prompt​

Testing the evaluator​