Revamped, realtime test results

    Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.

    To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.

    We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.

    Aggregate comparison


    The new aggregate view shows the statistical differences between different versions of your LLM-based feature.

    List of test cases with evaluations

    The new list view shows all of the scores for different test cases.


    When doing a comparison, it highlights what changed between the different modes.


    Drilldown per test case


    The new drilldown view presents a clear picture of the output(s) for a specific test case. It includes the JSON representation, evaluations, and timeline.

    Accessing the legacy view

    For the time being, you can access the old view by clicking the "Legacy results page" link at the top of the new results view. Please provide us with feedback if you find yourself going back to that view.

    Other changes

    • Performance work: Gentrace is much faster in many views
    • gpt-4o supported in playground beta, as a grader
    • Editing a processor will now show which evaluators will be affected
    • Improved human evaluation experience, particularly when taking notes
    • Fixed 83 bugs

    Comparative (head-to-head) evaluators

    Gentrace now supports evaluators that compare two outputs for head-to-head types of analysis.

    Use them to determine which of two experiments is better using an LLM or heuristics.

    Read more

    Classification evaluators

    Gentrace now supports classification evaluators that do a simple comparison of classification outputs and generate more robust aggregate metrics like precision, recall, and F1.

    Read more

    Better code editor and Python support


    We've upgraded the Gentrace code editor to dramatically improve the code writing experience. Gentrace also now supports Python for writing processors and evaluators.

    Creating test data from live runs

    Gentrace supports creating test data directly from your live runs.

    When you create test data from live data, Gentrace allows you to edit the data and choose the pipeline to which it gets saved.

    Read more

    Other changes

    • Significant performance work on data injestion, results view, live view
    • Revamped filters and added many (especially in live)
    • Updated prod evaluation daily AI evaluation limit to be per evaluator (rather than per organization)
    • Streaming evaluator outputs during testing
    • Save expected and actual outputs in test cases on blur
    • Added boolean metadata type
    • CSV export of test results now includes expected values
    • Fixed 73 bugs

    Image evaluators

    Gentrace now supports evaluating images using GPT-4 with Vision.

    Example use cases:

    • Validate that a generated webpage has certain visual characteristics (written as a rubric on an expected value)
    • Compare a generated image to a baseline image to see if a new model produces higher quality

    Read more

    Evaluator and processor templates


    Evaluator and processor templates are now available to be shared between pipelines, users, and teams.

    If you create the perfect factualness check, or a very useful conciseness evaluator, consider creating a template to share with the rest of your organization.

    ISO 27001 Compliance

    Gentrace has completed our audit and been certified as ISO 27001 compliant. Read more.

    New blog and docs

    To share our thinking on best practices in generative AI development, we've created a blog. We've also re-organized our docs and added an AI-powered search.

    Our first two blog posts are now available:

    Other changes

    • Grader role: for teammates who should be able to manually grade outputs but not build / edit pipelines
    • When testing evaluators during creation or edit, output now streams to the client
    • More evaluator rerun options on test result (missing evaluations, errored evaluations, all evaluations)
    • Show instructions for humans in the test result -> run evaluation side bar
    • Always make right bar statistics available on a test result
    • Collapse the left bar when you want more real estate
    • Filter live data by metadata
    • Download live data as CSV
    • Control which evaluators run on which test cases with processors
    • Fixed 34 bugs

    Production evaluation

    Gentrace now supports running evaluations in production. Configure AI, Heuristic, or Human evaluators as before, then specify a sampling rate to run them continuously in production.

    With production evaluation, we are also launching a revamped live production data view. This allows you to view production outputs inline and customize display options.

    SSO & SCIM

    Gentrace now supports SSO via OIDC and SCIM for user provisioning.

    Read more in our docs (SSO , SCIM ).

    External evaluations (Import or API)


    Gentrace now supports importing evaluations or receiving them via our API.

    Create an evaluator, then use the "Import evaluations" option in our UI to import from CSV, JSONL, or JSON.

    Alternatively, use our API to programmatically upload grades to the evaluator.

    Other changes

    • Added display options to result / live views
    • Made display options saveable
    • Added blind display option for unbiased, comparative grading
    • Made side bar more concise
    • Revamped top nav bar to be more concise and functional
    • Pipeline slugs are now automatically computed during creation (can be customized under "Advanced")
    • Updated pricing for OpenAI models
    • Run a test with a single-case / few-cases
    • Show grades without hovering in comparative results graphs
    • Pick folder while creating new pipeline
    • Edit result names inline
    • Fixed 27 bugs

    Evaluate UI 2.0

    We've improved the way you interact with Evaluate results and evaluations, collectively reflecting the 2.0 release of our Evaluate UI.

    We've improved the evaluation mini UI by adding:

    • More concise, appealing design
    • Inline editing
    • The ability to add notes

    We've improved the overall result view with:

    • Resizable first column
    • Better expanded view
    • More concise compact view
    • Support for archiving and deleting test results

    Finally, we've improved evaluator creation and testing:

    • Enhanced heuristic evaluator debugging with console.log statements
    • Improved AI evaluator debugging with OpenAI error statements
    • Edit evaluators without cloning
    • Edit and delete processors

    New docs


    We've launched a new docs site . The new docs feature a cleaner design, new SDK section, and better organization.

    We are keeping the old docs up for 2 more weeks, but then they will be removed and redirected to the new ones. Let us know if there's any way we can make the new docs better.

    OpenAI DevDay features

    We've made a series of changes in response to OpenAI DevDay:

    • We updated GPT cost information in our Observe product
    • We now support more models for AI evaluation including GPT 4 Turbo (128k).

    We're also adding support for the following soon:

    • Assistants API observability
    • Image AI grading with GPT-4 Vision



    Gentrace now has folders for better organizing pipelines across different teams. Folders can be nested and collapsed/shown in the left bar.

    Other changes

    • Added some new v2 API routes - they are more standardized, and we will gradually migrate all routes to v2
    • Fixed 29 bugs

    Multi-modal evaluation

    Evalute image-to-text generative pipelines by uploading images into Gentrace test cases (programatically or in our UI).

    Gentrace invokes your generative pipeline on these image test cases. Text outputs are then uploaded to Gentrace and evaluated the usual way.

    Better organizing


    We've made a series of changes to improve Gentrace organizations for larger teams:

    • Personal pipelines: pipelines can now be scoped to an individual for experimentation
    • Pipeline cloning: you can now clone pipelines (alongside their test cases and evaluators) to new private or team pipelines.
    • Editor role: Gentrace now supports three roles: Admin, Editor, and Viewer. You can learn more about these roles in our docs .

    Threading (eg Chats / Conversations)


    To better track chats, conversations, and other multi-turn interactions with an AI feature / product, Gentrace SDKs and APIs now support linking together pipeline runs into Threads.

    Learn more in our docs .



    Gentrace now supports arbitrary metadata on pipeline runs, evaluation runs, and total evaluation results. Metadata can be added and edited in our SDK/API or via the UI.

    Metadata can be used to:

    • Include environmental information about where the test was run
    • Preformat / prettify data
    • Specify prompt or model versions
    • Link to additional debugging context

    Other changes

    • We now integrate with Rivet for evaluating and tracing Rivet graphs - learn more in the docs
    • Evaluation results now automatically receive an average score
    • Migrated the User role to the Viewer role (read-only)
    • Added the Editor role which can upload to / edit pipeline data but cannot adminster (see details:
    • Prettier rendering of OpenAI function calls
    • Added a raw (JSON) view on timeline steps
    • Added a mechanism to render HTML outputs as HTML
    • Fixed 29 bugs

    Agent tracing (timeline)

    Gentrace now supports better tracing with our timeline view. This view shows all of the steps that occur when an agent runs (for example, which LLM calls occurred and which tools were used). It is especially useful for tracing agents and chains.

    In Observe, open a run in the right bar and click "Expand" or double-click a run to open the timeline view.

    In Evaluate, a condensed timeline view now shows up at the bottom of runs, which can be expanded.

    To take advantage of timeline, use our advanced SDK (Node , Python ) to link together steps.

    Multi-comparison; speed & cost comparison


    Gentrace now supports comparing many results at once and creating comparative graphs. This is particularly useful when considering many models, model versions, or many prompt options at once.

    To start a comparison, click into any evaluate result, then click "Add comparison" at the top. Charts will pop in from the right side.

    Note that charts vary with the evaluator type - you'll see the bar chart for "Options" evaluators (eg return "A", "B", "C") and the swarm and box plots for "Percentage" evaluators (return a number between 0 and 1).

    In addition, Gentrace Evaluate now shows the speed and cost of runs and aggregates on results. You'll need to use our advanced SDK (Node , Python ).

    Other changes

    • OpenAI Node v4 SDK support (see Installation )
    • More useful, prettier rendering of Pinecone queries
    • Show feedback text details in Observe
    • Choose row height in Evaluate results
    • Updated OpenAI pricing
    • Made more right bars resizable (all coming soon)
    • Changed "raw" toggle to a better UI element
    • Fixed 14 bugs

    Multi-output / interim step evaluation

    Gentrace now supports multi-output and interim step evaluation. This is particularly useful for agents and chains.

    Let’s say you have an agent that receives a user chat, uses that as instructions to crawl through a file structure (making modifications along the way), and then responds to the user. You probably do not want to evaluate just the end chatbot response. You want to evaluate the actual changes that were made to the files.

    Gentrace now supports this via:

    • Capturing all “steps” along the way
    • Allow custom processing of “steps” to turn them into “outputs” that can be easily evaluated

    Read more in the docs .

    Paid open beta (& legal)

    Gentrace has now ingested millions of generated data points and is working with many large and small companies.

    As a result, we’re transitioning to a paid open beta. You can view details about our Standard pay-as-you go pricing model here .

    • New users: will receive 14 day trials
    • Existing users: by default, will receive trial access until October 1
    • Larger companies: reach out ([email protected] ) so we can help customize our trial process to your vendor process

    Please let us know if you have questions and/or feedback about our Standard pricing.

    Additionally, we launched our terms of service , privacy policy , DPA , subprocessors list , and SOC 2 page. Let us know if you have any feedback.


    • Added testRun SDK method for capturing steps with test outputs
    • Added processors in evaluators for converting any combination of data from any steps to outputs for evaluation
    • Added multiple expected outputs
    • Unified evaluate and observe under the same unit (”Pipeline”)
    • Added option to compare between arbitrary test results (not just to “main”)
    • Added a tabular editor option for editing inputs and expected outputs (makes copy / pasting blocks of text with new lines much easier)
    • Swapped the naming of result and run in Evaluate to be consistent with Observe
    • Added labels on Pipelines, and made it possible to query Pipelines by label in the SDK
    • Added a pricing page
    • Added a billing page (admin only) where you can manage your plan, update your usage limit, and view invoices
    • Fixed 12 bugs

    Gentrace Evaluate

    Evaluate, our continuous evaluation tool, is now in early access.

    Check out the video .

    Gentrace Evaluate tests generative AI pipelines like normal code using model-graded evaluation.

    It replaces spreadsheets and many hours of manual work to grade those test cases.

    In Evaluate, you:

    • Create a set of test cases
    • Write a test script that pulls down those test cases, runs them through your generative pipeline, and uploads the results
    • Use customizable evaluators to automatically grade the results

    You can try Evaluate in Gentrace today. If you’re not already in early access, reach out to [email protected]


    Welcome to Releases

    Welcome to the releases page. We'll periodically update this page with release notes as we ship features and fix bugs in Gentrace.

    Gentrace alpha


    Gentrace is an evaluation and observability tool for generative AI builders, now in alpha.

    If you're interested in participating, please email [email protected] .