Evaluate UI 2.0
We've improved the way you interact with Evaluate results and evaluations, collectively reflecting the 2.0 release of our Evaluate UI.
We've improved the evaluation mini UI by adding:
- More concise, appealing design
- Inline editing
- The ability to add notes
We've improved the overall result view with:
- Resizable first column
- Better expanded view
- More concise compact view
- Support for archiving and deleting test results
Finally, we've improved evaluator creation and testing:
- Enhanced heuristic evaluator debugging with console.log statements
- Improved AI evaluator debugging with OpenAI error statements
- Edit evaluators without cloning
- Edit and delete processors
We've launched a new docs site . The new docs feature a cleaner design, new SDK section, and better organization.
We are keeping the old docs up for 2 more weeks, but then they will be removed and redirected to the new ones. Let us know if there's any way we can make the new docs better.
OpenAI DevDay features
We've made a series of changes in response to OpenAI DevDay:
- We updated GPT cost information in our Observe product
- We now support more models for AI evaluation including GPT 4 Turbo (128k).
We're also adding support for the following soon:
- Assistants API observability
- Image AI grading with GPT-4 Vision
Gentrace now has folders for better organizing pipelines across different teams. Folders can be nested and collapsed/shown in the left bar.
- Added some new v2 API routes - they are more standardized, and we will gradually migrate all routes to v2
- Fixed 29 bugs
Evalute image-to-text generative pipelines by uploading images into Gentrace test cases (programatically or in our UI).
Gentrace invokes your generative pipeline on these image test cases. Text outputs are then uploaded to Gentrace and evaluated the usual way.
We've made a series of changes to improve Gentrace organizations for larger teams:
- Personal pipelines: pipelines can now be scoped to an individual for experimentation
- Pipeline cloning: you can now clone pipelines (alongside their test cases and evaluators) to new private or team pipelines.
- Editor role: Gentrace now supports three roles: Admin, Editor, and Viewer. You can learn more about these roles in our docs .
Threading (eg Chats / Conversations)
To better track chats, conversations, and other multi-turn interactions with an AI feature / product, Gentrace SDKs and APIs now support linking together pipeline runs into Threads.
Learn more in our docs .
Gentrace now supports arbitrary metadata on pipeline runs, evaluation runs, and total evaluation results. Metadata can be added and edited in our SDK/API or via the UI.
Metadata can be used to:
- Include environmental information about where the test was run
- Preformat / prettify data
- Specify prompt or model versions
- Link to additional debugging context
- We now integrate with Rivet for evaluating and tracing Rivet graphs - learn more in the docs
- Evaluation results now automatically receive an average score
- Migrated the User role to the Viewer role (read-only)
- Added the Editor role which can upload to / edit pipeline data but cannot adminster (see details:
- Prettier rendering of OpenAI function calls
- Added a raw (JSON) view on timeline steps
- Added a mechanism to render HTML outputs as HTML
- Fixed 29 bugs
Agent tracing (timeline)
Gentrace now supports better tracing with our timeline view. This view shows all of the steps that occur when an agent runs (for example, which LLM calls occurred and which tools were used). It is especially useful for tracing agents and chains.
In Observe, open a run in the right bar and click "Expand" or double-click a run to open the timeline view.
In Evaluate, a condensed timeline view now shows up at the bottom of runs, which can be expanded.
Multi-comparison; speed & cost comparison
Gentrace now supports comparing many results at once and creating comparative graphs. This is particularly useful when considering many models, model versions, or many prompt options at once.
To start a comparison, click into any evaluate result, then click "Add comparison" at the top. Charts will pop in from the right side.
Note that charts vary with the evaluator type - you'll see the bar chart for "Options" evaluators (eg return "A", "B", "C") and the swarm and box plots for "Percentage" evaluators (return a number between 0 and 1).
- OpenAI Node v4 SDK support (see Installation )
- More useful, prettier rendering of Pinecone queries
- Show feedback text details in Observe
- Choose row height in Evaluate results
- Updated OpenAI pricing
- Made more right bars resizable (all coming soon)
- Changed "raw" toggle to a better UI element
- Fixed 14 bugs
Multi-output / interim step evaluation
Gentrace now supports multi-output and interim step evaluation. This is particularly useful for agents and chains.
Let’s say you have an agent that receives a user chat, uses that as instructions to crawl through a file structure (making modifications along the way), and then responds to the user. You probably do not want to evaluate just the end chatbot response. You want to evaluate the actual changes that were made to the files.
Gentrace now supports this via:
- Capturing all “steps” along the way
- Allow custom processing of “steps” to turn them into “outputs” that can be easily evaluated
Read more in the docs .
Paid open beta (& legal)
Gentrace has now ingested millions of generated data points and is working with many large and small companies.
As a result, we’re transitioning to a paid open beta. You can view details about our Standard pay-as-you go pricing model here .
- New users: will receive 14 day trials
- Existing users: by default, will receive trial access until October 1
- Larger companies: reach out ([email protected] ) so we can help customize our trial process to your vendor process
Please let us know if you have questions and/or feedback about our Standard pricing.
- Added testRun SDK method for capturing steps with test outputs
- Added processors in evaluators for converting any combination of data from any steps to outputs for evaluation
- Added multiple expected outputs
- Unified evaluate and observe under the same unit (”Pipeline”)
- Added option to compare between arbitrary test results (not just to “main”)
- Added a tabular editor option for editing inputs and expected outputs (makes copy / pasting blocks of text with new lines much easier)
- Swapped the naming of result and run in Evaluate to be consistent with Observe
- Added labels on Pipelines, and made it possible to query Pipelines by label in the SDK
- Added a pricing page
- Added a billing page (admin only) where you can manage your plan, update your usage limit, and view invoices
- Fixed 12 bugs
Evaluate, our continuous evaluation tool, is now in early access.
Gentrace Evaluate tests generative AI pipelines like normal code using model-graded evaluation.
It replaces spreadsheets and many hours of manual work to grade those test cases.
In Evaluate, you:
- Create a set of test cases
- Write a test script that pulls down those test cases, runs them through your generative pipeline, and uploads the results
- Use customizable evaluators to automatically grade the results
You can try Evaluate in Gentrace today. If you’re not already in early access, reach out to [email protected]
Welcome to Releases
Welcome to the releases page. We'll periodically update this page with release notes as we ship features and fix bugs in Gentrace.
Gentrace is an evaluation and observability tool for generative AI builders, now in alpha.
If you're interested in participating, please email [email protected] .