Evaluations

What are Evals?

Evals, short for Evaluations, let you test and compare different AI model configurations, prompts, and Flows side by side using real data. Instead of guessing which setup works best, Evals give you concrete metrics so you can make confident decisions about what to deploy.

You can run Evals from the Flow editor or the Flows table. Results are always available under Evals in the sidebar.

Why use Evals?

  • Compare models: See how GPT-5, Claude Sonnet 4.6, and Gemini 3 Flash perform on your actual use case.

  • Test prompt variations: Determine which prompt wording produces the best results.

  • Tune parameters: Find the right temperature, max tokens, reasoning mode, and other settings.

  • Validate before deploying: Confirm that changes improve quality before they reach users.

  • Measure consistency: Understand how reliably a configuration performs across diverse inputs.

How Evals work

  1. Choose a Flow to evaluate.

  2. Provide test data using either a collection of existing Records or inline test inputs.

  3. Define variations for the different configurations you want to compare, such as models, prompts, or temperature.

  4. Run the Eval so Runtype executes the Flow with each variation against each test input.

  5. Review results to compare outputs, latency, cost, success rates, and more.

Eval components

Test dataset

A good test dataset includes representative inputs that cover your typical use cases along with a handful of edge cases. You have two options:

  • Record collection: Select existing Records, such as a Record type you have already created. You can upload a CSV in the Records area to quickly build a reusable test dataset.

  • Inline test data: Provide test inputs directly without storing them as Records. This is useful for quick, one-off comparisons.

Start with 10 to 20 diverse test cases. This is usually enough to surface meaningful differences between variations. When you are ready to validate for production, expand to 50 to 100 or more.

Variations

Variations are the different configurations you want to compare. They are configured per step, so you can test different models or settings on different steps within the same Flow in a single Eval run.

Available overrides include model, prompt, temperature, max tokens, reasoning mode, and other step-level response format or tool settings.

Metrics

Evals capture several dimensions so you can make well-rounded decisions:

  • Quality: Compare outputs side by side, use keyword analysis to spot language patterns, and assess accuracy and relevance.

  • Success rate: See which variations complete successfully and which encounter errors.

  • Latency: Review response time per step and overall, especially for user-facing Flows where speed matters.

  • Cost: Compare token usage and cost per execution by variation.

Use keyword analysis to search for specific phrases across all outputs. This helps you spot patterns, such as whether one model consistently uses more empathetic language or includes details you care about.

Two ways to run Evals

  • Batch mode: Results are stored and available in the Evals section for ongoing reference. Use this for thorough comparisons and production validation.

  • Realtime mode: Results stream live as they are generated. Use this for quick iteration during development.

Evals in practice

Ken runs customer support automation for a SaaS platform with 200,000 users. His team built a Flow that drafts responses to common questions such as password resets, billing inquiries, and feature requests. It works well, but users occasionally complain that responses feel too robotic or miss important context.

He has been using GPT-5 at temperature 0.7 since launch. Newer models are available, and he wants to explore them. With the Flow handling more than 500 requests daily, he needs to make sure any change is an improvement.

Setting up the Eval

Ken creates a test dataset from 50 real customer questions pulled from the past month. Most are routine, but the set also includes edge cases such as frustrated users, ambiguous requests, and multi-part questions. He uploads these as Records through CSV so he can reuse them for future testing.

He sets up four variations:

  • GPT-5 at temp 0.7: His current production setup and baseline.

  • Claude Sonnet 4.6 at temp 0.7: Known for nuanced, empathetic responses.

  • Gemini 3 Flash at temp 0.7: A fast, cost-effective alternative.

  • GPT-5 at temp 0.3: More deterministic, which might reduce the robotic feel.

He runs the Eval from the Flow editor. Runtype executes his Flow 200 times, based on 4 variations across 50 questions, and presents the results side by side.

Reviewing results

The data reveals insights he could not have guessed:

  • Claude Sonnet 4.6 produces noticeably more empathetic, natural-sounding responses. When Ken reviews individual outputs, he sees Claude consistently acknowledge user frustration and provide clearer next steps.

  • Gemini 3 Flash is significantly faster and more affordable, with quality close to GPT-5 for straightforward questions, though it occasionally misses context on complex multi-part requests.

  • GPT-5 at temp 0.3 is faster than temp 0.7, but the responses feel more formulaic, which matches the robotic feedback users reported.

  • Cost breakdown: Claude costs 15% more per response than GPT-5, while Gemini 3 Flash costs 60% less. Latency differences are minimal across all three.

  • Edge case discovery: All models struggle with one specific question type, multi-part billing questions, so Ken flags this for a prompt revision.

The keyword analysis feature highlights that Claude uses phrases like “I understand” and “let me help” significantly more than GPT-5, which aligns with his team’s support style.

The decision

Ken switches to Claude Sonnet 4.6. The 15% cost increase is worth it because better responses lead to fewer escalations to human support staff, which saves more than the added model cost. He also updates his prompt to handle multi-part billing questions and runs a follow-up Eval with 20 targeted test cases to confirm the fix.

Two weeks later, user complaints about robotic responses drop by 60%. He now runs Evals monthly with fresh customer questions to catch quality regressions and test new models as they are released.

When to run Evals

  • Before deploying: Validate new prompts or model changes against real data.

  • After making changes: Confirm that modifications actually improve performance.

  • On a regular cadence: Periodic regression testing helps catch quality degradation early.

  • When optimizing costs: Find models that deliver acceptable quality at a lower price point.

You can also run Evals on Product Surfaces to compare how different model and configuration choices affect the end-user experience, including routing, orchestration, and Capability settings. If you want to publish a tested Flow for end users, see Quickstart: From Flow to Live Surface.

Next steps

Was this helpful?