What are Evals?

Evals (evaluations) test and compare AI model configurations, prompts, and Flows across datasets to find the best-performing setup. Run them from the Flow editor or Flows table; results appear under Evals in the sidebar.

Why use Evals?

Instead of guessing which model or prompt works best, Evals provide data-driven answers:

Compare models: gpt-5 vs claude-sonnet-4-5 vs gemini-3-flash
Test prompts: Which prompt variation produces better results?
Optimize parameters: Find ideal temperature, max tokens, etc.
Validate changes: Ensure updates improve quality before deploying
Measure consistency: How reliably does a configuration perform?

How Evals work

Choose a Flow to evaluate.
Provide a test dataset (Records or inline test data).
Define variations to compare (models, prompts, temperature, max tokens, etc.).
Run the Eval (executes the Flow with each variation for each input).
Review results (outputs, latency, cost, success rate, keyword analysis, record-level drill-downs).

Eval components

Test dataset

Representative inputs covering typical use cases and edge cases. Can be:

Record collection — select existing Records (e.g. a Record type). You can create Records by uploading a CSV in the Records area, then use them here.
Inline test data — provide test inputs directly (e.g. via API or ad-hoc Records) without storing Records.

Variations

Different configurations to compare. Variations are configured per step, so different steps in the same Flow can use different models or settings in one Eval:

Model: gpt-5, claude-sonnet-4-5, gemini-3-flash (model IDs depend on what's enabled in Settings)
Prompt: Different prompt templates or saved prompts
Temperature: 0.0, 0.5, 0.7, 1.0
Max tokens: 256, 512, 1024
Response format, tools: Other step-level overrides

Metrics

What to measure:

Quality: Compare outputs side-by-side and use keyword analysis; assess accuracy, relevance, and completeness manually or with your own tools
Success rate: Step and completion success across variations
Latency: Response time per step and aggregate
Cost: Tokens used and cost per execution

Evals in practice

Ken runs customer support automation for a SaaS platform with 200,000 users. His team built a Flow that drafts responses to common questions—password resets, billing inquiries, feature requests. It works well enough, but users occasionally complain about responses being "too robotic" or missing important context.

He's been using GPT-4 at temperature 0.7 since launch, but newer models are available and costs are climbing. Before making changes to a Flow handling 500+ requests daily, he needs confidence his changes won't make things worse.

Setting up the Eval

Ken creates a test dataset from 50 real customer questions pulled from the past month—mostly routine questions, but including edge cases like frustrated users, ambiguous requests, and questions requiring multiple pieces of information. He uploads these as Records via CSV so he can reuse them for future testing.

He sets up three variations to test:

GPT-4 at temp 0.7 (current production setup—the baseline)
Claude Sonnet 4.5 at temp 0.7 (he's heard it's better at nuanced, empathetic responses)
GPT-4 at temp 0.3 (more deterministic—might reduce the "robotic" feel by being more consistent)

He runs the Eval from the Flow editor. Runtype executes his Flow 150 times (3 variations × 50 questions) and presents the results side-by-side.

Reviewing results

The data immediately reveals insights he couldn't have guessed:

Claude Sonnet 4.5 produces noticeably more empathetic, natural-sounding responses. When he drills into individual outputs, Claude consistently acknowledges user frustration and provides clearer next steps.
GPT-4 at temp 0.3 is faster and cheaper than temp 0.7, but the responses feel more formulaic—exactly the "robotic" problem users complained about.
Cost breakdown: Claude costs 15% more per response than GPT-4, but latency is nearly identical.
Edge case discovery: Both models struggle with one specific question type (multi-part billing questions). He flags this for a prompt revision.

The keyword analysis feature highlights that Claude uses "I understand" and "let me help" significantly more than GPT-4—language that aligns with his team's support philosophy.

The decision

Ken switches to Claude Sonnet 4.5. The 15% cost increase is worth it—better responses mean fewer escalations to human agents, which saves far more than the added model cost. He also updates his prompt to handle multi-part billing questions better and runs a follow-up Eval with 20 test cases to confirm the fix works.

Two weeks later, user complaints about "robotic responses" drop by 60%. He now runs Evals monthly with fresh customer questions to catch quality regressions and test new models as they're released.

Start with 10-20 diverse test cases. This catches most issues without expensive Eval runs. Expand to 50-100+ for production validation.

When to run Evals

Before deploying: Validate new prompts or model changes
After changes: Ensure modifications improve performance
Periodic regression testing: Catch quality degradation
Cost optimization: Find cheaper models with acceptable quality

Next steps

Running an evaluation to execute your first Eval
Interpreting eval results to understand output
What are flows? to build Flows for evaluation

Was this helpful?