Evaluations

Running an evaluation

Run evaluations to compare model and prompt variations across a test dataset, then choose the best-performing configuration.

Start an evaluation

  1. Open the flow you want to evaluate

  2. Click Run

  3. Select the Eval tab

  4. Configure the evaluation

  5. Click Run Eval

Configure test dataset

Choose where test inputs come from:

CSV upload

  1. Prepare CSV with inputs (one row per test case)

  2. Upload file

  3. Map columns to flow variables

Example CSV:

customerMessage,priority
"Where is my order?",high
"How do I reset my password?",low
"I need to cancel my subscription",high

Record collection

  1. Select Use Records

  2. Choose record type

  3. Optionally filter records

  4. Map record fields to flow variables

Manual entry

  1. Select Manual Input

  2. Enter test cases one by one

  3. Click + Add Test Case for each

Define variations

Specify which configurations to test:

Model comparison

  1. Select the prompt step to vary

  2. Click Add Variation

  3. Choose Model

  4. Select models to compare (e.g., gpt-4o, claude-3-5-sonnet, gpt-4o-mini)

The eval runs your flow with each model.

Prompt comparison

  1. Click Add Variation

  2. Choose Prompt

  3. Enter or select different prompt variations

Example variations:

  • Variation A: "You are a helpful assistant. Answer concisely."

  • Variation B: "You are an expert customer support agent. Provide detailed, empathetic responses."

Parameter comparison

  1. Click Add Variation

  2. Choose Parameters

  3. Configure different temperature, max tokens, or other settings

Example:

  • Variation A: Temperature 0.3 (deterministic)

  • Variation B: Temperature 0.7 (balanced)

  • Variation C: Temperature 1.0 (creative)

Run the evaluation

  1. Review configuration (dataset + variations)

  2. Click Run Eval

  3. Monitor progress in the Evals page

The eval executes: (number of test cases) × (number of variations) = total executions

Example: 20 test cases × 3 model variations = 60 flow executions

Evals run in the background. You can close the browser—results will be ready when executions complete.

Grading responses

After execution, grade response quality:

Manual grading

  1. Review each output

  2. Rate quality (1-5 stars or thumbs up/down)

  3. Add notes about what's good or bad

AI-assisted grading

  1. Enable AI Grading

  2. Define grading criteria (e.g., "Is the response accurate, helpful, and professional?")

  3. AI grades all outputs automatically

  4. Review and adjust grades as needed

Custom grading function

Write JavaScript to grade programmatically:

function grade(input, output, expectedOutput) {
  // Check if output contains key information
  if (output.includes('tracking number')) {
    return { score: 5, reason: 'Includes tracking info' };
  }
  return { score: 3, reason: 'Missing tracking info' };
}

Cost considerations

Evals consume AI credits for each execution:

  • 20 test cases × 3 variations = 60 AI calls

  • Estimate cost before running large evals

  • Use cheaper models (gpt-4o-mini) for initial testing

Best practices

  • Diverse dataset: Cover common cases and edge cases

  • Start small: 10-20 test cases for initial validation

  • Clear criteria: Define what "good" looks like before running

  • Baseline comparison: Always include your current configuration

  • Document findings: Note why winner performed better

Next steps

  • Interpreting eval results to analyze outcomes

  • What are evals? for conceptual background

  • Creating and editing flows to build flows for evaluation

  • What are prompts? for prompt management

Was this helpful?