Running an evaluation
Run evaluations to compare model and prompt variations across a test dataset, then choose the best-performing configuration.
Start an evaluation
Open the flow you want to evaluate
Click Run
Select the Eval tab
Configure the evaluation
Click Run Eval
Configure test dataset
Choose where test inputs come from:
CSV upload
Prepare CSV with inputs (one row per test case)
Upload file
Map columns to flow variables
Example CSV:
customerMessage,priority
"Where is my order?",high
"How do I reset my password?",low
"I need to cancel my subscription",highRecord collection
Select Use Records
Choose record type
Optionally filter records
Map record fields to flow variables
Manual entry
Select Manual Input
Enter test cases one by one
Click + Add Test Case for each
Define variations
Specify which configurations to test:
Model comparison
Select the prompt step to vary
Click Add Variation
Choose Model
Select models to compare (e.g., gpt-4o, claude-3-5-sonnet, gpt-4o-mini)
The eval runs your flow with each model.
Prompt comparison
Click Add Variation
Choose Prompt
Enter or select different prompt variations
Example variations:
Variation A: "You are a helpful assistant. Answer concisely."
Variation B: "You are an expert customer support agent. Provide detailed, empathetic responses."
Parameter comparison
Click Add Variation
Choose Parameters
Configure different temperature, max tokens, or other settings
Example:
Variation A: Temperature 0.3 (deterministic)
Variation B: Temperature 0.7 (balanced)
Variation C: Temperature 1.0 (creative)
Run the evaluation
Review configuration (dataset + variations)
Click Run Eval
Monitor progress in the Evals page
The eval executes: (number of test cases) × (number of variations) = total executions
Example: 20 test cases × 3 model variations = 60 flow executions
Evals run in the background. You can close the browser—results will be ready when executions complete.
Grading responses
After execution, grade response quality:
Manual grading
Review each output
Rate quality (1-5 stars or thumbs up/down)
Add notes about what's good or bad
AI-assisted grading
Enable AI Grading
Define grading criteria (e.g., "Is the response accurate, helpful, and professional?")
AI grades all outputs automatically
Review and adjust grades as needed
Custom grading function
Write JavaScript to grade programmatically:
function grade(input, output, expectedOutput) {
// Check if output contains key information
if (output.includes('tracking number')) {
return { score: 5, reason: 'Includes tracking info' };
}
return { score: 3, reason: 'Missing tracking info' };
}Cost considerations
Evals consume AI credits for each execution:
20 test cases × 3 variations = 60 AI calls
Estimate cost before running large evals
Use cheaper models (gpt-4o-mini) for initial testing
Best practices
Diverse dataset: Cover common cases and edge cases
Start small: 10-20 test cases for initial validation
Clear criteria: Define what "good" looks like before running
Baseline comparison: Always include your current configuration
Document findings: Note why winner performed better
Next steps
Interpreting eval results to analyze outcomes
What are evals? for conceptual background
Creating and editing flows to build flows for evaluation
What are prompts? for prompt management