Skip to Content

Run evaluations

The arcade evals command discovers and executes evaluation suites with support for multiple providers, models, and output formats.

Backward compatibility: All new features (multi-provider support, capture mode, output formats) work with existing evaluation suites. No code changes required.

Basic usage

Run all evaluations in the current directory:

Terminal
arcade evals .

The command searches for files starting with eval_ and ending with .py.

Show detailed results with critic feedback:

Terminal
arcade evals . --details

Filter to show only failures:

Terminal
arcade evals . --only-failed

Multi-provider support

Single provider with default model

Use OpenAI with default model (gpt-4o):

Terminal
export OPENAI_API_KEY=sk-... arcade evals .

Use Anthropic with default model (claude-sonnet-4-5-20250929):

Terminal
export ANTHROPIC_API_KEY=sk-ant-... arcade evals . --use-provider anthropic

Specific models

Specify one or more models for a provider:

Terminal
arcade evals . --use-provider "openai:gpt-4o,gpt-4o-mini"

Multiple providers

Compare performance across providers (space-separated):

Terminal
arcade evals . \ --use-provider "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929" \ --api-key openai:sk-... \ --api-key anthropic:sk-ant-...

When you specify multiple models, results show side-by-side comparisons.

API keys

are resolved in the following order:

PriorityFormat
1. Explicit flag--api-key provider:key (can repeat)
2. EnvironmentOPENAI_API_KEY, ANTHROPIC_API_KEY
3. .env fileOPENAI_API_KEY=..., ANTHROPIC_API_KEY=...

Create a .env file in your directory to avoid setting keys in every terminal session.

Examples:

Terminal
# Single provider arcade evals . --api-key openai:sk-... # Multiple providers arcade evals . \ --api-key openai:sk-... \ --api-key anthropic:sk-ant-...

Capture mode

Record calls without scoring to bootstrap test expectations:

Terminal
arcade evals . --capture --output captures/baseline.json

Include conversation in captured output:

Terminal
arcade evals . --capture --include-context --output captures/detailed.json

Capture mode is useful for:

  • Creating initial test expectations
  • Debugging model behavior
  • Understanding call patterns

See Capture mode for details.

Output formats

Save results to files

Specify output files with extensions - format is auto-detected:

Terminal
# Single format arcade evals . --output results.md # Multiple formats arcade evals . --output results.md --output results.html --output results.json # All formats (no extension) arcade evals . --output results

Available formats

ExtensionFormatDescription
.txtPlain textPytest-style output
.mdMarkdownTables and collapsible sections
.htmlHTMLInteractive report
.jsonJSONStructured data for programmatic use
(none)All formatsGenerates all four formats

Command options

Quick reference

FlagShortPurposeExample
--use-provider-pSelect provider/model-p "openai:gpt-4o"
--api-key-kProvider API key-k openai:sk-...
--capture-Record without scoring--capture
--details-dShow critic feedback--details
--only-failed-fFilter failures--only-failed
--output-oOutput file(s)-o results.md
--include-context-Add messages to output--include-context
--max-concurrent-cParallel limit-c 10
--debug-Debug info--debug

--use-provider, -p

Specify provider(s) and model(s) (space-separated):

Terminal
--use-provider "<provider>[:<models>] [<provider2>[:<models2>]]"

Supported providers:

  • openai (default: gpt-4o)
  • anthropic (default: claude-sonnet-4-5-20250929)

Anthropic model names include date stamps. Check Anthropic’s model documentation  for the latest model versions.

Examples:

Terminal
# Default model for provider arcade evals . -p anthropic # Specific model arcade evals . -p "openai:gpt-4o-mini" # Multiple models from same provider arcade evals . -p "openai:gpt-4o,gpt-4o-mini" # Multiple providers (space-separated) arcade evals . -p "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929"

--api-key, -k

Provide explicitly (repeatable):

Terminal
arcade evals . -k openai:sk-... -k anthropic:sk-ant-...

--capture

Enable capture mode to record calls without scoring:

Terminal
arcade evals . --capture

--include-context

Include system messages and conversation history in output:

Terminal
arcade evals . --include-context --output results.md

--output, -o

Specify output file(s). Format is auto-detected from extension:

Terminal
# Single format arcade evals . -o results.md # Multiple formats (repeat flag) arcade evals . -o results.md -o results.html # All formats (no extension) arcade evals . -o results

--details, -d

Show detailed results including critic feedback:

Terminal
arcade evals . --details

--only-failed, -f

Show only failed test cases:

Terminal
arcade evals . --only-failed

--max-concurrent, -c

Set maximum concurrent evaluations:

Terminal
arcade evals . --max-concurrent 10

Default is 1 concurrent evaluation.

--debug

Show debug information for troubleshooting:

Terminal
arcade evals . --debug

Displays detailed error traces and connection information.

Understanding results

Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags.

Summary format

Results show overall performance:

PLAINTEXT
Summary -- Total: 5 -- Passed: 4 -- Failed: 1

How flags affect output:

  • --details: Adds per-critic breakdown for each case
  • --only-failed: Filters to show only failed cases (summary shows original totals)
  • --include-context: Includes system messages and conversation history
  • Multiple models: Switches to comparison table format
  • Comparative tracks: Shows side-by-side track comparison

Case results

Each case displays status and score:

PLAINTEXT
PASSED Get weather for city -- Score: 1.00 FAILED Weather with invalid city -- Score: 0.65

Detailed feedback

Use --details to see critic-level analysis:

PLAINTEXT
Details: location: Match: False, Score: 0.00/0.70 Expected: Seattle Actual: Seatle units: Match: True, Score: 0.30/0.30

Multi-model results

When using multiple models, results show comparison tables:

PLAINTEXT
Case: Get weather for city Model: gpt-4o -- Score: 1.00 -- PASSED Model: gpt-4o-mini -- Score: 0.95 -- WARNED

Advanced usage

High concurrency for fast execution

Increase concurrent evaluations:

Terminal
arcade evals . --max-concurrent 20

High concurrency may hit API rate limits. Start with default (1) and increase gradually.

Save comprehensive results

Generate all formats with full details:

Terminal
arcade evals . --details --include-context --output results

This creates:

  • results.txt
  • results.md
  • results.html
  • results.json

Troubleshooting

Missing dependencies

If you see ImportError: MCP SDK is required, install the full package:

Terminal
pip install 'arcade-mcp[evals]'

For Anthropic support:

Terminal
pip install anthropic

Tool name mismatches

names are normalized (dots become underscores). Check your tool definitions if you see unexpected names.

API rate limits

Reduce --max-concurrent value:

Terminal
arcade evals . --max-concurrent 2

No evaluation files found

Ensure your evaluation files:

  • Start with eval_
  • End with .py
  • Contain functions decorated with @tool_eval()

Next steps

Last updated on

Run evaluations | Arcade Docs