Run evaluations

The arcade evals command discovers and executes evaluation suites with support for multiple providers, models, and output formats.

Backward compatibility: All new features (multi-provider support, capture mode, output formats) work with existing evaluation suites. No code changes required.

Basic usage

Run all evaluations in the current directory:

Terminal


arcade evals .

The command searches for files starting with eval_ and ending with .py.

Show detailed results with critic feedback:

Terminal


arcade evals . --details

Filter to show only failures:

Terminal


arcade evals . --only-failed

Multi-provider support

Single provider with default model

Use OpenAI with default model (gpt-4o):

Terminal


export OPENAI_API_KEY=sk-...
arcade evals .

Use Anthropic with default model (claude-sonnet-4-5-20250929):

Terminal


export ANTHROPIC_API_KEY=sk-ant-...
arcade evals . --use-provider anthropic

Specific models

Specify one or more models for a provider:

Terminal


arcade evals . --use-provider "openai:gpt-4o,gpt-4o-mini"

Multiple providers

Compare performance across providers (space-separated):

Terminal


arcade evals . \
  --use-provider "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929" \
  --api-key openai:sk-... \
  --api-key anthropic:sk-ant-...

When you specify multiple models, results show side-by-side comparisons.

API keys

are resolved in the following order:

Priority	Format
1. Explicit flag	`--api-key provider:key` (can repeat)
2. Environment	`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`
3. `.env` file	`OPENAI_API_KEY=...`, `ANTHROPIC_API_KEY=...`

Create a .env file in your directory to avoid setting keys in every terminal session.

Examples:

Terminal


# Single provider
arcade evals . --api-key openai:sk-...
 
# Multiple providers
arcade evals . \
  --api-key openai:sk-... \
  --api-key anthropic:sk-ant-...

Capture mode

Record calls without scoring to bootstrap test expectations:

Terminal


arcade evals . --capture --output captures/baseline.json

Include conversation in captured output:

Terminal


arcade evals . --capture --include-context --output captures/detailed.json

Capture mode is useful for:

Creating initial test expectations
Debugging model behavior
Understanding call patterns

See Capture mode for details.

Output formats

Save results to files

Specify output files with extensions - format is auto-detected:

Terminal


# Single format
arcade evals . --output results.md
 
# Multiple formats
arcade evals . --output results.md --output results.html --output results.json
 
# All formats (no extension)
arcade evals . --output results

Available formats

Extension	Format	Description
`.txt`	Plain text	Pytest-style output
`.md`	Markdown	Tables and collapsible sections
`.html`	HTML	Interactive report
`.json`	JSON	Structured data for programmatic use
(none)	All formats	Generates all four formats

Command options

Quick reference

Flag	Short	Purpose	Example
`--use-provider`	`-p`	Select provider/model	`-p "openai:gpt-4o"`
`--api-key`	`-k`	Provider API key	`-k openai:sk-...`
`--capture`	-	Record without scoring	`--capture`
`--details`	`-d`	Show critic feedback	`--details`
`--only-failed`	`-f`	Filter failures	`--only-failed`
`--output`	`-o`	Output file(s)	`-o results.md`
`--include-context`	-	Add messages to output	`--include-context`
`--max-concurrent`	`-c`	Parallel limit	`-c 10`
`--debug`	-	Debug info	`--debug`

`--use-provider`, `-p`

Specify provider(s) and model(s) (space-separated):

Terminal


--use-provider "<provider>[:<models>] [<provider2>[:<models2>]]"

Supported providers:

openai (default: gpt-4o)
anthropic (default: claude-sonnet-4-5-20250929)

Anthropic model names include date stamps. Check Anthropic’s model documentation for the latest model versions.

Examples:

Terminal


# Default model for provider
arcade evals . -p anthropic
 
# Specific model
arcade evals . -p "openai:gpt-4o-mini"
 
# Multiple models from same provider
arcade evals . -p "openai:gpt-4o,gpt-4o-mini"
 
# Multiple providers (space-separated)
arcade evals . -p "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929"

`--api-key`, `-k`

Provide explicitly (repeatable):

Terminal


arcade evals . -k openai:sk-... -k anthropic:sk-ant-...

`--capture`

Enable capture mode to record calls without scoring:

Terminal


arcade evals . --capture

`--include-context`

Include system messages and conversation history in output:

Terminal


arcade evals . --include-context --output results.md

`--output`, `-o`

Specify output file(s). Format is auto-detected from extension:

Terminal


# Single format
arcade evals . -o results.md
 
# Multiple formats (repeat flag)
arcade evals . -o results.md -o results.html
 
# All formats (no extension)
arcade evals . -o results

`--details`, `-d`

Show detailed results including critic feedback:

Terminal


arcade evals . --details

`--only-failed`, `-f`

Show only failed test cases:

Terminal


arcade evals . --only-failed

`--max-concurrent`, `-c`

Set maximum concurrent evaluations:

Terminal


arcade evals . --max-concurrent 10

Default is 1 concurrent evaluation.

`--debug`

Show debug information for troubleshooting:

Terminal


arcade evals . --debug

Displays detailed error traces and connection information.

Understanding results

Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags.

Summary format

Results show overall performance:

PLAINTEXT


Summary -- Total: 5 -- Passed: 4 -- Failed: 1

How flags affect output:

--details: Adds per-critic breakdown for each case
--only-failed: Filters to show only failed cases (summary shows original totals)
--include-context: Includes system messages and conversation history
Multiple models: Switches to comparison table format
Comparative tracks: Shows side-by-side track comparison

Case results

Each case displays status and score:

PLAINTEXT


PASSED Get weather for city -- Score: 1.00
FAILED Weather with invalid city -- Score: 0.65

Detailed feedback

Use --details to see critic-level analysis:

PLAINTEXT


Details:
  location:
    Match: False, Score: 0.00/0.70
    Expected: Seattle
    Actual: Seatle
  units:
    Match: True, Score: 0.30/0.30

Multi-model results

When using multiple models, results show comparison tables:

PLAINTEXT


Case: Get weather for city
  Model: gpt-4o -- Score: 1.00 -- PASSED
  Model: gpt-4o-mini -- Score: 0.95 -- WARNED

Advanced usage

High concurrency for fast execution

Increase concurrent evaluations:

Terminal


arcade evals . --max-concurrent 20

High concurrency may hit API rate limits. Start with default (1) and increase gradually.

Save comprehensive results

Generate all formats with full details:

Terminal


arcade evals . --details --include-context --output results

This creates:

results.txt
results.md
results.html
results.json

Troubleshooting

Missing dependencies

If you see ImportError: MCP SDK is required, install the full package:

Terminal


pip install 'arcade-mcp[evals]'

For Anthropic support:

Terminal


pip install anthropic

Tool name mismatches

names are normalized (dots become underscores). Check your tool definitions if you see unexpected names.

API rate limits

Reduce --max-concurrent value:

Terminal


arcade evals . --max-concurrent 2

No evaluation files found

Ensure your evaluation files:

Start with eval_
End with .py
Contain functions decorated with @tool_eval()

Next steps

Explore capture mode for recording calls
Learn about comparative evaluations for comparing sources

Run evaluations

Basic usage

Multi-provider support

Single provider with default model

Specific models

Multiple providers

API keys

Capture mode

Output formats

Save results to files

Available formats

Command options

Quick reference

--use-provider, -p

--api-key, -k

--capture

--include-context

--output, -o

--details, -d

--only-failed, -f

--max-concurrent, -c

--debug

Understanding results

Summary format

Case results

Detailed feedback

Multi-model results

Advanced usage

High concurrency for fast execution

Save comprehensive results

Troubleshooting

Missing dependencies

Tool name mismatches

API rate limits

No evaluation files found

Next steps

`--use-provider`, `-p`

`--api-key`, `-k`

`--capture`

`--include-context`

`--output`, `-o`

`--details`, `-d`

`--only-failed`, `-f`

`--max-concurrent`, `-c`

`--debug`