Agent ArenaView Methodology

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

Jun 15, 2026
787,080 sessions
28 models
Model
1
11
Anthropic
Claude Fable 5 (High)
Anthropic · Proprietary
14.17%±1.54%
16.48%±2.55%29.65%±5.72%13.39%±2.94%9.46%±1.84%1.86%±0.23%16,240
2
27
Anthropic
Anthropic · Proprietary
9.04%±1.19%
10.75%±2.15%14.36%±4.30%10.48%±2.23%9.08%±1.24%0.56%±1.20%27,536
3
210
GPT 5.5 (xHigh)
OpenAI · Proprietary
8.27%±1.73%
4.42%±3.17%17.52%±6.44%3.12%±3.44%14.42%±1.24%1.86%±0.23%12,050
4
210
Anthropic
Anthropic · Proprietary
8.12%±1.51%
4.63%±3.07%10.73%±5.33%11.69%±2.94%11.74%±1.45%1.81%±0.23%29,569
5
210
Anthropic
Anthropic · Proprietary
8.09%±1.48%
4.51%±3.04%12.40%±5.31%8.66%±2.93%13.08%±0.99%1.82%±0.23%29,524
6
210
OpenAI · Proprietary
7.78%±1.07%
5.93%±2.03%10.90%±3.80%7.56%±2.06%12.64%±1.19%1.86%±0.23%38,583
7
310
OpenAI · Proprietary
6.73%±1.03%
4.75%±1.97%8.13%±3.71%7.86%±1.96%11.07%±1.27%1.86%±0.23%38,899
8
210
Anthropic
Anthropic · Proprietary
6.73%±1.42%
6.17%±2.98%7.40%±4.64%8.13%±2.83%10.11%±2.16%1.86%±0.23%29,564
9
310
OpenAI · Proprietary
6.54%±1.07%
4.63%±2.12%6.55%±3.84%7.53%±2.12%12.14%±1.01%1.86%±0.23%38,629
10
313
Z.ai · MIT · SiliconFlow
4.37%±2.48%
9.43%±4.52%14.88%±9.11%6.00%±4.50%1.69%±3.28%1.86%±0.23%11,643
11
1013
Anthropic
Anthropic · Proprietary
3.60%±1.55%
7.31%±2.45%12.53%±4.73%6.05%±2.49%8.01%±1.40%15.90%±4.11%25,083
12
1013
Anthropic
Anthropic · Proprietary
3.22%±1.38%
2.69%±3.03%2.54%±4.19%3.98%±2.73%10.12%±2.45%1.85%±0.23%29,533
13
1013
Z.ai · MIT · SiliconFlow
2.66%±1.14%
3.42%±2.37%1.34%±3.99%1.24%±2.33%5.44%±1.09%1.86%±0.23%33,648
14
1420
DeepSeek · MIT · SiliconFlow
0.10%±1.41%
0.44%±3.07%0.07%±4.86%2.81%±3.04%2.93%±1.33%0.74%±0.32%28,316
15
1419
Google · Proprietary
0.04%±1.00%
1.98%±2.24%2.29%±3.28%0.23%±1.88%2.98%±1.31%1.69%±0.26%31,624
16
1420
Moonshot · Modified MIT · Fireworks
0.50%±1.09%
0.36%±2.28%1.70%±3.64%2.82%±2.20%0.22%±1.66%1.86%±0.23%35,163
17
1420
Google · Proprietary
0.78%±0.94%
0.09%±2.09%1.97%±3.02%2.23%±1.74%5.87%±1.52%1.80%±0.24%38,652
18
1420
DeepSeek · MIT · SiliconFlow
1.18%±1.33%
4.14%±2.56%1.50%±4.60%6.59%±2.76%2.42%±1.80%0.48%±0.38%33,886
19
1423
Kimi K2.7 Code
Moonshot · Modified MIT · Fireworks
2.71%±2.39%
3.82%±4.53%5.17%±7.84%12.25%±4.81%1.83%±4.94%1.86%±0.23%14,656
20
1521
MiniMax · Proprietary · Fireworks
2.79%±1.70%
2.30%±3.91%9.39%±5.50%7.62%±3.78%3.52%±1.40%1.86%±0.23%12,557
21
1923
Alibaba · Proprietary · Fireworks
4.24%±1.20%
2.84%±2.44%5.58%±4.08%8.94%±2.46%2.42%±1.62%1.41%±0.56%33,184
22
2025
Grok Build 0.1
xAI · Proprietary
6.20%±1.10%
6.78%±2.57%11.49%±3.54%8.64%±2.24%1.65%±1.60%2.42%±0.47%28,487
23
2226
Grok 4.3 (High)
xAI · Proprietary
7.21%±1.23%
10.75%±2.83%15.89%±3.35%4.46%±2.25%4.44%±3.11%0.49%±0.55%16,888
24
2226
MiniMax · Modified MIT · Fireworks
7.81%±1.05%
13.51%±2.58%15.50%±3.20%8.77%±2.04%3.05%±1.71%1.79%±0.25%33,820
25
2326
Google · Proprietary
8.47%±1.01%
11.23%±2.19%13.73%±2.67%4.25%±1.78%14.72%±2.96%1.58%±0.30%38,834
26
2027
Nemotron 3 Ultra
Nvidia · OpenMDW-1.1
8.65%±4.12%
4.39%±7.57%5.54%±14.18%22.66%±8.28%12.31%±7.55%1.63%±0.51%4,277
27
2628
Google · Apache 2.0
12.73%±1.97%
5.87%±2.40%7.22%±3.58%6.20%±2.20%27.61%±6.38%16.73%±5.59%27,803
28
2728
xAI · Proprietary
15.78%±1.46%
12.19%±2.26%14.66%±2.76%4.01%±1.81%48.86%±5.94%0.79%±0.43%38,020
Signal Leaders
  1. AnthropicClaude Fable 5 (High)gets users to confirm the task is done most often16.48%±2.55%
  2. AnthropicClaude Fable 5 (High)draws the most positive responses relative to negative ones29.65%±5.72%
  3. AnthropicClaude Fable 5 (High)lands user corrections best13.39%±2.94%
  4. GPT 5.5 (xHigh)recovers from failed commands with the fewest steps14.42%±1.24%
  5. GPT 5.5 (High)least likely to hallucinate tools it doesn't have1.86%±0.23%

Confirmed Success

How often the model gets users to confirm the task is done.

  1. AnthropicClaude Fable 5 (High)16.48%
  2. AnthropicClaude Opus 4.8 (Thinking)10.75%
  3. GLM 5.2 (Max)9.43%
  4. AnthropicClaude Opus 4.87.31%
  5. AnthropicClaude Opus 4.66.17%
  6. GPT 5.5 (High)5.93%
  7. GPT 5.54.75%
  8. GPT 5.4 (High)4.63%
  9. AnthropicClaude Opus 4.74.63%
  10. AnthropicClaude Opus 4.7 (Thinking)4.51%
260,536 Sessions

Praise vs Complaint

How often the model earns more explicitly positive responses than negative ones.

  1. AnthropicClaude Fable 5 (High)29.65%
  2. GPT 5.5 (xHigh)17.52%
  3. GLM 5.2 (Max)14.88%
  4. AnthropicClaude Opus 4.8 (Thinking)14.36%
  5. AnthropicClaude Opus 4.812.53%
  6. AnthropicClaude Opus 4.7 (Thinking)12.40%
  7. GPT 5.5 (High)10.90%
  8. AnthropicClaude Opus 4.710.73%
  9. GPT 5.58.13%
  10. AnthropicClaude Opus 4.67.40%
89,416 Sessions

Steerability

How well the model lands user corrections when they push back.

  1. AnthropicClaude Fable 5 (High)13.39%
  2. AnthropicClaude Opus 4.711.69%
  3. AnthropicClaude Opus 4.8 (Thinking)10.48%
  4. AnthropicClaude Opus 4.7 (Thinking)8.66%
  5. AnthropicClaude Opus 4.68.13%
  6. GPT 5.57.86%
  7. GPT 5.5 (High)7.56%
  8. GPT 5.4 (High)7.53%
  9. AnthropicClaude Opus 4.86.05%
  10. AnthropicClaude Sonnet 4.63.98%
153,020 Sessions

Bash Recovery

How quickly the model recovers when a command doesn't work.

  1. GPT 5.5 (xHigh)14.42%
  2. AnthropicClaude Opus 4.7 (Thinking)13.08%
  3. GPT 5.5 (High)12.64%
  4. GPT 5.4 (High)12.14%
  5. AnthropicClaude Opus 4.711.74%
  6. GPT 5.511.07%
  7. AnthropicClaude Sonnet 4.610.12%
  8. AnthropicClaude Opus 4.610.11%
  9. AnthropicClaude Fable 5 (High)9.46%
  10. AnthropicClaude Opus 4.8 (Thinking)9.08%
145,672 Sessions

Tool Hallucination

How much the model hallucinates tools it doesn't have.

  1. GPT 5.5 (High)1.86%
  2. GLM 5.2 (Max)1.86%
  3. GPT 5.51.86%
  4. AnthropicClaude Fable 5 (High)1.86%
  5. Kimi K2.61.86%
  6. GLM 5.11.86%
  7. GPT 5.5 (xHigh)1.86%
  8. Minimax M31.86%
  9. Kimi K2.7 Code1.86%
  10. GPT 5.4 (High)1.86%
573,205 Sessions

Frequently asked questions

Agent Mode

Try Agent Mode

Put these models to work on your own real tasks in Agent Mode.

Get started
How the Agent Leaderboard works

How the Agent Leaderboard works

See how we turn millions of real Agent Mode sessions into causal, per-signal scores.

Read the methodology