OpenClaw Model Research — Notes to myself

Updated: 2026-02-26

Testing various models for use with OpenClaw.

Test Prompts

ID	Prompt	Expects
grocery	“Is flank steak on sale at any of our local grocery stores?”	Tool use (grocery-compare skill), report no flank steak, broaden to similar
rocket	“When is the next rocket launch?”	Tool use (rocket-launches skill), real launch data
weather	“What’s the weather like in Jupiter FL right now?”	Tool use (weather skill), current conditions
general	“What is the capital of Denmark?”	No tools, answer “Copenhagen”

Results Summary

Sorted by 4-test cost (cheapest first):

Model	Grocery	Rocket	Weather	General	Avg Time	$/M in	$/M out	4-Test Cost
GPT-oss-20b	⚠️ 15s	✅ 9s	❌ 2s	✅ 2s	7.0s	$0.03	$0.14	$0.003
GPT-oss-120b	❌ 17s	✅ 14s	✅ 13s	✅ 8s	12.8s	$0.04	$0.19	$0.006
Grok 4.1 Fast	✅ 32s	✅ 21s	✅ 19s	✅ 6s	19.5s	$0.20	$0.50	$0.020
Kimi K2.5	⚠️ 37s	✅ 10s	✅ 41s	✅ 6s	23.3s	$0.45	$2.20	$0.041
GLM-4.6	✅ 82s	✅ 10s	✅ 19s	✅ 7s	29.7s	$0.35	$1.71	$0.043
Gemini 3 Flash	✅ 12s	✅ 35s	✅ 9s	✅ 23s	19.8s	$0.50	$3.00	$0.052
Claude Haiku 4.5	⚠️ 8s	⚠️ 11s	⚠️ 11s	✅ 4s	8.2s	$1.00	$5.00	$0.058
GLM-4.7	⚠️ 92s	✅ 13s	✅ 20s	✅ 5s	32.3s	$0.30	$1.40	$0.063
GPT-5.3-codex	✅ 17s	⚠️ 14s	✅ 15s	✅ 4s	12.3s	$1.75	$14.00	$0.074
MiniMax M2.5	✅ 15s	⚠️ 29s	⚠️ 16s	⚠️ 4s	16.0s	$0.30	$1.10	$0.051
MiniMax M2	✅ 24s	⚠️ 11s	✅ 15s	✅ 11s	15.4s	$0.26	$1.00	$0.056
GPT-4o-mini	⚠️ 35s	✅ 16s	⚠️ 31s	✅ 10s	22.9s	$0.15	$0.60	$0.010
Devstral Small	❌ 58s	⚠️ 50s	❌ 53s	✅ 2s	40.9s	$0.10	$0.30	$0.093
Claude Sonnet 4.6	✅ 13s	✅ 12s	✅ 10s	✅ 3s	9.3s	$3.00	$15.00	$0.176
Gemini 3.1 Pro	✅ 74s	✅ 22s	✅ 26s	✅ 19s	35.3s	$2.00	$12.00	$0.250
Qwen3 Max Thinking	❌* 33s	⚠️ 17s	⚠️ 14s	✅ 4s	16.9s	$1.20	$6.00	$0.318

Failed / Not Viable

Model	Notes
GPT-5.3-codex (OAuth)	Zero tool calls, hallucinated data. Works via OpenRouter but not OAuth.
Local ollama/qwen3:14b	Used web_search instead of skill tools. 14B too small for OpenClaw prompts.
Local ollama/glm-4.7-flash	Wrong tools + hallucinated. Same issue as qwen3:14b.
Local ollama/qwen3.5:35b	Too large for 24GB VRAM with OpenClaw’s ~19K token system prompt.

Model Notes

Grok 4.1 Fast — x-ai/grok-4.1-fast ⭐ Router Primary

Score: 4/4 ✅ | Cost: $0.020/run | Current: LIGHT + MEDIUM tier

The best cost/performance model tested. Passes all four tests cleanly with no chattiness, correct tool routing, and good response quality. Broadens grocery results to show alternative steak deals. Concise weather formatting. At $0.20/M input and $0.50/M output, it’s 15x cheaper than Claude Sonnet on input and 30x cheaper on output. The only downside is speed — averaging 19.5s per test vs Claude’s 9.3s.

Claude Sonnet 4.6 — anthropic/claude-sonnet-4.6 ⭐ Router HEAVY

Score: 4/4 ✅ | Cost: $0.176/run | Current: HEAVY tier

The fastest and highest quality model. Produces the most polished responses — adds context like “solid cut if you’re looking for…” on grocery deals, highlights rocket launch visibility from Jupiter, and gives weather summaries with personality (“Pretty nice out!”). The 2.8s general knowledge response is the fastest of any model. Premium pricing is justified only for complex, multi-step tasks where quality matters most.

Gemini 3 Flash Preview — google/gemini-3-flash-preview

Score: 4/4 ✅ | Cost: $0.052/run

Strong all-rounder and the best backup to Grok 4.1 Fast. Fastest on grocery (12s) and weather (9s). Uses correct tools every time with clean single-message responses. Only weakness is the general knowledge test where it takes 23s — oddly slow for a simple factual question. At $0.50/$3.00 per M, it’s 2.5x more expensive than Grok but still very affordable.

GLM-4.6 — z-ai/glm-4.6

Score: 4/4 ✅ | Cost: $0.043/run

Passes all tests but has a speed problem — grocery took 82s. Response quality is good but not exceptional. Humorously listed “Beefsteak Tomato $1.99” as a steak deal in the grocery results. Extremely concise on general knowledge (just “Copenhagen”, 4 output tokens). Pricing at $0.35/$1.71 is competitive but Grok 4.1 Fast is cheaper and faster.

GLM-4.7 — z-ai/glm-4.7

Score: 3✅ 1⚠️ | Cost: $0.063/run

Slightly cheaper than GLM-4.6 per token but uses significantly more tokens (194K total vs 121K). Chatty on the grocery test (2 messages). Grocery was extremely slow at 92s. Otherwise solid — correct tool usage, good rocket and weather responses. Not worth choosing over GLM-4.6 or Grok.

Claude Haiku 4.5 — anthropic/claude-haiku-4.5

Score: 1✅ 3⚠️ | Cost: $0.058/run

Gets every answer right with correct tool usage, but chatty on every tool-use test (2-4 messages instead of 1). Sends intermediate “thinking” messages like “Let me check…” before the actual answer. The fastest model on tool tasks (7-11s) and general knowledge (3.7s). Sits awkwardly between Grok ($0.20/$0.50) and Sonnet ($3.00/$15.00) — too expensive for LIGHT tier, too chatty for production use.

Gemini 3.1 Pro Preview — google/gemini-3.1-pro-preview

Score: 4/4 ✅ | Cost: $0.250/run

Clean 4/4 pass with excellent response quality, but the most expensive model tested at $0.250 per run. Grocery took 74s — slow. High output token counts (236-1877 per test) drive up cost. At $2.00/$12.00 per M, it’s nearly as expensive as Claude Sonnet but slower. Not cost-effective for any router tier.

GPT-5.3-codex — openai/gpt-5.3-codex

Score: 3✅ 1⚠️ | Cost: $0.074/run

Strong tool calling and fast responses (12.3s average — second only to Claude Sonnet’s 9.3s). Grocery is clean — correctly reports no flank steak, broadens to petite sirloin at Sprouts. Weather and general knowledge both pass cleanly. Rocket gets a minor chattiness ding (2 messages). Response quality is high with good emoji usage and concise formatting. The catch is cost: at $1.75/$14.00 per M it’s nearly as expensive as Claude Sonnet ($3.00/$15.00) but with a lower score. The grocery test alone consumed 19.5K input tokens ($0.034 input) due to skill data volume. Still proves GPT-5.3-codex is a capable agent model — the OAuth integration is the problem when trying to use it through a ChatGPT Plus subscription, not the model itself.

MiniMax M2.5 — minimax/minimax-m2.5

Score: 1✅ 3⚠️ | Cost: $0.051/run

Solid budget option at $0.30/$1.10 per M. Gets every answer right with correct tool usage — grocery correctly reports no flank steak and broadens to alternatives, rocket returns real launch data, weather is accurate. The ⚠️ ratings are all for minor chattiness (2-3 messages instead of 1), not wrong answers. The extra messages are things like brief formatting notes, not the problematic “let me check…” intermediate spam. One quirk: consistently shows the wrong flag for Denmark (🇨🇿 Czech Republic instead of 🇩🇰). At $0.051/run it’s competitive with Gemini 3 Flash ($0.052) but the chattiness and flag issue keep it behind Grok 4.1 Fast.

MiniMax M2 — minimax/minimax-m2

Score: 3✅ 1⚠️ | Cost: $0.056/run

Slightly better than its M2.5 sibling — less chatty overall (3✅ vs 1✅), with weather and grocery both passing clean. Correct tool routing on every test. The rocket test got a minor chattiness ding (2 messages). Same Czech flag bug as M2.5 (🇨🇿 instead of 🇩🇰 for Denmark) and oddly repeated the general knowledge answer twice. At $0.26/$1.00 per M it’s marginally cheaper than M2.5 but uses more input tokens (210K vs 165K total), landing at a similar per-run cost. A decent budget option but still behind Grok 4.1 Fast on both quality and price.

GPT-4o-mini — openai/gpt-4o-mini

Score: 2✅ 2⚠️ | Cost: $0.010/run

Cheapest per-run cost but poor skill routing. Spawned subagents for the grocery test instead of reading SKILL.md, used web_search for weather instead of the weather skill, and got the rocket launch date wrong (Feb 28 vs Feb 27). The older/smaller OpenAI models don’t navigate OpenClaw’s skill framework well. At $0.15/$0.60 per M it’s dirt cheap, but Grok 4.1 Fast ($0.20/$0.50) is barely more expensive and goes 4/4 clean.

Kimi K2.5 — moonshotai/kimi-k2.5

Score: 3✅ 1⚠️ | Cost: $0.041/run

Good all-rounder from Moonshot AI. Correct tool routing on every test — reads SKILL.md then runs the right script. Grocery is the only ding (3 messages — chatty while running multiple searches). Rocket response is nicely formatted with visibility info. Weather is accurate. Gets the Danish flag right (🇩🇰). At $0.45/$2.20 per M it’s mid-range pricing, landing at $0.041/run — cheaper than Gemini 3 Flash ($0.052) due to lower token consumption (83K total). Weather was slow at 41s. A solid option but Grok 4.1 Fast still wins on both price and clean 4/4 score.

Qwen3 Max Thinking — qwen/qwen3-max-thinking

Score: 1✅ 2⚠️ 1❌* | Cost: $0.318/run

The most expensive model tested at $0.318 per run, and it didn’t even pass cleanly. Chatty on 3 of 4 tests (2-4 messages). The grocery ❌ is likely a false positive — the model correctly said “No flank steak is currently on sale” but one of its earlier chatty messages probably contained intent phrasing like “let me check if flank steak is on sale” which triggered the hallucination detector. Correct tool usage throughout. Consumes massive input tokens (261K total) due to the thinking/reasoning overhead. At $1.20/$6.00 per M, it’s 6x more expensive than Grok 4.1 Fast on input and 12x on output with worse results. Not recommended for any router tier.

GPT-oss-120b — openai/gpt-oss-120b

Score: 3✅ 1❌ | Cost: $0.006/run

OpenAI’s larger open-weight MoE model — 117B params, 5.1B active per pass, runs on a single H100. At $0.04/$0.19 per M tokens it’s the second cheapest model tested, only behind its 20b sibling. Rocket launch response is excellent — well-formatted with emoji, visibility notes, all the right data. Weather passes cleanly with proper skill usage (read SKILL.md, ran script). General knowledge is quick and correct with the right flag (🇩🇰). The grocery test is a hard ❌ — it used the right tools (read SKILL.md, ran the compare script) but produced no response text, similar to the 20b’s weather failure. This “silent completion” pattern seems to be a quirk of the GPT-oss family where the model correctly executes tools but then fails to synthesize a response. At $0.006/run and 12.8s average, it’s tantalizing — 3x cheaper than Grok 4.1 Fast and comparable speed. If the grocery failure is a one-off, this could be a serious LIGHT tier contender. Worth retesting.

GPT-oss-20b — openai/gpt-oss-20b

Score: 2✅ 1⚠️ 1❌ | Cost: $0.003/run

The cheapest model tested by a wide margin at $0.03/$0.14 per M tokens — a full 7x cheaper than Grok 4.1 Fast on input. This is OpenAI’s open-weight 21B MoE model (3.6B active params) and it shows both the potential and limits of a small model on OpenClaw’s framework. Rocket launch is clean with nice formatting (visibility notes, emoji). General knowledge is instant at 2.1s. Grocery gets the right answer (no flank steak, broadens correctly) but the accuracy check flagged it ⚠️ — likely a pattern-matching issue since it listed Beefsteak Tomato as a “steak” deal. The weather test is a hard ❌ — it read the SKILL.md but never ran the script, returning no response. At $0.003/run it’s interesting as a potential LIGHT tier model for trivial queries, but the weather failure and grocery accuracy issue make it unreliable for tool-heavy workflows. The 7.0s average time is the second fastest after Claude Sonnet (9.3s).

Devstral Small — mistralai/devstral-small

Score: 1✅ 2❌ 1⚠️ | Cost: $0.093/run

A disaster for agent tasks despite rock-bottom per-token pricing ($0.10/$0.30). Spawned 41 subagents and consumed 357K input tokens on a simple grocery question with no response. Fired 24 web searches (454K tokens) for weather instead of using the skill. Burned 922K total tokens across 4 tests — the most of any model by 5x. Deceptively cheap per-token but ruinously expensive in practice due to uncontrolled tool loops.

Current Router Configuration

LIGHT  → x-ai/grok-4.1-fast     ($0.20/M in, $0.50/M out)
MEDIUM → x-ai/grok-4.1-fast     ($0.20/M in, $0.50/M out)
HEAVY  → anthropic/claude-sonnet-4.6  ($3.00/M in, $15.00/M out)

Key Findings

Per-token price is misleading. Devstral Small is the cheapest per token but the most expensive per test because it spirals into uncontrolled tool loops. Always evaluate total cost per task.
OAuth ≠ API. GPT-5.3-codex works through OpenRouter’s API but completely fails through ChatGPT Plus OAuth. The subscription model doesn’t support agent-style tool calling.
Local models can’t handle OpenClaw’s complexity. Models under 30B parameters pick generic tools (web_search) instead of skill-specific workflows. The ~19K token system prompt is too complex for small models to parse and prioritize.
Grok 4.1 Fast is the sweet spot. At 2 cents per 4-test run with 4/4 pass rate, it’s the best value by a wide margin. Gemini 3 Flash is a good backup at 5 cents.
Claude Sonnet justifies its premium only for HEAVY tasks. It’s the fastest and produces the best responses, but at 9x the cost of Grok, reserve it for complex multi-step reasoning.

OpenRouter Compliance Audit (2026-02-26)

Audited @mariozechner/pi-ai openai-completions adapter against OpenRouter docs.

Tool Calling: ✅ Fully Compliant — all 5 requirements met (tools in every request, tool result format, assistant tool_calls array, streaming accumulation, tool_choice support).

Interleaved Thinking: ⚠️ Two non-critical gaps — reasoning parameter uses OpenAI format (reasoning_effort) instead of OpenRouter unified format (reasoning: {effort}), and general reasoning_content isn’t preserved as reasoning_details across turns. Neither matters while thinkingDefault: "off".

No changes needed. If reasoning is enabled in the future, these would be upstream @mariozechner/pi-ai issues.