Answer three questions and get ranked model recommendations backed by real benchmark data from 38 coding tasks across 15 models. Scores, costs, and speed are from actual test runs, not synthetic benchmarks.
Best LLMs for Coding Agents (2026)
Compare cloud LLM performance on 38 real coding tasks including regex, API integration, debugging, and math.
| Model | Provider | Score | Cost/Task | Speed (tok/s) | Context |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic | 100.0% | $0.0052 | 53 tok/s | 200k |
| Claude Opus 4.6 | Anthropic | 98.6% | $0.0181 | 38 tok/s | 200k |
| MiniMax M2.5 | Minimax | 98.6% | $0.0018 | 98 tok/s | 200k |
| Kimi K2.5 | Moonshot | 98.6% | $0.0034 | 48 tok/s | 128k |
| Gemini 2.5 Pro | 98.3% | $0.0187 | 116 tok/s | 1000k | |
| GPT-5.2 Codex | Openai | 98.3% | $0.0042 | 40 tok/s | 128k |
| GPT-5.2 | Openai | 98.0% | $0.0038 | 59 tok/s | 128k |
| Gemini 2.5 Flash | 97.1% | $0.0001 | 112 tok/s | 1000k | |
| DeepSeek R1 | Deepseek | 96.8% | $0.0032 | 39 tok/s | 128k |
| Claude Haiku 4.5 | Anthropic | 95.9% | $0.0009 | 77 tok/s | 200k |
| GPT-5 Nano | Openai | 94.8% | $0.0007 | 104 tok/s | 128k |
| DeepSeek V3 | Deepseek | 88.7% | $0.0002 | 20 tok/s | 128k |
Best Local LLMs for Coding (2026)
Run these models on your own hardware. Scores from our 38-task benchmark and the Aider edit leaderboard.
| Model | Score | Source | Min RAM | Quant |
|---|---|---|---|---|
| GPT-oss 20B | 98.3% | 38-task benchmark | 16 GB | N/A |
| Qwen 3.5 35B | 85.8% | 38-task benchmark | 24 GB | Q4_K_M |
| Gemma 3 12B | 80.6% | 38-task benchmark | 8 GB | N/A |
| Qwen 2.5 Coder 32B | 72.9% | Aider edit benchmark | 22 GB | N/A |
| DeepSeek Coder V2 | 72.9% | Aider edit benchmark | 42 GB | N/A |
| Qwen 2.5 Coder 14B | 69.2% | Aider edit benchmark | 10 GB | N/A |
| Llama 3.3 70B | 59.4% | Aider edit benchmark | 42 GB | N/A |
| Llama 3.1 70B | 58.6% | Aider edit benchmark | 42 GB | N/A |
| Qwen 2.5 Coder 7B | 57.9% | Aider edit benchmark | 5 GB | N/A |
| Codestral 22B | 48.1% | Aider edit benchmark | 14 GB | N/A |
| Llama 3.1 8B | 37.6% | Aider edit benchmark | 5 GB | N/A |
Benchmark data from 38 real coding tasks tested on each model. Cloud and local pickers both use real test results. Pricing from actual API costs during testing. Updated daily with live pricing from OpenRouter.