LLM Picker: Find the Right Model for Your Task

Answer three questions and get ranked model recommendations backed by real benchmark data from 38 coding tasks across 15 models. Scores, costs, and speed are from actual test runs, not synthetic benchmarks.

Best LLMs for Coding Agents (2026)

Compare cloud LLM performance on 38 real coding tasks including regex, API integration, debugging, and math.

ModelProviderScoreCost/TaskSpeed (tok/s)Context
Claude Sonnet 4.6Anthropic100.0%$0.005253 tok/s200k
Claude Opus 4.6Anthropic98.6%$0.018138 tok/s200k
MiniMax M2.5Minimax98.6%$0.001898 tok/s200k
Kimi K2.5Moonshot98.6%$0.003448 tok/s128k
Gemini 2.5 ProGoogle98.3%$0.0187116 tok/s1000k
GPT-5.2 CodexOpenai98.3%$0.004240 tok/s128k
GPT-5.2Openai98.0%$0.003859 tok/s128k
Gemini 2.5 FlashGoogle97.1%$0.0001112 tok/s1000k
DeepSeek R1Deepseek96.8%$0.003239 tok/s128k
Claude Haiku 4.5Anthropic95.9%$0.000977 tok/s200k
GPT-5 NanoOpenai94.8%$0.0007104 tok/s128k
DeepSeek V3Deepseek88.7%$0.000220 tok/s128k

Best Local LLMs for Coding (2026)

Run these models on your own hardware. Scores from our 38-task benchmark and the Aider edit leaderboard.

ModelScoreSourceMin RAMQuant
GPT-oss 20B98.3%38-task benchmark16 GBN/A
Qwen 3.5 35B85.8%38-task benchmark24 GBQ4_K_M
Gemma 3 12B80.6%38-task benchmark8 GBN/A
Qwen 2.5 Coder 32B72.9%Aider edit benchmark22 GBN/A
DeepSeek Coder V272.9%Aider edit benchmark42 GBN/A
Qwen 2.5 Coder 14B69.2%Aider edit benchmark10 GBN/A
Llama 3.3 70B59.4%Aider edit benchmark42 GBN/A
Llama 3.1 70B58.6%Aider edit benchmark42 GBN/A
Qwen 2.5 Coder 7B57.9%Aider edit benchmark5 GBN/A
Codestral 22B48.1%Aider edit benchmark14 GBN/A
Llama 3.1 8B37.6%Aider edit benchmark5 GBN/A

Benchmark data from 38 real coding tasks tested on each model. Cloud and local pickers both use real test results. Pricing from actual API costs during testing. Updated daily with live pricing from OpenRouter.