Why Your llama.cpp Benchmarks Are Wrong: GPU Architecture and Real Numbers
I pulled an aging Quadro out of my homelab LLM box, dropped in an RTX 2060 SUPER, and the thing booted on the first try. The driver saw both cards. The fast utility model came back up and generated text at about 52.8 tokens per second. Respectable for an 8GB card.
That number was a lie. The binaries the new card was running had been compiled for the old card, so I was benchmarking a build I was never going to keep. A llama.cpp benchmark only means something if you ran it against the build you are actually going to ship.
The GPU swap that “just worked”
The box is ubuntu1, my homelab machine that runs a couple of local language models on separate GPUs. GPU0 had an NVIDIA Quadro M4000: an old Maxwell-generation card, 8GB of memory, the kind of thing you find cheap. I pulled it and put in an RTX 2060 SUPER, a Turing-generation card, also 8GB. (I bought it listed as a 2070; the driver reports it as a 2060 SUPER. Same slot, same plan either way.)
NVIDIA driver 580.159.03 saw both cards immediately, and I never touched the RTX 3090 Ti on GPU1 that does the heavy reasoning work. Open the case, change the part, close the case, everything comes back. Except it did not all come back.
The swap killed my pinned container
The fast utility model runs in a Docker container, and after the reboot that container was dead. It had exited with code 128 and refused to start.
The cause was a stability practice that bit me. I pin each model to a specific physical GPU using that GPU’s UUID, a unique identifier baked into the card, passed to the container through the NVIDIA_VISIBLE_DEVICES setting. Pinning by UUID guarantees a container always lands on the exact card you intended, even if the system renumbers devices between boots. The catch is that a new card mints a new UUID. The container was still looking for the Quadro’s identifier, which no longer existed, so it had nothing to attach to and quit. The fix was one line: repoint at the new card’s UUID and bring it back up.
# old: container pinned to the Quadro's UUID, which no longer exists
NVIDIA_VISIBLE_DEVICES=GPU-<old-quadro-uuid>
# new: repoint at the 2060 SUPER's UUID
NVIDIA_VISIBLE_DEVICES=GPU-<new-2060super-uuid>
With that done, the utility model (Ministral-8B, a small 8-billion-parameter model, quantized to fit in roughly 5.1GB with all 37 of its layers offloaded onto the GPU) came up clean on the 2060 SUPER. Steady state: about 52.8 tokens per second generating text, about 106 tokens per second processing the prompt. Good enough that I almost stopped there.
The benchmark was measuring the wrong build
The binaries that model was running lived in a directory I had literally named llama-quadro-52-only. That build was compiled for the old Quadro’s architecture and nothing else.
Here is why that matters. When you compile GPU code with NVIDIA’s toolchain, you can target a specific GPU architecture and get native machine code (SASS) that runs directly on that exact generation of card. Each generation has an architecture number: the Quadro M4000 is sm_52 (Maxwell), the 2060 SUPER is sm_75 (Turing), and native code for one will not run as-is on the other. NVIDIA’s escape hatch is that a build can also include PTX, an intermediate representation not tied to any one architecture. When a binary lands on a card it has no native code for, the driver compiles that PTX into machine code at runtime. That is PTX JIT (“just-in-time” compilation), and it is why my Quadro-only build ran on the Turing card at all: no native Turing kernels, just the driver translating on the fly.
That architecture number is the thing you have to match. Here is the quick reference for the cards a homelab is likely to run:
| GPU generation | Architecture flag | Example cards |
|---|---|---|
| Maxwell | sm_52 | Quadro M4000, GTX 970/980 |
| Pascal | sm_61 | GTX 1080, Tesla P40 |
| Turing | sm_75 | RTX 2060 SUPER, RTX 2080, GTX 1660 |
| Ampere | sm_86 | RTX 3060, RTX 3090 Ti |
| Ada Lovelace | sm_89 | RTX 4090 |
| Hopper | sm_90 | H100 |
(Compute-capability codes like 8.6 map to the build flag by dropping the dot: 8.6 becomes sm_86. Datacenter Blackwell and the consumer RTX 50-series use newer codes again, so check your specific card if you are on the latest silicon.)
So 52.8 tokens per second was not the 2060 SUPER’s number. It was the 2060 SUPER running translated Maxwell code, a floor on the wrong build.
Rebuilding llama.cpp for the right architecture
The fix was to recompile llama.cpp (the inference engine that runs these models) with native code for both architectures: sm_52 for the Quadro and sm_75 for the 2060 SUPER. Both, not just the new card, because the Quadro is going back into a different slot later, and a single binary that carries native code for both serves either card with no runtime translation in either direction. (If you want the from-scratch version, no swap involved, I documented a first CUDA build of llama.cpp on my 3090 Ti workstation separately. This is the sequel.)
I built from the exact commit my production setup was already running, so every command-line flag the utility model uses would behave identically. The build flags that matter:
cmake -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="52;75" \ # native code for BOTH cards, no JIT
-DGGML_CUDA_NCCL=OFF \ # single-GPU lane, drop multi-GPU comms
-DGGML_NATIVE=OFF \ # don't bake the build host's CPU flags in
-DGGML_CUDA_FA_ALL_QUANTS=ON \ # flash-attention for all KV-cache formats
-DLLAMA_CURL=ON -DBUILD_SHARED_LIBS=ON \
-DCMAKE_CUDA_FLAGS="-O3" -DCMAKE_C_FLAGS="-O3" -DCMAKE_CXX_FLAGS="-O3"
The line doing the real work is CMAKE_CUDA_ARCHITECTURES="52;75": emit native code for both card generations, so neither falls back to runtime translation. A few of the other choices earn a sentence. GGML_NATIVE=OFF keeps the build from baking the build machine’s CPU instructions into a binary that runs in a different container; bake in CPU features the runtime host lacks and it crashes on launch. GGML_CUDA_FA_ALL_QUANTS=ON builds flash-attention (a faster, memory-leaner way to compute attention) for every key-value cache format, including the compressed q8_0 cache this model uses. I skipped fast-math, the option that trades a little numerical accuracy for speed, because I wanted native kernels, not arithmetic that drifts from the production build.
I built inside NVIDIA’s CUDA “devel” container (the image that ships the full compiler toolchain), matched to the exact CUDA runtime version on the host so there was no version-mismatch risk, and launched it so it could only ever see GPU0. The 3090 Ti doing production reasoning on GPU1 was structurally invisible to the build and could not be disturbed.
The libnccl.so.2 build that wouldn’t launch
Notice GGML_CUDA_NCCL=OFF in the flags. NCCL is NVIDIA’s library for multiple GPUs talking to each other during a computation. This is a single-GPU model. It does not need it. But the default is ON, and leaving it on cost me a build.
The first build came up fine inside the benchmark tool. Then the actual server refused to start: error while loading shared libraries: libnccl.so.2. The CUDA devel container I built in has NCCL installed; the slim runtime container the server actually runs in does not, and neither does the host library mount the binary borrows from. The binary had linked against a library that was not present where it would run. Set GGML_CUDA_NCCL=OFF and the dependency disappears, which is correct anyway for a single card.
How much faster the native build was
With the native multi-architecture build in place, the numbers were real, and they were better.
| Measure | Quadro-only build (PTX translated) | Native sm_75 build |
|---|---|---|
| Generation | 52.8 tok/s | ~62.5 tok/s |
| Prompt processing (live server) | ~106 tok/s | ~348 tok/s |
Generation went from 52.8 to about 62.5 tokens per second, a gain of roughly 19 percent, measured on the live server doing real 200-token generations with the model’s actual flags (two runs landed at 62.62 and 62.30). Prompt processing, the speed at which the model reads your input before it starts answering, jumped from about 106 to about 348 tokens per second, more than triple. In the raw benchmark tool with its default settings, prompt processing hit roughly 1,553 tokens per second. None of that gain existed in the first benchmark I almost trusted.
The cutover was deliberately reversible. I backed up the container config, stopped the utility model just long enough to free the 8GB needed to bench cleanly (the 3090 Ti reasoning model stayed up the whole time), benched, then repointed the model from the old build directory to the new one and restarted. The old Quadro-only build stays on disk untouched: it is the known-good build for when the Quadro goes back in, and a one-line rollback if the new one misbehaves.
The check that confirmed the fix was the logs. The native build loads native code directly, so there is no PTX-translation line in the startup output. The model reported all 37 layers on the GPU, the device named as the 2060 SUPER, and no mention of runtime recompilation. That absence is how you know you are finally running the build you think you are running.
(One harmless oddity, in case it trips you up reading your own logs: the new binary reports itself as version: 1 while the production build says version: 34, even though they’re the exact same commit. That counter is a build number stamped only by the official release tooling, not anything in the code. Same hash, same flags, same behavior.)
What I check after a GPU swap now
- A clean health check and a plausible token rate can both be true while you run the wrong build. “Health: ok” tells you the server started, not what it started with. Read the startup logs and confirm the architecture and the absence of a runtime-recompile line before you believe any number.
- Never benchmark a build you would not actually ship. A number from the wrong binary feels like data and quietly anchors every decision after it.
- UUID pinning is correct for stability and a guaranteed surprise on a hardware swap. Pin your containers to specific cards, and write the repoint step down right next to the pin so the next swap is a one-liner.
- A build-environment dependency missing from your runtime environment is invisible until first launch. Turn off what you do not need (NCCL on a single-GPU box) and confirm the runtime can resolve every library before you trust the build.
- An 8GB Turing card is a fast utility lane, not a reasoning card. It is excellent for the quick jobs (classify this, extract that, draft a short response), and roughly 62 tokens per second on a small quantized model is plenty for that. The heavy reasoning belongs on a bigger card, like the 3090 Ti I tuned separately for that job.
The whole episode took longer than the swap itself, and the payoff was a 19 percent gain I would otherwise have left on the table while congratulating myself on a clean boot.
Frequently asked questions
Why is llama.cpp slow even when it is using my GPU?
The most common silent cause is an architecture mismatch: your binary has no native code for the card it is running on, so the driver translates intermediate PTX code at runtime. The card works and the numbers look plausible, but you are running a slow fallback. Rebuilding with native code for your exact GPU architecture removes the translation layer. For me that alone lifted generation from 52.8 to about 62.5 tokens per second.
How do I specify the GPU architecture when building llama.cpp?
Pass -DCMAKE_CUDA_ARCHITECTURES to cmake with your card’s architecture number, for example -DCMAKE_CUDA_ARCHITECTURES="75" for a Turing card. List several, as in "52;75", to emit native code for multiple cards in one binary. Map the card’s compute capability to an sm_ code (7.5 becomes sm_75).
What does the libnccl.so.2 error at launch mean?
Your binary linked against NCCL in the build container, but the runtime container does not have that library. On a single-GPU box you do not need NCCL at all: rebuild with -DGGML_CUDA_NCCL=OFF and the dependency goes away.
How a CEO uses Claude Code and Hermes to do the knowledge work
A blank or generic config file means every session re-explains your workflow. These are the files I run daily as CEO of a cybersecurity company managing autonomous agents, cron jobs, and publishing pipelines.
- CLAUDE.md template with session lifecycle, subagent strategy, and cost controls
- 8 slash commands from my actual workflow (flush, project, morning, eod, and more)
- Token cost calculator: find out what each session is actually costing you
One email when the pack ships. Occasional posts after that. Unsubscribe anytime.