Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive → DFlash → MTP for Qwen3.6-27B
The setup
The starting line was 43 tokens per second decode on vanilla llama.cpp. The finishing line, three months later, is 39 to 49 tokens per second decode that doesn’t collapse at long context, using a completely different speculative decoding technique than the one Claude and Ian started with. This box runs Qwen3.6-27B Q4_K_M on a single RTX 3090 Ti (the T5820 build notes are here), serving an agent stack (Hermes/k2) over the OpenAI-compatible llama.cpp HTTP API.
The full audit trail is below: autoregressive baseline, DFlash via the BeeLlama fork, then MTP via vanilla llama.cpp once the workload reality caught up to the bench. Every knob got measured, most got rejected, and the production state at the end is simpler than what it replaced.
Terms used: Autoregressive = baseline generation, one token at a time, no speculation. Drafter = small model that proposes tokens for the target to verify. KV cache = stored key/value pairs from previous tokens so attention doesn’t recompute every step. Prefill = the model reading the prompt before generating. Decode = generating tokens after prefill. TTFT = time to first token. MTP = multi-token prediction (extra head layers on the target that predict several tokens in parallel).
What’s running in prod right now
End state as of 2026-05-15:
- Binary: vanilla llama.cpp,
build-mtp/bin/llama-server, built from the MTP PR branch (commitebe4fca, PR #22673). PR #22673 merged tomasteron 2026-05-16, so anymastercheckout after that date ships--spec-type mtpnatively. - Model:
Qwen3.6-27B-Q4_K_M-mtp.gguffrom [unsloth/Qwen3.6-27B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) (MTP heads baked into the weights, no separate drafter file). - Flags:
--spec-type mtp \
--spec-draft-n-max 6 \
--spec-draft-p-min 0.75 \
--reasoning-budget 256 \
-c 131072 -fa on -ctk q4_0 -ctv q4_0 --jinja \
--alias qwen3.6-27b-q4_k_m
- VRAM: ~20.4 to 20.9 GiB depending on measurement state (idle vs under light load on a 24 GiB card).
- Decode rate: 39 to 49 tok/s across tested output lengths (100, 500, 1000, 2000 tokens), low end at out=500, high end at out=2000. Plain autoregressive’s flat ~29 tok/s plus its faster prefill still beats MTP on wall clock below output ~900 tokens.
- Wall time at output 2000: 91 seconds, vs 112 seconds on DFlash and 107 on plain autoregressive.
- Reversibility: previous-state backups under
~/.config/systemd/user/llama-server.service.pre-*. Each swap in the table below is onecpaway from a revert.
Hermes hits the same alias it always did. Zero client changes through five binary swaps.
TL;DR
- MTP wins on wall clock above output ~900 tokens. Below that, plain autoregressive is faster.
- The DFlash Decode Collapse. DFlash decode drops from 46.9 to 30.1 tok/s as output grows from 100 to 2000 tokens. MTP holds 39 to 49 tok/s flat across the same range.
- Speculative decoding is NOT bit-for-bit lossless at temp=0 on free-form prose. Tool-call schema integrity is preserved identically. True for both DFlash and MTP. Backed empirically (the lossless probe, N=3 per workload) and academically (arxiv:2605.09992 on drafter attention drift).
- **
--spec-draft-p-min 0.75is the vanilla llama.cpp flag** that changed the MTP verdict from “buried at Phase 2” to “shipped at Phase 9.” Filter lives in PR #22397 (April 28). - **
--reasoning-budget 256** saves ~10 seconds per request on Qwen3.6 with no quality regression. - Single-prompt vendor benchmarks overstate DFlash gains by 30 to 60%. Distribution evidence (N=10+) is non-optional. N=1 misleads even at the right prompt size.
- Cache reuse beats every decode optimization when it hits. ~60x prefill speedup on realistic varying traffic with a stable prefix.
- PR #22673 (MTP) merged 2026-05-16. Builds from
masterafter that date have--spec-type mtpnatively.
What got tested
| # | Knob | Result | Status | Notes |
|---|---|---|---|---|
| 1 | Bigbatch (-ub 256 -b 2048) | +19% decode, +10% prompt, +186 MiB VRAM | ✅ kept | Free win on prompt-eval kernels. Carried into every later config. |
| 2 | DFlash (BeeLlama fork, default n_max=16) | 43 → 148 tok/s on linked-list code | ✅ shipped 2026-05-12 | First big swap. Same OpenAI/jinja API surface, same flags Hermes needed. |
| 3 | TurboQuant KV (turbo4 K + turbo3_tcq V) | Decode parity (178 vs 176), prompt -37% on sm_86 (Ampere, RTX 3090/3090 Ti) | ❌ rejected | Cross-stack corroboration: Red Hat AI researchers (Kurtić, Goin, Marques) reach the same verdict on H100 in the TurboQuant writeup on the vLLM blog. Hopper FP8 path not available on Ampere. |
| 4 | DDTree branch verify (--spec-branch-budget 22) | -49% to -58% decode across workloads | ❌ rejected | Anbeeld (BeeLlama maintainer) flags DDTree as “very much work in progress” in the README. Confirmed. |
| 5 | enable_thinking:false server-wide | +30% peak decode | ❌ rejected | Hermes/k2 hallucinated within minutes. Qwen3.6 reasoning is load-bearing. Reverted within ~15 min. |
| 6 | --spec-draft-n-max sweep (4 to 16) | 12 wins by a hair; surface flat ±5% across 10-14 | ✅ kept 12 | Initial sharp-peak finding flattened on the N=10 sweep. Same prod number, more nuance. |
| 7 | Q8_0 drafter (1.77 GB) | Tied Q4_K_M at N=10 | ❌ rejected | N=3 looked like a +10% surprise; N=10 collapsed it to noise. |
| 8 | Q5_K_S target (18 GB) + Q4_K_M drafter | +5% code, -10% chat | ❌ rejected | “Precision combo” wins on pure code. Hermes traffic is reasoning + chat heavy. |
| 9 | q8_0 KV cache (instead of q4_0) | -25% throughput; VRAM +1.8 GB | ❌ rejected | Re-confirmed pre-DFlash lesson under DFlash. Workload-consistent penalty. |
| 10 | CopySpec (suffix matching, no drafter) | >300x slowdown: 150-token prompt timed out at 600s | ❌ rejected | The drafter is load-bearing: on workloads without repetitive structure, CopySpec timed out in 600 seconds. The drafter is the entire performance story for speculative decoding on such workloads. |
| 11 | MTP, first pass (n_max=3, no p_min) | 1.4x autoregressive; ~2x slower than DFlash on long workloads | ❌ buried (Phase 2) | Tested at short context with original flags. Same prose drift profile as DFlash. Looked dead. Wasn’t. |
| 12 | **MTP, second pass (n_max=6 --spec-draft-p-min 0.75)** | 1.8x autoregressive; decode doesn’t collapse at long context | ✅ shipped 2026-05-15 | The --spec-draft-p-min filter (vanilla llama.cpp, PR #22397, April 28) changed the verdict. Decode holds 39 to 49 tok/s across every output length while DFlash drops 47 → 30 tok/s. |
| 13 | --reasoning-budget 256 | Saves ~10 seconds per request, no quality regression | ✅ shipped 2026-05-15 | Caps runaway reasoning chains at 256 tokens. Highest-impact, lowest-risk flag in the sweep. |
End-state decode rates:
| output_len | Autoregressive tok/s | DFlash tok/s | MTP (prod) tok/s | Wall-clock winner |
|---|---|---|---|---|
| 100 | 28.9 | 46.9 | 44.6 | Autoregressive (prefill dominates) |
| 500 | 29.1 | 37.0 | 39.1 | Autoregressive by 4-7s |
| 1000 | 29.1 | 30.2 | 44.4 | MTP/autoregressive tied within 86ms |
| 2000 | 29.0 | 30.1 | 48.9 | MTP by 16-21s |
What does “speed up a local LLM” actually mean
Three numbers that decouple at production scale:
- Decode rate (tok/s): how fast tokens come out once generation starts.
- TTFT (time-to-first-token): how long until the first visible character appears.
- Wall clock (TTFT + decode × length): what users actually feel.
Most speculative decoding marketing optimizes the first number. Production users feel the third. At 43K of context (Hermes-shape traffic, N=10 sampling), prefill is roughly 80% of wall clock, reasoning is ~12%, and decode is the remaining ~8%. A 2x decode improvement doesn’t double the wall clock. It nudges the smallest of three timescales.
Three months of decode tuning got production from 43 to 124 tok/s on a synthetic linked-list benchmark, then a single Hermes-shape bench at 43K context showed the gains evaporating at the real workload. The fix was changing the speculative decoding technique to one that doesn’t collapse at long context.
Plain autoregressive: the baseline
Plain autoregressive means generating one token at a time with no speculation, no drafter, no MTP heads. It’s the reference every measurement here compares against, and it’s a live competitor: on certain workloads it still wins on wall clock.
On Qwen3.6-27B Q4_K_M with -c 131072 -fa on -ctk q4_0 -ctv q4_0, plain autoregressive decodes at ~29 tok/s and stays there regardless of output length. It has the fastest prefill of the three modes (~37.5s at 43K context, vs MTP’s ~49s and DFlash’s ~44s). No drafter KV cache means no bandwidth contention and no collapse at long output. It also never speeds up.
How DFlash got here (2026-05-12)
The Dell T5820 install was the hardware story (companion post forthcoming). DFlash was the software follow-up. Initial scan of Luce-Org/lucebox-hub (advertising 3.43x decode + 10x TTFT on RTX 3090) ran into the same blocker: their daemon is a raw generate primitive with no OpenAI API, no jinja chat templates, no tool calling. Slotting it behind Hermes/k2 would need a chat-template shim written from scratch.
BeeLlama.cpp by Anbeeld already had the shim baked in: DFlash speculative decoding, TurboQuant KV cache, and CopySpec fallback layered onto the OpenAI server with --jinja and tool-call detection preserved. Different binary. Same flags Hermes needed.
The clean A/B: same Qwen3.6-27B Q4_K_M target, same KV quant, same -c 131072, same -fa on. Same workload (1200-token Python linked-list class, temperature 0, seed 42). BeeLlama with --spec-type dflash vs the same BeeLlama with no --spec-* flags. DFlash was the only variable.
| Config | Decode tok/s | Prompt tok/s | VRAM MiB | vs Autoregressive |
|---|---|---|---|---|
| Autoregressive baseline | 43.15 | 202 | 18634 | 1.00x |
| DFlash, q4_0 KV | 148.46 | 189 | 19760 | 3.44x |
| DFlash + bigbatch, thinking ON | ~124 | ~210 | 20372 | 2.88x |
| DFlash + bigbatch, thinking OFF (peak) | 176.02 | 209 | 19946 | 4.08x |
The headline is the third row. Server-wide enable_thinking:false was tested, ran for ~15 minutes in production, and reverted because the model started narrating work it never did (“running on a Pi, give it a second” on a 3090 Ti) and made up status messages. Qwen3.6 is a reasoning model. Server-wide thinking-off tanks output quality across the agent stack, and reasoning back on costs ~30% of the peak.
Tool calling stayed intact through the swap. Standard OpenAI-shape tools array request came back with finish_reason: "tool_calls" and a clean tool_calls array. No shim needed.
The drafter knobs
DFlash’s default --spec-draft-n-max is 16: the drafter guesses up to 16 tokens per round, the target verifies them all at once. Anything the drafter got right is free; anything past the first wrong guess is wasted compute. A sweep across 4, 8, 12, 16 (plus a fine pass at 10, 11, 13, 14 a week later) put the optimum at 12 on this workload, a wide valley flat to ±5% across n_max 10-14.
| Config | latency | tool-call | chat-short | code-long |
|---|---|---|---|---|
| prod (n_max=16, cross=1024, adaptive ON) | 92 | 76 | 59 | 130 |
| nmax-8 noadapt | 104 | 75 | 63 | 126 |
| nmax-12 noadapt | 111 | 78 | 73 | 137 |
| nmax-16 noadapt | 104 | 77 | 65 | 126 |
| crossctx-2048 | 98 | 70 | 64 | 157 |
The chat-short bump (+24% decode, -27% wall clock) came from making each wrong guess cheaper. With n_max=16, every rejected draft on <think> content burned 12-15 wasted tokens. At n_max=12, the same rejections cost 8-11 tokens. Multiplied across thousands of speculation cycles per response, that’s the 24%.
The lossless probe
The textbook claim: speculative decoding with rejection sampling (the mechanism that accepts draft tokens matching the target’s distribution and corrects ones that don’t) at temperature 0 produces output bit-for-bit identical to autoregressive. At temp=0 there’s no random sampling, so every token should land on the target model’s argmax. Pure speed optimization, zero quality impact.
Nobody on either side of the Qwen 3.6 and BeeLlama conversation had published a measurement, so Claude and Ian ran one: same target, temp=0, fixed seed 42, five deterministic prompts. We cached the autoregressive baseline, then ran DFlash against it with a character-level Levenshtein diff.
| Workload | Identical to autoregressive? | Median char drift | Notes |
|---|---|---|---|
| latency (single digit “4”) | YES (3/3) | 0 | Trivial |
tool-call (get_weather) | YES (3/3) | 0 | Schema + args match exactly |
| chat-short (TCP handshake, ~1300 chars) | NO (0/3) | ~1100 | ~86% of reference length. Semantically similar, textually distinct. |
| code-long (Python class, ~2600 chars) | NO (0/3) | 94 | ~3-6% drift. Variable names + docstrings varied. |
Lossless held narrowly for short deterministic answers and structured outputs like tool calls. It broke for sustained prose past a few hundred tokens. Probable cause: drafter distributional drift. The DFlash drafter is not an exact match for the target’s logit distribution, so at rejection-sampling boundaries where two tokens sit at near-equal probability, even small drafter drift flips the accepted token. One flipped token branches into a different sentence.
The agentic-stack consequence: tool-call schema integrity is preserved (tool_call_schema_match_all = true across all iterations). Function names, argument keys, JSON shape all stay identical run-to-run. Free-form chat text varies at temp=0, the same way any non-deterministic backend would.
The MTP head-to-head (Phase 2) ran the same probe a week later and got the same drift profile: ~1000 chars on chat-short, lossless on tool calls. Two implementations, same theoretical guarantee failing the same way. Academic backing arrived at the right time: arxiv:2605.09992 “Attention Drift in Autoregressive Speculative Decoding Drafters” measured the same phenomenon at the model-internal level. Our Levenshtein probe and their attention-pattern analysis are pointing at the same thing from opposite ends.
Phase 1: what lost
Before committing to MTP testing, four more configs against the n_max=12 baseline:
| Config | latency | tool-call | chat-short | code-long | Outcome |
|---|---|---|---|---|---|
| prod (nmax=12) | 105.6 | 77.4 | 69.7 | 131.3 | Baseline |
| nmax-4 | 77.6 (-26%) | 64.9 (-16%) | 58.7 (-16%) | 82.4 (-37%) | Regression on every workload |
| nmax-8 | 105.7 (tie) | 74.7 (-3%) | 63.2 (-9%) | 124.4 (-5%) | Strictly worse |
| q8-kv (nmax=12) | 97.3 (-8%) | 68.0 (-12%) | 63.9 (-8%) | 100.2 (-24%) | 25% penalty confirmed, VRAM +1.8 GB |
| copyspec | TIMED OUT at >600s | – | – | – | Catastrophic |
The Unsloth MTP configuration guide’s recommendation of n_max=2 does not transfer to DFlash. The nmax curve is monotonically worse going smaller. MTP and DFlash are different techniques with different optimal draft windows. CopySpec without a drafter is a 300x slowdown, not a floor. BeeLlama’s README describes it as model-free suffix matching. On the 150-token latency prompt it didn’t complete in 600 seconds. The drafter is load-bearing; the drafter is the entire performance story for speculative decoding on workloads with no repetitive structure.
Phase 2: MTP buried
Multi-token prediction in vanilla llama.cpp (PR #22673, am17an) predicts multiple target-model tokens in parallel without a separate draft model. Simpler architecture, smaller VRAM footprint, comparable advertised speedup.
The first MTP test ran at short context with --spec-draft-n-max 3 (no --spec-draft-p-min flag existed yet). Same target weights as DFlash, same q4_0 KV, reasoning ON. Three findings: MTP wasn’t lossless on prose either (~1000 chars drift on chat-short, same magnitude as DFlash); MTP ran ~1.3-1.4x slower than DFlash on matched chat-short workloads (52-57 tok/s vs ~73 tok/s); MTP preserved tool-call schema integrity (safe for Hermes).
Conclusion at the time: same drift profile on prose, lower throughput on matched workloads, no operational win. DFlash stays. Phase 2 looked dead. It was on the wrong settings, at the wrong context size.
Phase 6: prefill is most of TTFT
Production was running DFlash + bigbatch + nmax=12. Hermes felt slow. The decode bench said 70 tok/s on chat-short, which should have been ~10 seconds wall-clock on a typical answer. Real Hermes traffic was hitting 12-32 second TTFT. The numbers didn’t add up.
So Ian and Claude ran one bench at the actual Hermes workload shape: ~43K context (system message + multi-turn history + tools array), reasoning ON, then sent the same body twice in a row.
| Metric | Iter 1 (cold) | Iter 2 (warm, identical body) |
|---|---|---|
| Wall TTFT | 48.48s | 7.32s |
Server prompt_ms (prefill) | 46.90s | 0.24s |
Tokens evaluated (prompt_n) | 43,241 | 4 |
Tokens reused (cache_n) | 0 | 43,237 |
| Decode rate | 33.9 tok/s | 43.2 tok/s |
Cold prefill was most of TTFT. The model spent 46.9 seconds reading the prompt before generating anything. The whole decode-throughput investigation had been tuning the smallest slice of the wall clock. Cache reuse was the entire game when it worked: second request: 43,237 of 43,241 tokens reused, prefill dropped to 0.24s, ~200x speedup on the smallest slice that suddenly mattered.
Phase 7 immediately re-validated at N=10-20 and corrected the headlines: the 200x cache speedup required byte-identical bodies, realistic Hermes traffic (varying user message turn-to-turn) gets closer to 60x. The “96.7% prefill share of TTFT” was a high-tail observation; at N=10 the median is 88%. Reasoning budget effect, originally measured as <5%, was actually 20.7% with proper sampling. Three Phase 6 numbers off by 1.5x to 4x from a single observation.
Phase 7: DFlash hurts prefill
The Phase 6 framing was “speculative decoding optimizes the wrong axis.” It implicitly assumed DFlash was neutral on prefill. Phase 7 actually measured it.
| Metric | Autoregressive (N=10) | DFlash (N=10) | Δ |
|---|---|---|---|
| Prefill at 43K context | 37,498 ms | 44,153 ms | DFlash is 17.8% SLOWER |
| Decode at 43K context (out=100) | 29.3 tok/s | 38.4 tok/s | DFlash +30% |
DFlash trades prefill speed for decode speed: the drafter prefills alongside the target, adding 17.8% wall time at 43K context. On a TTFT-dominated workload (large system message, mostly tool calls + short replies, which is roughly what Hermes runs), DFlash was making user-felt latency worse, not better. Production was about to swap.
Phase 8: the DFlash Decode Collapse
Two sweeps, four output lengths each at N=3.
The DFlash Decode Collapse:
| output_len | Autoregressive wall (ms) | DFlash wall (ms) | DFlash decode tok/s |
|---|---|---|---|
| 100 | 41,888 | 48,394 | 46.9 |
| 500 | 55,785 | 59,995 | 37.0 |
| 1000 | 72,966 | 79,528 | 30.2 |
| 2000 | 107,476 | 112,567 | 30.1 |
DFlash decode at out=100 is 46.9 tok/s. At out=2000 it’s 30.1, basically autoregressive speed. The drafter’s KV cache grows alongside the target’s, the small drafter is more bandwidth-bound, and the speedup erodes as the conversation gets longer. By out=2000, DFlash is paying its 6-7 second prefill tax for no decode benefit. arxiv:2604.26412 “When Hidden States Drift: KV Caches and Long-Range Speculative Decoding” names this drafter-bandwidth bottleneck at the research level; the table above is the practitioner measurement. Three months of tuning a drafter ratio that evaporates at output > 1000 tokens.
**MTP, with the --spec-draft-p-min 0.75 filter on drafter logits:**
| output_len | Autoregressive wall (ms) | DFlash wall (ms) | new MTP wall (ms) | MTP decode tok/s |
|---|---|---|---|---|
| 100 | 41,888 | 48,394 | 52,609 | 44.5 |
| 500 | 55,785 | 59,995 | 62,951 | 39.1 |
| 1000 | 72,966 | 79,528 | 73,089 | 44.5 |
| 2000 | 107,476 | 112,567 | 91,427 | 48.8 |
MTP’s decode rate does not collapse. It holds 39 to 49 tok/s across every output length tested. MTP has no separate drafter model: the multi-token heads share the target’s hidden state and its KV cache. No second KV cache to feed, no bandwidth contention, no drafter-KV-grows-with-output bottleneck.
Crossover math: MTP has worse prefill (~49s vs autoregressive’s 37.5s) but sustained-high decode (~46 tok/s vs autoregressive’s 29). MTP overcomes its 11.5s prefill tax at output ~900 tokens: 11.5 / (1/29 - 1/46) ≈ 902. Below that, autoregressive wins. Above, MTP wins. DFlash is below both at every tested length.
Phase 2 buried MTP at short context with the wrong settings. The p_min=0.75 filter plus a long-context workload exhumed it.
Can a second GPU help?
The first instinct on the bandwidth-starved-drafter problem is to throw a second GPU at it: drafter on one card, target on the other, in parallel. The math doesn’t reward it. Within a single draft/verify cycle there’s no parallelism (target verification depends on drafter output). Across cycles, async or lookahead spec decoding gives a theoretical speedup ceiling of 1 + min(time_d, time_t) / max(time_d, time_t).
At long context the drafter dominates the cycle, so parallel speedup tops out around 1.125x. A second 3090 Ti buys a 1.2x win for ~$700. MTP gives the same architectural win for $0 via shared KV and no bandwidth contention. llama.cpp doesn’t support async 2-GPU spec decoding anyway. vLLM and TensorRT-LLM do, which means buying hardware AND switching the inference stack.
Phase 9: production switch (2026-05-15)
The llama-server unit on ubuntu1 got rewritten. BeeLlama out, vanilla llama.cpp build-mtp in. Standard Q4_K_M GGUF out, MTP-variant Q4_K_M in. Separate drafter file dropped entirely. Hermes didn’t need touching: same alias, same OpenAI shape, zero client changes. About 5 minutes total wall-time.
Verification bench against the live prod unit, N=3, four output lengths:
| output_len | wall median (ms) | decode tok/s | reasoning_chars median |
|---|---|---|---|
| 100 | 53,138 | 44.6 | 145 |
| 500 | 62,997 | 39.1 | 186 |
| 1000 | 73,052 | 44.4 | 803 |
| 2000 | 91,304 | 48.9 | 816 |
Numbers match Phase 8 standalone within 1%. --reasoning-budget 256 introduces no measurable regression. Production VRAM 20,902 MiB. The swap deleted code.
Sidebar: Tailscale relay vs LAN throughput
The MTP-variant model swap was painful the first time. Mac Studio and ubuntu1 are on the same Tailscale network, so the obvious move is scp over the 100.x address. That tops out at 1 to 2 MB/s because Tailscale routes through a DERP relay in Seattle. The 16 GB swap would have taken hours.
Both boxes are also physically Ethernet-bridged on the local LAN: ubuntu1 at 192.168.2.2, Mac Studio at 192.168.2.1. Same scp command pointed at the LAN address: 100 MB/s, ~3 minutes for the 16 GB model. 50 to 100x speedup over the Tailscale path.
The DigitalOcean droplet pulled MTP weights directly from HuggingFace at 67 MB/s when the Tailscale-from-Mac-Studio attempt hung on auth for 30 minutes. Public-internet egress was 30x faster than the mesh peer.
Methodology lessons
1. Distribution evidence is mandatory. Single-prompt benchmarks inflated DFlash gains 30 to 60% versus an 8-prompt diversity sweep. N=1 misleads even at the right prompt size: Phase 6 → 7 had three headline numbers off by 1.5x to 4x from one observation. When a vendor publishes “3.4x,” assume the median is closer to 1.5 to 2x. And match the bench prompt size to production: Phase 5 was the right experiment on the wrong workload (Phase 7 redid it at 43K context and got a 4x larger effect). 2. Decode tok/s and TTFT decouple on reasoning models at production context. At 43K context they’re three timescales (prefill, reasoning, decode) at roughly 80% / 12% / 8% of wall clock. Optimize what users feel. 3. Spec decoding is workload-dependent, and “workload” includes output length. DFlash is 1.6x on chat-short (out~700) and 1.04x at out=2000. Autoregressive holds ~29 tok/s flat. MTP holds 39 to 49 tok/s under prod flags. Pick the technique whose curve fits your traffic. Cache reuse is binary: ~60x prefill speedup with a stable prefix, full cost when it isn’t. 4. Spec decoding at temp=0 is NOT bit-for-bit lossless on prose. Identical on tool calls and one-token answers, completely different on free-form prose. True for both DFlash and MTP. Both implementations fail the textbook lossless guarantee on prose-heavy workloads.
FAQ
What is multi-token prediction (MTP)?
MTP adds “head” layers to the target model that predict several tokens in parallel with the main next-token prediction. Drafts get verified on the next forward pass: accepted tokens are free, the first rejected one cuts off the rest. No separate drafter model, shared KV cache, same speculative-decoding mechanism as DFlash but with drafts coming from inside the target.
Does MTP change the output?
On tool calls and short structured answers, bit-for-bit identical to autoregressive at temp=0 (verified N=3 per workload). On free-form prose past a few hundred tokens, ~1000 characters of textual drift on chat-short, same magnitude as DFlash. Semantically equivalent, textually different. For regression tests that diff against gold outputs, disable speculative decoding entirely.
MTP vs DFlash on a 3090 Ti, which one’s faster?
Depends on output length. Below output 500, autoregressive > DFlash > MTP because prefill dominates. By output 1000, autoregressive ≈ MTP > DFlash. By output 2000, MTP > autoregressive > DFlash by 16 to 21 seconds wall clock. DFlash decode collapses from 47 to 30 tok/s as output grows because its drafter’s KV cache competes for bandwidth. MTP shares the target’s KV, no collapse.
InsiderLLM has a DFlash-vs-MTP head-to-head on the same hardware that benches a single short-output point. The DFlash Decode Collapse only appears past output 500-1000 tokens, which is why it doesn’t surface in short-prompt comparisons.
Does MTP work on Qwen3.6-27B dense?
Yes. Unsloth ships Qwen3.6-27B-MTP-GGUF with the heads baked in. The HackMD MoE benchmark initially said “MTP doesn’t help” on Qwen3.6-35B-A3B, then a May 8 2026 update flipped to +27.5% with corrected flags. The MoE story is still evolving. Dense 27B with --spec-type mtp --spec-draft-p-min 0.75 --spec-draft-n-max 6 runs at 39 to 49 tok/s across the tested output lengths.
Why does MTP need a custom llama.cpp build?
PR #22673 (am17an, opened May 4 2026, merged to master May 16 2026) added --spec-type mtp. Builds from master after that date have it natively. For earlier checkouts, fetch the PR branch and build with -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_NATIVE=ON. Takes ~4 to 13 minutes depending on cache state.
What about Apple Silicon?
This post is RTX 3090 Ti specific. MTP on Metal has different gotchas (issue #23011 flags “MTP slower than baseline on Apple Metal despite high acceptance” on the 35B-A3B variant). The MLX backend is a separate story. For the Apple Silicon side of local inference (LM Studio tuning, KV-cache quantization, the sysctl GPU-memory-cap fix), see LM Studio Errors on Apple Silicon.
Money quotes
- “Same box, same model, same flags. Different binary. Nearly three times the throughput.”
- “Speculative decoding is supposed to be lossless at temp=0. We measured it. It isn’t, on prose. Tool calls survive.”
- “CopySpec without the drafter is a 300x slowdown. The drafter isn’t overhead. The drafter is the entire performance story.”
- “Three months of tuning a drafter ratio that evaporates at output > 1000 tokens.”
- “MTP doesn’t have a separate drafter. The heads are part of the same forward pass. There’s no second KV cache to feed. So when the context gets long, MTP doesn’t slow down. DFlash does.”
- “Production went from BeeLlama + DFlash + custom drafter back to vanilla llama.cpp + the MTP-variant GGUF. The swap deleted code.”
- “Hermes never noticed. We changed the binary, the model, the spec-decoding technique, and the drafter situation, and Hermes kept hitting the same alias.”
*Companion post forthcoming: Building a 3090 Ti Homelab Inference Node on a Dell Precision T5820. All commands and configs reproducible. Bench harnesses at ~/revalidation-results/ on ubuntu1.*
How a CEO uses Claude Code and Hermes to do the knowledge work
A blank or generic config file means every session re-explains your workflow. These are the files I run daily as CEO of a cybersecurity company managing autonomous agents, cron jobs, and publishing pipelines.
- CLAUDE.md template with session lifecycle, subagent strategy, and cost controls
- 8 slash commands from my actual workflow (flush, project, morning, eod, and more)
- Token cost calculator: find out what each session is actually costing you
One email when the pack ships. Occasional posts after that. Unsubscribe anytime.