Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive → DFlash → MTP for Qwen3.6-27B

Q: Does MTP work on Qwen3.6-27B dense?

Yes. Unsloth ships Qwen3.6-27B-MTP-GGUF with the heads baked in. The HackMD MoE benchmark initially said "MTP doesn't help" on Qwen3.6-35B-A3B, then a May 8 2026 update flipped to +27.5% with corrected flags. The MoE story is still evolving. Dense 27B with --spec-type mtp --spec-draft-p-min 0.75 --spec-draft-n-max 6 runs at 39 to 49 tok/s across the tested output lengths.

The setup

The starting line was 43 tokens per second decode on vanilla llama.cpp. The finishing line, three months later, is 39 to 49 tokens per second decode that doesn’t collapse at long context, using a completely different speculative decoding technique than the one Claude and Ian started with. This box runs Qwen3.6-27B Q4_K_M on a single RTX 3090 Ti (the T5820 build notes are here), serving an agent stack (Hermes/k2) over the OpenAI-compatible llama.cpp HTTP API.

The full audit trail is below: autoregressive baseline, DFlash via the BeeLlama fork, then MTP via vanilla llama.cpp once the workload reality caught up to the bench. Every knob got measured, most got rejected, and the production state at the end is simpler than what it replaced.

Terms used: Autoregressive = baseline generation, one token at a time, no speculation. Drafter = small model that proposes tokens for the target to verify. KV cache = stored key/value pairs from previous tokens so attention doesn’t recompute every step. Prefill = the model reading the prompt before generating. Decode = generating tokens after prefill. TTFT = time to first token. MTP = multi-token prediction (extra head layers on the target that predict several tokens in parallel).

What’s running in prod right now

End state as of 2026-05-15:

Binary: vanilla llama.cpp, build-mtp/bin/llama-server, built from the MTP PR branch (commit ebe4fca, PR #22673). PR #22673 merged to master on 2026-05-16, so any master checkout after that date ships --spec-type mtp natively.
Model: Qwen3.6-27B-Q4_K_M-mtp.gguf from [unsloth/Qwen3.6-27B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) (MTP heads baked into the weights, no separate drafter file).
Flags:

--spec-type mtp \
--spec-draft-n-max 6 \
--spec-draft-p-min 0.75 \
--reasoning-budget 256 \
-c 131072 -fa on -ctk q4_0 -ctv q4_0 --jinja \
--alias qwen3.6-27b-q4_k_m

VRAM: ~20.4 to 20.9 GiB depending on measurement state (idle vs under light load on a 24 GiB card).
Decode rate: 39 to 49 tok/s across tested output lengths (100, 500, 1000, 2000 tokens), low end at out=500, high end at out=2000. Plain autoregressive’s flat ~29 tok/s plus its faster prefill still beats MTP on wall clock below output ~900 tokens.
Wall time at output 2000: 91 seconds, vs 112 seconds on DFlash and 107 on plain autoregressive.
Reversibility: previous-state backups under ~/.config/systemd/user/llama-server.service.pre-*. Each swap in the table below is one cp away from a revert.

Hermes hits the same alias it always did. Zero client changes through five binary swaps.

TL;DR

MTP wins on wall clock above output ~900 tokens. Below that, plain autoregressive is faster.
The DFlash Decode Collapse. DFlash decode drops from 46.9 to 30.1 tok/s as output grows from 100 to 2000 tokens. MTP holds 39 to 49 tok/s flat across the same range.
Speculative decoding is NOT bit-for-bit lossless at temp=0 on free-form prose. Tool-call schema integrity is preserved identically. True for both DFlash and MTP. Backed empirically (the lossless probe, N=3 per workload) and academically (arxiv:2605.09992 on drafter attention drift).
**--spec-draft-p-min 0.75 is the vanilla llama.cpp flag** that changed the MTP verdict from “buried at Phase 2” to “shipped at Phase 9.” Filter lives in PR #22397 (April 28).
**--reasoning-budget 256** saves ~10 seconds per request on Qwen3.6 with no quality regression.
Single-prompt vendor benchmarks overstate DFlash gains by 30 to 60%. Distribution evidence (N=10+) is non-optional. N=1 misleads even at the right prompt size.
Cache reuse beats every decode optimization when it hits. ~60x prefill speedup on realistic varying traffic with a stable prefix.
PR #22673 (MTP) merged 2026-05-16. Builds from master after that date have --spec-type mtp natively.

What got tested

#	Knob	Result	Status	Notes
1	Bigbatch (`-ub 256 -b 2048`)	+19% decode, +10% prompt, +186 MiB VRAM	✅ kept	Free win on prompt-eval kernels. Carried into every later config.
2	DFlash (BeeLlama fork, default `n_max=16`)	43 → 148 tok/s on linked-list code	✅ shipped 2026-05-12	First big swap. Same OpenAI/jinja API surface, same flags Hermes needed.
3	TurboQuant KV (`turbo4` K + `turbo3_tcq` V)	Decode parity (178 vs 176), prompt -37% on sm_86 (Ampere, RTX 3090/3090 Ti)	❌ rejected	Cross-stack corroboration: Red Hat AI researchers (Kurtić, Goin, Marques) reach the same verdict on H100 in the TurboQuant writeup on the vLLM blog. Hopper FP8 path not available on Ampere.
4	DDTree branch verify (`--spec-branch-budget 22`)	-49% to -58% decode across workloads	❌ rejected	Anbeeld (BeeLlama maintainer) flags DDTree as “very much work in progress” in the README. Confirmed.
5	`enable_thinking:false` server-wide	+30% peak decode	❌ rejected	Hermes/k2 hallucinated within minutes. Qwen3.6 reasoning is load-bearing. Reverted within ~15 min.
6	`--spec-draft-n-max` sweep (4 to 16)	12 wins by a hair; surface flat ±5% across 10-14	✅ kept 12	Initial sharp-peak finding flattened on the N=10 sweep. Same prod number, more nuance.
7	Q8_0 drafter (1.77 GB)	Tied Q4_K_M at N=10	❌ rejected	N=3 looked like a +10% surprise; N=10 collapsed it to noise.
8	Q5_K_S target (18 GB) + Q4_K_M drafter	+5% code, -10% chat	❌ rejected	“Precision combo” wins on pure code. Hermes traffic is reasoning + chat heavy.
9	q8_0 KV cache (instead of q4_0)	-25% throughput; VRAM +1.8 GB	❌ rejected	Re-confirmed pre-DFlash lesson under DFlash. Workload-consistent penalty.
10	CopySpec (suffix matching, no drafter)	>300x slowdown: 150-token prompt timed out at 600s	❌ rejected	The drafter is load-bearing: on workloads without repetitive structure, CopySpec timed out in 600 seconds. The drafter is the entire performance story for speculative decoding on such workloads.
11	MTP, first pass (`n_max=3`, no `p_min`)	1.4x autoregressive; ~2x slower than DFlash on long workloads	❌ buried (Phase 2)	Tested at short context with original flags. Same prose drift profile as DFlash. Looked dead. Wasn’t.
12	MTP, second pass (`n_max=6 --spec-draft-p-min 0.75`)	1.8x autoregressive; decode doesn’t collapse at long context	✅ shipped 2026-05-15	The `--spec-draft-p-min` filter (vanilla llama.cpp, PR #22397, April 28) changed the verdict. Decode holds 39 to 49 tok/s across every output length while DFlash drops 47 → 30 tok/s.
13	`--reasoning-budget 256`	Saves ~10 seconds per request, no quality regression	✅ shipped 2026-05-15	Caps runaway reasoning chains at 256 tokens. Highest-impact, lowest-risk flag in the sweep.

End-state decode rates:

output_len	Autoregressive tok/s	DFlash tok/s	MTP (prod) tok/s	Wall-clock winner
100	28.9	46.9	44.6	Autoregressive (prefill dominates)
500	29.1	37.0	39.1	Autoregressive by 4-7s
1000	29.1	30.2	44.4	MTP/autoregressive tied within 86ms
2000	29.0	30.1	48.9	MTP by 16-21s

What does “speed up a local LLM” actually mean

Three numbers that decouple at production scale:

Decode rate (tok/s): how fast tokens come out once generation starts.
TTFT (time-to-first-token): how long until the first visible character appears.
Wall clock (TTFT + decode × length): what users actually feel.

Most speculative decoding marketing optimizes the first number. Production users feel the third. At 43K of context (Hermes-shape traffic, N=10 sampling), prefill is roughly 80% of wall clock, reasoning is ~12%, and decode is the remaining ~8%. A 2x decode improvement doesn’t double the wall clock. It nudges the smallest of three timescales.

Three months of decode tuning got production from 43 to 124 tok/s on a synthetic linked-list benchmark, then a single Hermes-shape bench at 43K context showed the gains evaporating at the real workload. The fix was changing the speculative decoding technique to one that doesn’t collapse at long context.

Plain autoregressive: the baseline

Plain autoregressive means generating one token at a time with no speculation, no drafter, no MTP heads. It’s the reference every measurement here compares against, and it’s a live competitor: on certain workloads it still wins on wall clock.

On Qwen3.6-27B Q4_K_M with -c 131072 -fa on -ctk q4_0 -ctv q4_0, plain autoregressive decodes at ~29 tok/s and stays there regardless of output length. It has the fastest prefill of the three modes (~37.5s at 43K context, vs MTP’s ~49s and DFlash’s ~44s). No drafter KV cache means no bandwidth contention and no collapse at long output. It also never speeds up.

How DFlash got here (2026-05-12)

The Dell T5820 install was the hardware story (companion post forthcoming). DFlash was the software follow-up. Initial scan of Luce-Org/lucebox-hub (advertising 3.43x decode + 10x TTFT on RTX 3090) ran into the same blocker: their daemon is a raw generate primitive with no OpenAI API, no jinja chat templates, no tool calling. Slotting it behind Hermes/k2 would need a chat-template shim written from scratch.

BeeLlama.cpp by Anbeeld already had the shim baked in: DFlash speculative decoding, TurboQuant KV cache, and CopySpec fallback layered onto the OpenAI server with --jinja and tool-call detection preserved. Different binary. Same flags Hermes needed.

The clean A/B: same Qwen3.6-27B Q4_K_M target, same KV quant, same -c 131072, same -fa on. Same workload (1200-token Python linked-list class, temperature 0, seed 42). BeeLlama with --spec-type dflash vs the same BeeLlama with no --spec-* flags. DFlash was the only variable.

Config	Decode tok/s	Prompt tok/s	VRAM MiB	vs Autoregressive
Autoregressive baseline	43.15	202	18634	1.00x
DFlash, q4_0 KV	148.46	189	19760	3.44x
DFlash + bigbatch, thinking ON	~124	~210	20372	2.88x
DFlash + bigbatch, thinking OFF (peak)	176.02	209	19946	4.08x

The headline is the third row. Server-wide enable_thinking:false was tested, ran for ~15 minutes in production, and reverted because the model started narrating work it never did (“running on a Pi, give it a second” on a 3090 Ti) and made up status messages. Qwen3.6 is a reasoning model. Server-wide thinking-off tanks output quality across the agent stack, and reasoning back on costs ~30% of the peak.

Tool calling stayed intact through the swap. Standard OpenAI-shape tools array request came back with finish_reason: "tool_calls" and a clean tool_calls array. No shim needed.

The drafter knobs

DFlash’s default --spec-draft-n-max is 16: the drafter guesses up to 16 tokens per round, the target verifies them all at once. Anything the drafter got right is free; anything past the first wrong guess is wasted compute. A sweep across 4, 8, 12, 16 (plus a fine pass at 10, 11, 13, 14 a week later) put the optimum at 12 on this workload, a wide valley flat to ±5% across n_max 10-14.

Config	latency	tool-call	chat-short	code-long
prod (n_max=16, cross=1024, adaptive ON)	92	76	59	130
nmax-8 noadapt	104	75	63	126
nmax-12 noadapt	111	78	73	137
nmax-16 noadapt	104	77	65	126
crossctx-2048	98	70	64	157

The chat-short bump (+24% decode, -27% wall clock) came from making each wrong guess cheaper. With n_max=16, every rejected draft on <think> content burned 12-15 wasted tokens. At n_max=12, the same rejections cost 8-11 tokens. Multiplied across thousands of speculation cycles per response, that’s the 24%.

The lossless probe

The textbook claim: speculative decoding with rejection sampling (the mechanism that accepts draft tokens matching the target’s distribution and corrects ones that don’t) at temperature 0 produces output bit-for-bit identical to autoregressive. At temp=0 there’s no random sampling, so every token should land on the target model’s argmax. Pure speed optimization, zero quality impact.

Nobody on either side of the Qwen 3.6 and BeeLlama conversation had published a measurement, so Claude and Ian ran one: same target, temp=0, fixed seed 42, five deterministic prompts. We cached the autoregressive baseline, then ran DFlash against it with a character-level Levenshtein diff.

Workload	Identical to autoregressive?	Median char drift	Notes
latency (single digit “4”)	YES (3/3)	0	Trivial
tool-call (`get_weather`)	YES (3/3)	0	Schema + args match exactly
chat-short (TCP handshake, ~1300 chars)	NO (0/3)	~1100	~86% of reference length. Semantically similar, textually distinct.
code-long (Python class, ~2600 chars)	NO (0/3)	94	~3-6% drift. Variable names + docstrings varied.

Lossless held narrowly for short deterministic answers and structured outputs like tool calls. It broke for sustained prose past a few hundred tokens. Probable cause: drafter distributional drift. The DFlash drafter is not an exact match for the target’s logit distribution, so at rejection-sampling boundaries where two tokens sit at near-equal probability, even small drafter drift flips the accepted token. One flipped token branches into a different sentence.

The agentic-stack consequence: tool-call schema integrity is preserved (tool_call_schema_match_all = true across all iterations). Function names, argument keys, JSON shape all stay identical run-to-run. Free-form chat text varies at temp=0, the same way any non-deterministic backend would.

The MTP head-to-head (Phase 2) ran the same probe a week later and got the same drift profile: ~1000 chars on chat-short, lossless on tool calls. Two implementations, same theoretical guarantee failing the same way. Academic backing arrived at the right time: arxiv:2605.09992 “Attention Drift in Autoregressive Speculative Decoding Drafters” measured the same phenomenon at the model-internal level. Our Levenshtein probe and their attention-pattern analysis are pointing at the same thing from opposite ends.

Phase 1: what lost

Before committing to MTP testing, four more configs against the n_max=12 baseline:

Config	latency	tool-call	chat-short	code-long	Outcome
prod (nmax=12)	105.6	77.4	69.7	131.3	Baseline
nmax-4	77.6 (-26%)	64.9 (-16%)	58.7 (-16%)	82.4 (-37%)	Regression on every workload
nmax-8	105.7 (tie)	74.7 (-3%)	63.2 (-9%)	124.4 (-5%)	Strictly worse
q8-kv (nmax=12)	97.3 (-8%)	68.0 (-12%)	63.9 (-8%)	100.2 (-24%)	25% penalty confirmed, VRAM +1.8 GB
copyspec	TIMED OUT at >600s	–	–	–	Catastrophic

The Unsloth MTP configuration guide’s recommendation of n_max=2 does not transfer to DFlash. The nmax curve is monotonically worse going smaller. MTP and DFlash are different techniques with different optimal draft windows. CopySpec without a drafter is a 300x slowdown, not a floor. BeeLlama’s README describes it as model-free suffix matching. On the 150-token latency prompt it didn’t complete in 600 seconds. The drafter is load-bearing; the drafter is the entire performance story for speculative decoding on workloads with no repetitive structure.

Phase 2: MTP buried

Multi-token prediction in vanilla llama.cpp (PR #22673, am17an) predicts multiple target-model tokens in parallel without a separate draft model. Simpler architecture, smaller VRAM footprint, comparable advertised speedup.

The first MTP test ran at short context with --spec-draft-n-max 3 (no --spec-draft-p-min flag existed yet). Same target weights as DFlash, same q4_0 KV, reasoning ON. Three findings: MTP wasn’t lossless on prose either (~1000 chars drift on chat-short, same magnitude as DFlash); MTP ran ~1.3-1.4x slower than DFlash on matched chat-short workloads (52-57 tok/s vs ~73 tok/s); MTP preserved tool-call schema integrity (safe for Hermes).

Conclusion at the time: same drift profile on prose, lower throughput on matched workloads, no operational win. DFlash stays. Phase 2 looked dead. It was on the wrong settings, at the wrong context size.

Phase 6: prefill is most of TTFT

Production was running DFlash + bigbatch + nmax=12. Hermes felt slow. The decode bench said 70 tok/s on chat-short, which should have been ~10 seconds wall-clock on a typical answer. Real Hermes traffic was hitting 12-32 second TTFT. The numbers didn’t add up.

So Ian and Claude ran one bench at the actual Hermes workload shape: ~43K context (system message + multi-turn history + tools array), reasoning ON, then sent the same body twice in a row.

Metric	Iter 1 (cold)	Iter 2 (warm, identical body)
Wall TTFT	48.48s	7.32s
Server `prompt_ms` (prefill)	46.90s	0.24s
Tokens evaluated (`prompt_n`)	43,241	4
Tokens reused (`cache_n`)	0	43,237
Decode rate	33.9 tok/s	43.2 tok/s

Cold prefill was most of TTFT. The model spent 46.9 seconds reading the prompt before generating anything. The whole decode-throughput investigation had been tuning the smallest slice of the wall clock. Cache reuse was the entire game when it worked: second request: 43,237 of 43,241 tokens reused, prefill dropped to 0.24s, ~200x speedup on the smallest slice that suddenly mattered.

Phase 7 immediately re-validated at N=10-20 and corrected the headlines: the 200x cache speedup required byte-identical bodies, realistic Hermes traffic (varying user message turn-to-turn) gets closer to 60x. The “96.7% prefill share of TTFT” was a high-tail observation; at N=10 the median is 88%. Reasoning budget effect, originally measured as <5%, was actually 20.7% with proper sampling. Three Phase 6 numbers off by 1.5x to 4x from a single observation.

Phase 7: DFlash hurts prefill

The Phase 6 framing was “speculative decoding optimizes the wrong axis.” It implicitly assumed DFlash was neutral on prefill. Phase 7 actually measured it.

Metric	Autoregressive (N=10)	DFlash (N=10)	Δ
Prefill at 43K context	37,498 ms	44,153 ms	DFlash is 17.8% SLOWER
Decode at 43K context (out=100)	29.3 tok/s	38.4 tok/s	DFlash +30%

DFlash trades prefill speed for decode speed: the drafter prefills alongside the target, adding 17.8% wall time at 43K context. On a TTFT-dominated workload (large system message, mostly tool calls + short replies, which is roughly what Hermes runs), DFlash was making user-felt latency worse, not better. Production was about to swap.

Phase 8: the DFlash Decode Collapse

Two sweeps, four output lengths each at N=3.

The DFlash Decode Collapse:

output_len	Autoregressive wall (ms)	DFlash wall (ms)	DFlash decode tok/s
100	41,888	48,394	46.9
500	55,785	59,995	37.0
1000	72,966	79,528	30.2
2000	107,476	112,567	30.1

DFlash decode at out=100 is 46.9 tok/s. At out=2000 it’s 30.1, basically autoregressive speed. The drafter’s KV cache grows alongside the target’s, the small drafter is more bandwidth-bound, and the speedup erodes as the conversation gets longer. By out=2000, DFlash is paying its 6-7 second prefill tax for no decode benefit. arxiv:2604.26412 “When Hidden States Drift: KV Caches and Long-Range Speculative Decoding” names this drafter-bandwidth bottleneck at the research level; the table above is the practitioner measurement. Three months of tuning a drafter ratio that evaporates at output > 1000 tokens.

**MTP, with the --spec-draft-p-min 0.75 filter on drafter logits:**

output_len	Autoregressive wall (ms)	DFlash wall (ms)	new MTP wall (ms)	MTP decode tok/s
100	41,888	48,394	52,609	44.5
500	55,785	59,995	62,951	39.1
1000	72,966	79,528	73,089	44.5
2000	107,476	112,567	91,427	48.8

MTP’s decode rate does not collapse. It holds 39 to 49 tok/s across every output length tested. MTP has no separate drafter model: the multi-token heads share the target’s hidden state and its KV cache. No second KV cache to feed, no bandwidth contention, no drafter-KV-grows-with-output bottleneck.

Crossover math: MTP has worse prefill (~49s vs autoregressive’s 37.5s) but sustained-high decode (~46 tok/s vs autoregressive’s 29). MTP overcomes its 11.5s prefill tax at output ~900 tokens: 11.5 / (1/29 - 1/46) ≈ 902. Below that, autoregressive wins. Above, MTP wins. DFlash is below both at every tested length.

Phase 2 buried MTP at short context with the wrong settings. The p_min=0.75 filter plus a long-context workload exhumed it.

Can a second GPU help?

The first instinct on the bandwidth-starved-drafter problem is to throw a second GPU at it: drafter on one card, target on the other, in parallel. The math doesn’t reward it. Within a single draft/verify cycle there’s no parallelism (target verification depends on drafter output). Across cycles, async or lookahead spec decoding gives a theoretical speedup ceiling of 1 + min(time_d, time_t) / max(time_d, time_t).

At long context the drafter dominates the cycle, so parallel speedup tops out around 1.125x. A second 3090 Ti buys a 1.2x win for ~$700. MTP gives the same architectural win for $0 via shared KV and no bandwidth contention. llama.cpp doesn’t support async 2-GPU spec decoding anyway. vLLM and TensorRT-LLM do, which means buying hardware AND switching the inference stack.

Phase 9: production switch (2026-05-15)

The llama-server unit on ubuntu1 got rewritten. BeeLlama out, vanilla llama.cpp build-mtp in. Standard Q4_K_M GGUF out, MTP-variant Q4_K_M in. Separate drafter file dropped entirely. Hermes didn’t need touching: same alias, same OpenAI shape, zero client changes. About 5 minutes total wall-time.

Verification bench against the live prod unit, N=3, four output lengths:

output_len	wall median (ms)	decode tok/s	reasoning_chars median
100	53,138	44.6	145
500	62,997	39.1	186
1000	73,052	44.4	803
2000	91,304	48.9	816

Numbers match Phase 8 standalone within 1%. --reasoning-budget 256 introduces no measurable regression. Production VRAM 20,902 MiB. The swap deleted code.

Sidebar: Tailscale relay vs LAN throughput

The MTP-variant model swap was painful the first time. Mac Studio and ubuntu1 are on the same Tailscale network, so the obvious move is scp over the 100.x address. That tops out at 1 to 2 MB/s because Tailscale routes through a DERP relay in Seattle. The 16 GB swap would have taken hours.

Both boxes are also physically Ethernet-bridged on the local LAN: ubuntu1 at 192.168.2.2, Mac Studio at 192.168.2.1. Same scp command pointed at the LAN address: 100 MB/s, ~3 minutes for the 16 GB model. 50 to 100x speedup over the Tailscale path.

The DigitalOcean droplet pulled MTP weights directly from HuggingFace at 67 MB/s when the Tailscale-from-Mac-Studio attempt hung on auth for 30 minutes. Public-internet egress was 30x faster than the mesh peer.

Methodology lessons

1. Distribution evidence is mandatory. Single-prompt benchmarks inflated DFlash gains 30 to 60% versus an 8-prompt diversity sweep. N=1 misleads even at the right prompt size: Phase 6 → 7 had three headline numbers off by 1.5x to 4x from one observation. When a vendor publishes “3.4x,” assume the median is closer to 1.5 to 2x. And match the bench prompt size to production: Phase 5 was the right experiment on the wrong workload (Phase 7 redid it at 43K context and got a 4x larger effect). 2. Decode tok/s and TTFT decouple on reasoning models at production context. At 43K context they’re three timescales (prefill, reasoning, decode) at roughly 80% / 12% / 8% of wall clock. Optimize what users feel. 3. Spec decoding is workload-dependent, and “workload” includes output length. DFlash is 1.6x on chat-short (out~700) and 1.04x at out=2000. Autoregressive holds ~29 tok/s flat. MTP holds 39 to 49 tok/s under prod flags. Pick the technique whose curve fits your traffic. Cache reuse is binary: ~60x prefill speedup with a stable prefix, full cost when it isn’t. 4. Spec decoding at temp=0 is NOT bit-for-bit lossless on prose. Identical on tool calls and one-token answers, completely different on free-form prose. True for both DFlash and MTP. Both implementations fail the textbook lossless guarantee on prose-heavy workloads.

FAQ

What is multi-token prediction (MTP)?

MTP adds “head” layers to the target model that predict several tokens in parallel with the main next-token prediction. Drafts get verified on the next forward pass: accepted tokens are free, the first rejected one cuts off the rest. No separate drafter model, shared KV cache, same speculative-decoding mechanism as DFlash but with drafts coming from inside the target.

Does MTP change the output?

On tool calls and short structured answers, bit-for-bit identical to autoregressive at temp=0 (verified N=3 per workload). On free-form prose past a few hundred tokens, ~1000 characters of textual drift on chat-short, same magnitude as DFlash. Semantically equivalent, textually different. For regression tests that diff against gold outputs, disable speculative decoding entirely.

MTP vs DFlash on a 3090 Ti, which one’s faster?

Depends on output length. Below output 500, autoregressive > DFlash > MTP because prefill dominates. By output 1000, autoregressive ≈ MTP > DFlash. By output 2000, MTP > autoregressive > DFlash by 16 to 21 seconds wall clock. DFlash decode collapses from 47 to 30 tok/s as output grows because its drafter’s KV cache competes for bandwidth. MTP shares the target’s KV, no collapse.

InsiderLLM has a DFlash-vs-MTP head-to-head on the same hardware that benches a single short-output point. The DFlash Decode Collapse only appears past output 500-1000 tokens, which is why it doesn’t surface in short-prompt comparisons.

Does MTP work on Qwen3.6-27B dense?

Yes. Unsloth ships Qwen3.6-27B-MTP-GGUF with the heads baked in. The HackMD MoE benchmark initially said “MTP doesn’t help” on Qwen3.6-35B-A3B, then a May 8 2026 update flipped to +27.5% with corrected flags. The MoE story is still evolving. Dense 27B with --spec-type mtp --spec-draft-p-min 0.75 --spec-draft-n-max 6 runs at 39 to 49 tok/s across the tested output lengths.

Why does MTP need a custom llama.cpp build?

PR #22673 (am17an, opened May 4 2026, merged to master May 16 2026) added --spec-type mtp. Builds from master after that date have it natively. For earlier checkouts, fetch the PR branch and build with -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_NATIVE=ON. Takes ~4 to 13 minutes depending on cache state.

What about Apple Silicon?

This post is RTX 3090 Ti specific. MTP on Metal has different gotchas (issue #23011 flags “MTP slower than baseline on Apple Metal despite high acceptance” on the 35B-A3B variant). The MLX backend is a separate story. For the Apple Silicon side of local inference (LM Studio tuning, KV-cache quantization, the sysctl GPU-memory-cap fix), see LM Studio Errors on Apple Silicon.

Money quotes

“Same box, same model, same flags. Different binary. Nearly three times the throughput.”
“Speculative decoding is supposed to be lossless at temp=0. We measured it. It isn’t, on prose. Tool calls survive.”
“CopySpec without the drafter is a 300x slowdown. The drafter isn’t overhead. The drafter is the entire performance story.”
“Three months of tuning a drafter ratio that evaporates at output > 1000 tokens.”
“MTP doesn’t have a separate drafter. The heads are part of the same forward pass. There’s no second KV cache to feed. So when the context gets long, MTP doesn’t slow down. DFlash does.”
“Production went from BeeLlama + DFlash + custom drafter back to vanilla llama.cpp + the MTP-variant GGUF. The swap deleted code.”
“Hermes never noticed. We changed the binary, the model, the spec-decoding technique, and the drafter situation, and Hermes kept hitting the same alias.”

*Companion post forthcoming: Building a 3090 Ti Homelab Inference Node on a Dell Precision T5820. All commands and configs reproducible. Bench harnesses at ~/revalidation-results/ on ubuntu1.*

How a CEO uses Claude Code and Hermes to do the knowledge work

A blank or generic config file means every session re-explains your workflow. These are the files I run daily as CEO of a cybersecurity company managing autonomous agents, cron jobs, and publishing pipelines.

CLAUDE.md template with session lifecycle, subagent strategy, and cost controls
8 slash commands from my actual workflow (flush, project, morning, eod, and more)
Token cost calculator: find out what each session is actually costing you

One email when the pack ships. Occasional posts after that. Unsubscribe anytime.