Tuning llama-server for agent workloads: a week of receipts
A few of the hobby projects on this site lean on a local LLM server. The chook-manager vision pipeline pings one for every captured frame, and the brainstorming experiment I wrote up last week fans 30+ calls into one. Both got fast enough to feel native this week, but only after I swept a pile of llama-server configs and hit a few surprises.
Everything below is one card (RTX 4090, 24 GiB) and one server, llama-server from llama.cpp build b8933. The full sweep, exact flags, every trial, every dead end, is in results.md under the local-llm-tuning skill.
The headline numbers
Three things under those bars are worth pulling out.
The quant matters more than the model
The 0.5 tk/s reading was the first surprise of the week. Qwen3.6-35B-A3B is a hybrid architecture: 30 of its 40 layers are a Mamba-like state space model, and only 10 are attention. The lmstudio-community Q4_K_M quant loaded fine, fit in VRAM with room to spare, claimed all 41 layers on GPU, and then ran at half a token per second. Nothing looked broken. It was just slow.
The fix wasn’t a config flag. It was a different GGUF. Unsloth ship a “dynamic” quant of the same model (UD-Q4_K_XL), and on the same hardware, same flags, same binary, it runs at 152 tk/s with q4_0 KV cache. That’s 200x faster than a quant of nominally the same thing. My best guess at what’s going on: llama.cpp’s SSM scan kernels have a fast path that’s picky about tensor layout, and the lmstudio quant trips a fallback. Either way, the rule I walked away with:
For any hybrid-SSM architecture (
qwen35moeand its descendants), use Unsloth’s UD-* dynamic quants. Never assume two Q4_K_M GGUFs of the same model are interchangeable.
None of this shows up on a model card. I only found it because 0.5 tk/s was absurd enough that “the quant is broken” felt worth testing.
Match the config to the shape of your context
With the quant sorted, two knobs looked promising: bigger batches (-b 4096 -ub 1024) and the KV-cache quant (q4_0 vs q8_0). I swept both against prompt sizes from empty to 55k tokens, same 230-token generation each time.
At 55k tokens, the wall time for the same turn drops from 352 seconds on the q4_0 config to 73 seconds on the q8_0 one. A/B-ing the two levers separately, they turn out to do almost unrelated things:
-b 4096 -ub 1024is a decode speedup: about 2x at any context size, independent of KV quant.q8_0KV is a prefill speedup at large context: about 4x at 55k tokens.
So pick per workload. A single-turn chat with a fresh prompt is fine on q4_0. An agent that’s been piling up tool output for half an hour and sits 40k tokens deep wants q8_0 and the big batches. This is exactly why my April chook-manager vision runs felt like batch jobs: every turn re-primed a growing context and paid the prefill tax, minutes per turn. With the right flags the same turns finish in seconds, and the opencode-agent loop feels closer to a Claude Code session than a cron job.
The max-context flag is a tax you keep paying
This one surprised me most. -c (the maximum context the server will allocate) isn’t a “set it to the model’s max and forget it” flag. Decode slows down as you raise it, even when the prompt is tiny and the KV cache is empty.
I figured the cost came from the KV cache filling up. It doesn’t. The KV buffer barely grows across this sweep and nothing spills. What grows is llama.cpp’s internal compute graph, which is sized for the maximum context the server might ever see, and a bigger graph is slower to schedule even when most of it sits unused. Two rules fall out:
- Size
-cto the workload, not the model card. A coding agent that rarely passes 30k runs at 152 tk/s on-c 32768; setting 131k “just in case” costs a 2.5x slowdown across the whole session. - Round up to a power of 2.
-c 50000is slower than-c 65536; non-power-of-2 contexts hit a kernel fallback.
gpt-oss-20b, almost as an aside
The fastest config I found wasn’t any flavour of Qwen. OpenAI’s gpt-oss-20b in its native MXFP4 quant (a 32-expert MoE, about 12 GiB of weights, no SSM in the way) does 215 tk/s at its full 128k context, in about 14 GiB of VRAM. For this model, on this card, 128k context is basically free, with room left over for a vision tower next to it.
For brainstorming fan-out, 20B is plenty. For coding-agent work the Qwen MoE still wins on quality, and once both are tuned the speed gap is small. But a 215 tk/s config for a genuinely useful 20B model on a consumer card is the kind of shift that’s easy to miss, because nobody benchmarks the boring number on the boring quant.
The agent-shaped lesson
The thread tying all of this together: the default config is built for a different workload. llama.cpp’s defaults are tuned for the chatbot shape, one user, short prompt, long generation. Agent workloads flip every one of those:
- Many short turns, not one long generation
- Context grows monotonically across the session
- Prefill cost matters as much as decode cost
- The 30k-token-deep turn matters more than the empty-prompt turn
The flags that win for agents (bigger batches, q8_0 KV, context sized to the workload) quietly cost you a little on the chatbot case, which is why the “best llama.cpp config” guides don’t land on them. If your local server is feeding an agent loop, the defaults are leaving most of the speedup on the table.
Next on my list is speculative decoding with a small Qwen draft model, the obvious win I haven’t tried yet. I’ll write it up if the numbers hold.