from old school to new frontier: Local LLMs for Delphi: A Production Benchmark

Part 1 described the setup: eight local models, five benchmark phases (AT1 through AT5), 30 real production Delphi units. The test domain was legacy code migration — but the phases themselves cover tasks that appear in any Delphi LLM pipeline: code analysis, comprehension, patch generation, routing, and tool calling. This post covers what the numbers showed.

The Scoreboard

Local models (Ollama, 32 GB VRAM):

Model	AT1	AT2	AT3	AT4	AT5	Overall
gemma4:26b	0.96	0.62	0.88	0.84	0.99	0.82
qwen3.5:27b	0.86	0.69	0.77	0.85	1.00	0.79
qwen3.6:27b	0.82	0.70	0.70	0.81	1.00	0.76
qwen3.6:35b-a3b	0.79	0.64	0.70	0.84	1.00	0.74
devstral-small-2:24b	0.70	0.58	0.71	0.80	1.00	0.70
qwen2.5-coder:14b	0.68	0.50	0.69	0.79	n/a	0.66

Cloud baseline (Anthropic API):

Model	AT1	AT2	AT3	AT4	AT5	Overall
Claude Sonnet 4.6 ☁	0.983	0.983	0.955	0.994	1.00	0.983

Finding 1: Comprehension and Patch Quality Are Not Correlated

gemma4:26b has the lowest AT2 score (0.62) and the highest AT3 score (0.88). A 26-point gap across two phases that test adjacent capabilities is not noise. qwen3.6:27b is the opposite: highest AT2 (0.70), mid-table AT3 (0.70). The two skills are measuring something genuinely different — not just two views of the same underlying "code understanding."

Finding 2: AT3 Failure Modes

Format refusal (gemma4:26b)

70% of gemma4's AT3 responses returned as natural-language prose — no JSON block. The 30% that did comply scored 0.88 on content — the best in the benchmark. This is an instruction-following problem, not a capability problem. Fixable with better prompting.

Systematic token corruption (qwen2.5-coder:14b)

Every generated code line was prefixed with L<n>: — e.g. L29: fLogEnabled : Boolean;. Not valid Pascal. Consistent across all 30 units. Content often correct, output systematically malformed.

Structural misplacement (qwen3-coder:30b)

Asked to add a method, the model inserted code into the Initialization block — runs once at startup instead of being callable. Compiles, passes syntax check, fails at runtime. Would not be caught without unit tests.

Finding 3: Speed — The MoE Advantage

Model	tok/s	Architecture
gemma4:26b	~170	Dense
qwen3.6:35b-a3b	~131	MoE
qwen2.5-coder:14b	~120	Dense, 14b
qwen3.5:27b / qwen3.6:27b	~28–29	Dense, 27b

For a batch run over 30 units, 170 tok/s vs 28 tok/s is measured in hours of wall clock time.

Finding 4: Tool Calling — A Binary Split

Models that returned proper tool_calls objects: devstral, qwen3.5:27b, qwen3.6:27b, qwen3.6:35b-a3b (1.00 each), gemma4:26b (0.98).
Hard fail at the API level: qwen2.5-coder:14b, qwen3-coder:30b (text output only), deepseek-r1:8b (Ollama error: "does not support tools").

This is not a prompting problem — it is an API-level capability. A model either returns tool_calls objects in the response or it does not. For any pipeline that relies on IDE or MCP integration, this is a hard filter applied before any other consideration.

Finding 5: German vs English Prompts on AT1

Model	AT1 EN	AT1 DE	Delta
gemma4:26b	0.96	0.83	-0.13
qwen3.5:27b	0.86	0.78	-0.05
qwen2.5-coder:14b	0.68	0.39	-0.31

Every model scores lower in German. The losses range from 5 to 31 points. Recommendation: Use English for any prompt that includes source code or asks the model to name specific identifiers. Use German for everything that faces the developer.

Finding 6: Cloud vs Local — Where the Gap Actually Is

Phase	Best local	Sonnet 4.6	Delta
AT1 Code Detective	0.96 (gemma4)	0.983	+0.02 ≈ 0
AT2 Comprehension	0.70	0.983	+0.28 ← main gap
AT3 Patches	0.88	0.955	+0.08
AT4 Routing	1.00	0.994	≈ 0
AT5 Tool Calling	1.00	1.00	0

Local models are essentially on par with Sonnet in code extraction, routing, and tool calling. The comprehension gap (28 points on AT2) is where cloud-scale models are meaningfully better. For GDPR-constrained teams, local-only is viable. For hybrid architectures, route comprehension to Sonnet and keep everything else local.

Finding 7: AT4 Routing — Not All Models Are Reliable Routers

Routing accuracy is the capability that matters most in a multi-model pipeline: a misclassified task gets sent to the wrong model, which either wastes compute on an oversized model or produces a failure on an undersized one. The AT4 results show a meaningful spread across the field.

The top scorers — qwen3.5:27b (0.85), gemma4:26b (0.84), and qwen3.6:35b-a3b (0.84) — differ significantly from the problem case: qwen3-coder:30b at 0.71. In a pipeline of 90 routing decisions, a 0.71 accuracy means roughly 26 tasks misclassified. That is not recoverable through prompt engineering — classification drift at that scale corrupts the batching architecture that the pipeline depends on.

Speed is the second dimension here. Routing sits on the critical path of every task — every request passes through the router before any work begins. A model that routes accurately but slowly adds latency multiplied across the entire batch. qwen3.6:35b-a3b combines solid routing accuracy with 131 tok/s, making it the only model that is both reliable and fast enough to use as a production router.

VT Pre-Test Findings

Before the five AT phases, every model ran through seven validation tests (VT1–VT7) covering JSON compliance, instruction following, output consistency, basic Delphi understanding, and native tool-call support. All eight models passed every pre-test — this confirmed that the AT results reflect genuine capability differences, not basic reliability failures.

Two VT findings produced results specific enough to be worth documenting alongside the main benchmark.

VT6: Migration Risk Detection

VT6 tested whether models could identify concrete migration traps across three categories of Delphi code: ShortString serialization risks, binary-packed record layouts, and call-chain dependencies. The chart shows significant variation between models, even where their AT1–AT5 scores are similar — VT6 captures a targeted detection capability that the main phases only partially measure.

VT8: Token Budget and Thinking Mode

VT8 exposed a practical deployment trap: when a 300-token output budget is set, and thinking mode is active, all three tested Qwen models consume the entire budget on internal reasoning and return an empty response. The correct answer exists — the model knows it — but there is no token budget left to output it. Setting num_predict ≥ 800 for any thinking-enabled model, the issue is completely.

What the Numbers Add Up To

Seven findings point to a consistent picture. Comprehension and patch generation are genuinely different skills — the best model at one is not the best at the other. Speed varies by a factor of six between the fastest and slowest capable models. Tool calling eliminates three models outright at the API level. Language choice in prompts can cost 31 percentage points on code extraction tasks. Routing accuracy is not guaranteed — one model in the tested pool is too unreliable to use as a router. And the cloud comprehension gap, while real at 28 points, is the only phase where local models fall meaningfully short.

None of these findings argues for a single best model. They argue for a pipeline that routes different task types to different models — which is exactly what Part 3 covers.

Part 3 covers the practical takeaways: routing architecture, model selection by task, and the hybrid pipeline decision table.

from old school to new frontier

Buy Delphi

Thursday, May 28, 2026

Local LLMs for Delphi: A Production Benchmark — Part 2: What the Numbers Reveal