Part 1 described the setup: eight local models, five benchmark phases (AT1 through AT5), 30 real production Delphi units. The test domain was legacy code migration — but the phases themselves cover tasks that appear in any Delphi LLM pipeline: code analysis, comprehension, patch generation, routing, and tool calling. This post covers what the numbers showed.
The Scoreboard
Local models (Ollama, 32 GB VRAM):
| Model | AT1 | AT2 | AT3 | AT4 | AT5 | Overall |
|---|---|---|---|---|---|---|
| gemma4:26b | 0.96 | 0.62 | 0.88 | 0.84 | 0.99 | 0.82 |
| qwen3.5:27b | 0.86 | 0.69 | 0.77 | 0.85 | 1.00 | 0.79 |
| qwen3.6:27b | 0.82 | 0.70 | 0.70 | 0.81 | 1.00 | 0.76 |
| qwen3.6:35b-a3b | 0.79 | 0.64 | 0.70 | 0.84 | 1.00 | 0.74 |
| devstral-small-2:24b | 0.70 | 0.58 | 0.71 | 0.80 | 1.00 | 0.70 |
| qwen2.5-coder:14b | 0.68 | 0.50 | 0.69 | 0.79 | n/a | 0.66 |
Cloud baseline (Anthropic API):
| Model | AT1 | AT2 | AT3 | AT4 | AT5 | Overall |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 ☁ | 0.983 | 0.983 | 0.955 | 0.994 | 1.00 | 0.983 |
Finding 1: Comprehension and Patch Quality Are Not Correlated
gemma4:26b has the lowest AT2 score (0.62) and the highest AT3 score (0.88). A 26-point gap across two phases that test adjacent capabilities is not noise. qwen3.6:27b is the opposite: highest AT2 (0.70), mid-table AT3 (0.70). The two skills are measuring something genuinely different — not just two views of the same underlying "code understanding."
Finding 2: AT3 Failure Modes
Format refusal (gemma4:26b)
70% of gemma4's AT3 responses returned as natural-language prose — no JSON block. The 30% that did comply scored 0.88 on content — the best in the benchmark. This is an instruction-following problem, not a capability problem. Fixable with better prompting.
Systematic token corruption (qwen2.5-coder:14b)
Every generated code line was prefixed with L<n>: — e.g. L29: fLogEnabled : Boolean;. Not valid Pascal. Consistent across all 30 units. Content often correct, output systematically malformed.
Structural misplacement (qwen3-coder:30b)
Asked to add a method, the model inserted code into the Initialization block — runs once at startup instead of being callable. Compiles, passes syntax check, fails at runtime. Would not be caught without unit tests.
Finding 3: Speed — The MoE Advantage
| Model | tok/s | Architecture |
|---|---|---|
| gemma4:26b | ~170 | Dense |
| qwen3.6:35b-a3b | ~131 | MoE |
| qwen2.5-coder:14b | ~120 | Dense, 14b |
| qwen3.5:27b / qwen3.6:27b | ~28–29 | Dense, 27b |
For a batch run over 30 units, 170 tok/s vs 28 tok/s is measured in hours of wall clock time.
Finding 4: Tool Calling — A Binary Split
Models that returned proper tool_calls objects: devstral, qwen3.5:27b, qwen3.6:27b, qwen3.6:35b-a3b (1.00 each), gemma4:26b (0.98).
Hard fail at the API level: qwen2.5-coder:14b, qwen3-coder:30b (text output only), deepseek-r1:8b (Ollama error: "does not support tools").
This is not a prompting problem — it is an API-level capability. A model either returns tool_calls objects in the response or it does not. For any pipeline that relies on IDE or MCP integration, this is a hard filter applied before any other consideration.
Finding 5: German vs English Prompts on AT1
| Model | AT1 EN | AT1 DE | Delta |
|---|---|---|---|
| gemma4:26b | 0.96 | 0.83 | -0.13 |
| qwen3.5:27b | 0.86 | 0.78 | -0.05 |
| qwen2.5-coder:14b | 0.68 | 0.39 | -0.31 |
Every model scores lower in German. The losses range from 5 to 31 points. Recommendation: Use English for any prompt that includes source code or asks the model to name specific identifiers. Use German for everything that faces the developer.
Finding 6: Cloud vs Local — Where the Gap Actually Is
| Phase | Best local | Sonnet 4.6 | Delta |
|---|---|---|---|
| AT1 Code Detective | 0.96 (gemma4) | 0.983 | +0.02 ≈ 0 |
| AT2 Comprehension | 0.70 | 0.983 | +0.28 ← main gap |
| AT3 Patches | 0.88 | 0.955 | +0.08 |
| AT4 Routing | 1.00 | 0.994 | ≈ 0 |
| AT5 Tool Calling | 1.00 | 1.00 | 0 |
Local models are essentially on par with Sonnet in code extraction, routing, and tool calling. The comprehension gap (28 points on AT2) is where cloud-scale models are meaningfully better. For GDPR-constrained teams, local-only is viable. For hybrid architectures, route comprehension to Sonnet and keep everything else local.
Finding 7: AT4 Routing — Not All Models Are Reliable Routers
Routing accuracy is the capability that matters most in a multi-model pipeline: a misclassified task gets sent to the wrong model, which either wastes compute on an oversized model or produces a failure on an undersized one. The AT4 results show a meaningful spread across the field.
The top scorers — qwen3.5:27b (0.85), gemma4:26b (0.84), and qwen3.6:35b-a3b (0.84) — differ significantly from the problem case: qwen3-coder:30b at 0.71. In a pipeline of 90 routing decisions, a 0.71 accuracy means roughly 26 tasks misclassified. That is not recoverable through prompt engineering — classification drift at that scale corrupts the batching architecture that the pipeline depends on.
Speed is the second dimension here. Routing sits on the critical path of every task — every request passes through the router before any work begins. A model that routes accurately but slowly adds latency multiplied across the entire batch. qwen3.6:35b-a3b combines solid routing accuracy with 131 tok/s, making it the only model that is both reliable and fast enough to use as a production router.
VT Pre-Test Findings
Before the five AT phases, every model ran through seven validation tests (VT1–VT7) covering JSON compliance, instruction following, output consistency, basic Delphi understanding, and native tool-call support. All eight models passed every pre-test — this confirmed that the AT results reflect genuine capability differences, not basic reliability failures.
Two VT findings produced results specific enough to be worth documenting alongside the main benchmark.
VT6: Migration Risk Detection
VT6 tested whether models could identify concrete migration traps across three categories of Delphi code: ShortString serialization risks, binary-packed record layouts, and call-chain dependencies. The chart shows significant variation between models, even where their AT1–AT5 scores are similar — VT6 captures a targeted detection capability that the main phases only partially measure.
VT8: Token Budget and Thinking Mode
VT8 exposed a practical deployment trap: when a 300-token output budget is set, and thinking mode is active, all three tested Qwen models consume the entire budget on internal reasoning and return an empty response. The correct answer exists — the model knows it — but there is no token budget left to output it. Setting num_predict ≥ 800 for any thinking-enabled model, the issue is completely.
What the Numbers Add Up To
Seven findings point to a consistent picture. Comprehension and patch generation are genuinely different skills — the best model at one is not the best at the other. Speed varies by a factor of six between the fastest and slowest capable models. Tool calling eliminates three models outright at the API level. Language choice in prompts can cost 31 percentage points on code extraction tasks. Routing accuracy is not guaranteed — one model in the tested pool is too unreliable to use as a router. And the cloud comprehension gap, while real at 28 points, is the only phase where local models fall meaningfully short.
None of these findings argues for a single best model. They argue for a pipeline that routes different task types to different models — which is exactly what Part 3 covers.
Part 3 covers the practical takeaways: routing architecture, model selection by task, and the hybrid pipeline decision table.

.png)
.png)
.png)
.png)
.png)
No comments:
Post a Comment