This is the final post in a three-part series. Part 1 covered benchmark design and methodology. Part 2 covered what the numbers revealed. This post covers what you should actually do with those results.
After running 8 local models through 5 benchmark phases on 30 real Delphi production units, the most useful thing I can offer is not another table of scores — it is a set of concrete decisions. If you are planning to integrate local LLMs into any Delphi pipeline — migration, code review, documentation, or IDE integration — this post tells you which model to use for each job, which two to skip entirely, and where the remaining rough edges are.
Pick the Right Model for the Right Job
The clearest finding from this benchmark is that no single model dominates across all five task types. The right approach is task routing — matching each class of work to the model best suited for it.
Code Analysis / Risk Discovery (AT1)
Use gemma4:26b (score 0.96) for targeted fact extraction. Use qwen3.5:27b (0.86) or qwen3.6:27b (0.82) when you need coherent functional explanations.
Code Comprehension and Q&A (AT2)
Use qwen3.6:27b (score 0.70) or qwen3.5:27b (0.69). Both score significantly above the field.
Patch Generation / Code Writing (AT3)
Use gemma4:26b (score 0.88, 170 tok/s) as your primary patcher — add format validation for the 70% non-compliant responses. Use qwen3.5:27b (0.77) as a fallback when compliance matters more than speed.
Routing / Complexity Classification (AT4)
Use qwen3.6:35b-a3b (score 1.00, 131 tok/s). Perfect routing accuracy combined with MoE speed. Avoid qwen3-coder:30b (0.71) — inconsistent classification defeats the router.
Tool Calling / IDE Integration (AT5)
Only: devstral, qwen3.5:27b, qwen3.6:27b, qwen3.6:35b-a3b, gemma4:26b. Hard disqualified: qwen2.5-coder:14b, qwen3-coder:30b, deepseek-r1:8b — API-level failure, not fixable through prompting.
The Routing Pipeline Architecture
The single most impactful design decision is the batching strategy. GPU model loading takes 30–60 seconds per swap. Batch by tier:
Incoming task
|
Router model (qwen3.6:35b-a3b, fast)
|
+--------------------+--------------------+
| | |
local mid-tier complex
gemma4:26b qwen3.5:27b qwen3.6:27b
(fast, high vol) (reliable, balanced) (best understanding)
For a batch of 50 units across three complexity tiers, batch-by-tier scheduling can eliminate 45+ minutes of pure idle time.
The Context Window Problem: Why a Proxy Layer Is Not Optional
Model selection and batching strategy are the two decisions this benchmark directly informs. But there is a third decision that matters at least as much for local deployments, and it has nothing to do with which model scores highest on any phase.
If you are running a real development pipeline, you are not running one MCP server. You are running several. A realistic Delphi development setup includes a source analysis server (DelphiMCP: ~40 tools), a document server (WordMCP: ~40 tools), a code indexer (PasIndexer: ~15 tools), a file editor (StrEditor: ~15 tools), and a handful of supporting services. That is well over 150 tool declarations present in context before a single line of code is analyzed.
Each tool declaration — name, parameter schema, description — costs between 150 and 300 tokens. At 150 tools, you are looking at 22,000 to 45,000 tokens of overhead at the start of every session. For a local model running with a 32k context window, that is between 70% and 140% of the available context consumed before the first user message arrives.
VT5 makes this worse, not better. The benchmark showed that local models score only 1–34% on Phase 2 of the tool-name test — they cannot reliably interpret a tool's purpose from its name alone. They depend on the description field in the schema. Compressing tool declarations to save context is not an option; the descriptions are load-bearing for local models.
ProxyMCP addresses this directly. As a single MCP endpoint, it does not expose all tools from all servers to the model. It exposes only the tools relevant to the current task — typically 3 to 8 — and routes the call to the appropriate backend server. From the model's perspective, the tool surface is minimal and always task-scoped. From the pipeline's perspective, every server is still fully available.
The practical effect: a local model operating through ProxyMCP sees a context overhead of roughly 500–1,500 tokens for tool declarations rather than 22,000–45,000. That difference is the difference between a model that has room to reason and a model that is fighting for context from the first token. For cloud models, the same architecture translates directly into cost savings — fewer tokens declared means fewer tokens billed, on every single request.
Decision Table (32 GB VRAM Setup)
| Use case | Recommended model | Speed | Notes |
|---|---|---|---|
| High-volume patching | gemma4:26b | 170 tok/s | Add format validation layer |
| Code understanding | qwen3.6:27b | 28 tok/s | Best comprehension overall |
| Routing decisions | qwen3.6:35b-a3b | 131 tok/s | Perfect routing + fast |
| Balanced all-rounder | qwen3.5:27b | 29 tok/s | Strong across all phases |
| Tool/IDE integration | qwen3.6:35b-a3b | 131 tok/s | Best speed + tool support |
What About Cloud?
I ran the full five-phase evaluation on Claude Sonnet 4.6 via the Anthropic API for comparison. Where does cloud actually outperform local?| Phase | Best local | Sonnet 4.6 | Delta |
|---|---|---|---|
| AT1 Code Detective | 0.96 (gemma4) | 0.983 | +0.02 ≈ 0 |
| AT2 Comprehension | 0.70 (qwen3.6:27b) | 0.983 | +0.28 ← gap |
| AT3 Patches | 0.88 (gemma4) | 0.955 | +0.08 |
| AT4 Routing | 1.00 (multiple) | 0.994 | ≈ 0 |
| AT5 Tool Calling | 1.00 (multiple) | 1.00 | 0 |
Practical conclusion: Local models are at parity with cloud on extraction, routing, and tool calling. The 28-point comprehension gap (AT2) is the only argument for selective cloud use. For GDPR(DSGVO)-constrained teams, local-only remains viable. For hybrid architectures, route comprehension to Sonnet and keep everything else local.
The Realistic Assessment
Local LLMs are genuinely useful for Delphi development — as a force multiplier for analysis and mechanical transformation phases. What works: breaking work into distinct task types, batching by model, adding format validation, using comprehension models for risk analysis first, and treating tool calling as a hard capability requirement.
The models that fit this architecture on a 32 GB setup — qwen3.6:27b for understanding, gemma4:26b for patching, qwen3.5:27b as the balanced mid-tier, qwen3.6:35b-a3b for routing and tool calls — are capable enough to make local-only, GPU-resident Delphi LLM pipelines a practical option today.

.png)
.png)