Let me save you some time upfront: this is not a post about prompting ChatGPT to help with Delphi. This is about running local, on-premise LLMs — models that never send your source code to any server — through a structured, five-phase benchmark built from 30 real production files.
The test domain is a large legacy codebase undergoing Unicode migration — demanding enough to expose real capability gaps across code analysis, patch generation, routing, and tool calling. The findings apply broadly: if you are integrating local LLMs into any Delphi workflow, this benchmark tells you which models handle which tasks, which ones fail at the API level, and where the architecture decisions actually matter.
The Problem Is Bigger Than It Looks
The codebase in question is large. The production compile base is Delphi 2007, partially migrated to a modern Delphi version. What does legacy here actually mean? ShortStrings used for binary serialization. Records with fixed memory layouts that must not change because data was written to disk in that exact layout twenty years ago. No MVC, no MVVM — a custom control hierarchy that grew organically before those terms meant anything in the Delphi world. The kind of codebase where a seemingly innocent String vs ShortString swap can corrupt a file format three layers down.
A cloud AI service is not an option. The code is proprietary, the clients are sensitive, and the answer to "does your code leave the building?" must always be no. So the question became: can local LLMs running on our own hardware actually help, or are we on our own?
The Hardware Reality
One GPU with 32 GB VRAM. An Ollama server reachable at a local network address. We tested eight models:
| Model | Parameters | Architecture |
|---|---|---|
| qwen2.5-coder:14b | 14b | Dense |
| devstral-small-2:24b | 24b | Dense |
| qwen3-coder:30b | 30b | Dense |
| qwen3.5:27b | 27b | Dense |
| qwen3.6:27b | 27b | Dense |
| qwen3.6:35b-a3b | 35b (MoE) | Mixture-of-Experts |
| gemma4:26b | 26b | Dense |
| deepseek-r1:8b | 8b | Dense |
Models That Didn't Make the Cut
Six models failed pre-screening and never entered the main evaluation:
| Model | Reason for rejection |
|---|---|
| granite4:32b-a9b-h | Only 16% GPU utilization — 84% CPU offloading. Unusably slow. |
| nemotron-3-nano:30b-a3b | Zero valid JSON responses across 120 validation attempts. |
| llama4:scout | Failed instruction following and basic code output tests. |
| qwen3-coder-next:latest | Failed instruction following and basic code output tests. |
| deepseek-r1:32b | Failed validation tests. |
| deepseek-r1:14b | Failed validation tests. |
Why a Five-Phase Benchmark?
- Find hidden problems before you touch anything
- Understand what a unit actually does, not just what it says it does
- Write correct Delphi code, not plausible-looking pseudocode
- Know when a task is trivial and when it needs a more capable model
- Call tools — interacting with an IDE or AST system rather than outputting text
Those five capabilities became five test phases, applied to 30 carefully selected production units.
The Production Context
This benchmark was not designed in isolation. It was built to answer a concrete engineering question: which local models can reliably handle which classes of work in a production-grade, on-premise AI pipeline?
That pipeline consists of two components developed in parallel with this benchmark work.
ProxyMCP is a lightweight routing layer that sits between IDE tooling and the model backend. It implements the Model Context Protocol, accepts tool calls from development environments, and forwards them — without transforming the payload — to the backend layer that handles execution.
Enterprise Server is the orchestration layer behind ProxyMCP. It handles model routing, task classification, session management, and audit logging across multiple users and projects. It is designed for on-premise deployment — no data leaves the local network — and built to the requirements of teams that need traceable, reproducible AI-assisted workflows with defined service levels.
The routing architecture described in Part 3 of this series is not a theoretical recommendation. It is the architecture these components implement. The benchmark determined which models fill which roles.
Phase AT1: Code Detective Work
Each of the 30 units contains one non-obvious fact — something you can only find by genuinely reading the code. Example:
Unit: AsyncSettings.pas
Question: Which fields of TAsyncSettings are shared across every instance rather than per-instance, making the class effectively a singleton without the type system saying so?
Required: Name the specific identifiers.
Phase AT2: Comprehension (120 Tasks)
Each unit generates four tasks: one unit-level summary and three focused questions. Scored by Claude Opus 4.7 as judge using HIT/PARTIAL/MISS grading.
Phase AT3: Patch Generation
Each unit has one realistic migration task. The model must output a structured JSON patch — insert after line N, replace lines N through M, delete lines N through M. Maximum three operations:
{
"operations": [
{
"op": "insert_after_line",
"line": 35,
"content": " Class procedure WaitForCompletion(aTimeoutMs : Cardinal);"
}
]
}
Phase AT4: Routing (90 Decisions)
30 units × 3 tasks each. Classify as local (trivial), mid (moderate), or top (deep understanding required). A model that consistently over- or under-classifies is useless as a router.
Phase AT5: Tool Calling (73 Tasks)
Can these models use Ollama's native tool_calls API? This is not about prompting a model to output JSON — it is about whether the model returns actual tool_calls objects in the API response. Several otherwise capable models fail here completely.
The Judge and the Baseline: Two Roles for Claude Models
This benchmark uses Claude models from Anthropic in two completely separate roles — and it is worth being explicit about this, because the distinction matters for how you read the results.
| Model | Role | What it does |
|---|---|---|
| Claude Opus 4.7 | Judge & Gold Standard Author | Reads each of the 30 source files independently, writes the gold standard answers, then scores every model response as HIT / PARTIAL / MISS. Does not participate as a candidate. |
| Claude Sonnet 4.6 | Cloud Baseline Candidate | Runs the identical five-phase benchmark as the local models — same questions, same tasks, same scoring. Acts as a reference point: how much better does a capable cloud model actually score? |
Opus is the most capable model in Anthropic's lineup. Using it as the judge — rather than a fixed rubric or human annotators — means the gold standard is set by the model with the deepest understanding of the source material. Sonnet then competes against local models on the same playing field, graded by the same judge.
Is it a conflict of interest that the judge and one of the candidates come from the same company? It is worth noting. In practice, Opus's judgments on Delphi code analysis were consistent with what a senior developer would recognize as correct — the gold standard answers were not padded to favor any particular model family. And the local models' scores speak for themselves: gemma4:26b reaches 96% on AT1, within two points of Sonnet. A biased judge would not produce numbers like that.
A Note on the Cloud Baseline
After completing the local evaluation, we ran the same five-phase benchmark on Claude Sonnet 4.6 via the Anthropic API — not to re-open the DSGVO argument, but to establish a reference point. The short version: local models are essentially at parity with Sonnet on code extraction, routing, and tool calling. The comprehension gap (AT2) is real and significant — 28 percentage points between the best local model and Sonnet. Whether that gap matters depends on whether you need the model to explain code or just to find things in it. Part 2 covers this in detail.
Part 2 covers the actual results: which models performed where, the failure modes, and what surprised us. Part 3 covers the practical takeaways — routing architecture and model selection by task type.

.png)
No comments:
Post a Comment