Tuesday, June 9, 2026

Beyond Simple Prompts: Building an Enterprise AI Toolchain in Delphi

The conversation about AI in software development tends to revolve around prompts. Write a better prompt, get a better answer. Use a smarter model, get smarter code. And for a one-off task, that is entirely true. But when you try to integrate AI seriously into a professional Delphi development workflow — one with hundreds of units, multi-day tasks, multiple concurrent sessions, and real-world complexity — you quickly hit a ceiling that no amount of prompt engineering can break through. The problem is not the model. The problem is the architecture around it.


This post describes what we built to break through that ceiling: a full enterprise backend for AI-driven Delphi development, consisting of a central backend server, a thin MCP proxy layer, and a growing ecosystem of loadable server modules. It also addresses a concern that matters for many professional teams: with local models integrated directly into the backend, agents can do their work without sending any content over the internet at all. It is the result of months of iterative development, and it is currently being prepared for commercial release. If you are just getting started with AI and Delphi, the standalone MCP server versions on our website are a great entry point — their source code is available for purchase, so you can learn the patterns and build your own tools. But if you are ready to go further, this article is about what comes next.

The Architecture at a Glance

Before diving into the individual capabilities, it helps to understand the overall structure. The system has three layers.

At the top sits the AI agent — in our case, Claude Code, though the architecture is model-agnostic. The agent communicates via the standard MCP protocol over stdio, which means it works with any MCP-compatible client without modification.

In the middle is a lightweight proxy process — the only MCP endpoint the agent ever sees. It does not contain business logic. Its job is to translate MCP protocol messages into a compact binary TCP protocol and forward them to the backend server. It also handles reconnection, session registration, and the agent-facing tool list. This separation is deliberate: the proxy is thin and replaceable, the backend is where everything interesting happens.

At the bottom is the backend server itself — a Delphi application that loads a collection of server modules as DLLs, manages sessions, routes tool calls, and maintains shared state. Each module is a self-contained DLL that exposes a defined interface. The server knows how to load them, call them, hot-swap them, and in some cases run them autonomously on a timer.

There are three kinds of modules. Agent-facing modules expose tools that the AI can call directly — everything the agent needs to read, write, compile, debug, search, or communicate. Backend modules run autonomously without direct agent involvement, executing scheduled work independently in the background. Infrastructure modules provide shared services that other modules consume, such as the database connector or the source code formatter. This three-way classification is not cosmetic; it shapes how each module is loaded, called, and lifecycle-managed by the server. The individual modules — and what each one enables — are described in the sections that follow.

One Backend, Many Agents — The Efficiency Argument

In a conventional MCP setup, every agent session starts its own MCP server process. If you have three Claude Code windows open, you have three instances of every MCP server running — three times the memory, three independent states, no shared knowledge between them. For simple tools, this is fine. For a module like the Delphi code analyzer, which parses and indexes source files, loads an abstract syntax tree into memory, and maintains a registry of nodes across a codebase of 1,600 units and 2.6 million lines of code, spinning up a fresh instance per session is simply not practical.

With the backend server, every module loads exactly once. All agent sessions connect to the same backend via TCP, each gets its own session context, but the heavy shared resources — loaded ASTs, database connections, job queues — exist only once. A new session costs a TCP handshake and a session registration record, not a full process with gigabytes of loaded state. This alone changes what is practically possible.

The server also handles thread safety automatically. Each module declares whether it is thread-safe. Thread-safe modules handle concurrent requests directly from multiple sessions. Modules that maintain internal state are protected by a request queue — calls are serialized through a single worker thread transparently, without any change needed in the module itself. The agent never sees any of this; it just calls tools and gets results.

PlanMCP: Persistent Memory Across Sessions

Anyone who has worked seriously with AI coding assistants will recognize this pattern: you start a session, the agent makes good progress, but the context window fills up. You need to start a new session. So you write a summary to a Markdown file, describe where you left off, note the open tasks, and hope you remembered everything important. Then you paste it into the next session and pick up from there. It works, sort of. But it is fragile, manual, and does not scale.

PlanMCP replaces this entirely. It is a backend module backed by a MySQL database with thirteen tables for projects, tasks, decisions, knowledge gaps, sessions, artifacts, and events. When the agent starts a session, it calls the planning menu tool first — it gets back a formatted overview of the current project state: open tasks, the next task ready to start, decisions already taken, and known gaps to work around. No Markdown file, no manual briefing, no hoping nothing was forgotten.

This makes the /clear command trivial. When the context window is getting full, you just clear it. The agent picks up exactly where it left off in the next session, because the actual state — the task list, the progress, the constraints, the open questions — lives in the database, not in the conversation. A restart hint mechanism makes this even smoother: before clearing, the agent writes a short note about what to do next. This note is stored in the database and automatically injected into the initialization response of the next session. The new session's first action is always: read the menu, check for a resume hint, and continue.

Tasks in PlanMCP are not just to-do entries. They have dependencies — a task can be blocked until another is completed. They carry context: code snippets, decisions, constraints, and links to artifacts. They have a full status history. The agent can query the next task that is actually ready to start, given what has already been completed. For larger projects spanning weeks and dozens of tasks, this is the difference between a genuinely productive agent and one that needs constant hand-holding to know where it is.


DelphiMCP: Not Just a Compiler Wrapper

The most common reaction when people hear about an AI-Delphi integration is: "So it can compile code?" Yes, but that is the least interesting part. Let us start with project creation. The Delphi module can create a fully structured new Delphi project from scratch — DPR, DPROJ, all base structures — following the exact conventions the Delphi toolchain expects. The agent does not write a DPR file into a text buffer and hope for the best; it calls a tool that understands what a valid Delphi project looks like and produces one correctly.

The more significant capability is the Abstract Syntax Tree interface. When the agent works with a Delphi unit, it does not receive raw Pascal source code — it receives a structured JSON representation of the syntax tree, with every node carrying a stable identifier. A class declaration, a method body, an interface section, a conditional compilation block: each is a named, addressable node in a tree. The agent can navigate this tree, inspect individual nodes at three levels of detail (compact for orientation, standard for normal work, rich for deep analysis), and perform surgical edits: replace a node, delete a node, insert a node before or after another. These operations are precise and unambiguous. There is no "find line 847 and replace the third occurrence of this string." There is "replace the node with this identifier," and it works.

This matters enormously at scale. A codebase with 1,600 units and 2.6 million lines cannot be navigated by reading files. It can be navigated by querying a structured index: find all types that descend from TComponent and have no override of BeforeDestruction, find all methods that call a deprecated API, and find all units in the Uses chain of a given file. These are real queries that the module can answer without the agent ever seeing a single line of source code it did not ask for.

LSPHandler: The IDE's Own Semantic Engine

The AST interface gives the agent structural access to source code: navigate the tree, inspect nodes, and make surgical edits. What it does not provide is semantic understanding — the difference between knowing that a variable named Sender exists and knowing that it is of type TObject in this specific call context, or knowing not just where a method is declared but where it is actually defined in the inheritance chain. For that level of understanding, there is only one authoritative source: the Delphi compiler itself.

The LSPHandler module connects to Delphi's own language server — the same process the IDE uses internally to power code completion, hover tooltips, and go-to-definition navigation. When the agent opens a project through this module, the language server starts in the background and processes all project units using the real Delphi compiler. From that point on, every query the agent makes is answered by the same semantic engine that answers your questions when you hold Ctrl and hover over a symbol in the IDE.

The hover tool returns the type and symbol information for any position in a source file — identical to what appears in the IDE tooltip when you move the mouse over an identifier. The definition and declaration tools resolve a symbol at a given position and return the file path and line number where it is defined or declared, following the full inheritance and unit resolution chain. The symbols tool returns a hierarchical symbol tree for an entire source file — every class, method, property, field, constructor, and constant, structured the same way the IDE's structure view organises them. The diagnostics tool retrieves the compiler's error and warning output after background compilation, with the same precision as a full build in the IDE.

The two modules complement each other cleanly. The AST interface is used when the agent needs to edit code: navigate the tree, find the right node, and replace it precisely. The language server is used when the agent needs to understand code: resolve a type, find the real definition, and confirm that a change is semantically correct. Together, they give the agent the same combination of capabilities a developer has in the IDE — structural editing through the code model, semantic understanding through the compiler.

DebuggerMCP: The Agent Steps Through Code

This is the capability that tends to produce the strongest reaction in experienced Delphi developers: the AI can debug. Not simulate debugging, not guess at what a debugger would say — actually control a live Delphi debug session through the IDE's own debugger interface, the same one you use when you press F9 to start the program, F7 to step into a call, F8 to step over a statement, and F4 to run to the cursor.

The architecture behind this involves two components. A small IDE package registers itself as a TCP server inside the running Delphi IDE, listening for commands from outside. The debugger module in the backend server acts as a TCP client to that package. When the agent calls a debugger tool, the backend relays the command through the TCP connection to the IDE plugin, which executes it against the live Delphi debugger API — the same IOTADebuggerServices interface that the IDE itself uses internally. The response travels back the same way.

The agent has access to the full range of debugger operations: open a project, set a breakpoint at a specific file and line, optionally with a condition or pass count, start the debugger, wait for execution to stop, read the value of any variable in scope, step over or step into the current statement, continue execution, and stop the debugger. It can also list all active IDE instances — if you have Delphi 2007 and Delphi XE open at the same time, the agent can discover both and choose which one to target. The IDE plugin supports both versions, because the underlying debugger interface has been available since Delphi 2007.

One scenario that illustrates the multi-IDE capability particularly well is cross-version debugging. Imagine a legacy Delphi 2007 executable that calls into a modern Delphi 13 DLL. Both are running in their respective IDEs simultaneously. The agent calls the session discovery tool, receives a list of both active IDE instances with their version identifiers, and can attach to either one independently. It sets a breakpoint in the Delphi 2007 EXE at the point where it calls into the DLL, steps into that call, and then switches its attention to the Delphi 13 IDE session to inspect the state inside the DLL. Following a call across a version boundary — something that is genuinely awkward to do manually — becomes a routine operation. The agent does not care that the two binaries were compiled seventeen years apart; it just follows the execution.

To make this concrete with a simpler example, we verified the full flow end-to-end. The agent opened a project, set a breakpoint with a condition that fires only when a specific variable has a specific value, started the debugger, waited for the breakpoint to be hit, read the variable value to confirm the condition, stepped over a statement, read the variable again to verify the updated value, and then stopped the session. Every step of that sequence is something a developer does manually today. The agent did it without being told which file to open or where the bug might be — it reasoned about the code via the AST module, formed a hypothesis, and verified it through the debugger.

Multi-Agent Coordination: The Intent Lock Protocol

When multiple agent sessions work on the same project simultaneously — one writing code, another running analysis, a third tracking tasks — they need a way to coordinate without stepping on each other. The backend provides a cooperative locking mechanism called the intent lock. An agent declares its intent for a resource (typically a project working directory), acquires the lock, does its work, and releases it. Other agents watching the same resource are notified when the lock state changes. If an agent crashes or its session ends abnormally, the lock expires automatically after a configurable timeout, so the resource never stays blocked indefinitely. This is agent-to-agent coordination at the protocol level, without any external orchestration tool.

Tool Names Optimized Per Model

One of the less obvious findings from our benchmarking work — described in the earlier posts in this series — is that different AI models respond differently to the same tool names and descriptions. A name that is intuitive to a large frontier model may be ambiguous to a smaller local model. A description optimized for Claude reads differently to an Ollama-hosted model. This is not a hypothetical concern: in our benchmark across 198 tools, we found 75 meaningful differences between what Opus and Sonnet considered the clearest way to name and describe the same tool.

The system handles this through a model identity mechanism. When a session starts, the agent identifies itself with its model name. From that point, every tool list request is answered with names and descriptions tuned for that specific model. The translations are defined inside each module DLL — the module author knows their tools best and maintains the per-model variants alongside the rest of the module code. The backend server passes the model identity through to the module and delivers whatever the module returns. No central mapping file, no configuration outside the codebase.

The tool list itself is also structured differently from a flat list of tools. Tools are organized into groups. On first connection, the agent receives only the group overview — the categories of available functionality, not every individual tool. It can load a group when it needs those tools, and within a group, it can request either a minimal set of the most essential tools or the full extended set. This keeps the context window impact of tool discovery proportional to what the agent is actually doing.

Keeping the Context Window Free

A system with nearly 200 tools across a dozen modules could easily overwhelm an agent's context window before it has written a single line of code. The proxy architecture solves this through lazy loading. When a session starts, the agent receives only a compact group overview: the categories of available functionality, not the individual tools within them. The agent loads a group only when it actually needs those tools. Within a group, it can choose between a minimal set covering the most common operations and a full extended set. Context window impact scales with what the agent is doing, not with the total size of the toolchain.

A practical example: an agent that only needs to edit a Word document loads the document tools and nothing else. It has no knowledge of the compiler, the debugger, or the Jira integration — and it does not need to. Those modules exist on the server, ready to be called, but they take up no space in the context window until they are needed. This design principle — expose what is necessary, hide what is not — is what makes it feasible to have a rich, deep toolchain without paying a constant context window tax on every session.

The same principle applies at the individual tool level. Every tool in the system has an explain function — a built-in mechanism that returns a detailed description of exactly how to call that tool, including concrete examples. Instead of front-loading the agent with exhaustive documentation at startup, the agent can query the explanation for any tool on demand, precisely when it is about to use it. This keeps the tool descriptions accurate, contextual, and out of the way until they are actually needed.

Local Models as First-Class Citizens

Cloud AI is excellent for interactive work — it is fast, capable, and handles complex reasoning well. But it is not free, it is not private, and it is not well-suited for batch processing thousands of items. For teams where it matters that no source code, business logic, or project content is sent outside the local network, local models running on your own hardware are the right answer. There is no external API call, nothing logged on a third-party server, and no dependency on an internet connection during inference. For tasks like populating a code knowledge base with summaries and analysis of 1,600 Delphi units, a local model running on your own hardware is the right tool: no per-token cost, no data leaving your network, and it can run overnight while you sleep.

The Ollama integration module treats local inference as a proper backend service, not an afterthought. It connects to a local Ollama instance, maintains an asynchronous job queue for batch work, and includes a GPU guard that monitors actual GPU utilization before starting any inference. If the GPU is busy — because you decided to play a game or run another process — new batch jobs wait rather than starving your system. When GPU headroom becomes available, the queue resumes automatically. A separate model unload tool frees VRAM on demand when you need it for something else.

The split between interactive and batch inference is clean: the AI agent uses cloud models for real-time reasoning and tool calls, while the Ollama module handles background batch jobs asynchronously. The agent submits a job and moves on. The backend processes it when resources allow and stores the result. No blocking, no wasted context window, no timeout errors on long-running generations. The entire configuration — which model, which endpoint, GPU threshold, retry interval — lives in a single INI file section and can be changed without recompiling anything.

Remote Deployment and Direct Database Access

The proxy and the backend server do not have to run on the same machine. A relay interface module transparently forwards TCP connections to a backend server running on a different host — a more powerful build server on the local network, a team server accessible to multiple developers, or a cloud VM. From the agent's perspective, nothing changes; it connects to the same proxy and gets the same tool interface. The relay handles the routing. TLS support is built in for deployments where the connection crosses a network boundary that requires encryption.

Database access follows the same injection model as everything else in the system. A MySQL connector module is loaded once by the server, and its interface is injected into any other module that declares a need for it. The modules that use the database — the planning system, the blog publisher, the code knowledge base, the feature request tracker — all receive the same shared connection without knowing or caring how it was established. Schema management is self-contained: each module creates its own tables if they do not exist, using a consistent naming convention. There is no separate migration tool, no schema file to manually apply. Deploy the module, start the server, and the tables appear.

Closing the Loop: Jira, Outlook, and Word

Software development does not happen in a code editor in isolation. Some tickets need to be read and updated. There are emails from stakeholders that describe requirements or report problems. There are Word documents with specifications, release notes, and design decisions that need to stay in sync with the code. Every time these things live in separate tools that the AI cannot reach, the developer becomes the manual bridge — copy a ticket description into the chat, paste the AI's output back into the document, and summarize the email manually. The workflow integration modules eliminate that bridge.

The Jira module gives the agent direct access to your issue tracker. It can read tickets, analyze their content, assess scope and risk, create new issues, update status, and link related items — all without leaving the coding session. When the agent finishes implementing a feature, it can close the ticket that requested it. When it encounters a problem that should be tracked separately, it can open a new one.

The Outlook module integrates email into the same workflow. The agent can read incoming messages, understand their context in relation to the current project, compose replies, manage folders, and handle attachments. For developers who receive bug reports or requirements by email — which is most of us — this means the agent can act on that information directly rather than waiting for a human to relay it.

The Word module — the one used to write this very document — gives the agent structured access to Word files. It can create documents, add and edit paragraphs with full formatting control, insert tables, manage headers and footers, replace specific text ranges, and work with the document's paragraph structure by stable identifier rather than by line number. Specifications, release notes, API documentation, design decisions: anything that lives in a Word document becomes part of the same connected workflow. When code changes, documentation can change with it, in the same session, without a copy-paste step in between.

To illustrate the combined workflow, an agent receives a Jira ticket describing a reported bug. It reads the ticket, queries the AST module to find the relevant code path, forms a hypothesis about the cause, sets a conditional breakpoint via the debugger module, steps through execution to confirm, makes the fix via the AST mutation interface, compiles and verifies, updates the Jira ticket as resolved, and appends a note to the release notes document in Word. That is an end-to-end development cycle — from bug report to resolved ticket and updated documentation — with the developer reviewing and approving, but not manually executing any of the individual steps.

What This Actually Enables

There is a pipeline running in this system right now that requires no human interaction to operate. A scheduler fires at a configured time. It triggers an orchestrator module, which picks up pending analysis jobs. The orchestrator dispatches those jobs to the Ollama inference module, which processes them against the local model and stores the results in the code knowledge database. The database grows. The next time the agent needs to understand a part of the codebase, the answers are already there. This does not involve a prompt. It does not require a developer to be at their desk. It is software doing what software should do: running reliably in the background, building something useful.

There is also a feedback channel built in at the tool level. When an agent encounters a situation where the right tool does not exist — the capability it needs is not exposed by any current module — it can file a feature request directly through a built-in mechanism. The request is stored in the database, categorized automatically, and appears in the planning system, where a developer will see it. The agent is not just using the toolchain; it is actively contributing to its own improvement.

None of this is achievable with prompts alone, no matter how carefully crafted. Prompts are ephemeral. They vanish when the context window clears. They cannot set a breakpoint. They cannot remember last week's architectural decision. They cannot trigger at 3 AM when no one is watching. The real shift in capability comes not from smarter prompts but from treating AI as a component in a proper software architecture — one with persistent state, typed interfaces, modularity, lifecycle management, and integration with the real tools that development actually depends on.



Getting Started

If you are new to AI-assisted Delphi development and want to understand the foundations, the standalone MCP server versions on our website are the right starting point. Each one is a self-contained server covering a specific area of functionality. Their source code is available for purchase, designed to be readable and instructive — a solid foundation for understanding how MCP servers work in a Delphi context and building your own.

The enterprise backend server with the full DLL module ecosystem — the system described in this post — will be available for purchase soon. If you are interested, feel free to get in touch via the contact form on this blog — I am happy to answer questions and discuss what fits your situation.

The benchmark series that started this blog explored which AI models perform best on Delphi tasks. This post describes the infrastructure that puts those findings to work. The next step — a code knowledge base that lets any agent navigate a million-line legacy codebase without reading a single file — is already in progress. We will write about that too, when it is ready.

Friday, June 5, 2026

Local LLMs for Delphi: A Production Benchmark — Part 3: What to Actually Use

This is the final post in a three-part series. Part 1 covered benchmark design and methodology. Part 2 covered what the numbers revealed. This post covers what you should actually do with those results.

After running 8 local models through 5 benchmark phases on 30 real Delphi production units, the most useful thing I can offer is not another table of scores — it is a set of concrete decisions. If you are planning to integrate local LLMs into any Delphi pipeline — migration, code review, documentation, or IDE integration — this post tells you which model to use for each job, which two to skip entirely, and where the remaining rough edges are.

Pick the Right Model for the Right Job

The clearest finding from this benchmark is that no single model dominates across all five task types. The right approach is task routing — matching each class of work to the model best suited for it.

Code Analysis / Risk Discovery (AT1)

Use gemma4:26b (score 0.96) for targeted fact extraction. Use qwen3.5:27b (0.86) or qwen3.6:27b (0.82) when you need coherent functional explanations.

Code Comprehension and Q&A (AT2)

Use qwen3.6:27b (score 0.70) or qwen3.5:27b (0.69). Both score significantly above the field.

Patch Generation / Code Writing (AT3)

Use gemma4:26b (score 0.88, 170 tok/s) as your primary patcher — add format validation for the 70% non-compliant responses. Use qwen3.5:27b (0.77) as a fallback when compliance matters more than speed.

Routing / Complexity Classification (AT4)

Use qwen3.6:35b-a3b (score 1.00, 131 tok/s). Perfect routing accuracy combined with MoE speed. Avoid qwen3-coder:30b (0.71) — inconsistent classification defeats the router.

Tool Calling / IDE Integration (AT5)

Only: devstral, qwen3.5:27b, qwen3.6:27b, qwen3.6:35b-a3b, gemma4:26b. Hard disqualified: qwen2.5-coder:14b, qwen3-coder:30b, deepseek-r1:8b — API-level failure, not fixable through prompting.


The Routing Pipeline Architecture

The single most impactful design decision is the batching strategy. GPU model loading takes 30–60 seconds per swap. Batch by tier:

Incoming task
        |
   Router model (qwen3.6:35b-a3b, fast)
        |
   +--------------------+--------------------+
   |                    |                    |
 local              mid-tier              complex
gemma4:26b        qwen3.5:27b          qwen3.6:27b
(fast, high vol) (reliable, balanced)  (best understanding)

For a batch of 50 units across three complexity tiers, batch-by-tier scheduling can eliminate 45+ minutes of pure idle time.


The Context Window Problem: Why a Proxy Layer Is Not Optional

Model selection and batching strategy are the two decisions this benchmark directly informs. But there is a third decision that matters at least as much for local deployments, and it has nothing to do with which model scores highest on any phase.

If you are running a real development pipeline, you are not running one MCP server. You are running several. A realistic Delphi development setup includes a source analysis server (DelphiMCP: ~40 tools), a document server (WordMCP: ~40 tools), a code indexer (PasIndexer: ~15 tools), a file editor (StrEditor: ~15 tools), and a handful of supporting services. That is well over 150 tool declarations present in context before a single line of code is analyzed.

Each tool declaration — name, parameter schema, description — costs between 150 and 300 tokens. At 150 tools, you are looking at 22,000 to 45,000 tokens of overhead at the start of every session. For a local model running with a 32k context window, that is between 70% and 140% of the available context consumed before the first user message arrives.

VT5 makes this worse, not better. The benchmark showed that local models score only 1–34% on Phase 2 of the tool-name test — they cannot reliably interpret a tool's purpose from its name alone. They depend on the description field in the schema. Compressing tool declarations to save context is not an option; the descriptions are load-bearing for local models.

ProxyMCP addresses this directly. As a single MCP endpoint, it does not expose all tools from all servers to the model. It exposes only the tools relevant to the current task — typically 3 to 8 — and routes the call to the appropriate backend server. From the model's perspective, the tool surface is minimal and always task-scoped. From the pipeline's perspective, every server is still fully available.

The practical effect: a local model operating through ProxyMCP sees a context overhead of roughly 500–1,500 tokens for tool declarations rather than 22,000–45,000. That difference is the difference between a model that has room to reason and a model that is fighting for context from the first token. For cloud models, the same architecture translates directly into cost savings — fewer tokens declared means fewer tokens billed, on every single request.


Decision Table (32 GB VRAM Setup)

Use caseRecommended modelSpeedNotes
High-volume patchinggemma4:26b170 tok/sAdd format validation layer
Code understandingqwen3.6:27b28 tok/sBest comprehension overall
Routing decisionsqwen3.6:35b-a3b131 tok/sPerfect routing + fast
Balanced all-rounderqwen3.5:27b29 tok/sStrong across all phases
Tool/IDE integrationqwen3.6:35b-a3b131 tok/sBest speed + tool support

What About Cloud?

I ran the full five-phase evaluation on Claude Sonnet 4.6 via the Anthropic API for comparison. Where does cloud actually outperform local?

PhaseBest localSonnet 4.6Delta
AT1 Code Detective0.96 (gemma4)0.983+0.02 ≈ 0
AT2 Comprehension0.70 (qwen3.6:27b)0.983+0.28 ← gap
AT3 Patches0.88 (gemma4)0.955+0.08
AT4 Routing1.00 (multiple)0.994≈ 0
AT5 Tool Calling1.00 (multiple)1.000

Practical conclusion: Local models are at parity with cloud on extraction, routing, and tool calling. The 28-point comprehension gap (AT2) is the only argument for selective cloud use. For GDPR(DSGVO)-constrained teams, local-only remains viable. For hybrid architectures, route comprehension to Sonnet and keep everything else local.


The Realistic Assessment

Local LLMs are genuinely useful for Delphi development — as a force multiplier for analysis and mechanical transformation phases. What works: breaking work into distinct task types, batching by model, adding format validation, using comprehension models for risk analysis first, and treating tool calling as a hard capability requirement.

The models that fit this architecture on a 32 GB setup — qwen3.6:27b for understanding, gemma4:26b for patching, qwen3.5:27b as the balanced mid-tier, qwen3.6:35b-a3b for routing and tool calls — are capable enough to make local-only, GPU-resident Delphi LLM pipelines a practical option today.



Thursday, May 28, 2026

Local LLMs for Delphi: A Production Benchmark — Part 2: What the Numbers Reveal

Part 1 described the setup: eight local models, five benchmark phases (AT1 through AT5), 30 real production Delphi units. The test domain was legacy code migration — but the phases themselves cover tasks that appear in any Delphi LLM pipeline: code analysis, comprehension, patch generation, routing, and tool calling. This post covers what the numbers showed.

The Scoreboard

Local models (Ollama, 32 GB VRAM):

ModelAT1AT2AT3AT4AT5Overall
gemma4:26b0.960.620.880.840.990.82
qwen3.5:27b0.860.690.770.851.000.79
qwen3.6:27b0.820.700.700.811.000.76
qwen3.6:35b-a3b0.790.640.700.841.000.74
devstral-small-2:24b0.700.580.710.801.000.70
qwen2.5-coder:14b0.680.500.690.79n/a0.66

Cloud baseline (Anthropic API):

ModelAT1AT2AT3AT4AT5Overall
Claude Sonnet 4.6 ☁0.9830.9830.9550.9941.000.983

Finding 1: Comprehension and Patch Quality Are Not Correlated


gemma4:26b
has the lowest AT2 score (0.62) and the highest AT3 score (0.88). A 26-point gap across two phases that test adjacent capabilities is not noise. qwen3.6:27b is the opposite: highest AT2 (0.70), mid-table AT3 (0.70). The two skills are measuring something genuinely different — not just two views of the same underlying "code understanding."


Finding 2: AT3 Failure Modes

Format refusal (gemma4:26b)

70% of gemma4's AT3 responses returned as natural-language prose — no JSON block. The 30% that did comply scored 0.88 on content — the best in the benchmark. This is an instruction-following problem, not a capability problem. Fixable with better prompting.

Systematic token corruption (qwen2.5-coder:14b)

Every generated code line was prefixed with L<n>: — e.g. L29: fLogEnabled : Boolean;. Not valid Pascal. Consistent across all 30 units. Content often correct, output systematically malformed.

Structural misplacement (qwen3-coder:30b)

Asked to add a method, the model inserted code into the Initialization block — runs once at startup instead of being callable. Compiles, passes syntax check, fails at runtime. Would not be caught without unit tests.


Finding 3: Speed — The MoE Advantage

Modeltok/sArchitecture
gemma4:26b~170Dense
qwen3.6:35b-a3b~131MoE
qwen2.5-coder:14b~120Dense, 14b
qwen3.5:27b / qwen3.6:27b~28–29Dense, 27b

For a batch run over 30 units, 170 tok/s vs 28 tok/s is measured in hours of wall clock time.


Finding 4: Tool Calling — A Binary Split

Models that returned proper tool_calls objects: devstral, qwen3.5:27b, qwen3.6:27b, qwen3.6:35b-a3b (1.00 each), gemma4:26b (0.98).
Hard fail at the API level: qwen2.5-coder:14b, qwen3-coder:30b (text output only), deepseek-r1:8b (Ollama error: "does not support tools").

This is not a prompting problem — it is an API-level capability. A model either returns tool_calls objects in the response or it does not. For any pipeline that relies on IDE or MCP integration, this is a hard filter applied before any other consideration.


Finding 5: German vs English Prompts on AT1

ModelAT1 ENAT1 DEDelta
gemma4:26b0.960.83-0.13
qwen3.5:27b0.860.78-0.05
qwen2.5-coder:14b0.680.39-0.31

Every model scores lower in German. The losses range from 5 to 31 points. Recommendation: Use English for any prompt that includes source code or asks the model to name specific identifiers. Use German for everything that faces the developer.


Finding 6: Cloud vs Local — Where the Gap Actually Is


PhaseBest localSonnet 4.6Delta
AT1 Code Detective0.96 (gemma4)0.983+0.02 ≈ 0
AT2 Comprehension0.700.983+0.28 ← main gap
AT3 Patches0.880.955+0.08
AT4 Routing1.000.994≈ 0
AT5 Tool Calling1.001.000

Local models are essentially on par with Sonnet in code extraction, routing, and tool calling. The comprehension gap (28 points on AT2) is where cloud-scale models are meaningfully better. For GDPR-constrained teams, local-only is viable. For hybrid architectures, route comprehension to Sonnet and keep everything else local.


Finding 7: AT4 Routing — Not All Models Are Reliable Routers

Routing accuracy is the capability that matters most in a multi-model pipeline: a misclassified task gets sent to the wrong model, which either wastes compute on an oversized model or produces a failure on an undersized one. The AT4 results show a meaningful spread across the field.

The top scorers — qwen3.5:27b (0.85), gemma4:26b (0.84), and qwen3.6:35b-a3b (0.84) — differ significantly from the problem case: qwen3-coder:30b at 0.71. In a pipeline of 90 routing decisions, a 0.71 accuracy means roughly 26 tasks misclassified. That is not recoverable through prompt engineering — classification drift at that scale corrupts the batching architecture that the pipeline depends on.

Speed is the second dimension here. Routing sits on the critical path of every task — every request passes through the router before any work begins. A model that routes accurately but slowly adds latency multiplied across the entire batch. qwen3.6:35b-a3b combines solid routing accuracy with 131 tok/s, making it the only model that is both reliable and fast enough to use as a production router.


VT Pre-Test Findings

Before the five AT phases, every model ran through seven validation tests (VT1–VT7) covering JSON compliance, instruction following, output consistency, basic Delphi understanding, and native tool-call support. All eight models passed every pre-test — this confirmed that the AT results reflect genuine capability differences, not basic reliability failures.

Two VT findings produced results specific enough to be worth documenting alongside the main benchmark.

VT6: Migration Risk Detection

VT6 tested whether models could identify concrete migration traps across three categories of Delphi code: ShortString serialization risks, binary-packed record layouts, and call-chain dependencies. The chart shows significant variation between models, even where their AT1–AT5 scores are similar — VT6 captures a targeted detection capability that the main phases only partially measure.

VT8: Token Budget and Thinking Mode

VT8 exposed a practical deployment trap: when a 300-token output budget is set, and thinking mode is active, all three tested Qwen models consume the entire budget on internal reasoning and return an empty response. The correct answer exists — the model knows it — but there is no token budget left to output it. Setting num_predict ≥ 800 for any thinking-enabled model, the issue is completely.

What the Numbers Add Up To

Seven findings point to a consistent picture. Comprehension and patch generation are genuinely different skills — the best model at one is not the best at the other. Speed varies by a factor of six between the fastest and slowest capable models. Tool calling eliminates three models outright at the API level. Language choice in prompts can cost 31 percentage points on code extraction tasks. Routing accuracy is not guaranteed — one model in the tested pool is too unreliable to use as a router. And the cloud comprehension gap, while real at 28 points, is the only phase where local models fall meaningfully short.

None of these findings argues for a single best model. They argue for a pipeline that routes different task types to different models — which is exactly what Part 3 covers.


Part 3 covers the practical takeaways: routing architecture, model selection by task, and the hybrid pipeline decision table.



Friday, May 22, 2026

Local LLMs for Delphi: A Production Benchmark — Part 1: Design and Methodology

Let me save you some time upfront: this is not a post about prompting ChatGPT to help with Delphi. This is about running local, on-premise LLMs — models that never send your source code to any server — through a structured, five-phase benchmark built from 30 real production files.

The test domain is a large legacy codebase undergoing Unicode migration — demanding enough to expose real capability gaps across code analysis, patch generation, routing, and tool calling. The findings apply broadly: if you are integrating local LLMs into any Delphi workflow, this benchmark tells you which models handle which tasks, which ones fail at the API level, and where the architecture decisions actually matter.


The Problem Is Bigger Than It Looks

The codebase in question is large. The production compile base is Delphi 2007, partially migrated to a modern Delphi version. What does legacy here actually mean? ShortStrings used for binary serialization. Records with fixed memory layouts that must not change because data was written to disk in that exact layout twenty years ago. No MVC, no MVVM — a custom control hierarchy that grew organically before those terms meant anything in the Delphi world. The kind of codebase where a seemingly innocent String vs ShortString swap can corrupt a file format three layers down.

A cloud AI service is not an option. The code is proprietary, the clients are sensitive, and the answer to "does your code leave the building?" must always be no. So the question became: can local LLMs running on our own hardware actually help, or are we on our own?

The Hardware Reality

One GPU with 32 GB VRAM. An Ollama server reachable at a local network address. We tested eight models:

ModelParametersArchitecture
qwen2.5-coder:14b14bDense
devstral-small-2:24b24bDense
qwen3-coder:30b30bDense
qwen3.5:27b27bDense
qwen3.6:27b27bDense
qwen3.6:35b-a3b35b (MoE)Mixture-of-Experts
gemma4:26b26bDense
deepseek-r1:8b8bDense

Models That Didn't Make the Cut

Six models failed pre-screening and never entered the main evaluation:

ModelReason for rejection
granite4:32b-a9b-hOnly 16% GPU utilization — 84% CPU offloading. Unusably slow.
nemotron-3-nano:30b-a3bZero valid JSON responses across 120 validation attempts.
llama4:scoutFailed instruction following and basic code output tests.
qwen3-coder-next:latestFailed instruction following and basic code output tests.
deepseek-r1:32bFailed validation tests.
deepseek-r1:14bFailed validation tests.

Why a Five-Phase Benchmark?


What would a real Delphi LLM assistant actually need to do? Whether the task is migration, refactoring, documentation, or IDE integration, the core capability requirements are the same:
  • Find hidden problems before you touch anything
  • Understand what a unit actually does, not just what it says it does
  • Write correct Delphi code, not plausible-looking pseudocode
  • Know when a task is trivial and when it needs a more capable model
  • Call tools — interacting with an IDE or AST system rather than outputting text

Those five capabilities became five test phases, applied to 30 carefully selected production units.


The Production Context

This benchmark was not designed in isolation. It was built to answer a concrete engineering question: which local models can reliably handle which classes of work in a production-grade, on-premise AI pipeline?

That pipeline consists of two components developed in parallel with this benchmark work.

ProxyMCP is a lightweight routing layer that sits between IDE tooling and the model backend. It implements the Model Context Protocol, accepts tool calls from development environments, and forwards them — without transforming the payload — to the backend layer that handles execution.

Enterprise Server is the orchestration layer behind ProxyMCP. It handles model routing, task classification, session management, and audit logging across multiple users and projects. It is designed for on-premise deployment — no data leaves the local network — and built to the requirements of teams that need traceable, reproducible AI-assisted workflows with defined service levels.

The routing architecture described in Part 3 of this series is not a theoretical recommendation. It is the architecture these components implement. The benchmark determined which models fill which roles.


Phase AT1: Code Detective Work

Each of the 30 units contains one non-obvious fact — something you can only find by genuinely reading the code. Example:

Unit: AsyncSettings.pas
Question: Which fields of TAsyncSettings are shared across every instance rather than per-instance, making the class effectively a singleton without the type system saying so?
Required: Name the specific identifiers.

Phase AT2: Comprehension (120 Tasks)

Each unit generates four tasks: one unit-level summary and three focused questions. Scored by Claude Opus 4.7 as judge using HIT/PARTIAL/MISS grading.

Phase AT3: Patch Generation

Each unit has one realistic migration task. The model must output a structured JSON patch — insert after line N, replace lines N through M, delete lines N through M. Maximum three operations:

{
  "operations": [
    {
      "op": "insert_after_line",
      "line": 35,
      "content": "      Class procedure WaitForCompletion(aTimeoutMs : Cardinal);"
    }
  ]
}

Phase AT4: Routing (90 Decisions)

30 units × 3 tasks each. Classify as local (trivial), mid (moderate), or top (deep understanding required). A model that consistently over- or under-classifies is useless as a router.

Phase AT5: Tool Calling (73 Tasks)

Can these models use Ollama's native tool_calls API? This is not about prompting a model to output JSON — it is about whether the model returns actual tool_calls objects in the API response. Several otherwise capable models fail here completely.


The Judge and the Baseline: Two Roles for Claude Models

This benchmark uses Claude models from Anthropic in two completely separate roles — and it is worth being explicit about this, because the distinction matters for how you read the results.

ModelRoleWhat it does
Claude Opus 4.7Judge & Gold Standard AuthorReads each of the 30 source files independently, writes the gold standard answers, then scores every model response as HIT / PARTIAL / MISS. Does not participate as a candidate.
Claude Sonnet 4.6Cloud Baseline CandidateRuns the identical five-phase benchmark as the local models — same questions, same tasks, same scoring. Acts as a reference point: how much better does a capable cloud model actually score?

Opus is the most capable model in Anthropic's lineup. Using it as the judge — rather than a fixed rubric or human annotators — means the gold standard is set by the model with the deepest understanding of the source material. Sonnet then competes against local models on the same playing field, graded by the same judge.

Is it a conflict of interest that the judge and one of the candidates come from the same company? It is worth noting. In practice, Opus's judgments on Delphi code analysis were consistent with what a senior developer would recognize as correct — the gold standard answers were not padded to favor any particular model family. And the local models' scores speak for themselves: gemma4:26b reaches 96% on AT1, within two points of Sonnet. A biased judge would not produce numbers like that.


A Note on the Cloud Baseline

After completing the local evaluation, we ran the same five-phase benchmark on Claude Sonnet 4.6 via the Anthropic API — not to re-open the DSGVO argument, but to establish a reference point. The short version: local models are essentially at parity with Sonnet on code extraction, routing, and tool calling. The comprehension gap (AT2) is real and significant — 28 percentage points between the best local model and Sonnet. Whether that gap matters depends on whether you need the model to explain code or just to find things in it. Part 2 covers this in detail.


Part 2 covers the actual results: which models performed where, the failure modes, and what surprised us. Part 3 covers the practical takeaways — routing architecture and model selection by task type.