from old school to new frontier: June 2026

Tuesday, June 9, 2026

Beyond Simple Prompts: Building an Enterprise AI Toolchain in Delphi

The conversation about AI in software development tends to revolve around prompts. Write a better prompt, get a better answer. Use a smarter model, get smarter code. And for a one-off task, that is entirely true. But when you try to integrate AI seriously into a professional Delphi development workflow — one with hundreds of units, multi-day tasks, multiple concurrent sessions, and real-world complexity — you quickly hit a ceiling that no amount of prompt engineering can break through. The problem is not the model. The problem is the architecture around it.

This post describes what we built to break through that ceiling: a full enterprise backend for AI-driven Delphi development, consisting of a central backend server, a thin MCP proxy layer, and a growing ecosystem of loadable server modules. It also addresses a concern that matters for many professional teams: with local models integrated directly into the backend, agents can do their work without sending any content over the internet at all. It is the result of months of iterative development, and it is currently being prepared for commercial release. If you are just getting started with AI and Delphi, the standalone MCP server versions on our website are a great entry point — their source code is available for purchase, so you can learn the patterns and build your own tools. But if you are ready to go further, this article is about what comes next.

The Architecture at a Glance

Before diving into the individual capabilities, it helps to understand the overall structure. The system has three layers.

At the top sits the AI agent — in our case, Claude Code, though the architecture is model-agnostic. The agent communicates via the standard MCP protocol over stdio, which means it works with any MCP-compatible client without modification.

In the middle is a lightweight proxy process — the only MCP endpoint the agent ever sees. It does not contain business logic. Its job is to translate MCP protocol messages into a compact binary TCP protocol and forward them to the backend server. It also handles reconnection, session registration, and the agent-facing tool list. This separation is deliberate: the proxy is thin and replaceable, the backend is where everything interesting happens.

At the bottom is the backend server itself — a Delphi application that loads a collection of server modules as DLLs, manages sessions, routes tool calls, and maintains shared state. Each module is a self-contained DLL that exposes a defined interface. The server knows how to load them, call them, hot-swap them, and in some cases run them autonomously on a timer.

There are three kinds of modules. Agent-facing modules expose tools that the AI can call directly — everything the agent needs to read, write, compile, debug, search, or communicate. Backend modules run autonomously without direct agent involvement, executing scheduled work independently in the background. Infrastructure modules provide shared services that other modules consume, such as the database connector or the source code formatter. This three-way classification is not cosmetic; it shapes how each module is loaded, called, and lifecycle-managed by the server. The individual modules — and what each one enables — are described in the sections that follow.

One Backend, Many Agents — The Efficiency Argument

In a conventional MCP setup, every agent session starts its own MCP server process. If you have three Claude Code windows open, you have three instances of every MCP server running — three times the memory, three independent states, no shared knowledge between them. For simple tools, this is fine. For a module like the Delphi code analyzer, which parses and indexes source files, loads an abstract syntax tree into memory, and maintains a registry of nodes across a codebase of 1,600 units and 2.6 million lines of code, spinning up a fresh instance per session is simply not practical.

With the backend server, every module loads exactly once. All agent sessions connect to the same backend via TCP, each gets its own session context, but the heavy shared resources — loaded ASTs, database connections, job queues — exist only once. A new session costs a TCP handshake and a session registration record, not a full process with gigabytes of loaded state. This alone changes what is practically possible.

The server also handles thread safety automatically. Each module declares whether it is thread-safe. Thread-safe modules handle concurrent requests directly from multiple sessions. Modules that maintain internal state are protected by a request queue — calls are serialized through a single worker thread transparently, without any change needed in the module itself. The agent never sees any of this; it just calls tools and gets results.

PlanMCP: Persistent Memory Across Sessions

Anyone who has worked seriously with AI coding assistants will recognize this pattern: you start a session, the agent makes good progress, but the context window fills up. You need to start a new session. So you write a summary to a Markdown file, describe where you left off, note the open tasks, and hope you remembered everything important. Then you paste it into the next session and pick up from there. It works, sort of. But it is fragile, manual, and does not scale.

PlanMCP replaces this entirely. It is a backend module backed by a MySQL database with thirteen tables for projects, tasks, decisions, knowledge gaps, sessions, artifacts, and events. When the agent starts a session, it calls the planning menu tool first — it gets back a formatted overview of the current project state: open tasks, the next task ready to start, decisions already taken, and known gaps to work around. No Markdown file, no manual briefing, no hoping nothing was forgotten.

This makes the /clear command trivial. When the context window is getting full, you just clear it. The agent picks up exactly where it left off in the next session, because the actual state — the task list, the progress, the constraints, the open questions — lives in the database, not in the conversation. A restart hint mechanism makes this even smoother: before clearing, the agent writes a short note about what to do next. This note is stored in the database and automatically injected into the initialization response of the next session. The new session's first action is always: read the menu, check for a resume hint, and continue.

Tasks in PlanMCP are not just to-do entries. They have dependencies — a task can be blocked until another is completed. They carry context: code snippets, decisions, constraints, and links to artifacts. They have a full status history. The agent can query the next task that is actually ready to start, given what has already been completed. For larger projects spanning weeks and dozens of tasks, this is the difference between a genuinely productive agent and one that needs constant hand-holding to know where it is.

DelphiMCP: Not Just a Compiler Wrapper

The most common reaction when people hear about an AI-Delphi integration is: "So it can compile code?" Yes, but that is the least interesting part. Let us start with project creation. The Delphi module can create a fully structured new Delphi project from scratch — DPR, DPROJ, all base structures — following the exact conventions the Delphi toolchain expects. The agent does not write a DPR file into a text buffer and hope for the best; it calls a tool that understands what a valid Delphi project looks like and produces one correctly.

The more significant capability is the Abstract Syntax Tree interface. When the agent works with a Delphi unit, it does not receive raw Pascal source code — it receives a structured JSON representation of the syntax tree, with every node carrying a stable identifier. A class declaration, a method body, an interface section, a conditional compilation block: each is a named, addressable node in a tree. The agent can navigate this tree, inspect individual nodes at three levels of detail (compact for orientation, standard for normal work, rich for deep analysis), and perform surgical edits: replace a node, delete a node, insert a node before or after another. These operations are precise and unambiguous. There is no "find line 847 and replace the third occurrence of this string." There is "replace the node with this identifier," and it works.

This matters enormously at scale. A codebase with 1,600 units and 2.6 million lines cannot be navigated by reading files. It can be navigated by querying a structured index: find all types that descend from TComponent and have no override of BeforeDestruction, find all methods that call a deprecated API, and find all units in the Uses chain of a given file. These are real queries that the module can answer without the agent ever seeing a single line of source code it did not ask for.

LSPHandler: The IDE's Own Semantic Engine

The AST interface gives the agent structural access to source code: navigate the tree, inspect nodes, and make surgical edits. What it does not provide is semantic understanding — the difference between knowing that a variable named Sender exists and knowing that it is of type TObject in this specific call context, or knowing not just where a method is declared but where it is actually defined in the inheritance chain. For that level of understanding, there is only one authoritative source: the Delphi compiler itself.

The LSPHandler module connects to Delphi's own language server — the same process the IDE uses internally to power code completion, hover tooltips, and go-to-definition navigation. When the agent opens a project through this module, the language server starts in the background and processes all project units using the real Delphi compiler. From that point on, every query the agent makes is answered by the same semantic engine that answers your questions when you hold Ctrl and hover over a symbol in the IDE.

The hover tool returns the type and symbol information for any position in a source file — identical to what appears in the IDE tooltip when you move the mouse over an identifier. The definition and declaration tools resolve a symbol at a given position and return the file path and line number where it is defined or declared, following the full inheritance and unit resolution chain. The symbols tool returns a hierarchical symbol tree for an entire source file — every class, method, property, field, constructor, and constant, structured the same way the IDE's structure view organises them. The diagnostics tool retrieves the compiler's error and warning output after background compilation, with the same precision as a full build in the IDE.

The two modules complement each other cleanly. The AST interface is used when the agent needs to edit code: navigate the tree, find the right node, and replace it precisely. The language server is used when the agent needs to understand code: resolve a type, find the real definition, and confirm that a change is semantically correct. Together, they give the agent the same combination of capabilities a developer has in the IDE — structural editing through the code model, semantic understanding through the compiler.

DebuggerMCP: The Agent Steps Through Code

This is the capability that tends to produce the strongest reaction in experienced Delphi developers: the AI can debug. Not simulate debugging, not guess at what a debugger would say — actually control a live Delphi debug session through the IDE's own debugger interface, the same one you use when you press F9 to start the program, F7 to step into a call, F8 to step over a statement, and F4 to run to the cursor.

The architecture behind this involves two components. A small IDE package registers itself as a TCP server inside the running Delphi IDE, listening for commands from outside. The debugger module in the backend server acts as a TCP client to that package. When the agent calls a debugger tool, the backend relays the command through the TCP connection to the IDE plugin, which executes it against the live Delphi debugger API — the same IOTADebuggerServices interface that the IDE itself uses internally. The response travels back the same way.

The agent has access to the full range of debugger operations: open a project, set a breakpoint at a specific file and line, optionally with a condition or pass count, start the debugger, wait for execution to stop, read the value of any variable in scope, step over or step into the current statement, continue execution, and stop the debugger. It can also list all active IDE instances — if you have Delphi 2007 and Delphi XE open at the same time, the agent can discover both and choose which one to target. The IDE plugin supports both versions, because the underlying debugger interface has been available since Delphi 2007.

One scenario that illustrates the multi-IDE capability particularly well is cross-version debugging. Imagine a legacy Delphi 2007 executable that calls into a modern Delphi 13 DLL. Both are running in their respective IDEs simultaneously. The agent calls the session discovery tool, receives a list of both active IDE instances with their version identifiers, and can attach to either one independently. It sets a breakpoint in the Delphi 2007 EXE at the point where it calls into the DLL, steps into that call, and then switches its attention to the Delphi 13 IDE session to inspect the state inside the DLL. Following a call across a version boundary — something that is genuinely awkward to do manually — becomes a routine operation. The agent does not care that the two binaries were compiled seventeen years apart; it just follows the execution.

To make this concrete with a simpler example, we verified the full flow end-to-end. The agent opened a project, set a breakpoint with a condition that fires only when a specific variable has a specific value, started the debugger, waited for the breakpoint to be hit, read the variable value to confirm the condition, stepped over a statement, read the variable again to verify the updated value, and then stopped the session. Every step of that sequence is something a developer does manually today. The agent did it without being told which file to open or where the bug might be — it reasoned about the code via the AST module, formed a hypothesis, and verified it through the debugger.

Multi-Agent Coordination: The Intent Lock Protocol

When multiple agent sessions work on the same project simultaneously — one writing code, another running analysis, a third tracking tasks — they need a way to coordinate without stepping on each other. The backend provides a cooperative locking mechanism called the intent lock. An agent declares its intent for a resource (typically a project working directory), acquires the lock, does its work, and releases it. Other agents watching the same resource are notified when the lock state changes. If an agent crashes or its session ends abnormally, the lock expires automatically after a configurable timeout, so the resource never stays blocked indefinitely. This is agent-to-agent coordination at the protocol level, without any external orchestration tool.

Tool Names Optimized Per Model

One of the less obvious findings from our benchmarking work — described in the earlier posts in this series — is that different AI models respond differently to the same tool names and descriptions. A name that is intuitive to a large frontier model may be ambiguous to a smaller local model. A description optimized for Claude reads differently to an Ollama-hosted model. This is not a hypothetical concern: in our benchmark across 198 tools, we found 75 meaningful differences between what Opus and Sonnet considered the clearest way to name and describe the same tool.

The system handles this through a model identity mechanism. When a session starts, the agent identifies itself with its model name. From that point, every tool list request is answered with names and descriptions tuned for that specific model. The translations are defined inside each module DLL — the module author knows their tools best and maintains the per-model variants alongside the rest of the module code. The backend server passes the model identity through to the module and delivers whatever the module returns. No central mapping file, no configuration outside the codebase.

The tool list itself is also structured differently from a flat list of tools. Tools are organized into groups. On first connection, the agent receives only the group overview — the categories of available functionality, not every individual tool. It can load a group when it needs those tools, and within a group, it can request either a minimal set of the most essential tools or the full extended set. This keeps the context window impact of tool discovery proportional to what the agent is actually doing.

Keeping the Context Window Free

A system with nearly 200 tools across a dozen modules could easily overwhelm an agent's context window before it has written a single line of code. The proxy architecture solves this through lazy loading. When a session starts, the agent receives only a compact group overview: the categories of available functionality, not the individual tools within them. The agent loads a group only when it actually needs those tools. Within a group, it can choose between a minimal set covering the most common operations and a full extended set. Context window impact scales with what the agent is doing, not with the total size of the toolchain.

A practical example: an agent that only needs to edit a Word document loads the document tools and nothing else. It has no knowledge of the compiler, the debugger, or the Jira integration — and it does not need to. Those modules exist on the server, ready to be called, but they take up no space in the context window until they are needed. This design principle — expose what is necessary, hide what is not — is what makes it feasible to have a rich, deep toolchain without paying a constant context window tax on every session.

The same principle applies at the individual tool level. Every tool in the system has an explain function — a built-in mechanism that returns a detailed description of exactly how to call that tool, including concrete examples. Instead of front-loading the agent with exhaustive documentation at startup, the agent can query the explanation for any tool on demand, precisely when it is about to use it. This keeps the tool descriptions accurate, contextual, and out of the way until they are actually needed.

Local Models as First-Class Citizens

Cloud AI is excellent for interactive work — it is fast, capable, and handles complex reasoning well. But it is not free, it is not private, and it is not well-suited for batch processing thousands of items. For teams where it matters that no source code, business logic, or project content is sent outside the local network, local models running on your own hardware are the right answer. There is no external API call, nothing logged on a third-party server, and no dependency on an internet connection during inference. For tasks like populating a code knowledge base with summaries and analysis of 1,600 Delphi units, a local model running on your own hardware is the right tool: no per-token cost, no data leaving your network, and it can run overnight while you sleep.

The Ollama integration module treats local inference as a proper backend service, not an afterthought. It connects to a local Ollama instance, maintains an asynchronous job queue for batch work, and includes a GPU guard that monitors actual GPU utilization before starting any inference. If the GPU is busy — because you decided to play a game or run another process — new batch jobs wait rather than starving your system. When GPU headroom becomes available, the queue resumes automatically. A separate model unload tool frees VRAM on demand when you need it for something else.

The split between interactive and batch inference is clean: the AI agent uses cloud models for real-time reasoning and tool calls, while the Ollama module handles background batch jobs asynchronously. The agent submits a job and moves on. The backend processes it when resources allow and stores the result. No blocking, no wasted context window, no timeout errors on long-running generations. The entire configuration — which model, which endpoint, GPU threshold, retry interval — lives in a single INI file section and can be changed without recompiling anything.

Remote Deployment and Direct Database Access

The proxy and the backend server do not have to run on the same machine. A relay interface module transparently forwards TCP connections to a backend server running on a different host — a more powerful build server on the local network, a team server accessible to multiple developers, or a cloud VM. From the agent's perspective, nothing changes; it connects to the same proxy and gets the same tool interface. The relay handles the routing. TLS support is built in for deployments where the connection crosses a network boundary that requires encryption.

Database access follows the same injection model as everything else in the system. A MySQL connector module is loaded once by the server, and its interface is injected into any other module that declares a need for it. The modules that use the database — the planning system, the blog publisher, the code knowledge base, the feature request tracker — all receive the same shared connection without knowing or caring how it was established. Schema management is self-contained: each module creates its own tables if they do not exist, using a consistent naming convention. There is no separate migration tool, no schema file to manually apply. Deploy the module, start the server, and the tables appear.

Closing the Loop: Jira, Outlook, and Word

Software development does not happen in a code editor in isolation. Some tickets need to be read and updated. There are emails from stakeholders that describe requirements or report problems. There are Word documents with specifications, release notes, and design decisions that need to stay in sync with the code. Every time these things live in separate tools that the AI cannot reach, the developer becomes the manual bridge — copy a ticket description into the chat, paste the AI's output back into the document, and summarize the email manually. The workflow integration modules eliminate that bridge.

The Jira module gives the agent direct access to your issue tracker. It can read tickets, analyze their content, assess scope and risk, create new issues, update status, and link related items — all without leaving the coding session. When the agent finishes implementing a feature, it can close the ticket that requested it. When it encounters a problem that should be tracked separately, it can open a new one.

The Outlook module integrates email into the same workflow. The agent can read incoming messages, understand their context in relation to the current project, compose replies, manage folders, and handle attachments. For developers who receive bug reports or requirements by email — which is most of us — this means the agent can act on that information directly rather than waiting for a human to relay it.

The Word module — the one used to write this very document — gives the agent structured access to Word files. It can create documents, add and edit paragraphs with full formatting control, insert tables, manage headers and footers, replace specific text ranges, and work with the document's paragraph structure by stable identifier rather than by line number. Specifications, release notes, API documentation, design decisions: anything that lives in a Word document becomes part of the same connected workflow. When code changes, documentation can change with it, in the same session, without a copy-paste step in between.

To illustrate the combined workflow, an agent receives a Jira ticket describing a reported bug. It reads the ticket, queries the AST module to find the relevant code path, forms a hypothesis about the cause, sets a conditional breakpoint via the debugger module, steps through execution to confirm, makes the fix via the AST mutation interface, compiles and verifies, updates the Jira ticket as resolved, and appends a note to the release notes document in Word. That is an end-to-end development cycle — from bug report to resolved ticket and updated documentation — with the developer reviewing and approving, but not manually executing any of the individual steps.

What This Actually Enables

There is a pipeline running in this system right now that requires no human interaction to operate. A scheduler fires at a configured time. It triggers an orchestrator module, which picks up pending analysis jobs. The orchestrator dispatches those jobs to the Ollama inference module, which processes them against the local model and stores the results in the code knowledge database. The database grows. The next time the agent needs to understand a part of the codebase, the answers are already there. This does not involve a prompt. It does not require a developer to be at their desk. It is software doing what software should do: running reliably in the background, building something useful.

There is also a feedback channel built in at the tool level. When an agent encounters a situation where the right tool does not exist — the capability it needs is not exposed by any current module — it can file a feature request directly through a built-in mechanism. The request is stored in the database, categorized automatically, and appears in the planning system, where a developer will see it. The agent is not just using the toolchain; it is actively contributing to its own improvement.

None of this is achievable with prompts alone, no matter how carefully crafted. Prompts are ephemeral. They vanish when the context window clears. They cannot set a breakpoint. They cannot remember last week's architectural decision. They cannot trigger at 3 AM when no one is watching. The real shift in capability comes not from smarter prompts but from treating AI as a component in a proper software architecture — one with persistent state, typed interfaces, modularity, lifecycle management, and integration with the real tools that development actually depends on.

Getting Started

If you are new to AI-assisted Delphi development and want to understand the foundations, the standalone MCP server versions on our website are the right starting point. Each one is a self-contained server covering a specific area of functionality. Their source code is available for purchase, designed to be readable and instructive — a solid foundation for understanding how MCP servers work in a Delphi context and building your own.

The enterprise backend server with the full DLL module ecosystem — the system described in this post — will be available for purchase soon. If you are interested, feel free to get in touch via the contact form on this blog — I am happy to answer questions and discuss what fits your situation.

The benchmark series that started this blog explored which AI models perform best on Delphi tasks. This post describes the infrastructure that puts those findings to work. The next step — a code knowledge base that lets any agent navigate a million-line legacy codebase without reading a single file — is already in progress. We will write about that too, when it is ready.

Friday, June 5, 2026

Local LLMs for Delphi: A Production Benchmark — Part 3: What to Actually Use

This is the final post in a three-part series. Part 1 covered benchmark design and methodology. Part 2 covered what the numbers revealed. This post covers what you should actually do with those results.

After running 8 local models through 5 benchmark phases on 30 real Delphi production units, the most useful thing I can offer is not another table of scores — it is a set of concrete decisions. If you are planning to integrate local LLMs into any Delphi pipeline — migration, code review, documentation, or IDE integration — this post tells you which model to use for each job, which two to skip entirely, and where the remaining rough edges are.

Pick the Right Model for the Right Job

The clearest finding from this benchmark is that no single model dominates across all five task types. The right approach is task routing — matching each class of work to the model best suited for it.

Code Analysis / Risk Discovery (AT1)

Use gemma4:26b (score 0.96) for targeted fact extraction. Use qwen3.5:27b (0.86) or qwen3.6:27b (0.82) when you need coherent functional explanations.

Code Comprehension and Q&A (AT2)

Use qwen3.6:27b (score 0.70) or qwen3.5:27b (0.69). Both score significantly above the field.

Patch Generation / Code Writing (AT3)

Use gemma4:26b (score 0.88, 170 tok/s) as your primary patcher — add format validation for the 70% non-compliant responses. Use qwen3.5:27b (0.77) as a fallback when compliance matters more than speed.

Routing / Complexity Classification (AT4)

Use qwen3.6:35b-a3b (score 1.00, 131 tok/s). Perfect routing accuracy combined with MoE speed. Avoid qwen3-coder:30b (0.71) — inconsistent classification defeats the router.

Tool Calling / IDE Integration (AT5)

Only: devstral, qwen3.5:27b, qwen3.6:27b, qwen3.6:35b-a3b, gemma4:26b. Hard disqualified: qwen2.5-coder:14b, qwen3-coder:30b, deepseek-r1:8b — API-level failure, not fixable through prompting.

The Routing Pipeline Architecture

The single most impactful design decision is the batching strategy. GPU model loading takes 30–60 seconds per swap. Batch by tier:

Incoming task
        |
   Router model (qwen3.6:35b-a3b, fast)
        |
   +--------------------+--------------------+
   |                    |                    |
 local              mid-tier              complex
gemma4:26b        qwen3.5:27b          qwen3.6:27b
(fast, high vol) (reliable, balanced)  (best understanding)

For a batch of 50 units across three complexity tiers, batch-by-tier scheduling can eliminate 45+ minutes of pure idle time.

The Context Window Problem: Why a Proxy Layer Is Not Optional

Model selection and batching strategy are the two decisions this benchmark directly informs. But there is a third decision that matters at least as much for local deployments, and it has nothing to do with which model scores highest on any phase.

If you are running a real development pipeline, you are not running one MCP server. You are running several. A realistic Delphi development setup includes a source analysis server (DelphiMCP: ~40 tools), a document server (WordMCP: ~40 tools), a code indexer (PasIndexer: ~15 tools), a file editor (StrEditor: ~15 tools), and a handful of supporting services. That is well over 150 tool declarations present in context before a single line of code is analyzed.

Each tool declaration — name, parameter schema, description — costs between 150 and 300 tokens. At 150 tools, you are looking at 22,000 to 45,000 tokens of overhead at the start of every session. For a local model running with a 32k context window, that is between 70% and 140% of the available context consumed before the first user message arrives.

VT5 makes this worse, not better. The benchmark showed that local models score only 1–34% on Phase 2 of the tool-name test — they cannot reliably interpret a tool's purpose from its name alone. They depend on the description field in the schema. Compressing tool declarations to save context is not an option; the descriptions are load-bearing for local models.

ProxyMCP addresses this directly. As a single MCP endpoint, it does not expose all tools from all servers to the model. It exposes only the tools relevant to the current task — typically 3 to 8 — and routes the call to the appropriate backend server. From the model's perspective, the tool surface is minimal and always task-scoped. From the pipeline's perspective, every server is still fully available.

The practical effect: a local model operating through ProxyMCP sees a context overhead of roughly 500–1,500 tokens for tool declarations rather than 22,000–45,000. That difference is the difference between a model that has room to reason and a model that is fighting for context from the first token. For cloud models, the same architecture translates directly into cost savings — fewer tokens declared means fewer tokens billed, on every single request.

Decision Table (32 GB VRAM Setup)

Use case	Recommended model	Speed	Notes
High-volume patching	gemma4:26b	170 tok/s	Add format validation layer
Code understanding	qwen3.6:27b	28 tok/s	Best comprehension overall
Routing decisions	qwen3.6:35b-a3b	131 tok/s	Perfect routing + fast
Balanced all-rounder	qwen3.5:27b	29 tok/s	Strong across all phases
Tool/IDE integration	qwen3.6:35b-a3b	131 tok/s	Best speed + tool support

What About Cloud?

I ran the full five-phase evaluation on Claude Sonnet 4.6 via the Anthropic API for comparison. Where does cloud actually outperform local?

Phase	Best local	Sonnet 4.6	Delta
AT1 Code Detective	0.96 (gemma4)	0.983	+0.02 ≈ 0
AT2 Comprehension	0.70 (qwen3.6:27b)	0.983	+0.28 ← gap
AT3 Patches	0.88 (gemma4)	0.955	+0.08
AT4 Routing	1.00 (multiple)	0.994	≈ 0
AT5 Tool Calling	1.00 (multiple)	1.00	0

Practical conclusion: Local models are at parity with cloud on extraction, routing, and tool calling. The 28-point comprehension gap (AT2) is the only argument for selective cloud use. For GDPR(DSGVO)-constrained teams, local-only remains viable. For hybrid architectures, route comprehension to Sonnet and keep everything else local.

The Realistic Assessment

Local LLMs are genuinely useful for Delphi development — as a force multiplier for analysis and mechanical transformation phases. What works: breaking work into distinct task types, batching by model, adding format validation, using comprehension models for risk analysis first, and treating tool calling as a hard capability requirement.

The models that fit this architecture on a 32 GB setup — qwen3.6:27b for understanding, gemma4:26b for patching, qwen3.5:27b as the balanced mid-tier, qwen3.6:35b-a3b for routing and tool calls — are capable enough to make local-only, GPU-resident Delphi LLM pipelines a practical option today.

Buy Delphi