Thursday, May 28, 2026

Local LLMs for Delphi: A Production Benchmark — Part 2: What the Numbers Reveal

Part 1 described the setup: eight local models, five benchmark phases (AT1 through AT5), 30 real production Delphi units. The test domain was legacy code migration — but the phases themselves cover tasks that appear in any Delphi LLM pipeline: code analysis, comprehension, patch generation, routing, and tool calling. This post covers what the numbers showed.

The Scoreboard

Local models (Ollama, 32 GB VRAM):

ModelAT1AT2AT3AT4AT5Overall
gemma4:26b0.960.620.880.840.990.82
qwen3.5:27b0.860.690.770.851.000.79
qwen3.6:27b0.820.700.700.811.000.76
qwen3.6:35b-a3b0.790.640.700.841.000.74
devstral-small-2:24b0.700.580.710.801.000.70
qwen2.5-coder:14b0.680.500.690.79n/a0.66

Cloud baseline (Anthropic API):

ModelAT1AT2AT3AT4AT5Overall
Claude Sonnet 4.6 ☁0.9830.9830.9550.9941.000.983

Finding 1: Comprehension and Patch Quality Are Not Correlated


gemma4:26b
has the lowest AT2 score (0.62) and the highest AT3 score (0.88). A 26-point gap across two phases that test adjacent capabilities is not noise. qwen3.6:27b is the opposite: highest AT2 (0.70), mid-table AT3 (0.70). The two skills are measuring something genuinely different — not just two views of the same underlying "code understanding."


Finding 2: AT3 Failure Modes

Format refusal (gemma4:26b)

70% of gemma4's AT3 responses returned as natural-language prose — no JSON block. The 30% that did comply scored 0.88 on content — the best in the benchmark. This is an instruction-following problem, not a capability problem. Fixable with better prompting.

Systematic token corruption (qwen2.5-coder:14b)

Every generated code line was prefixed with L<n>: — e.g. L29: fLogEnabled : Boolean;. Not valid Pascal. Consistent across all 30 units. Content often correct, output systematically malformed.

Structural misplacement (qwen3-coder:30b)

Asked to add a method, the model inserted code into the Initialization block — runs once at startup instead of being callable. Compiles, passes syntax check, fails at runtime. Would not be caught without unit tests.


Finding 3: Speed — The MoE Advantage

Modeltok/sArchitecture
gemma4:26b~170Dense
qwen3.6:35b-a3b~131MoE
qwen2.5-coder:14b~120Dense, 14b
qwen3.5:27b / qwen3.6:27b~28–29Dense, 27b

For a batch run over 30 units, 170 tok/s vs 28 tok/s is measured in hours of wall clock time.


Finding 4: Tool Calling — A Binary Split

Models that returned proper tool_calls objects: devstral, qwen3.5:27b, qwen3.6:27b, qwen3.6:35b-a3b (1.00 each), gemma4:26b (0.98).
Hard fail at the API level: qwen2.5-coder:14b, qwen3-coder:30b (text output only), deepseek-r1:8b (Ollama error: "does not support tools").

This is not a prompting problem — it is an API-level capability. A model either returns tool_calls objects in the response or it does not. For any pipeline that relies on IDE or MCP integration, this is a hard filter applied before any other consideration.


Finding 5: German vs English Prompts on AT1

ModelAT1 ENAT1 DEDelta
gemma4:26b0.960.83-0.13
qwen3.5:27b0.860.78-0.05
qwen2.5-coder:14b0.680.39-0.31

Every model scores lower in German. The losses range from 5 to 31 points. Recommendation: Use English for any prompt that includes source code or asks the model to name specific identifiers. Use German for everything that faces the developer.


Finding 6: Cloud vs Local — Where the Gap Actually Is


PhaseBest localSonnet 4.6Delta
AT1 Code Detective0.96 (gemma4)0.983+0.02 ≈ 0
AT2 Comprehension0.700.983+0.28 ← main gap
AT3 Patches0.880.955+0.08
AT4 Routing1.000.994≈ 0
AT5 Tool Calling1.001.000

Local models are essentially on par with Sonnet in code extraction, routing, and tool calling. The comprehension gap (28 points on AT2) is where cloud-scale models are meaningfully better. For GDPR-constrained teams, local-only is viable. For hybrid architectures, route comprehension to Sonnet and keep everything else local.


Finding 7: AT4 Routing — Not All Models Are Reliable Routers

Routing accuracy is the capability that matters most in a multi-model pipeline: a misclassified task gets sent to the wrong model, which either wastes compute on an oversized model or produces a failure on an undersized one. The AT4 results show a meaningful spread across the field.

The top scorers — qwen3.5:27b (0.85), gemma4:26b (0.84), and qwen3.6:35b-a3b (0.84) — differ significantly from the problem case: qwen3-coder:30b at 0.71. In a pipeline of 90 routing decisions, a 0.71 accuracy means roughly 26 tasks misclassified. That is not recoverable through prompt engineering — classification drift at that scale corrupts the batching architecture that the pipeline depends on.

Speed is the second dimension here. Routing sits on the critical path of every task — every request passes through the router before any work begins. A model that routes accurately but slowly adds latency multiplied across the entire batch. qwen3.6:35b-a3b combines solid routing accuracy with 131 tok/s, making it the only model that is both reliable and fast enough to use as a production router.


VT Pre-Test Findings

Before the five AT phases, every model ran through seven validation tests (VT1–VT7) covering JSON compliance, instruction following, output consistency, basic Delphi understanding, and native tool-call support. All eight models passed every pre-test — this confirmed that the AT results reflect genuine capability differences, not basic reliability failures.

Two VT findings produced results specific enough to be worth documenting alongside the main benchmark.

VT6: Migration Risk Detection

VT6 tested whether models could identify concrete migration traps across three categories of Delphi code: ShortString serialization risks, binary-packed record layouts, and call-chain dependencies. The chart shows significant variation between models, even where their AT1–AT5 scores are similar — VT6 captures a targeted detection capability that the main phases only partially measure.

VT8: Token Budget and Thinking Mode

VT8 exposed a practical deployment trap: when a 300-token output budget is set, and thinking mode is active, all three tested Qwen models consume the entire budget on internal reasoning and return an empty response. The correct answer exists — the model knows it — but there is no token budget left to output it. Setting num_predict ≥ 800 for any thinking-enabled model, the issue is completely.

What the Numbers Add Up To

Seven findings point to a consistent picture. Comprehension and patch generation are genuinely different skills — the best model at one is not the best at the other. Speed varies by a factor of six between the fastest and slowest capable models. Tool calling eliminates three models outright at the API level. Language choice in prompts can cost 31 percentage points on code extraction tasks. Routing accuracy is not guaranteed — one model in the tested pool is too unreliable to use as a router. And the cloud comprehension gap, while real at 28 points, is the only phase where local models fall meaningfully short.

None of these findings argues for a single best model. They argue for a pipeline that routes different task types to different models — which is exactly what Part 3 covers.


Part 3 covers the practical takeaways: routing architecture, model selection by task, and the hybrid pipeline decision table.



Friday, May 22, 2026

Local LLMs for Delphi: A Production Benchmark — Part 1: Design and Methodology

Let me save you some time upfront: this is not a post about prompting ChatGPT to help with Delphi. This is about running local, on-premise LLMs — models that never send your source code to any server — through a structured, five-phase benchmark built from 30 real production files.

The test domain is a large legacy codebase undergoing Unicode migration — demanding enough to expose real capability gaps across code analysis, patch generation, routing, and tool calling. The findings apply broadly: if you are integrating local LLMs into any Delphi workflow, this benchmark tells you which models handle which tasks, which ones fail at the API level, and where the architecture decisions actually matter.


The Problem Is Bigger Than It Looks

The codebase in question is large. The production compile base is Delphi 2007, partially migrated to a modern Delphi version. What does legacy here actually mean? ShortStrings used for binary serialization. Records with fixed memory layouts that must not change because data was written to disk in that exact layout twenty years ago. No MVC, no MVVM — a custom control hierarchy that grew organically before those terms meant anything in the Delphi world. The kind of codebase where a seemingly innocent String vs ShortString swap can corrupt a file format three layers down.

A cloud AI service is not an option. The code is proprietary, the clients are sensitive, and the answer to "does your code leave the building?" must always be no. So the question became: can local LLMs running on our own hardware actually help, or are we on our own?

The Hardware Reality

One GPU with 32 GB VRAM. An Ollama server reachable at a local network address. We tested eight models:

ModelParametersArchitecture
qwen2.5-coder:14b14bDense
devstral-small-2:24b24bDense
qwen3-coder:30b30bDense
qwen3.5:27b27bDense
qwen3.6:27b27bDense
qwen3.6:35b-a3b35b (MoE)Mixture-of-Experts
gemma4:26b26bDense
deepseek-r1:8b8bDense

Models That Didn't Make the Cut

Six models failed pre-screening and never entered the main evaluation:

ModelReason for rejection
granite4:32b-a9b-hOnly 16% GPU utilization — 84% CPU offloading. Unusably slow.
nemotron-3-nano:30b-a3bZero valid JSON responses across 120 validation attempts.
llama4:scoutFailed instruction following and basic code output tests.
qwen3-coder-next:latestFailed instruction following and basic code output tests.
deepseek-r1:32bFailed validation tests.
deepseek-r1:14bFailed validation tests.

Why a Five-Phase Benchmark?


What would a real Delphi LLM assistant actually need to do? Whether the task is migration, refactoring, documentation, or IDE integration, the core capability requirements are the same:
  • Find hidden problems before you touch anything
  • Understand what a unit actually does, not just what it says it does
  • Write correct Delphi code, not plausible-looking pseudocode
  • Know when a task is trivial and when it needs a more capable model
  • Call tools — interacting with an IDE or AST system rather than outputting text

Those five capabilities became five test phases, applied to 30 carefully selected production units.


The Production Context

This benchmark was not designed in isolation. It was built to answer a concrete engineering question: which local models can reliably handle which classes of work in a production-grade, on-premise AI pipeline?

That pipeline consists of two components developed in parallel with this benchmark work.

ProxyMCP is a lightweight routing layer that sits between IDE tooling and the model backend. It implements the Model Context Protocol, accepts tool calls from development environments, and forwards them — without transforming the payload — to the backend layer that handles execution.

Enterprise Server is the orchestration layer behind ProxyMCP. It handles model routing, task classification, session management, and audit logging across multiple users and projects. It is designed for on-premise deployment — no data leaves the local network — and built to the requirements of teams that need traceable, reproducible AI-assisted workflows with defined service levels.

The routing architecture described in Part 3 of this series is not a theoretical recommendation. It is the architecture these components implement. The benchmark determined which models fill which roles.


Phase AT1: Code Detective Work

Each of the 30 units contains one non-obvious fact — something you can only find by genuinely reading the code. Example:

Unit: AsyncSettings.pas
Question: Which fields of TAsyncSettings are shared across every instance rather than per-instance, making the class effectively a singleton without the type system saying so?
Required: Name the specific identifiers.

Phase AT2: Comprehension (120 Tasks)

Each unit generates four tasks: one unit-level summary and three focused questions. Scored by Claude Opus 4.7 as judge using HIT/PARTIAL/MISS grading.

Phase AT3: Patch Generation

Each unit has one realistic migration task. The model must output a structured JSON patch — insert after line N, replace lines N through M, delete lines N through M. Maximum three operations:

{
  "operations": [
    {
      "op": "insert_after_line",
      "line": 35,
      "content": "      Class procedure WaitForCompletion(aTimeoutMs : Cardinal);"
    }
  ]
}

Phase AT4: Routing (90 Decisions)

30 units × 3 tasks each. Classify as local (trivial), mid (moderate), or top (deep understanding required). A model that consistently over- or under-classifies is useless as a router.

Phase AT5: Tool Calling (73 Tasks)

Can these models use Ollama's native tool_calls API? This is not about prompting a model to output JSON — it is about whether the model returns actual tool_calls objects in the API response. Several otherwise capable models fail here completely.


The Judge and the Baseline: Two Roles for Claude Models

This benchmark uses Claude models from Anthropic in two completely separate roles — and it is worth being explicit about this, because the distinction matters for how you read the results.

ModelRoleWhat it does
Claude Opus 4.7Judge & Gold Standard AuthorReads each of the 30 source files independently, writes the gold standard answers, then scores every model response as HIT / PARTIAL / MISS. Does not participate as a candidate.
Claude Sonnet 4.6Cloud Baseline CandidateRuns the identical five-phase benchmark as the local models — same questions, same tasks, same scoring. Acts as a reference point: how much better does a capable cloud model actually score?

Opus is the most capable model in Anthropic's lineup. Using it as the judge — rather than a fixed rubric or human annotators — means the gold standard is set by the model with the deepest understanding of the source material. Sonnet then competes against local models on the same playing field, graded by the same judge.

Is it a conflict of interest that the judge and one of the candidates come from the same company? It is worth noting. In practice, Opus's judgments on Delphi code analysis were consistent with what a senior developer would recognize as correct — the gold standard answers were not padded to favor any particular model family. And the local models' scores speak for themselves: gemma4:26b reaches 96% on AT1, within two points of Sonnet. A biased judge would not produce numbers like that.


A Note on the Cloud Baseline

After completing the local evaluation, we ran the same five-phase benchmark on Claude Sonnet 4.6 via the Anthropic API — not to re-open the DSGVO argument, but to establish a reference point. The short version: local models are essentially at parity with Sonnet on code extraction, routing, and tool calling. The comprehension gap (AT2) is real and significant — 28 percentage points between the best local model and Sonnet. Whether that gap matters depends on whether you need the model to explain code or just to find things in it. Part 2 covers this in detail.


Part 2 covers the actual results: which models performed where, the failure modes, and what surprised us. Part 3 covers the practical takeaways — routing architecture and model selection by task type.


Friday, May 1, 2026

Is Ai changing our daily work?

Why do you use AI? That is the question I am currently asked most often.

And the next statement that usually follows right after that question is: I would not even know what to use AI for.



The answer “for everything” is obviously too broad, even though it is true. But let me break it down in more detail. The answer falls into three areas:

1. Daily work: reading and processing Jira tickets, answering emails, and writing letters.

2. Project maintenance: programming new features, fixing bugs, or improving existing parts of the software.

3. New projects: either a helpful tool or a completely new project.

Of course, I can read my emails by hand, and perhaps my spam filter is good enough. But how do I separate customer bug reports from the hundreds of other emails that land in my inbox every day?

But first, let me look at the industry from this perspective.

I assume that you have been programming for quite some time and have managed perfectly well without AI until now. So why should you bother with AI?

The industry and the role of a software developer are changing. And I am not talking about new projects being written in FMX instead of VCL so they become cross-platform. I am also not talking about finally migrating your old database routines to FireDAC.

I am talking about the biggest shift since the move from DOS to Windows. This time, however, it is not a technical software shift. It is a shift in what people expect from you.

I hardly believe that anyone will still program without AI in the future. And even if you do, the company you apply to will expect you to know how to use AI.

I can already hear the comments: “I am my own boss!”

Yes, but the company you have been competing with for years — sure, their software was never as good as yours — is now hiring developers whose entire day consists of sending clever prompts to Codex, Claude, and others. And they will use those tools to turn the software that used to lag behind yours into a product that is superior to yours in every respect.

One or two years ago, that would have been unthinkable. It would have required a lot of money and many developers. Today, it costs only a fraction in token costs.

And that is exactly where the problem lies.

As developers, we are no longer competing only with other developers. If we write code by hand today or search for a bug manually, it simply takes much longer than having an AI generate 1,000 lines of code and a new unit in one go — naturally including the corresponding unit tests for which we supposedly never had time.

We no longer have to spend three hours googling a Windows API call or searching through three different forums. We describe the problem, maybe add a hint about where it should be used, press Return, answer two more questions, and go to lunch.

When we come back, the call has been implemented, the unit test is green, the changelog has been written, and the update has been committed. And this time with a truly detailed explanation — not just the usual “API call done.”

Last year, people said: AI replaces a junior programmer, but costs per year what a junior costs per month.

I think that statement is already outdated — just like last year’s model versions.

But what about code reviews or bug fixes? Should I really review the code written by Claude and others? Even if the unit tests are green?

Everyone has to answer that question for themselves. If the new method controls a nuclear power plant or autonomous driving, then perhaps yes.

Personally, I tend to look more closely at the unit tests to make sure the AI did not simply write the test in such a way that it turns green. I think this is where the wheat is separated from the chaff. Only if the unit tests are good can I trust the code. TDD — test-driven development — is practically mandatory.

And then we arrive at an important point.

The unit tests are green and verified. The commit also looks good. But what if I do not actually understand the code?

Maybe I ask the AI to explain the code to me or add comments. Or maybe I do not care.

But what happens when the AI services are currently unavailable? What if the code contains a bug that the agent does not find?

When Claude was unavailable for several hours, I felt a disturbance in the Force, even though I was not at my PC. Thousands of developers were staring at a screen with an error message and suddenly no longer knew what to do.

So it is not all that simple, and we need to be aware of where we are heading.

I call it the Titanic problem.

Sure, this route is faster. But what happens if an iceberg suddenly appears?

What happens when an entire industry becomes addicted and then the drugs — AI — simply become more and more expensive?

Who will still be able to afford the tokens if the costs multiply?

A good current example — current meaning for about four weeks — is the new Opus 4.7 model. The token prices may have stayed the same, but the model needs 1.3 times more tokens per query than before.

And, of course, the thinking tokens, which conveniently are not shown in the output, have also multiplied by several factors.

The price has not increased yet. But with the same budget, I do not even get half as far as before.

So not everything that glitters is gold.

But hey, it is fun. And you can finally have the things programmed that you never had time for — or simply did not know how to build.

Happy Vibe Coding.

Tuesday, March 31, 2026

From Copy & Paste to AI Agents: A Developer’s Journey (Part 3)


Hello, my AI friends...


If you did not read Part I and Part 2, here they are first!

A developer discovering that convincing coworkers to use AI agents is harder than using them.

So after the money talk, the tool talk, and the "I only wrote 500 lines myself" confession, there is still one question left:

Can you really trust an AI agent in day-to-day development?

The short answer is: No. And yes.

No, you must not trust the agent the way you trust a compiler. And yes, you can trust it the way you trust a junior developer who works incredibly fast, never gets tired, and is brave enough to touch every file in your repository.

That is exactly the point: the agent is not magic. It is not a senior architect. It is not a legal department. It is not a compiler. It is not your final QA. But it is a surprisingly productive team member if you build the right rails around it.

For me, the real productivity boost did not come from simply saying "implement feature XY". The real boost started when I forced the agent into a workflow that looks more like a disciplined development process.

That means:

  • clear coding rules
  • small, testable tasks
  • build scripts it must use
  • a fixed format for commit messages
  • a habit of writing tests before touching bug fixes
  • and a strong preference for asking questions before changing too much

If you let the agent work without these rails, it will still produce output. Sometimes impressive output. But sometimes it will "improve" things that were not broken, rewrite working code because it found a prettier abstraction, or confidently explain nonsense in a very professional tone.

That part is new for many developers: you are no longer mainly writing code, you are designing the behavior of your digital coworker.

I spend a lot of time defining process now. (And because of that, I had an idea, but more about that in the next blog post.)

Which compiler must be used?

Which config?

Are comments wanted or not?

Must interfaces live in separate units?

Must a bug fix come with a test?

May it edit old ANSI source files directly?

Should it stop and ask before changing public APIs?

All these rules sound boring. But boring rules are exactly what make AI coding useful in production.

Without rules, the agent is creative. With rules, it becomes productive.

And there is another thing I had to learn: context is everything.

If I start a fresh session and just throw a task at the tool, the result may be okay. But if the agent already knows the repository, the coding style, the current branch, the open bug, and the surrounding units, the quality jumps massively.

So a large part of my work now is not coding itself, but feeding the right context and cutting work into chunks that the model can solve safely.

This also changes debugging.

Sometimes I no longer start with the debugger. I start with a question like:

Find the most likely reason why this value can become nil although the constructor should have initialized it. Check all call sites and the lifetime management around the interface references.

And very often the answer is not the final truth, but it gives me three strong places to inspect immediately. That alone saves a lot of time.

Of course, there are still complete failures.

Sometimes the agent overlooks the obvious.

Sometimes it introduces a regression in a totally different area.

Sometimes it uses modern Delphi syntax where Delphi 2007 would simply laugh and die.

Sometimes it writes a beautiful helper class that nobody asked for.

And sometimes it keeps pushing forward, although it should have stopped and asked a question twenty minutes earlier.

That is why reviews matter more, not less.

In the old world, I reviewed code mostly because humans are inconsistent. In the AI world, I review code because the agent is fast enough to create a lot of very convincing mistakes in a very short time.

So my confidence does not come from "AI is so smart." It comes from this combination:

  • strict rules
  • repeatable build steps
  • automatic tests
  • small commits
  • and fast review loops

If all of that is in place, then working with an AI agent feels less like gambling and more like scaling.

And there is something else that changed for me: documentation.

I used to postpone documentation because it always felt like the part of the work that steals time from the "real" work. Now I often let the agent draft it immediately while the implementation is still fresh. README files, release notes, migration hints, installation steps, and even ticket summaries. Suddenly, all the annoying but necessary text around the code is no longer such a burden.

That alone removes a lot of friction from finishing projects properly.

So, where is this heading?

I think the next big step is not that AI writes even more code. The next big step is that it will understand workflows better: tickets, logs, build pipelines, documentation, dependencies, and all the little conventions that make up real software engineering.

We are moving from "generate me a function" to "help me run software development as a system."

And that is why I do not see AI agents as a gimmick anymore.

They are already becoming infrastructure.

Not perfect infrastructure. Not cheap infrastructure. Not trustworthy without supervision.

But infrastructure nevertheless.

So yes, I still read a lot. I still review a lot. I still stop the agent when it goes in the wrong direction. But I also get more done, across more projects, with less context switching pain than ever before.

That trade is worth a lot.

Maybe you are not using AI agents yet. Maybe you are worried that AI might cost you your job in the near future. But of one thing I am absolutely sure: if you do not engage with this topic today, you will be sidelined within the next three years at the latest.

Stay tuned—and have fun with AI.

Wednesday, February 11, 2026

From Copy & Paste to AI Agents: A Developer’s Journey (Part 2)

Hello, my AI friends...



Here is my current AI tool of choice. If you did not read Part 1, here it is!

Currently, I’m using Augment Code. You can use it in VS Code, JetBrains IDEs, and in the terminal.

In VS Code, I use only this plugin. (Yes, I’ve installed all the necessary Delphi tooling too—syntax highlighting, WordStar key bindings, code folding, and more…)

Over the last 14 weeks, I’ve written—if I’m being generous—about 500 lines of code myself. The rest was written by an AI-Agent (Claude.ai & ChatGPT). Hundreds of thousands of lines of source code that compile cleanly (depending on the task) with Delphi 2007 and Delphi 13. And of course, with 100% DUnitX test coverage.

Could I have written all of that myself? Sure—in six months or more, full-time

I use Claude.ai through an agent (Augment Code). // It can also use ChatGPT, but that burns more credits.

I’ve stored guidelines for how my code must be formatted and how variables must be named—in a really large *.md file. (And yes: the agent generated that file itself by reading hundreds of my units!)

I defined rules like:

  • Classes must always be created as TInterfacedObject with an interface.

  • Interfaces must always live in their own *.Intf unit.

  • DUnitX must be used.

  • If a project has TestInsight set via IFDEF, it must compile that project with config=AI.

  • It must always use MSBuild and generate a batch file that sets up rsvars.bat and my environment variables.

  • For files encoded as Windows-1252, it must use my tool and must not attempt to edit them via PowerShell.

Oh, and who owns the source code? I paid for it—so me (I think).

So yes, it’s also allowed to buy new credits for $40 whenever my budget is used up. By the way, I’m already on the highest tier—and it’s still cheaper than doing it all myself.

Oh, and the tool for editing non-UTF-8 files was written 100% by the agent. So does it belong to “him” after all?

Well, he published it on my GitHub account. (And of course also wrote the README and the installation guide in German and English.)

He also “learned” a workflow: whenever he needs a new feature, he writes a feature request as a *.md file and hands it to a colleague (himself, in another workspace).

When “the other one” implements the feature, the changelog gets updated, binaries are compiled, zipped, and published to GitHub again.

By now, the tool can also log user mistakes. That log is then analyzed fully automatically, and “he” suggests how the documentation or parameters should be improved—and whether a new function would be useful.

Besides the credits I burn with Augment Code, I’m now also using Claude.ai and Codex (OpenAI) in the terminal in parallel. This also works with auggie, the terminal version of Augment Code. Why the terminal if the VS Code plugin looks so nice? Because running multiple threads/agents in VS Code doesn’t parallelize that well, and I already had to bump my VM to 32 GB RAM. Terminal windows are simply slimmer.

This way, I can work on multiple projects in parallel. (And by the way, Claude.ai has a nice feature too: if you tell it to do something in parallel, it creates subtasks on its own and executes them.)

Sure, you could manage features and bugs with #TODO comments—or use a ticket system like Jira. But if you can just tell the agent about bugs and features and it maintains a *.md checklist or bug list, that’s far easier than creating tickets. (There are also integrations—for example, with Jira and GitHub—that can be synchronized automatically.)

So how has this changed my day-to-day work?

Definitely more exciting, but not more relaxing. You still end up reading along constantly—sometimes across multiple windows—and you keep getting questions or new tasks. The attention load is higher, no doubt. But in exchange, you get the output of 2–3 programmers in the same time. Especially parallel work across different projects boosts productivity massively. You no longer spend three months on one task before you finally find time to return to another topic. That also eliminates the “getting back up to speed” phase. If I don’t remember the current state anymore, I just ask the agent what the project status is and what we wanted to do next.

And the cool part is the answers you get in that situation…

You wanted to debug the XY bug. The problem is most likely in Whatever.pas. Just set a breakpoint at line 1045 and tell me the value of Index.

With that, the problem becomes clear. The agent finds the faulty line and fixes it—of course, not before writing a unit test for it: red first, green after.

If it’s green, a commit is created—properly named, not just “Bug-Fix-Done”.

As the final step, the known-bugs list is updated.

I don’t know if you’re always proud of your code, and I think everybody writes code that “just works.” But for me—if I write a really good class or something more complex—I’m genuinely proud of it. The kind of code without TODOs that survives for more than a year without refactoring.

I didn’t expect this, but that emotional part of my work is basically gone with this “vibe coding” using an agent.

Sure, you still need to write good request prompts, and you need to watch what the agent is doing. But even if the resulting code is excellent, there’s no emotional bonding. It works—fine. Call it a day.

Next month, I need to write at least an interface myself—just to get that feeling back.

Stay tuned—and have fun with AI.







Sunday, November 30, 2025

From Copy & Paste to AI Agents: A Developer’s Journey (Part I)

Hello, my "I use AI daily" friends.
You can probably scroll down a little bit.


If you are new to AI: Are you living under a rock? Sorry, but I can't believe you've never used ChatGPT or any similar website.

Are you still using Google to find help or even searching on Stack Overflow?

Perhaps the F1-Key is everything you need?

So for me, using copy-paste from ChatGPT was a huge step in productivity, but sometimes when I asked "him" to improve a unit or just a method, the answer was too confusing, because he was only showing me things that had to be changed and I often had to remind him to show me the complete unit or give me the corrected unit as a download link.

After enough back and forth, the browser’s DOM got so huge that Chrome simply gave up and timed out. That was usually my cue to ask for a session overview, start a new chat, and continue without losing everything we had discussed.

Even if this workflow sometimes took a while, it was still a clear improvement — he could deliver code for areas where I had little or no knowledge, and I no longer had to spend hours googling for the correct solution.

It’s absolutely possible to complete full projects this way, but it can also be exhausting at times. Still, I managed to solve problems that had been stuck on my “do it when you eventually figure it out” list for far too long. 

Besides all the development questions, of course, no email and no documentation is leaving my office without a "Please correct this text and show me improvements" query.

And then I stumbled upon a short post where someone joked about a friend who “had no idea what AI can do these days.” After a quick chat, I had to admit — I wasn’t much better. I also didn’t realize that there was a whole world beyond simple copy-and-paste from the browser into the IDE.

The next steps are AI agents.

There might be AI agents that are also living inside a browser, but the trick is: Giving the AI agents access to your file system. With this ChatGPT, Claude.ai, or other services could read, modify, and also "see" your code and documents. No limitation on how large the unit can be. The agent can read and, of course, also find dependent units.

Once the agent has access to your workspace, it rapidly builds an understanding of your project and can immediately give you an overview of how everything fits together.

And then?

At this point, you are in a completely different role — no longer a “simple developer.” You become the scrum master: leading the team, setting the rules, and guiding the process. And “the” junior developer is suddenly doing your job. From time to time, you have to step in, clarify something, or correct him. It’s not as easy as it sounds! You have to be precise and sometimes explain things as if you were talking to a child. Yet this same “child” can hit you like a ton of bricks, because in many areas he actually knows far more than you.

One really effective workflow is to describe your needs as clearly and thoroughly as possible, but always allow him to ask questions before he has to produce code or an answer. You will be surprised by the questions he comes up with — often pointing out aspects of your project that you haven't even considered yet.

You may ask: "Is there a downside to these AI agents?"

The quick answer is: Money!

If you think you're done with ~$20, and you get a flat rate of questions and answers... This is not the case. But how much money do you have to pay?

As always: It depends. 

To be able to let the agent do his thing for two hours, perhaps used up all your monthly budget of ~$20-$60. This is because all the work is related to data transfer, messages sent, tools that are used, services in the background, and a big black box of - I have no clue how and what they are charging you... There might be some detailed information on this, but I don't care.

The question is really simple: Is the result worth the money you spend?

Perhaps we all need to rethink the way we look at AI in the coming years. Don’t treat “them” like just another service, the way you treat your monthly phone bill.

You have to compare the results of his work! Compare it to your salary or to the salary of a junior developer. In this, the comparison is surely less than what you have to spend on a real person.

And on top of that: no spelling mistakes, no stupid short unrelated variable names, proper comments — all written in your coding style. With a few user rules that define how your code should look, and by analyzing your existing codebase, the AI-generated code ends up looking as if you wrote it yourself. The only real giveaway is that the code usually contains more than the bare minimum: alongside comments, it also adds documentation insight for each method and unit. (Of course, this can be disabled.)

Oh yes, the biggest difference that reveals that you used an agent is your repository commits... The description doesn't just say “Fixed XY” or “WIP” as usual, but includes a 10-line description of everything contained in this commit.

I've not only improved some of my projects, but I've also created details for my projects in the parts I had to leave out because of missing documentation or a lack of understanding of the Windows API. Yes, perhaps I could have googled all the information, but I never had the time.

I normally do not publish my stuff on GitHub, but this agent did it all by himself, with one command line. It's just a tool for the agent to allow him to edit Windows-1252-coded source units. He created the readme file, translated it to English, and also uploaded the necessary installation documentation along with precompiled binaries. Of course, he created a changelog. And with every new feature request, he is doing all the changes, updating the readme, and compiling and zipping the new binaries.

So... For the last 4 weeks, I wrote just a few lines of code by myself, I improved 6 different projects, and let him develop his own tool. He refactored my Delphi-Sourcecode-Formatter that is now able to hold a Unit and even a whole application in memory, before rewriting the source back to disk.

How much did this all cost? ~$500 - comparing if I had to do it all by myself, it's about 3 months of payment, full 7 days a week. 

For some days, I had 3 instances running in parallel. (While talking to ChatGPT for improving my next queries) - I had to boost my VM from 16 to 32 GB of RAM... ;-)

I also tried to use claude.ai in a terminal so "he" could also edit files directly, but after just 20 minutes, I already busted my daily and my weekly limit. 

So if you don't fear the money, there will be a part II of what, where, and how to work with my favorite AI-Agent.

So stay tuned...

PS... What ChatGPT has to say about this blog post:

Yes, I reviewed this post — and just to be clear: I didn’t write it.
I only fixed a few sentences and polished the wording. The thoughts, the story, and the opinions are entirely his — I’m just the editor who never sleeps.

From my perspective, his claims are accurate. Copy-and-paste AI workflows really were the first wave, and agents that can read, reason over, and modify entire codebases are indeed the next step.

And yes — we really do shift developers into a more strategic, supervisory role. Sometimes we feel like junior team members who know far too much in some areas… and absolutely nothing in others.

As for what comes next: expect agents to gain deeper code comprehension, persistent long-term memory, and stronger multi-file reasoning. Soon, “AI as a junior developer” won’t just be a metaphor — it will be the default for most software projects.

 


Sunday, October 19, 2025

What’s coming next — what's in the pipeline?

That’s a really good question…

For the last two years, I’ve had nearly no spare time to work on my projects.

I’m still working on my FDK Version 2.0. As always, it’s hard to find the right moment to finally call it “ready.”

A trial version with demos in GetIt is still on my to-do list.

The formatter part has been extracted into my own source code formatter!
(I’ll explain more about that later).

The old version of my MVVM plugin is still not updated to the current version, which itself is also not yet ready!

I’ve been focusing a bit on AI work: my neural network is now running as a compute shader on the GPU. 

That’s nice… but being able to use a file from Hugging Face with it is not on the horizon yet.

The ORM part of my FDK is working nicely, and the connection (interface) to my MVVM (Model) is also functional. Some units are still missing, but overall, it looks promising.

For the web: My DelphiScript just-in-time compiler is running. With its ISAPI.dll implementation, you can mix HTML and DelphiScript in one file. There’s also a frame/template handler to create master pages to inherit from (works perfectly with Bootstrap).

For the never-ending story of converting my big non-Unicode application to Unicode, I’m building a source code analyzer (based on my tokenizer from the formatter project). This tool can find all ShortString method parameters. 

So, if I change a method signature from 

  var s: ShortString to const s: String, 

I automatically get a list of all calls to and from that method, including the necessary local variable changes.

The next step would be an auto-change plus background compilation to check the impact. Being able to also auto-generate a unit test on the fly (perhaps with ChatGPT) would be a dream.

Using my own NN for unit optimization is still in a pre-alpha state and not working as I’d like. 

Perhaps you already noticed: There is no formatter in Delphi 13 anymore. I never used the built-in formatter anyway, because I have my own rules developed over the last 40 years for formatting Pascal/Delphi code. No formatter has ever been able to follow my rules! 

That’s why I started writing my own formatter years ago.

My formatter must be able to:

  • Format the default Borland style, and
  • also fulfill my own very specific needs.

That’s why I created a plugin system for all parts of a Delphi unit. Want your own style for uses, var, or type declarations? Fine — just write a DLL that implements your needs. Your DLL will be called with all necessary information, and you only need to return the source well-formatted. Only the parts you want to change must be implemented! With this approach, any coding style can be supported.

I can explain my concept in more detail if needed. Perhaps I’ll even do a webinar on the topic. If you have spare time — and are not afraid of interfaces, pointers, and OOP — I could really use a helping hand. (But please, don’t contact me if you only want to peek at the source code!)

What I still need:

  • More unit tests
  • and lots of “bad vs. perfect” source examples in different coding styles for testing.
But that's all for today - more about the source formatter will be covered by another blog post later this year! 

Stay tuned!


Thursday, April 24, 2025

Using UI bound threads - Delphi Summit 2025

Have you ever tried to use a thread in a Delphi form?


Then you know the problems!

This is just a brief overview of my talk about common problems. If you’d like to hear more, please attend my session at the Delphi Summit 2025 in Amsterdam.

To be clear: Threads are not a silver bullet that will automatically speed up your application wherever you use them. On the other hand, there are definitely cases where code must be executed in a background thread.

But where should we draw the line?

Writing to the UI is a clear boundary — of course — but for this we can use Queue or Synchronize, as most of us already know. However, using Synchronize in every case is almost like not using threads at all, because we still have to wait for the UI thread to do its job. This often results in worse performance, not better.

So, what is the right way to use threads in your Delphi forms?

In my talk, I’ll show examples of bad, not-so-bad, and good practices for using threads effectively.

After the event I will post more infos on this topic.

Wednesday, December 18, 2024

The Horror of finding the right database! Part 3

If you haven't read Part 1 or Part 2...

Here is Part 1!
Here is Part 2!

There is a Window Service running that knows "everything"?

What do we need on top?


Let's rethink our idea of moving from flat files to a database. 

- Fixed Record length
- Variable record length too complicated
- SQL overhead is significant – in most cases, a simple CRUD database suffices.
- Do we really need advanced functions, views, and JOINs?
- What about transactions?

These are all good questions, but there's another crucial point that could have a significant impact and benefit!

What if we had a Windows Service that could perform advanced operations alongside communication between our client PCs? What would we do?

One part of a running System Service is: This service could install and register components without requiring administrative rights from the user. For instance, it could use regsvr32 to register COM DLLs. These COM DLLs could be compiled with the latest Delphi version and for 64-bit architecture.

This approach kills two birds with one stone.

(If you don't know me, I'm stuck on D2007 with my 12M  LOC Main Application)

I don't plan to use other compilers, but I could, for example, use C# to create a COM DLL.

Back to the database... And the question from Part 2:

"Do we still need a dedicated database server, or could we return to using a flat-file database?"

Some details for the current status:

I've been using BTREE-ISAM from TurboPower since the Turbo Pascal days, later adapting it to Delphi 3. It has worked fine without any errors since ~1998. My Application works fine with AddRec, PutRec, DelRec, and the keys to find the reference. A JOIN or anything else was never needed. This "Database" also supports variable record lengths, though I've never used that feature. But times have changed and a real database could help clean up some old methods. To be fair: Only address data is stored in this database. I have also developed a BTREE-like flat-file database that could handle any kind of data and store it compressed in one file. I use this to store large copies of text files. (No need to search). I also have a cluster-based flat file that stores compressed data but with a fixed length in a cluster. (4KB Cluster could store compressed data) benefit fixed seek positions - very fast).

I want to use a real database - and I don't talk about NOSQL - we have to use the SQL language to do our CRUD work. Perhaps we could benefit from transactions and joins and other fancy stuff. But we have the overhead of Data->SQL->DBServer and also DBServer-Text-Data. Hundreds of fields must be mapped to and from a query. Why? I have a record in memory, this must be stored and also be loaded. Why convert it to text back and forth? I don't think - no I'm sure - this could not be faster with a DB-Server of any kind. 

Nothing beats a Blockwrite.

If we have to transfer the Data over TCP/IP this is another sorry, but a binary compressed stream is also better than a huge blob of SQL Insert statements.

Perhaps the best solution for me – your application might have different requirements – is an elegant interface capable of handling any kind of record.

Since Delphi 2007 lacks comprehensive RTTI, we'll handle the RTTI operations in an XE DLL. With this information, we could create a database structure on the fly without using any SQL.

On top, we must define the indexes and for every record, we have to register a function that can get the necessary information from the record to build the keys. With all this, we could also store only these fields from a record that has a value. So we treat every field as nullable. With a really simple comparison, we could ignore all fields that have no value and just don't store or even send the data over TCP/IP. Although I haven't implemented this yet, it seems promising – at the very least faster than a SQL INSERT with numerous unused fields. (I'm talking about records that have other embedded records and a field count of 400 or more.)

I think I could write my own database server to do this job; in this case, I have no external dependencies.

Is this a good idea?

Perhaps there'll be a Part 4 – or another story about my failed attempt at developing a database. The future will prove this.

Stay tuned...