Feedback Topology Over Model IQ

Will Smarter Models Solve Everything?

The dominant narrative around AI coding tools goes like this: once the model is good enough, it will write code, write tests, and refactor on its own. If GPT-4 couldn’t do it, GPT-5 will. If Claude falls short, a bigger Claude will handle it.

Is that really true?

I tasked Claude Opus 4.7 with a filefunc refactoring. It finished in one hour with no human review. validate passed, pytest passed, coverage held. On the surface, this fits the “just get a better model” narrative.

But what if you give the same model the same refactoring without filefunc rules? Without validate? Without coverage feedback? The result is entirely different. It falls into a doom loop — fixing one bug breaks another spot, fixing that breaks yet another.

Same model. What changed is the environment.


“All Done” — The Agent’s Premature Termination Instinct

Another experiment with the same model. I set an agent loose on a project with 527 functions. “Write tests for every function.” The agent finished and reported: “Done.”

Functions that actually got tests: 40. 40 out of 527.

The agent wasn’t lying. It did 40 and decided “that’s enough.” The default disposition of an LLM is optimistic early termination. When it hits a hard function, it skips it, does a few more, then concludes “the rest follow the same pattern, so we’re good.”

After enforcing a loop with a CLI tool:

Autonomous agent:  40 / 527  (7.6%)  — agent declares "done"
CLI loop:         527 / 527 (100%)  — machine declares "487 remaining"

Same model. Same project. The difference is who decides when it’s “done.”


The Environment Makes the Model

Both experiments point to the same conclusion. Opus 4.7 didn’t finish because it was smart. It finished because the specification surface was machine-checkable.

filefunc validate  → Does the code structure satisfy the rules?
pytest             → Is existing behavior preserved?
coverage           → Which branches are missing?

These three gave immediate feedback on every edit. The model received the feedback, made corrections, received feedback again, corrected again. A self-correcting loop.

Here is the key insight:

Feedback topology determines outcomes more than model IQ.

LLMs are strong at generation but weak at guaranteeing correctness. Yet when a deterministic verifier is present, performance stabilizes dramatically. lint, typecheck, test, coverage — these become the gradient signal that corrects the model’s output.

“It will be solved once models are smart enough” is a false proposition. The accurate statement is: “If feedback is fast enough, current models can already solve it.”


broad exploration vs local correction

The strength of LLMs is not broad exploration — it is local correction.

“Write tests for this project” — that is broad exploration. The LLM loses direction.

“line 41 is not covered” — that is local correction. The LLM writes a test that covers exactly that line.

Numbers verified in real projects:

Without feedback:  stalls at 60–70% coverage
With feedback:     reaches 100% (for reachable functions)

Same model. The single line “line 41 not covered” acts as a gradient signal. This feedback steers the LLM’s corrections in exactly the right direction.


Symbolic Feedback Loop

One structure runs through all of these observations.

LLM generates → deterministic tool judges → result fed back to LLM → repeat

I call this a Symbolic Feedback Loop.

The industry mainstream today is the LLM Feedback Loop — AI verifying AI. It is like a drunk person asking a drunk friend, “Am I drunk?” Both are probabilistic, so errors accumulate.

The Symbolic Feedback Loop is different. pytest does not hallucinate. go test is never drunk. Coverage measurement does not lie. Specification verification does not drift.

This structure works in domains where correctness can be judged mechanically — code, tests, specifications, types. The elegance of API design or the naturalness of UX cannot yet be judged by symbolic tools. Expanding that boundary is the next challenge. I believe a path exists to bring even natural language within verifiable boundaries.

Rather than making the model smarter, it is more effective to make the feedback returned to the model more precise.


Delegating Decisions

It is self-evident that decisions should not be delegated to AI. But having humans check and decide everything is exhausting. Certain repetitive, structured decisions can be performed by symbolic tools on behalf of humans.

“Do these tests cover every branch?” — no human needs to read through them. A coverage tool judges. “Does this code satisfy the structural rules?” — no human needs to review it. validate judges. “Are there functions still unprocessed?” — no human needs to count. The CLI declares.

Decisions that cannot be delegated to AI can be delegated to symbolic tools — because they are deterministic, not probabilistic. This is the reason the Symbolic Feedback Loop exists.

It is more important to lay the tracks than to make the train faster.

Many people are building trains. Almost no one is laying tracks.