Don’t put a robot in a human office

Image: AI generated

One question

Open the longest file in your project. How many functions are in it?

Tell an AI agent to modify one function in that file. The agent reads the entire file. It opened the file because it needed one function, but 19 unnecessary functions came along for the ride.

This is where the problem begins.

Code humans read vs. code agents work with

Until now, code was written for humans to read. Naming variables well, writing comments, producing documentation — all of it was about reducing human cognitive load.

In the age of agents, the question changes. Is code that’s easy for humans to read the same as code that’s easy for agents to work with?

It isn’t.

HumanAI Agent
NavigationScans directory trees visuallySearches with grep
Opening filesScrolls in IDEread file — loads everything
Judging contextIntuition + experienceOnly knows what’s in the context
Irrelevant codeIgnores itConsumes context budget
2,000-line fileLooks at only what’s neededProcesses all of it

A human scrolling through a 2,000-line file has an intuition that says “don’t touch this part.” An agent has no such intuition. When it reads 2,000 lines, 1,950 of them are context pollution.

Research confirms this. When irrelevant information is mixed in, AI performance drops 30–85%. Performance degrades even when the unnecessary tokens are whitespace. That shorter context is better isn’t intuition — it’s experimental evidence.

Don’t put a robot in a human office. Build a factory where robots can work.

Three things agents need

For agents to work reliably in a codebase, three things must be in place.

1. It must be readable — without noise

One concept per file. The filename is the concept name.

before: read utils.go → 20 functions, 19 unnecessary
after:  read check_one_file_one_func.go → 1 function, exactly what's needed

filefunc solves this problem. It separates code into semantic units with 22 structural rules. Applied to the Hono framework (23k+ stars), it split 186 files into 626. All 4,419 tests passed. Files increased 3.4x, but not a single line of logic changed.

“Won’t there be too many files?” — Agents don’t browse directories. They search. Whether there are 500 or 1,000 files, one grep is all it takes. Not opening 295 unnecessary files matters more than picking the 5 you need.

2. It must be verifiable — mechanically

When you modify a function with no tests, nobody knows what breaks. The agent doesn’t know either. It falls into a doom loop.

before: 0 tests, no way to know what breaks on modification
after:  527 functions with tests, behavior changes detected immediately

tsma solves this problem. It indexes every function in the project, detects test presence, measures coverage, and feeds back uncovered branches with line numbers.

Without feedback, asking an LLM to write tests plateaus at 60–70% coverage. Tell it “line 41, 44, 70 uncovered” and it reaches 100%. Same model. The only difference is the resolution of feedback.

Experimental results on a project with 527 functions: completed to TODO 0. An autonomous agent declared “all done” at 40. Apply the ratchet: 527 completed.

3. Specifications must be cross-verifiable

It must be mechanically verifiable whether API schemas, DB schemas, security policies, and state transitions are consistent with each other. When one changes, you must know before compilation whether it conflicts with the others.

before: 200 endpoints, humans check spec consistency
after:  one operationId chains all layers, machines detect drift

yongol solves this problem. It chains 10 SSOTs (OpenAPI, DDL, sqlc, SSaC, Rego, Hurl, etc.) through a single operationId and cross-validates with ~287 rules. user_id is a string in OpenAPI but BIGINT in DDL — existing tools can’t catch cross-layer contradictions like this.

One structure running through all three tools

filefunc, tsma, and yongol are independent tools, but they share a common structure.

filefunc:  22 structural rules → validate → fix → repeat
tsma:      measure coverage → feed back uncovered branches → fix → repeat
yongol:    cross-validate → detect drift → fix → repeat

All the same loop.

LLM generates → deterministic tool judges → result fed back to LLM → repeat

Symbolic Feedback Loop. A cyclic structure where deterministic tools correct the probabilistic generation of LLMs. Not AI verifying AI — machines verifying AI.

Give it opinions and it flatters. Give it facts and it fixes. Ask “is the code okay?” and it answers “yes, looks great.” Tell it “line 41: field name mismatch” and it fixes it immediately. Feedback with no one to flatter — because numbers and locations aren’t emotions.

From legacy to agent-operable

You don’t need to change an existing codebase all at once. This isn’t foundation work — it’s seismic retrofitting. Reinforcing the building without closing the shop.

Step 1 — Make it readable

Start with the longest files. Run filefunc validate and drive violations to zero. All existing tests must pass.

Step 2 — Make it verifiable

Repeat tsma next. Add tests to untested functions and fill uncovered branches. Even if the agent dies mid-run, progress is preserved. A new agent runs tsma next and picks up where it left off.

Step 3 — Cross-validate

Introduce SSOTs and run yongol validate. Machines catch cross-layer contradictions.

Each step is independent. You can do step 2 without step 1, or step 1 without step 2. But when all three combine, the scope of autonomous agent work expands dramatically.

Changing the operating system

An agent-operable codebase isn’t just linting or tooling. It’s changing the operating system of the codebase.

human-readableagent-operable
File sizeScrollable range for humansOne concept
TestsNice to have; intuition fills the gapRequired for every function
SpecsDocs, wikis, verbal handoffsDeclarative, cross-verifiable, machine-readable
FeedbackPR review (hours)Verifier run (seconds)
Completion checkHuman says “looks good”Machine says “487 remaining”

Many people are making the train faster. Bigger models, smarter agents, better prompts.

The faster the train goes, the more the tracks matter. Almost no one is laying tracks yet.



References

  • Stanford, “Lost in the Middle: How Language Models Use Long Contexts” (2024) — 30%+ performance drop when relevant info is buried in the middle of the context
  • Amazon, “Context Length Alone Hurts LLM Performance” (2025) — 13.9–85% performance drop even when unnecessary tokens are whitespace
  • Hono framework case study — 186 files → 626 files, all 4,419 tests passed
  • tsma 527-function case study — PASS 246 (46.7%), DONE 281 (53.3%), TODO 0
  • Ratchet Pattern experiment — autonomous agent 40/527 (7.6%) vs ratchet CLI 527/527 (100%)