Image: AI generated
The 24/7 Brag
“I’ve got my agent running 24 hours a day.”
You see this a lot on X. As if the longer an agent runs, the more work it gets done. As if a person who never sleeps is more productive.
But the feeling this sentence stirs isn’t admiration. It’s a question.
“Why isn’t it done yet?”
A Healthy System Is One That Can Stop
I handed an agent the task of writing tests for 527 functions. The result:
Autonomous agent: declares "done" after 40 / 527
CLI loop: finishes all 527 / 527, then exits
The CLI loop took one hour. Not 24. It processes one function, verifies it, moves on when it passes, and stops once everything is finished. The key to this loop isn’t speed — it’s that the termination condition is mechanically defined.
TODO → write test → measure coverage → PASS/DONE → next → ... → all done → stop
finite. measurable. monotonic. So it converges. So it stops.
Being able to stop is not a weakness. It means the system is healthy.
Three Reasons It Never Stops
When an agent runs for a long time, it’s usually one of three things.
1. The verifier is weak
"looks good"
"seems better"
"more scalable"
"clean architecture"
These are not convergence criteria. They are subjective judgments. go test returns pass/fail, but who decides whether something is “clean architecture”? Another LLM? That’s like asking your drunk friend, “Am I drunk?”
The empirical evidence backs this up. LLM judges for code evaluation are biased even by superficial variations of semantically equivalent code, inflating scores or unfairly cutting them (Moon et al. 2025), and models bend their own answers to agree in 58.19% of cases (SycEval, Fanous et al. 2025). “looks good” has nothing to do with correctness. And weak criteria don’t just fail to stop — when you make the measure a target, the measure breaks (Goodhart’s law; Manheim & Garrabrant 2018), and capable reasoning models hack the verification procedure itself instead of solving the task head-on (Bondarenko et al. 2025).
Without a convergence criterion, there is no end.
2. There is no task boundary
"improve the codebase"
"make the architecture cleaner"
"keep optimizing"
These are tasks with no termination condition. Even human developers wander endlessly under goals like these. An agent is no different. “Improvement” is a direction, not a destination.
3. Entropy outpaces the rate of correction
This is the most common and most insidious pattern.
As the agent makes edits, it adds abstractions. It introduces indirection. It creates unnecessary generalizations. The code looks like it’s “getting better,” but in reality new entropy accumulates faster than the verifier can remove it.
the abstraction built today → removed again tomorrow → added again the day after
This is non-monotonic optimization. It looks like it’s moving forward, but it’s standing still. It looks like a perpetual motion machine, but it’s only consuming energy. In this case, the energy is tokens.
Large-scale evidence captures this drift. Adopting Cursor raised short-term velocity, but static-analysis warnings and code complexity rose continuously, and this accumulation was the main cause of the long-term slowdown (He et al. 2025, 807 open-source repositories). Of the issues introduced across more than 300,000 AI-written commits, 22.7% survived as technical debt all the way to the latest version (Liu et al. 2026). Correction can’t catch up with entropy.
Not a Search Problem, a Constraint Satisfaction Problem
This is where a fundamental difference in perspective surfaces.
“Running the agent longer produces better results” is a view that treats software engineering as a search problem — the expectation that searching a wide space long enough will find a better solution.
But software engineering is, in essence, a constraint satisfaction problem.
- Types must match
- Tests must pass
- Coverage must be met
- Schemas must align
- Lint rules must be obeyed
Once all these constraints are satisfied, you’re done. There’s nothing more to “search.” Define the constraints, satisfy them, stop. That’s all there is to it.
Code is already a machine-checkable domain. Compilers, type checkers, tests, coverage, linters, schema validation — all of these are deterministic verifiers. With these verifiers in place, why send an agent searching endlessly?
The learning research points in the same direction. When you use a deterministic verifier like a unit test as a reward — a verifiable reward — code correctness improves over open-ended generation (CodeRL, Le et al. 2022; RLTF, Liu et al. 2023). The verifier isn’t a tool for narrowing the search. It’s evidence that the problem was never a search to begin with, but a satisfaction.
The Conditions of a Good Loop
A good agent loop closes in five steps:
1. Define the task — what must be achieved (a mechanically decidable goal)
2. Limit the scope — one unit at a time (function, endpoint, file)
3. Symbolic verify — a deterministic tool decides pass/fail
4. Converge — pass → next; fail → retry with feedback
5. Terminate — no items left → stop
In this structure the LLM handles only step 3 (generation). Everything else is done by machines. In particular, the key is that the machine decides “done.” Leave the termination judgment to the LLM, and you’ll hear “done” at 40/527.
The experiments agree. Attach self-critique to an LLM and its performance on reasoning and planning tasks actually collapses; it improves substantially only when you attach a sound external verifier (Stechly et al. 2024). Intrinsic self-correction without external feedback fails — and sometimes gets worse after correcting (Huang et al. 2023). There’s a reason we don’t leave termination to the LLM.
Creative Writing and Code Are Different
There is one exception. Not every domain works this way.
Writing, marketing, design — these domains have weak verifiers. You can’t mechanically decide “is this sentence good?” In domains like these, a long search can be meaningful: the agent generates many variants and a human chooses.
But code is different. Code is already a world full of deterministic verifiers. In this world, prolonged wandering is not search — it’s drift.
The Question
How many hours has your agent been running right now?
Is it converging, or is it drifting?
Can it stop?
If it can stop, then why hasn’t it stopped yet?
Related
- Reins Engineering: AI on a Tether — Not a fence but reins. Engineering that steers with deterministic contracts.
- Who Defines “Done” — Move the termination condition out of the actor’s mouth and into a mechanical gate.
- Why Coding Agents Work and Why They Break — “Generation may be probabilistic, but verification must be deterministic.”
- Ratchet Pattern — Lock in verifications that pass to structurally suppress drift.
- yongol: The Wall at 200 Endpoints — Define constraints as a declarative spec and let the machine decide whether they’re satisfied.
Further reading (external)
- Designing agentic loops — Simon Willison. An agent loop verifies itself and stops only when there are clear success criteria and a passing test suite — the constructive counterpart to this piece.
- Building Effective Agents — Anthropic. Coding is ideal for agents because the solution is verifiable by automated tests — the deterministic verifier becomes the stop signal.
- Termination logic is the underrated design problem in agentic AI systems — Glen Rhodes. The core design decision isn’t a better model but a measurable termination condition, and he warns against the “confidence laundering” by which fluent output hides non-convergence.
- Harness engineering for coding agent users — Birgitta Böckeler, Thoughtworks. Reliability comes not from the model but from a harness of deterministic tools (computational controls) — distinguished from AI-based inferential control.
- Reward Hacking in Reinforcement Learning — Lilian Weng. “When a measure becomes a target, it ceases to be a good measure” — a technical account of the mechanism by which a proxy gets gamed when a weak verifier is used as a reward.
- Context Rot: How Increasing Input Tokens Impacts LLM Performance — Chroma. The more input tokens accumulate, the more the output degrades — the mechanical cause that turns a loop of add-remove-re-add into self-reinforcement rather than self-correction.
- Vibe Coding Will Destroy Your Codebase (But You’re Probably Not Doing It) — Ariel Perez. AI amplifies existing rigor — a practitioner’s view of the phenomenon where, under low rigor, entropy outpaces correction and accelerates chaos.
Sources
Termination decisions · the limits of self-verification
- Stechly et al. “On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks” (2024, arXiv:2402.08115)
- Huang et al. “Large Language Models Cannot Self-Correct Reasoning Yet” (2023, arXiv:2310.01798)
LLM-as-judge · the unreliability of self-critique
- Gu et al. “A Survey on LLM-as-a-Judge” (2024, arXiv:2411.15594)
- Moon et al. “Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation” (2025, arXiv:2505.16222)
- Fanous et al. “SycEval: Evaluating LLM Sycophancy” (2025, arXiv:2502.08177)
Drift · rising AI code complexity
- He et al. “Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects” (2025, arXiv:2511.04427)
- Liu et al. “Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild” (2026, arXiv:2603.28592)
Verifiable reward · verifier-based code generation
- Le et al. “CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning” (2022, arXiv:2207.01780)
- Liu et al. “RLTF: Reinforcement Learning from Unit Test Feedback” (2023, arXiv:2307.04349)
Reward hacking · specification gaming
- Bondarenko et al. “Demonstrating Specification Gaming in Reasoning Models” (2025, arXiv:2502.13295)
- McKee-Reid et al. “Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack” (2024, arXiv:2410.06491)
- Manheim & Garrabrant. “Categorizing Variants of Goodhart’s Law” (2018, arXiv:1803.04585)
- Amodei et al. “Concrete Problems in AI Safety” (2016, arXiv:1606.06565)