Hurl Stops Vibe Coding Drift

The Three-Month Wall

You build a SaaS with vibe coding. It starts fast. “Build login” — 30 seconds. “Add payments” — 2 minutes. An MVP ships in three weeks.

Three months later, strange things happen. The AI “cleans up” payment logic and silently changes discount calculations. Adding a new endpoint breaks existing authentication. A refactoring request changes public API field names, killing every client.

This is called logic drift — AI unintentionally modifying existing business logic. Regression bugs exist in traditional development too. But logic drift is different. Changes the developer never intended happen invisibly, across the entire codebase. Every prompt starts in a fresh context window.

Drift in Numbers

This is not sentiment. There is data.

Speed costs complexity. A Carnegie Mellon research team compared 807 GitHub repositories before and after Cursor adoption (MSR 2026). Code additions increased 3–5x in the first month. After two months, the speed advantage vanished. What remained: static analysis warnings up 30%, code complexity up 41% — permanently.

It didn’t get faster — it got slower. Nonprofit AI research organization METR ran a randomized controlled trial with 16 experienced open-source developers (2025). On projects they already knew well, the group using AI tools took 19% longer to complete tasks. Yet the developers themselves perceived a 20% speedup. A 39-percentage-point gap between perception and reality. Results may differ on new projects, but the assumption “AI = always faster” is broken.

At scale, stability collapses. According to the Google DORA Report (2025), every 25% increase in AI adoption correlates with a 7.2% decrease in software delivery stability.

It actually collapsed. Amazon mandated company-wide AI coding tool usage in 2025 and deployed 21,000 AI agents. During the same period, roughly 30,000 employees were laid off, drastically reducing review capacity. The combination of rapid AI-generated code and reduced review staff resulted in 4 Sev-1 incidents in 90 days. On March 5, 2026, a 6-hour outage caused an estimated 6.3 million lost orders. Internal documents stated: “GenAI’s rapid code generation is inadvertently exposing vulnerabilities, and current safeguards are wholly inadequate.”

“Do TDD” Is Not the Answer

The common advice for vibe coding drift is “write tests.” The direction is right, but how you provide tests determines the outcome.

The TDAD study (arxiv 2026) tested this precisely. Qwen3-Coder 30B was given 100 instances from SWE-bench Verified.

Condition	Regression Rate
Baseline (no test instructions)	6.08%
Procedural “do TDD” instruction	9.94% (worse)
Affected test files provided in context	1.82% (70% reduction)

Telling the agent “do TDD” makes things worse. The agent derails from the original task trying to follow procedural instructions. But providing “these test files must pass” as concrete context reduces regressions by 70%.

The difference is clear. Not “how to test” instructions, but “what must pass” contracts.

Hurl: Contracts in Plain Text

Hurl is a testing tool that declares HTTP requests and expected responses in plain text. Maintained by Orange (France Télécom), it’s a Rust binary with zero runtime dependencies, 18.7k GitHub stars. Fast enough to run on every commit in CI.

# Login succeeds
POST http://localhost:8080/api/auth/login
{
  "email": "test@example.com",
  "password": "secret123"
}
HTTP 200
[Asserts]
jsonpath "$.token" exists
jsonpath "$.user.email" == "test@example.com"

# Unauthenticated access returns 401
GET http://localhost:8080/api/pages
HTTP 401

Two contracts. Login must return 200 with a token. Unauthenticated access must return 401.

When this file is committed to git and runs on every commit in CI — the moment the AI “cleans up” auth logic and 401 becomes 200, the commit is rejected. Drift is caught before it reaches production.

Why Hurl

Unit tests can catch drift too — if you don’t give the AI permission to modify test files. But unit tests verify internal functions, making them structurally coupled to implementation. When function names change, tests break. Every refactoring requires test updates.

Hurl sits at the HTTP boundary. It declares only requests and responses. It knows nothing about code internals. No matter how the AI changes the code, if externally observable behavior stays the same, tests pass; if it differs, tests fail. It is naturally independent of implementation.

	Unit Tests	Hurl
Verifies	Function internals	HTTP contract
On AI refactoring	Changed together	Unchanged
Drift detection	Conditional (if locked)	Natural
Code structure dependency	High	None
Human readability	Code-level	Plain text
LLM generation	Requires code structure understanding	Only needs HTTP

What Hurl verifies is not code but behavior. Code can be freely changed by AI. Behavior must not change. This distinction is the key to catching drift.

Ratchet Lock

When Hurl tests pass, lock them. This is the ratchet.

1. Write Hurl tests for current API (or auto-extract)
2. Run on every commit in CI
3. Passing tests cannot be deleted or modified
4. New features require new Hurl tests
5. All existing + all new tests must pass to merge

Tell the agent “refactor this code” and it freely changes the code. But if Hurl tests break, the commit is rejected. The agent must preserve all existing behavior while refactoring. Drift in edge cases not covered by Hurl is still possible, but for covered behavior, drift is structurally suppressed.

This aligns exactly with the TDAD study’s finding. Not a procedural “write tests” instruction, but a concrete “these Hurl files must pass” contract. The agent can choose the method, but cannot violate the contract.

Works on Legacy Too

Already running production on vibe-coded software? No need to start over.

Step 1: Capture current behavior in Hurl.

If API docs exist, translate them directly to Hurl. If not, have an agent read the existing code and write Hurl tests. The goal is to declare “this is how it currently works” in plain text for every endpoint.

Step 2: Wire it into CI.

Verify all Hurl tests pass and add them as merge conditions.

Step 3: You’re safe now.

Whether AI refactors or adds features, Hurl protects existing behavior. If drift occurs, CI catches it immediately.

Not foundation work — seismic retrofitting. Reinforcing the building without closing the shop.

Not the End of Vibe Coding — Its Evolution

Andrej Karpathy, who coined “vibe coding,” declared exactly one year later in February 2026 that “the era of vibe coding is over.” The new paradigm is agentic engineering — humans don’t write code, they orchestrate agents that autonomously plan, implement, and test.

Thoughtworks Technology Radar (2025) placed Spec-Driven Development at “Assess” level. Martin Fowler’s team published an SDD tools analysis. The industry is converging in the same direction.

Hurl tests are the smallest unit of this transition. You don’t need 10 specs. You don’t need to learn OpenAPI. One Hurl file is one contract. And that contract structurally prevents drift without constraining the agent’s freedom.

Don’t change the model. Add a contract.

yongol — The Keel of AI Coding SaaS — Forces full-stack consistency with 10 SSOTs. Hurl is one of them.
Ratchet Pattern — How to Make Agents Go All the Way — The theory behind deterministic verification and ratchet locking.
IFEval-Exploiting Ratchet Code — Feedback loops using sycophancy bias and Reins.

References

Cursino, D. et al. (2026). “Speed at the Cost of Quality? The Impact of AI Coding on Software.” MSR 2026. arxiv.org/abs/2511.04427
METR (2025). “Measuring the Impact of Early AI on Experienced Open-source Developer Productivity.” arxiv.org/abs/2507.09089
Google Cloud (2025). DORA Report 2025. cloud.google.com
Wang, Z. et al. (2026). “TDAD: Test-Driven Agentic Development.” ACM AIWare 2026. arxiv.org/abs/2603.17973
Autonoma (2026). “Amazon Vibe Coding Failures: 4 Sev-1s in 90 Days.” getautonoma.com
CNBC (2026). “Amazon convenes ‘deep dive’ internal meeting to address AI-related outages.” cnbc.com
Thoughtworks (2025). “Spec-Driven Development.” Technology Radar Vol.33. thoughtworks.com
Karpathy, A. (2026). “From Vibe Coding to Agentic Engineering.” thenewstack.io
Fowler, M. et al. (2025). “SDD Tools.” martinfowler.com
Hurl. hurl.dev | github.com/Orange-OpenSource/hurl