
Quick Tips — Just Know This and You Can Command AI
Four phrases are all you need.
To the agent: “Create a Hurl test” AI writes a contract verifying the feature works correctly. Plain text that anyone can read even without knowing code.
To the agent: “Add this feature. But existing Hurl tests must pass” This one phrase prevents drift. If AI breaks existing features while adding new ones, Hurl tells you in red text.
To the agent: “Commit” Save the working state. Like a save point in a game. If the next task goes wrong, you return here.
To the agent: “Revert” Return to the last save point. Undo what AI broke.
The pattern of these four phrases
Feature complete → “Create Hurl test” → Verify pass → “Commit” → Next feature → “Existing Hurl must pass” → Problem? “Revert”
This is a ratchet. A cogwheel that only goes forward and never back. Whether 5 or 50 features, existing ones don’t break.
Why does this work?
We learned in Class 2. Give AI opinions and it flatters; give facts and it fixes. What Hurl returns isn’t an opinion — it’s a fact. “test failed: status 401, expected 200” — there’s nothing to flatter about.
Hands-on Try
Create one Hurl test for the todo app from Class 2. Takes 3 minutes.
To the agent: “Write a Hurl test to verify that the current todo add feature works correctly”
AI creates a .hurl file.
To the agent: “Run the Hurl test”
If it passes, green. Now intentionally break it.
To the agent: “Change id to todo_id in the todo add API response”
To the agent: “Run the Hurl test”
Red text shows failure. That’s drift detection.
To the agent: “Revert”
Green again. This is the essence of ratcheting.
Why You Need to Command This Way
In Class 2 we saw the problems. Logic drift, context evaporation, sycophancy bias. Past 5 features, existing ones break, and AI falsely declares “it works.”
This class teaches three tools that prevent these problems. All have been used by software engineers for decades. You don’t need to read code. AI writes and AI runs. You just check “did it pass?”
The three tools’ roles:
| Tool | Analogy | What it does |
|---|---|---|
| Hurl | Contract | Declares “this feature must work this way” |
| Git | Save point | Guarantees “I can return to this moment” |
| CI/CD | Automatic surveillance camera | Mechanizes “automatically check every time” |
Hurl — Declare API Contracts in Plain Text
What is Hurl
Hurl is a file that records “how this API should behave.”
In game terms: in an RPG, buying a potion from an NPC has a rule “1 potion → -50 gold, +100 HP.” Checking that this rule hasn’t changed after a patch. That’s what Hurl does.
Let’s look at an actual Hurl file:
# Add todo
POST http://localhost:8080/api/todos
{
"title": "Buy milk",
"priority": "high"
}
HTTP 201
[Asserts]
jsonpath "$.id" exists
jsonpath "$.title" == "Buy milk"
jsonpath "$.priority" == "high"
jsonpath "$.completed" == false
Even someone who doesn’t know code can read it:
- POST — Request the server to “add”
- http://localhost:8080/api/todos — Todo list address
- { “title”: “Buy milk” } — Send this data
- HTTP 201 — Success should return response 201
- jsonpath “$.title” == “Buy milk” — The returned data must contain “Buy milk”
This is a contract. “When you add a todo, 201 comes back, and the title and priority are returned as-is.” If this contract breaks, Hurl tells you in red text.
One more:
# Unauthenticated access should be denied
GET http://localhost:8080/api/todos
HTTP 401
“Accessing the todo list without logging in should return 401 (authentication required).” This too is a contract. If AI “tidies up” the authentication code and breaks this rule, Hurl catches it immediately.
Why Hurl — The Difference from Unit Tests
“There are many test tools, why Hurl?” There’s a special reason for vibe coders.
Unit tests inspect functions inside the code. In car terms, unit tests disassemble engine parts for individual inspection, while Hurl is a road test driving the car on actual roads. If a function name changes, the test breaks too, and when AI refactors, tests must be modified along with it. If you give AI permission to modify both code and tests, AI changes tests to match the code. Tests pass, but the original rules are gone.
Hurl is different. It inspects at the server’s front door. It sends requests and checks responses. It doesn’t know the code’s internal structure. No matter how AI changes the code, if the externally observable behavior is the same it passes; if different, it fails.
| Unit Tests | Hurl | |
|---|---|---|
| Car analogy | Engine part disassembly | Road driving test |
| When AI changes code | Tests might change too | Tests stay the same, only results are judged |
| Reading difficulty | Must know code | Reads like normal text |
| Drift detection | Missed if AI changes tests too | Naturally detected since independent from code |
What Hurl verifies is not code but behavior. AI can freely change code. But behavior must not change. This distinction is the key to catching drift.
Why this approach works — research proves it
We learned about sycophancy bias in Class 2. The advice “write tests” also produces completely different results depending on how you give it.
The TDAD (Test-Driven AI Development) study (2026) experimented with exactly this. They asked AI to fix bugs with different test conditions:
| Condition | Regression Rate (rate of existing features breaking) |
|---|---|
| Baseline (no test instruction) | 6.08% |
| “Do TDD” procedural instruction | 9.94% (worsened!) |
| Provide affected test files as context | 1.82% (70% reduction) |
Surprising results. Instructing “do TDD” actually makes things worse. AI gets sidetracked trying to follow procedural instructions. But giving “this test file must pass” as specific context reduces regression by 70%.
The difference is clear:
- “Develop with tests” → Procedural instruction → AI gets confused
- “This Hurl file must pass” → Specific contract → 70% regression reduction
It’s not instructing a method, but giving a contract for what must pass. The “sentence 3” from above is exactly this.
Git — Reversible Save Points
What is Git
When playing games, you save. Save before the boss fight, and reload if you die.
Git is the save feature for code. “This state works well right now” → save (commit). Things went wrong in the next task → return to previous save.
Without Git in vibe coding:
Add feature 1 → Works
Add feature 2 → Works
Add feature 3 → Feature 1 breaks!
→ Want to revert but... what was the feature 2 state?
→ Tell AI "go back to before" → AI doesn't know what "before" is
→ Start from scratch
With Git:
Add feature 1 → Works → Commit (Save 1)
Add feature 2 → Works → Commit (Save 2)
Add feature 3 → Feature 1 breaks!
→ "Go back to Save 2" → Return to state where features 1 and 2 work
→ Try feature 3 a different way
Git Usage: Two Words Suffice
No need to learn Git’s dozens of commands. Vibe coders need just two things.
“Commit” — Save the current state
"Commit the current state. Message: 'todo add feature complete'"
Command AI runs:
git add .
git commit -m "todo add feature complete"
“Revert” — Restore previous state
"Revert to the last commit"
Command AI runs:
git checkout .
Or to go further back:
"Revert to the 'todo add feature complete' commit"
When to Commit
The rule is simple:
- When a feature is complete and works → Commit
- When all Hurl tests pass → Commit
- Before starting the next feature → Always commit
Proceeding without committing means there’s nowhere to return when problems arise. Like playing a game for 3 hours without saving.
Good pattern:
Feature complete → Hurl passes → Commit → Next feature
Bad pattern:
Feature 1 → Feature 2 → Feature 3 → ... → Something breaks → Nowhere to return
Git Analogy: Climbing Camps
When climbing Everest, you don’t go straight to the summit. Base camp → Camp 1 → Camp 2 → … → Summit. At each camp you pitch a tent and stock supplies. If weather worsens, you descend to the previous camp. Without camps, you die when a storm hits.
Git commits are camps. Set up a camp every time a feature is complete. Even if AI breaks things in the next feature, you can return to the previous camp.
CI/CD — Machines Guard Automatically
What is CI/CD
CI (Continuous Integration) is “automatically running tests every time code is uploaded.” CD (Continuous Deployment) is “automatically deploying when tests pass.”
For now, just know CI. CD comes later.
Without CI:
You: "Add the feature"
AI: "Done!"
You: (only checks the new feature on screen) "Looks good!"
→ Unaware that existing features broke
With CI:
You: "Add the feature"
AI: (writes code)
Machine: (automatically runs all Hurl tests)
Machine: "Existing login test failed!"
You: "Login is broken. Fix it."
AI: (fixes)
Machine: "All tests pass"
You: "Commit"
You don’t need to manually run Hurl tests every time. Machines run them automatically every time.
Creating CI with GitHub Actions
When you push code to GitHub, GitHub Actions automatically runs tests. Just one config file.
Let’s have AI do it:
"Set up CI with GitHub Actions.
- Automatically run Hurl tests on every code push
- Server must start first, then run Hurl tests
- Block merging if tests fail
(A PR is a 'may I merge this code?' request, and merge is actually combining it)"
AI creates the .github/workflows/ci.yml file. You don’t need to understand the content exactly. AI creates it, and you just need to know the key points:
- It runs automatically every time code is pushed
- It starts the server and runs Hurl tests
- If any fail, a red light turns on
It roughly looks like this:
name: CI # Name of this automation
on: [push, pull_request] # Run every time code is uploaded
jobs:
test:
runs-on: ubuntu-latest # Run on a cloud server
steps:
- uses: actions/checkout@v4 # Get the code
- name: Start server # Start the app server
run: |
docker compose up -d
sleep 5
- name: Run Hurl tests # Run all tests
run: |
hurl --test tests/*.hurl
- name: Stop server # Stop the server
run: docker compose down
CI Analogy: Building Fire Alarm
Buildings have fire alarms. They sound automatically when there’s a fire. No need for a guard to patrol 24/7.
CI is the fire alarm for code. When Hurl tests break, it automatically alerts you. No need to check manually every time.
The difference:
| Manual Check | CI | |
|---|---|---|
| When | When you remember | Every time, automatic |
| Scope | Only new feature | Everything |
| Missed checks | Often | Never |
| Cost | Time and energy | Free (GitHub Actions free plan) |
When Three Tools Combine: Ratchet Lock
Hurl + Git + CI combine to become a ratchet. A ratchet is a cogwheel that only turns in one direction. Turn it and it goes forward; release it and it stops but doesn’t go backward.
Feature 1 complete → Write Hurl test → All pass → Commit → Lock
Feature 2 complete → Add Hurl test → All existing + new pass → Commit → Lock
Feature 3 work → Existing Hurl test fails → Commit rejected → Fix → All pass → Commit → Lock
The rules are simple:
- When Hurl tests pass, lock them
- Locked tests cannot be deleted/modified
- When adding new features, add new Hurl tests too
- All existing + all new tests must pass to commit
When you tell AI “refactor the code,” AI freely changes code. But if Hurl tests break, the commit is rejected. AI must work while preserving all existing behavior.
This aligns exactly with the TDAD study results above. Not the procedural instruction “write tests,” but the specific contract “this Hurl file must pass.” The agent can choose its method, but it cannot violate the contract.
How Class 2’s Problems Are Solved
| Class 2 Problem | Class 3 Solution |
|---|---|
| Logic drift | Hurl protects existing behavior. Even if AI changes code, behavior changes trigger failure |
| Context evaporation | Hurl files permanently preserve decisions. Contracts persist even when sessions change |
| Sycophancy bias (“all done!”) | CI judges mechanically. Not AI’s self-report but pass/fail |
| Decision-implementation mixing | Hurl declares decisions (behavior) in files separated from code |
| Multiplicative degradation | Ratchet-locking at each step resets degradation |
Let’s revisit the key experimental results from Class 2:
Autonomous agent: 40 / 527 (7.6%) — Agent declares "done"
Ratchet CLI: 527 / 527 (100%) — Machine declares "487 still remaining"
Same model. The difference is who decides “done.” When AI decides, it stops at 40; when a machine decides, it goes to 527. Hurl + CI play exactly that “machine” role.
Retrofitting to an Existing App
If you haven’t built an app yet, skip this section. Come back when needed.
“I already built an app with vibe coding and it’s running — do I need to start over?”
No. No need to start over. It’s not foundation work — it’s seismic retrofitting. Reinforcing a building without closing the store for business.
Step 1: Capture current behavior with Hurl
Write down in Hurl how the app currently works. If there’s API documentation, transcribe it; if not, have AI do it:
"Analyze all API endpoints of the current app and write Hurl tests.
You need to capture exactly how things work right now as tests."
Goal: Declare “this is how it currently works” in plain text.
No need to do everything at once. Start with the most important things:
- Login/signup — nothing works if this breaks
- Payment — anything involving money is top priority
- Core business logic — the main thing your app does
Priority:
1. Login API → Write Hurl test → Verify pass
2. Payment API → Write Hurl test → Verify pass
3. Core CRUD → Write Hurl test → Verify pass
... (rest when you have time)
Step 2: Save current state with Git
If you’re not using Git yet:
"Initialize this project as a Git repository and commit the current state.
Message: 'preserve existing app state'"
If already using Git, commit when all Hurl tests pass.
Step 3: Set up CI
If code is on GitHub:
"Set up CI with GitHub Actions. Automatically run Hurl tests on every push."
Step 4: Now you’re safe
From here, whatever you tell AI to do, Hurl protects existing behavior:
"Add this feature. But all existing Hurl tests must pass."
When drift occurs, CI catches it immediately. Before it reaches production.
The Power of Feedback: Opinions vs Facts
Remember the sycophancy bias from Class 2. Give AI opinions and it flatters; give facts and it fixes.
What Hurl returns to AI isn’t an opinion but a fact:
Opinion: "The login feature seems a bit off"
→ AI: "I checked and it works fine!" (sycophancy)
Fact: "test failed: status 401 ≠ expected 200"
→ AI: (fixes precisely at the line level)
Asking “are you sure?” makes AI reverse a correct answer. But telling it “line 41: expected user_id, got userId” leaves nothing to flatter about. Numbers and locations aren’t emotions.
This is the fundamental reason deterministic tools (tools that always produce the same result for the same input) like Hurl, Git, and CI work. These tools don’t flatter. Pass is pass and fail is fail.
FAQ
Q: How do I know if the Hurl file is correct? AI might write it wrong.
After initially writing a Hurl file, running it and passing captures the behavior at that point. If code changes later and Hurl fails, that’s a signal that behavior changed. Hurl itself isn’t wrong — it detects whether behavior has changed.
If initial writing doesn’t match expectations: run it and check results with your eyes. “Adding a todo should return 201 but returns 200” — you can judge this yourself.
Q: Won’t too many Hurl tests become hard to manage?
Starting with 3-5 is enough. Login, core features, the most important business rules. Add more one at a time when needed later. No need for perfection at once.
Q: Do I need to memorize Git commands?
No. Two phrases — “commit” and “revert” — are enough. AI runs the appropriate Git commands.
Q: Does GitHub Actions cost money?
Public repositories are free. Private repositories also get 2,000 free minutes per month (Free plan). Sufficient for small projects.
Q: If I have an existing app, will applying Hurl change the current code?
No. Hurl doesn’t touch code. You just add .hurl files next to the code. It only captures current behavior; it doesn’t modify code.
Exercise: Building a Hurl + Git + CI Pipeline
Use the todo list app from the Class 2 exercise.
Prerequisite: Install Hurl
Just ask Claude Code:
"Install Hurl for me"
Or install directly:
# Ubuntu/WSL
curl --proto '=https' --tlsv1.2 -sSf https://hurl.dev/install.sh | bash
Verify installation:
hurl --version
If a version number appears, done.
Step 1: Capture Current State with Hurl (15 min)
Ask AI:
"Write Hurl tests for the current todo app's API.
Include at least these scenarios:
1. Add todo → 201 response, title returned as-is
2. List todos → 200 response, array returned
3. Complete todo → 200 response, completed changes to true
4. Delete todo → 200 or 204 response
5. Unauthenticated access → 401 response (if authentication exists)"
Run tests:
"Run the Hurl tests"
Check if all pass. If any fail, have AI fix them.
Step 2: Git Commit (5 min)
"Git commit the current state. Message: 'Add Hurl tests — protect basic CRUD'"
This is the first save point.
Step 3: Add Feature + Ratchet Verification (20 min)
Remember the feature that broke in the Class 2 exercise? This time we add it with Hurl protection.
"Add priority (High/Medium/Low) feature to todos.
But all existing Hurl tests must pass.
Also add Hurl tests for the new feature."
Check points:
- Do all existing Hurl tests pass?
- Do new Hurl tests also pass?
- Is what broke in Class 2 now protected?
If it passes, commit:
"Commit. Message: 'Add priority feature + Hurl tests'"
Add one more:
"Add due date feature.
All existing Hurl tests pass + add new feature Hurl tests."
If it passes, commit. This is ratcheting. Only forward. Never backward.
Step 4: GitHub Actions CI Setup (10 min, optional)
Skip this step if you don’t have a GitHub account. Steps 1-3 alone are enough to experience the ratchet’s essence. You can make a GitHub account later.
If you have a GitHub repository:
"Set up CI with GitHub Actions.
- Automatically start server and run Hurl tests on every push
- Block code merging if tests fail"
Push to GitHub and verify tests run automatically in the Actions tab.
Step 5: Intentional Drift Experiment (10 min)
Once CI is confirmed working, intentionally break things:
"Change the todo add API response format.
Rename the todo number field from id to todo_id."
Verify Hurl test failure. Verify CI shows red. This is drift detection.
"Revert. Back to normal."
Verify the green light returns.
What to Record:
- Class 2 exercise vs Class 3 exercise: when adding the same feature, did existing features break?
- How many times did Hurl catch drift?
- Were there cases where AI’s “Done!” and Hurl’s verdict differed?
Summary
What we learned in this class:
- Hurl — Declares “must work this way” as a contract in plain text. Verifies behavior, not code
- Git — Creates save points — “can return to this moment”
- CI/CD — Installs mechanical verification — “automatically check every time”
- Ratchet — When the three combine, a cogwheel that “locks when passed, never goes backward”
Core principle:
Don’t instruct AI on methods. Give a contract for what must pass.
“Do TDD” → regression worsens. “This Hurl must pass” → 70% regression reduction. The difference is instruction vs contract.
Don’t change the model — add contracts.
Next Class Preview
In Class 3 we learned to protect each API with Hurl. But as projects grow, APIs aren’t the only thing needing protection. Database structure, security policies, UI components — all must be consistent with each other.
In Class 4, we learn yongol. Manage API, DB, security, and UI in a single declarative specification, and move AI’s work target from code to specifications. The method to break through the wall where vibe coding crumbles at 200 endpoints.
Related Articles
- How Hurl Prevents Vibe Coding Drift — Detailed analysis of how API contract verification with Hurl prevents vibe coding drift
- Ratchet Pattern — Why AI stopped at 40 of 527 function tests, and the pattern of making it go all the way with mechanical verifiers
Reins Engineering Full Course
| Class | Title |
|---|---|
| Class 1 | How to Command AI |
| Class 2 | How to Distrust AI |
| Class 3 | Unbreakable Apps |
| Class 4 | Decisions Outside Code |
| Class 5 | AI with Reins |
| Class 6 | Lock When It Passes |
| Class 7 | Flipping Sycophancy |
| Class 8 | Agent Factory |
| Class 9 | Automation Beyond Code |
| Class 10 | Law of Data |
Sources
- TDAD (Test-Driven AI Development) 2026 — Procedural instruction “do TDD” worsened regression to 9.94%, providing test files as context reduced regression to 1.82% (70% reduction)
- Ratchet Pattern experiment — Autonomous agent 40/527 (7.6%) vs Ratchet CLI 527/527 (100%), same model with different completion judgment authority