Class 3. Apps That Don't Break — Hurl, Git, CI/CD

Class 3 Image: AI generated

Quick Tip — Know This and You Can Command It

Four commands are all you need.

Tell the agent: “Create a Hurl test” AI writes a contract that verifies the feature works correctly. It is plain text that anyone can read, even without knowing code.

Tell the agent: “Add this feature. But existing Hurl tests must pass” This single command prevents drift. If AI breaks existing features while adding new ones, Hurl tells you in red text.

Tell the agent: “Commit” Save the working state. Like a save point in a game. If the next task goes wrong, you return here.

Tell the agent: “Revert” Return to the last save point. Undo what AI broke.

The pattern of these four commands

Feature complete → “Create Hurl test” → Confirm pass → “Commit” → Next feature → “Existing Hurl must pass” → If problems arise “Revert”

This is a ratchet. A gear that only moves forward and never backward. Whether you have 5 or 50 features, existing ones do not break.

Why does this work?

You learned in Class 2. Give AI opinions and it flatters; give facts and it fixes. What Hurl returns is not an opinion — it is a fact. “test failed: status 401, expected 200” — there is nothing to flatter about.

Hands-On Preview

Create one Hurl test for the to-do app from the Class 2 exercise. It takes 3 minutes.

Tell the agent: “Write a Hurl test that verifies the current task-add feature works correctly”

AI creates a .hurl file.

Tell the agent: “Run the Hurl test”

Green if it passes. Now intentionally break it.

Tell the agent: “Change id to todo_id in the task-add API response”

Tell the agent: “Run the Hurl test”

It fails in red text. This is drift detection.

Tell the agent: “Revert”

Green again. This is the core of the ratchet.

Why You Must Command It This Way

In Class 2, we saw the problems. Logic drift, context evaporation, sycophancy. Beyond 5 features, existing ones break and AI falsely declares “it works fine.”

In this class, you learn three tools to prevent these problems. All have been used by software engineers for decades. You do not need to know how to read code. AI writes and AI runs. You only check “Did it pass?”

Roles of the three tools:

Tool	Analogy	What It Does
Hurl	Contract	Declares “this feature must behave this way”
Git	Save point	Guarantees “you can return to this state”
CI/CD	Automatic surveillance camera	Mechanizes “automatically check every time”

Hurl — Declare API Contracts in Plain Text

What is Hurl

Hurl is a file that describes “how this API should behave.”

As a game analogy: In an RPG, when you buy a potion from an NPC, there is a rule “1 potion → gold -50, HP +100.” Checking that this rule has not changed after a patch — that is what Hurl does.

Let’s look at an actual Hurl file:

# Add a task
POST http://localhost:8080/api/todos
{
  "title": "Buy milk",
  "priority": "high"
}
HTTP 201
[Asserts]
jsonpath "$.id" exists
jsonpath "$.title" == "Buy milk"
jsonpath "$.priority" == "high"
jsonpath "$.completed" == false

Even someone who does not know code can read this:

POST — Send a request to the server saying “add this”
http://localhost:8080/api/todos — The address of the to-do list
{ “title”: “Buy milk” } — This data is sent
HTTP 201 — If successful, response 201 must come back
jsonpath “$.title” == “Buy milk” — The returned data must contain “Buy milk”

This is a contract. “When you add a task, you should get a 201 and the title and priority should come back as-is.” If this contract is broken, Hurl tells you in red text.

One more:

# Access without authentication must be denied
GET http://localhost:8080/api/todos
HTTP 401

“Accessing the to-do list without logging in should return 401 (authentication required).” This is also a contract. If AI “cleans up” the authentication code and breaks this rule, Hurl catches it immediately.

Why Hurl — The Difference from Unit Tests

“There are many testing tools. Why Hurl?” For vibe coders, there is a specific reason.

Unit tests inspect functions inside the code. By car analogy, unit tests disassemble engine parts and inspect each one, while Hurl is a road test — driving the car on actual roads. If function names change, tests break too, and when AI refactors, tests must also be modified. If you give AI permission to modify both code and tests, AI will adjust tests to match the code. Tests pass, but the original rules are gone.

Hurl is different. It inspects at the server’s entrance. It sends requests and checks responses. It does not know the code’s internal structure. No matter how AI changes the code, if the externally observable behavior is the same, it passes; if different, it fails.

	Unit Tests	Hurl
Car analogy	Engine parts disassembly inspection	Road driving test
When AI changes code	Tests may also change	Tests remain, only results are judged
Reading difficulty	Must know code	Reads like regular text
Drift detection	Missed if AI changes tests too	Naturally detected since independent of code

What Hurl verifies is not code but behavior. AI may freely change code. Behavior must not change. This distinction is the key to catching drift.

Why this approach is effective — Research proves it

You learned about sycophancy in Class 2. The advice to “write tests” also produces completely different results depending on how you give it.

TDAD (Test-Driven AI Development) research (2026) tested this precisely. They had AI fix bugs while varying the testing conditions:

Condition	Regression Rate (% of existing features broken)
Baseline (no test instructions)	6.08%
“Do TDD” procedural instruction	9.94% (worse!)
Provide affected test files as context	1.82% (70% reduction)

Surprising results. “Do TDD” makes things worse. AI gets sidetracked trying to follow procedural instructions and deviates from the actual task. But providing “this test file must pass” as specific context reduces regression by 70%.

The difference is clear:

“Develop while writing tests” → Procedural instruction → AI gets confused
“This Hurl file must pass” → Specific contract → 70% regression reduction

Not instructing the method, but providing a contract of what must pass. This is what “Command 3” above is all about.

Git — Save Points You Can Return To

What is Git

When gaming, you save. Save before the boss fight, reload if you die.

Git is the save function for code. “This state works well” → save (commit). Next task goes wrong → return to previous save.

Vibe coding without Git:

Feature 1 added → works
Feature 2 added → works
Feature 3 added → Feature 1 breaks!
→ Want to go back... what was the Feature 2 state?
→ Tell AI "go back to before" → AI doesn't know what "before" is
→ Start from scratch

With Git:

Feature 1 added → works → commit (save 1)
Feature 2 added → works → commit (save 2)
Feature 3 added → Feature 1 breaks!
→ "Go back to save 2" → restored to state where features 1 and 2 work
→ Try Feature 3 a different way

Git usage: Two words are enough

No need to learn Git’s dozens of commands. A vibe coder needs only two things.

“Commit” — Save the current state

"Commit the current state. Message: 'Task add feature complete'"

AI executes:

git add .
git commit -m "Task add feature complete"

“Revert” — Restore the previous state

"Revert to the last commit"

AI executes:

git checkout .

Or to go further back:

"Revert to the 'Task add feature complete' commit"

When to commit

The rule is simple:

When a feature is complete and works → commit
When all Hurl tests pass → commit
Before starting the next feature → always commit

Proceeding without committing means there is nowhere to return when problems arise. Like playing a game for 3 hours without saving.

Good pattern:
Feature complete → Hurl pass → commit → next feature

Bad pattern:
Feature 1 → Feature 2 → Feature 3 → ... → something breaks → nowhere to return

Git analogy: Base camps on Everest

You do not climb Everest in one go. Base camp → Camp 1 → Camp 2 → … → summit. At each camp, you pitch tents and stock supplies. If weather turns bad, you descend to the previous camp. Without camps, a storm means death.

Git commits are camps. Pitch a camp every time a feature is complete. Even if AI breaks things during the next feature, you can return to the previous camp.

CI/CD — Machines Guard Automatically

What is CI/CD

CI (Continuous Integration) is “automatically running tests every time code is uploaded.” CD (Continuous Deployment) is “automatically deploying when tests pass.”

For now, you only need to know CI. CD comes later.

Without CI:

You: "Add the feature"
AI:  "Done!"
You: (check only the new feature on screen) "Looks good!"
→ You do not notice existing features are broken and move on

With CI:

You: "Add the feature"
AI:  (writes code)
Machine: (automatically runs all Hurl tests)
Machine: "Existing login test failed!"
You: "Login is broken. Fix it."
AI:  (fixes)
Machine: "All tests pass"
You: "Commit"

You do not need to manually run Hurl tests every time. The machine runs them automatically every time.

Creating CI with GitHub Actions

When code is uploaded to GitHub, GitHub Actions automatically runs tests. One configuration file is all it takes.

Let’s ask AI:

"Set up CI with GitHub Actions.
 - Automatically run Hurl tests on every push
 - Server must start first, then run Hurl tests
 - Block merge if tests fail
   (A PR is a 'may I merge this code?' request, and merge is the actual combining)"

AI creates a .github/workflows/ci.yml file. You do not need to understand the contents precisely. AI creates it, and you only need to know the key points:

Runs automatically on every code push
Starts the server and runs Hurl tests
If any fail, a red light appears

It looks roughly like this:

name: CI                          # Name of this automation
on: [push, pull_request]          # Run on every code upload

jobs:
  test:
    runs-on: ubuntu-latest        # Run on a cloud server
    steps:
      - uses: actions/checkout@v4 # Get the code

      - name: Start server        # Start the app server
        run: |
          docker compose up -d
          sleep 5

      - name: Run Hurl tests      # Run all tests
        run: |
          hurl --test tests/*.hurl

      - name: Stop server         # Stop the server
        run: docker compose down

CI analogy: A building’s fire alarm

A building has a fire alarm. It rings automatically when there is a fire. No need for a guard to patrol 24 hours.

CI is a fire alarm for code. It automatically alerts when a Hurl test breaks. No need for you to manually check every time.

The difference:

	Manual Check	CI
Check timing	When you remember	Every time automatically
Check scope	Only new features	Everything
Missing checks	Frequent	None
Cost	Time and energy	Free (GitHub Actions free plan)

When the Three Combine: Ratchet Lock

Hurl + Git + CI combined become a ratchet. A ratchet is a gear that only turns in one direction. Turn it and it goes forward; release it and it stops but does not reverse.

Feature 1 complete → write Hurl test → all pass → commit → lock
Feature 2 complete → add Hurl test → all existing + new tests pass → commit → lock
Feature 3 work → existing Hurl test fails → commit rejected → fix → all pass → commit → lock

The rules are simple:

When Hurl tests pass, lock
Locked tests cannot be deleted/modified
When adding new features, also add new Hurl tests
All existing tests + all new tests must pass to commit

When you tell AI to “refactor the code,” AI freely changes the code. But if a Hurl test breaks, the commit is rejected. AI must work while preserving all existing behavior.

This aligns precisely with the TDAD research results above. Not a procedural instruction to “write tests,” but a specific contract that “this Hurl file must pass.” The agent can choose the method, but cannot violate the contract.

How Class 2’s Problems Are Solved

Class 2 Problem	Class 3 Solution
Logic drift	Hurl protects existing behavior. Even if AI changes code, failure if behavior differs
Context evaporation	Hurl files permanently preserve decisions. Contracts remain even when sessions change
Sycophancy (“all done”)	CI judges mechanically. Not AI’s self-report but pass/fail
Decision-implementation mixing	Hurl declares decisions (behavior) in a file separate from code
Multiplicative degradation	Locking with ratchet at each step resets degradation

Let’s revisit the key experimental result from Class 2:

Autonomous agent:  40 / 527  (7.6%)  — Agent declares "done"
Ratchet CLI:      527 / 527 (100%)  — Machine declares "487 remaining"

Same model. The difference is who decides “done.” When AI decides, it stops at 40; when a machine decides, it goes to 527. Hurl + CI serve exactly as that “machine.”

Retrofitting to Apps Already Built and Running

If you have not built an app yet, skip this section. Return when you need it later.

“I already built and run an app with vibe coding. Do I need to start from scratch?”

No. No need to start from scratch. It is not laying a new foundation — it is seismic retrofitting. Reinforcing the building without closing a shop that is already open for business.

Step 1: Capture current behavior with Hurl

Write down in Hurl how the app currently behaves. If you have API documentation, transfer it directly; if not, have AI do it.

Tip: If you don’t have API documentation (OpenAPI spec), you can try codistill. codistill statically analyzes existing Go+Gin, NestJS, or FastAPI source code to automatically extract OpenAPI specs and DDL. “Put code in, specs are distilled out.” With the extracted OpenAPI, writing Hurl tests becomes much easier. We will revisit codistill in Class 4.

"Analyze all API endpoints of the current app and write Hurl tests.
 You must capture the exact current behavior as tests."

Goal: Declare in plain text “this is how it currently works.”

No need to do everything at once. Start with the most important, one at a time:

Login/registration — if this breaks, nothing works
Payments — anything involving money is top priority
Core business logic — the main thing your app does

Priority:
1. Login API → write Hurl test → confirm pass
2. Payment API → write Hurl test → confirm pass
3. Core CRUD → write Hurl test → confirm pass
... (rest when time permits)

Step 2: Save current state with Git

If you are not using Git yet:

"Initialize this project as a Git repository and commit the current state.
 Message: 'Preserve existing app state'"

If you are already using Git, commit when all Hurl tests pass.

Step 3: Add CI

If your code is on GitHub:

"Set up CI with GitHub Actions. Automatically run Hurl tests on every push."

Step 4: Now you are safe

From here, whatever you ask AI to do, Hurl protects existing behavior:

"Add this feature. But all existing Hurl tests must pass."

When drift occurs, CI catches it immediately. Before it reaches production.

The Power of Feedback: Opinions vs Facts

Remember the sycophancy from Class 2. Give AI opinions and it flatters; give facts and it fixes.

What Hurl returns to AI is not an opinion — it is a fact:

Opinion: "Login seems a bit off"
→ AI: "I checked and it works fine!" (sycophancy)

Fact: "test failed: status 401 ≠ expected 200"
→ AI: (fixes precisely at the line level)

If you ask “Are you sure?” AI reverses a correct answer. But “line 41: expected user_id, got userId” leaves nothing to flatter about. Numbers and locations are not emotions.

This is the fundamental reason why deterministic tools (tools that give the same output for the same input every time) like Hurl, Git, and CI work. These tools do not flatter. Pass is pass and fail is fail.

FAQ

Q: How do I know the Hurl file is correct? AI might write it wrong.

After initially writing a Hurl file, running it and passing means the behavior at that point has been captured. If the code later changes and Hurl fails, that is a signal that behavior has changed. Hurl itself is not wrong — it detects whether behavior has changed.

If the initial writing does not match expectations: run it and check the results visually. “Task add should return 201 but returns 200” — you can judge this yourself.

Q: Won’t too many Hurl tests become hard to manage?

Start with 3-5. Login, core features, the most important business rules. Add one at a time as needed. No need for perfection all at once.

When endpoints grow numerous, huma helps. huma reads the OpenAPI spec, finds endpoints without Hurl tests, and fills them one by one in a ratchet loop. It mechanically enforces “if there are 42 endpoints, all 42 must have tests.” We will revisit huma in Class 6.

Q: Do I need to memorize Git commands?

No. “Commit” and “Revert” — two words are enough. AI executes the appropriate Git commands.

Q: Does GitHub Actions cost money?

Public repositories are free. Private repositories get 2,000 free minutes per month (Free plan). Plenty for small projects.

Q: If I have an existing app, will applying Hurl change the current code?

No. Hurl does not touch the code. You just add .hurl files next to the code. It only captures current behavior — it does not modify code.

Hands-On: Building a Hurl + Git + CI Pipeline

Use the to-do list app from the Class 2 exercise.

Prerequisite: Install Hurl

Have Claude Code do it:

"Install Hurl"

Or install manually:

# Ubuntu/WSL
curl --proto '=https' --tlsv1.2 -sSf https://hurl.dev/install.sh | bash

Verify installation:

hurl --version

If a version number appears, installation succeeded.

Step 1: Capture Current State with Hurl (15 min)

Ask AI:

"Write Hurl tests for the current to-do app's API.
 Include at minimum these scenarios:

 1. Add task → 201 response, title comes back as-is
 2. List tasks → 200 response, array comes back
 3. Complete task → 200 response, completed changes to true
 4. Delete task → 200 or 204 response
 5. Access without auth → 401 response (if auth exists)"

Run tests:

"Run the Hurl tests"

Check that all pass. If any fail, have AI fix them.

Step 2: Git Commit (5 min)

"Git commit the current state. Message: 'Add Hurl tests — protect basic CRUD'"

This is the first save point.

Step 3: Add Feature + Ratchet Check (20 min)

Remember the feature that broke in the Class 2 exercise? This time, add it while protecting with Hurl.

"Add priority (high/medium/low) to tasks.
 But all existing Hurl tests must pass.
 Also add Hurl tests for the new feature."

Check points:

Do all existing Hurl tests pass?
Do new Hurl tests also pass?
Is what broke in Class 2 now protected?

If it passes, commit:

"Commit. Message: 'Add priority feature + Hurl tests'"

Add one more:

"Add due dates.
 All existing Hurl tests pass + add new feature Hurl tests."

If it passes, commit. This is the ratchet. Only forward. No going back.

Step 4: GitHub Actions CI Setup (10 min, optional)

Skip this step if you do not have a GitHub account. Steps 1-3 alone let you experience the ratchet’s core. You can create GitHub later.

If you have a GitHub repository:

"Set up CI with GitHub Actions.
 - Automatically start server and run Hurl tests on every push
 - Block code merging if tests fail"

Push to GitHub and check in the Actions tab that tests run automatically.

Step 5: Intentional Drift Experiment (10 min)

Once you confirm CI works, intentionally break something:

"Change the task add API's response format.
 Change the task number name from id to todo_id."

Confirm the Hurl test fails. Confirm CI shows a red light. This is drift detection.

"Revert. Back to original."

Confirm the green light returns.

Record:

Class 2 exercise vs Class 3 exercise: When adding the same features, did existing features break?
How many times did Hurl catch drift?
Were there cases where AI’s “Done” and Hurl’s judgment differed?

Summary

What you learned in this class:

Hurl — Declares “this is how it should behave” as a contract in plain text. Verifies behavior, not code
Git — Creates save points guaranteeing “you can return to this state”
CI/CD — Installs mechanical verification that “automatically checks every time”
Ratchet — When the three combine, a gear that “locks on pass and never goes backward”

Core principle:

Do not instruct AI how. Give a contract of what must pass.

“Do TDD” → regression worsens. “This Hurl must pass” → 70% regression reduction. The difference is instruction vs contract.

Do not change the model — add a contract.

Next Class Preview

In Class 3, you learned how to protect each API with Hurl. But as projects grow, APIs are not the only things that need protection. Database structure, security policies, UI components — all must be consistent with each other.

In Class 4, you learn yongol. Managing API, DB, security, and UI in a single declarative spec, moving AI’s work target from code to specs. The method to break through the wall where vibe coding crumbles at 200 endpoints.

Hurl Prevents Vibe Coding Drift — Detailed analysis of how API contract verification with Hurl prevents vibe coding drift
Ratchet Pattern — Why AI stopped at 40 when tasked with testing 527 functions, and the pattern that uses a mechanical verifier to drive it to the end

Reins Engineering Full Course

Class	Title
Class 0	Install Claude Code
Class 1	How to Command AI
Class 2	How to Distrust AI
Class 3	Apps That Don’t Break
Class 4	Decisions Out of Code
Class 5	AI with Reins
Class 6	Pass Then Lock
Class 7	Flipping Sycophancy
Class 8	The Agent’s Factory
Class 9	Automation Beyond Code
Class 10	The Law of Data
Class 11	How to Rescue Failed Vibe Coding

Supporting Evidence

TDAD (Test-Driven AI Development) 2026 — “Do TDD” procedural instruction worsens regression to 9.94%, providing test files as context reduces regression to 1.82% (70% reduction)
Ratchet Pattern experiment — Autonomous agent 40/527 (7.6%) vs Ratchet CLI 527/527 (100%), difference in completion judge on the same model

References

Meyer, B. (1992). “Applying ‘Design by Contract’.” Computer 25(10), 40-51. link
Golmohammadi, A., Zhang, M. & Arcuri, A. (2024). “Testing RESTful APIs: A Survey.” ACM TOSEM 33(1), 27:1-27:41. link
Lightman, H., Kosaraju, V., Burda, Y. et al. (2024). “Let’s Verify Step by Step.” ICLR 2024. link
Tyen, G., Mansoor, H., Carbune, V. et al. (2024). “LLMs cannot find reasoning errors, but can correct them given the error location.” Findings of ACL 2024. link