filefunc × Hono — From 60 Lines to 18: Code an Agent Reads in One Pass

filefunc × Hono — From 60 Lines to 18: Code an Agent Reads in One Pass Image: AI generated

If your coding agent keeps fixing the wrong thing in a large codebase — if reading one file drags in 19 unrelated functions and poisons the context — if you’re skeptical that a convention like “1 file 1 concept” actually makes a difference — here are the measurements from a 23k-star production framework.

“Won’t You End Up With Too Many Files?”

This is the most common question about filefunc. Splitting 186 files into 626 — won’t that become unmanageable?

The answer is in Hono. An ultralight web framework that runs on Cloudflare Workers, Deno, Bun, and Node.js. 23k+ stars, 1M+ weekly npm downloads. Battle-tested production code. I refactored it using filefunc. 4,419 tests — all passing. (Verifiable directly on the park-jun-woo/hono fork.)

The Numbers

Metric	Original	After refactoring
Source files	186	626
Total lines	24,653	30,244
filefunc violations	397	0
vitest passing	4419	4419
vitest failing	4	4 (pre-existing defects)
vitest skipped	33	33

Files grew 3.4×. Lines grew 23%. Violations went from 397 to 0. Not a single test broke — more precisely, exactly 4 tests fail, identical to the original (defects that existed before). The 23% line increase comes from annotations (//ff:func, //ff:what) and re-export hubs. Not one line of logic changed. Pure structural refactoring.

The Point Is Read Length, Not File Count

“Zero violations, 626 files” is actually a proxy. filefunc’s real goal isn’t chopping files into pieces — it’s ensuring that when an agent reads one concept, it reads only that concept, without excessive length. So the number that actually needs proving isn’t violation count — it’s lines read per file. I measured that.

Lines per file	Original	After refactoring
median	60.0	17.5 (−71%)
p90	305	119 (−61%)
max	2,778	1,051 (−62%)
files ≤20 lines	26%	54%

When an agent opens one concept, it used to swallow an average of 60 lines; now it swallows 18. Even the worst case (p90) dropped from 305 to 120 lines. The functions themselves didn’t get shorter (median 11→12 lines) — obviously, since they were rearranged, not rewritten. What shrank was “the surrounding code you inevitably read just to reach one concept.”

Why does this matter? Long context is not free. LLMs systematically miss information buried in the middle of long inputs (Liu et al., Lost in the Middle, TACL 2023, arXiv:2307.03172). In coding tasks, performance collapses as context grows longer — in one benchmark, Claude 3.5 Sonnet’s accuracy fell from 29% to 3% (Rando et al., LongCodeBench, 2025, arXiv:2505.07897). And feeding a model precisely the relevant fragment at the concept level — rather than a whole file — yields higher code-completion quality (Yusuf et al., 2025, arXiv:2510.06606). Reducing read length is not a style preference; it is an accuracy defense.

The types.ts Problem

Let’s make it concrete rather than abstract. Hono’s original src/types.ts contained more than 20 interfaces and types.

When an AI agent reads this file just to find the HonoRequest type? Nineteen unnecessary types tag along. Context pollution.

After refactoring, each type lives in its own file. Need just HonoRequest? Read only hono_request.ts. The original types.ts remains as a re-export hub to preserve existing import paths.

# original
import { HonoRequest } from './types'  // 20+ types tag along

# after refactoring
import { HonoRequest } from './types'  // same path, same behavior
// internally: types.ts → hono_request.ts re-export

From the outside, nothing changed. From the AI agent’s perspective, everything changed.

Depth 6 to 2

Hono’s router algorithm is complex. The trie-router’s Node.search had a nesting depth of 6.

for → if → if → for → if → if   // depth 6

Is depth 6 bad code? No. Trie traversal is naturally deeply nested. But for an AI agent to understand this function, it has to hold 6 levels of nesting in its head at once. Same for a human. filefunc extracted the inner logic into private methods and module-level arrow functions. Depth 6 → 2. Each piece has only one control flow. The overall algorithm is identical.

# original: monolithic search
Node.search()   // depth 6, 100+ lines

# refactored: decomposed into pieces
Node.search()           // depth 2, handles composition only
  → matchParam()        // depth 1, parameter matching
  → matchWildcard()     // depth 1, wildcard handling
  → mergeHandlers()     // depth 1, handler merging

TypeScript’s F1, and an Honest Tail

filefunc’s core rule F1 is “one file, one function.” In Go, this is intuitive. But in TypeScript, splitting files breaks the module system — extract a class method to an external file and this binding disappears. So filefunc’s TypeScript parser (ts_ast.js) counts only function declarations, not const arrow functions. The principle is “one file, one concept” — not “exactly one syntactic function.”

Here honesty is required. This approach cleanly separated the easy cases (types, single helpers), but it did not separate everything. Measuring the refactored result again:

90% of 626 files (566) contain ≤1 function — satisfying “1 file 1 concept.” (The original was 70%.)
But 60 files (9.6%) still co-locate 2 or more functions. And crucially, those files tend to be long — the median line count for these 60 is 151. For example, src/utils/url.ts packs 14 functions into 319 lines.

The const arrow technique passes the counter but only partially achieves the goal. If multiple arrow functions remain in one file, an agent opening it still reads multiple concepts. The moment a metric becomes the target, the metric breaks (Goodhart). filefunc is no exception — most of the remaining read-length risk is concentrated in this 10% tail. Not elevating “zero violations” to the status of a complete answer, but measuring what still isn’t done — that is verification.

“So What Actually Improves?”

With 626 files, humans may find it inconvenient — open a directory and files spill out. But AI agents don’t open directories. They grep.

rg '//ff:func' --glob '*.ts' -l | head -20    # extract candidate files
rg '//ff:what.*router' --glob '*.ts'            # router-related functions only

With 186 files averaging 3–4 functions each, grep finds the file but reading it drags in unnecessary functions. With 626 files each containing 1 concept, the file grep finds is the concept needed. The intermediate step disappears. Code localization — finding the relevant location — is a bottleneck for downstream problem-solving in agent workflows (Chen et al., LocAgent, 2025, arXiv:2503.09089); filefunc makes this search deterministic by aligning concepts with file boundaries.

Is Function Granularity Always the Right Answer?

Let’s also look at the counter-evidence honestly. One controlled experiment reports that “function-level chunking is 3.6–5.6pp lower than other strategies and is not Pareto-optimal” for RAG code completion (Wu et al., 2026, arXiv:2605.04763). Function granularity is not a silver bullet.

That said, this is a different layer of discussion. That experiment concerns autocomplete context, where a retriever slices code and stuffs it into a prompt. filefunc is about the operational unit — files an agent directly selects and reads. Chunking strategy (retrieval chunk) and operational unit (file an agent opens) are different layers. Still, making this distinction explicit matters — “smaller is always better” is not filefunc’s claim. The claim is “when the unit an agent reads aligns with a concept, read length decreases” — and the numbers above show that.

Constraint Is Freedom

There is one thing confirmed by refactoring Hono with filefunc.

Structural constraints do not restrict code. They liberate navigation.

More files is a cost. But when each file holds exactly one concept, the agent reads precisely what it needs and is not polluted by unnecessary context — the measurement showing read length dropping from 60 to 18 lines is the evidence. The same is true for humans. When the function name is the file name, the directory is the table of contents.

397 violations became 0, and 4,419 tests passed identically to the original. And anyone can re-run that result from the refactoring report in the README. This is the evidence that “1 file 1 concept” is not theory — it is production. Including the remaining 9.6% tail.

The Codebase an Agent Can Operate — “20 functions per file drops agent performance 30–85%”: the parent proposition for this article
filefunc: One File, One Concept — The convention itself, defined. This article is its large-scale empirical validation
Building Agent-Operable Systems — The macro narrative of making locked-up legacy code accessible to agents
Ratchet Pattern — The deterministic gate that makes 4,419 tests “the machine stops until it’s done”

Sources

Liu et al. “Lost in the Middle: How Language Models Use Long Contexts” (TACL 2023, arXiv:2307.03172)
Rando et al. “LongCodeBench: Evaluating Coding LLMs at 1M Context Windows” (2025, arXiv:2505.07897)
Yusuf et al. “Beyond More Context: How Granularity and Order Drive Code Completion Quality” (2025, arXiv:2510.06606)
Chen et al. “LocAgent: Graph-Guided LLM Agents for Code Localization” (2025, arXiv:2503.09089)
Wu et al. “How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study” (2026, arXiv:2605.04763)
Refactoring validation: park-jun-woo/hono (filefunc Refactoring Report in README) · Convention: filefunc
Hero image: AI-generated (Google Gemini)