The Preconditions for Improving LLM Multi-Agent Accuracy

The Problem

There’s an intuition that “running several agents makes you more accurate.” It’s only half true.

What we should be aiming at is not multi-agent itself but multi-agent voting without independence. Spin up N agents built from the same model, the same data, the same alignment, and take a majority vote — and you don’t get more accurate. You fail together.

  • Empirical LLM ensemble sentiment analysis: adding a bigger, more accurate model brought almost no gain. The independence that Condorcet’s theorem presupposes was already broken (arXiv:2409.00094).
  • Multi-agent debate (MAD): adding debate doesn’t reliably beat the self-consistency of a single agent (ICML 2024, arXiv:2311.17371).
  • My anecdotal observation (sample of 1, uncontrolled): on the ZenFlow task, running Grok Build as 8 concurrent agents stalled on 3 of 10 endpoints and failed to pass validate. It’s just an anecdote, so don’t weight it like the two studies above.

A majority vote is not magic. Condorcet’s jury theorem spelled out the preconditions 200 years ago. And when you meet those preconditions, multi-agent actually works. This essay is about what those preconditions are and how to meet them.

Condorcet’s Two Preconditions

In 1785, Condorcet nailed down, as a formula, the conditions under which a majority vote converges to the truth.

  1. Each voter’s accuracy > 50%
  2. Errors across voters are independent

(Strictly, there’s a third — a uniformity assumption that all share the same accuracy. I set it aside for simplicity.)

Number 2 is the heart of it. Models aligned with the same training data, the same architecture, the same RLHF fail in the same places. Vote, and “the answer they got wrong together” becomes the majority.

This isn’t just intuition. One study analyzing more than 350 LLMs reports that when two models are wrong at the same time, they converge on the identical wrong answer with 60% probability (ICML 2025, arXiv:2506.07962). The same study observed an even sharper paradox — the bigger and more accurate the model, the higher the error correlation. It held even across different architectures. (It’s a single large-scale analysis, and broad replication is still pending. But the direction is exactly what Condorcet foretold.)

The Math of Correlated Errors

If errors are independent, the ensemble shaves off wrong answers. If they’re correlated, there’s nothing to shave.

  • When independent: P(both wrong) = 0.1 × 0.1 = 0.01
  • When fully correlated: P(both wrong) ≈ 0.1 (if one is wrong, the other is too)

This intuition is rooted in a 30-year-old theorem. Krogh and Vedelsby’s ambiguity decomposition (NeurIPS 1994): ensemble error = average member error − ensemble diversity. The more correlated the member errors, the more the diversity term converges to zero, and adding models stops yielding any gain. A 2023 JMLR unified theory generalized this — diversity is not a separate knob but a dimension hidden inside the bias-variance decomposition (arXiv:2301.03962).

To summarize:

  • The condition for an ensemble to raise accuracy: the lower the error correlation, the bigger the gain (maximal under negative correlation).
  • The condition for ensemble gain to converge to zero: error correlation → 1 (same data, same bias).

The form of the vote matters too. A majority vote, if independent, lifts accuracy just as Condorcet describes. But if you bind it as consensus (unanimity, an AND gate) where “everyone must pass it,” accuracy collapses multiplicatively — if a classifier’s accuracy is 0.977 and you bind n of them in unanimity, you get 0.977ⁿ. Design the gate wrong, and more agents produce lower accuracy.

That’s the diagnosis. Now the prescription splits two ways — reduce the error correlation (Axis 1), or bypass it (Axis 2).

Axis 1 — Secure Independence and Multi-Agent Works

Let’s be clear. Multi-agent isn’t wrong. Voting without independence is what’s wrong. Meet Condorcet’s second precondition — make the agents’ errors uncorrelated — and the majority vote lifts accuracy as promised. There are two roads to manufacturing independence.

(a) Decompose the problem — this is the strongest.

Don’t give the agents the same problem and have them vote; give them different sub-problems. When inputs differ, errors become structurally independent — even with the same model. Two agents reading different documents cannot fail in the same place. They’re looking at different places.

That Anthropic’s multi-agent research system reported a 90.2% improvement over a single agent is precisely this principle. A lead agent decomposes the problem, distributes it across parallel sub-agents, and merges the results of each independent search. No verifier was needed. Decomposition made independence free.

But there’s a condition. The problem has to be decomposable. For work where sub-tasks depend on one another and require constant coordination — like several agents editing one block of code at once — parallel sub-agents collide instead. The context fragments and they make mutually contradictory decisions (Cognition, “Don’t Build Multi-Agents”). The independence of decomposition is free only when the sub-problems are genuinely independent.

(b) Diversify the models — it works, but there’s a ceiling.

Even for the same problem, have different models (GPT, Claude, Gemini) solve it and, because their weights differ, error correlation drops. Multi-agent debate also beats the single baseline only once you mix heterogeneous models (arXiv:2502.08788) — I don’t dispute this. The key is that what matters is correlation, not individual accuracy. There’s an information-theoretic result that even when choosing which models to put in an ensemble, you should pick the least-correlated combination rather than the strongest model — weak but diverse beats the single strongest model (arXiv:2602.08003). But this knob has a low ceiling. Internet corpora overlap, and as we saw, the bigger the model the more it fails together again (arXiv:2506.07962). Diversity reduces correlation but can’t drive it to zero.

Third, self-consistency, which scatters reasoning paths within the same model, also yields gains by decorrelating surface errors (GSM8K +17.9pp, arXiv:2203.11171). But that gain stops at the point where the model is systematically wrong — the same bias carved by the same data. However much you diversify the paths, there’s only one way the model doesn’t know what it doesn’t know.

Source of independenceHow it worksLimit
Problem decomposition (different inputs)Different inputs make errors structurally independentDecomposable problems only. Backfires on dependent, coordination-heavy work
Heterogeneous models (GPT+Claude+Gemini)Different weights → lower correlationCorpus overlap + bigger models → higher correlation
Reasoning-path diversification (self-consistency)Sample paths within one model, then majority voteStops at systematic errors

Conclusion for Axis 1: multi-agent works if you design for independence. And the surest independence comes not from finding a different model but from splitting the problem into independent pieces.

Axis 2 — Verifiers Bypass Independence

The third knob is of a different kind. Axis 1 reduces error correlation to rescue the vote. A verifier bypasses the correlation — even if the agents are all wrong together, an external criterion unrelated to their errors blocks the pass. It’s a gate, not a vote. So it works even where you can’t secure independence, as long as it’s a verifiable domain.

This diagnosis isn’t mine alone. “Consensus is Not Verification” (arXiv:2603.06612) nailed the same conclusion first — consensus-based aggregation has no consistent gain over a single sample and amplifies shared misconceptions, and inference-time scaling works in verifiable domains (math) but fails in non-verifiable ones. It’s not that consensus is a truth signal that works in math; it’s that the verifier filters the candidates. I accept that diagnosis and go one step further — into prescription. The strongest source of independence is decomposition; independence and verification are complements, not competitors; and there are three points where a deterministic verifier diverges from an LLM judge (below).

Yet the industry hands even this verification to an LLM — LLM-as-Judge.

Let’s start fair. LLM judges often work well. On MT-Bench, a GPT-4 judge agreed with human preference over 80% of the time, on par with the agreement rate among humans (arXiv:2306.05685). For vague preference evaluation, an LLM judge is serviceable. The question is where it breaks.

The judge breaks when it shares the same trap as the generator. A judging LLM rates outputs familiar to itself (low perplexity) higher than humans do (self-preference bias, NeurIPS 2024 WS, arXiv:2410.21819). When the judge shares the same distribution as the generator, it passes a hallucination made by the same model “because it’s familiar.” The 80% agreement rate is no comfort, because the 20% it gets wrong is clustered exactly where the generator is also wrong — the problem is error correlation, not average accuracy. And the judgment is swayed even by irrelevant variables like a candidate’s position rather than its correctness (position bias, arXiv:2406.07791).

One supporting point. LLM judgment wobbles even at the hardware layer. Even with the same input and T=0 greedy decoding, floating-point non-associativity and dynamic batching split the result depending on GPU configuration — accuracy varied by up to 9pp under BF16 (arXiv:2506.09501). This is a reproducibility issue, not a validity one, so I won’t make it a main argument. It’s just that seating something that can’t even guarantee the same answer to the same question on the final judge’s bench feels off.

So there’s the opposite direction. A weak generator + a strong verifier. Even a weak model approaches a strong one once you attach the same verifier, and a weak model’s errors are actually easier to detect (arXiv:2509.17995). You can also weight-combine several weak verifiers into a strong one (Weaver, arXiv:2506.18203), or refine LLM output with feedback from a formal verifier to guarantee consistency (AlphaVerus, arXiv:2412.06176). This isn’t a fringe claim — reasoning models and coding agents trained with verifiable rewards are the fastest-advancing area right now, and Jason Wei summed it up as verifier’s law: the degree to which AI gets stronger is proportional to the verifiability of the task.

Here we must be honest. A verifier is not a magic oracle. Tests can have gaps; specs can be wrong. More sharply — if an LLM writes the verifier, the critique I just leveled at LLM-as-Judge comes right back to life. If the generator and the verifier are the same model, a test that’s wrong in the same place passes code that’s wrong in the same place. The error correlation merely relocates to the verification layer; it doesn’t disappear.

So how do we prevent the resurrection? By raising the verifier’s reliability from outside the generator. Three things go together.

  • Human review. A person reviews the verification criteria (spec, tests, properties) once and freezes them. Even if an LLM drafts them, the pass criterion is finalized by a person outside the generator’s distribution. The cost is one-time, and a criterion frozen once is reused infinitely — the decisive difference from LLM-as-Judge, which re-judges with every generation.
  • Reduction to math and logic. As far as possible, move verification into a mechanically decidable form — type checks, invariants, formal proofs, mathematical properties. There’s no room here for an LLM’s “judgment.” True/false is decided by rules, not by model opinion.
  • Repeated testing. Because a verifier’s errors are reproducible, they accumulate into improvement. Widen coverage with regression tests and property-based testing, and a hole the verifier missed once gets pinned down by a test, never leaking in the same place again. An LLM judge wobbles even on the same input, making this accumulation impossible.

These three make the verifier a criterion independent of the generator’s bias. The way to cut error correlation at the verification layer too is to nail the verifier down not inside the model but outside it — in people, math, and test suites.

So where does a deterministic verifier’s difference lie? Not in being error-free. It’s three things. First, the verification criterion lives outside the generator’s weights — whether written by a person or made by another procedure, you can stand up a criterion independent of the generator’s bias (structurally impossible for an LLM judge). Second, the verifier’s errors surface not as confident hallucinations but as detectable, reproducible failures — giving the same verdict to the same input, so they get debugged and accumulate into improvement. Third, trust moves to a small, auditable surface (spec, tests), so once a person reviews it, it’s reused infinitely. The verifier doesn’t guarantee accuracy; the verifier’s quality becomes the ceiling on accuracy — not the generator’s size.

The Core Insight

The multi-agent accuracy formula:

accuracy = f(individual accuracy, error independence, verification mechanism)

The industry invests only in the first (bigger models). It doesn’t design the second (independence), and it hands the third (verification) to an LLM. And the invest-only-in-the-first strategy hits a paradox — the bigger the model, the higher the error correlation, so the more smart agents you gather, the more amicably they fail together.

The second and third are the real knobs. And the two don’t compete. Independence (Axis 1) rescues the vote; the verifier (Axis 2) cuts where the vote can’t reach. Have both and you’re strongest.

  • Anthropic’s research system: Axis 1’s decomposition taken to the extreme — split the problem into independent parallel searches. 90.2% improvement without a verifier.
  • SciencePedia (China, 2026): independent solvers each solve their own (Axis 1), and only model-consensus answers are kept (cross-model consensus, arXiv:2510.26854). But because the final filter is “model consensus,” it only half-catches Axis 2 — consensus is not deterministic verification. That’s why it can be trusted only when restricted to verifiable domains like math and logic.
  • Why 8 agents of the same model fail: both axes absent. Zero independence, zero verification loop. The 8 stall together in one place.
  • Why yongol works even with Haiku: a direct implementation of Axis 2. Even with low model accuracy, a deterministic verifier filters at every step — as long as the verifier’s quality holds up.

The Democracy Analogy

Just as democracy becomes mob rule if it’s a majority vote of voters who watched the same news, a majority vote of LLMs trained on the same data is a consensus of hallucination. Headcount doesn’t make truth. Independent headcount does. And where headcount can’t reach, a criterion outside the headcount makes it.

The same intuition reads through in learning algorithms too. In backpropagation, gradient directions are correlated; in evolution, mutations scatter independently. There’s a report that genetic algorithms, which use no gradient at all, explore a different solution space than gradient-based methods in deep reinforcement learning (Deep Neuroevolution, arXiv:1712.06567). Independent search reaches where correlated search can’t — the same principle we saw in ensembles, in the same shape, in optimization. That said, “it’s better because of independence” is still a post hoc interpretation — I leave it as a hypothesis, not a proof.

Conclusion

Multi-agent is not “more is more accurate.” The target of attack is not multi-agent but voting without independence. Gathering N copies of the same model for a majority vote is raising a choir that sings the wrong note together.

The prescription is two, and both are real. First, design for independence — split the problem into independent pieces (the surest way) and multi-agent works even with the same model. Second, if it’s a verifiable domain, stand up a verifier outside the LLM — raising the accuracy ceiling regardless of independence.

Let’s nail the scope honestly. The verifier axis (Axis 2) is the answer only in verifiable domains — code, math, formal specs, where you can cut the correct answer with an external criterion. In areas without such a criterion — open-ended generation, summarization, counseling, creative work, strategic judgment — Axis 1, designing for independence, is the only knob left. The knob to turn is not model size but — the independence of errors, and, where possible, an external verifier.

(Conflict-of-interest disclosure: I build yongol, a tool that takes a deterministic verifier as its keystone. So I lean toward the verifier axis. Read the argument above accounting for that bias — if the spine is wrong, the tool is wrong too.)

References

Condorcet and ensemble theory

  • Condorcet’s jury theorem (1785) — the two preconditions for majority-vote convergence: individual accuracy >50%, error independence
  • Krogh & Vedelsby, Neural Network Ensembles, Cross Validation, and Active Learning (NeurIPS 1994) — ambiguity decomposition
  • Wood et al., A Unified Theory of Diversity in Ensemble Learning (JMLR 2023, arXiv:2301.03962) — bias-variance-diversity decomposition

LLM error correlation / limits of consensus

  • Kim et al., Correlated Errors in Large Language Models (ICML 2025, arXiv:2506.07962) — 60% identical wrong answers when two models err together; bigger models, higher correlation
  • Lefort et al., Examining Independence in Ensemble Sentiment Analysis … Condorcet Jury Theorem (arXiv:2409.00094) — Condorcet’s independence assumption breaks for LLMs
  • Denisov-Blanch et al., Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness (2026, arXiv:2603.06612) — consensus aggregation amplifies shared misconceptions; inference-time scaling works only in verifiable domains (same diagnosis as this essay — differentiated into prescription in the body)

Multi-agent: independence and decomposition

  • Cemri et al., Why Do Multi-Agent LLM Systems Fail? (UC Berkeley, 2025, arXiv:2503.13657) — analysis of 1,600+ execution traces across 7 frameworks. 14 failure modes classified into 3 categories: system design, inter-agent misalignment, task verification
  • Smit et al., Should we be going MAD? (ICML 2024, arXiv:2311.17371) — debate doesn’t reliably beat a simple baseline
  • Zhang et al., Stop Overvaluing Multi-Agent Debate (arXiv:2502.08788) — heterogeneity is the antidote (works once independence is restored)
  • Du et al., Improving Factuality and Reasoning through Multiagent Debate (ICML 2024, arXiv:2305.14325) — the original positive claim for MAD
  • Wang et al., Self-Consistency Improves Chain of Thought Reasoning (ICLR 2023, arXiv:2203.11171) — the gain from path diversification
  • Turkmen et al., Don’t Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection (2026, arXiv:2602.08003) — the criterion for ensemble selection is not individual performance but lower correlation (maximizing mutual information). Weak but diverse wins

LLM-as-Judge reliability

  • Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023, arXiv:2306.05685) — GPT-4 judge agrees 80%+ with humans (positive evidence)
  • Wataoka et al., Self-Preference Bias in LLM-as-a-Judge (NeurIPS 2024 WS, arXiv:2410.21819)
  • Shi et al., Judging the Judges: Position Bias in LLM-as-a-Judge (arXiv:2406.07791)
  • Numerical Sources of Nondeterminism in LLM Inference (arXiv:2506.09501) — output wobbles even at T=0

Weak generator + strong verifier

  • Saad-Falcon et al., Shrinking the Generation-Verification Gap with Weak Verifiers (Weaver, arXiv:2506.18203)
  • Aggarwal et al., AlphaVerus: Formally Verified Code Generation (arXiv:2412.06176)
  • Variation in Verification: Understanding Verification Dynamics in LLMs (arXiv:2509.17995)

Cases of verifiable generation

  • SciencePedia (DP Technology / DeepModeling, 2026) — Inverse Knowledge Search over Verifiable Reasoning (arXiv:2510.26854). Independent solvers + cross-model consensus filter

Evolution vs gradient

  • Such et al., Deep Neuroevolution (arXiv:1712.06567) — GA explores a different solution space than gradient

First-hand measurement (the author)

  • ZenFlow / Grok Build: 8 concurrent agents, 3 of 10 endpoints unfinished (failed validate)
  • ZenFlow / yongol: Haiku finished; Sonnet 131 min; Opus 76 min

Further Reading

  • Don’t Build Multi-Agents — Cognition (makers of Devin), 2025. A field-tested classic that flatly argues it’s better not to build multi-agents. When context fragments, agents collide — the trap of work that can’t be decomposed. (Read alongside the sequel Multi-Agents: What’s Actually Working, 2026.)
  • How we built our multi-agent research system — Anthropic, 2025. Read as a pair with the above. It shows the condition under which multi-agent works — when sub-tasks can be independently parallelized (Axis 1’s decomposition) — as a 90.2% improvement.
  • Asymmetry of verification and verifier’s law — Jason Wei, 2025. “The degree to which AI gets stronger is proportional to the verifiability of the task.” The theoretical spine of Axis 2 (weak generator + strong verifier).
  • Hallucinations in code are the least dangerous form of LLM mistakes — Simon Willison, 2025. Code blows the hallucination’s cover the moment you run it. The most intuitive case for why deterministic verification is the decisive lever.
  • Using LLM-as-a-Judge For Evaluation — Hamel Husain, 2024. Why you shouldn’t trust an LLM judge as-is, and the practical procedure for scaling to automation only after aligning with humans.
  • Defeating Nondeterminism in LLM Inference — Thinking Machines Lab, 2025. The real cause of LLMs wobbling even at temperature=0. The infrastructural case for keeping the verifier outside the model.
  • The Wisdom of Crowds — the wisdom of crowds evaporates once diversity and independence break down. An accessible introduction that unpacks Condorcet’s independence precondition in a non-AI context.

  • Cover image: AI-generated (Google Gemini)