Why you want to stop early

Liang et al.'s Multi-Agent Debate paper (arXiv:2305.19118) is the clearest published evidence that multi-model debate beats single-model reflection. It's also the source of a subtler finding: debate helps in the first two to four rounds and then stops helping. In some tasks, quality actively degrades past that.

The mechanism is easy to describe. A debate prompt asks models to review the prior round's findings and respond — agree, challenge, or revise. Once the real disagreements have been resolved, the prompt is still there. Models don't refuse it. They produce output that satisfies the shape of the prompt: fresh-sounding disagreement on smaller and smaller issues, nitpicks promoted to the level of the original findings, speculative edge cases that were not raised when they would have mattered.

The output looks like debate. It isn't. It's the panel dutifully generating disagreement because disagreement was asked for, not because disagreement exists.

Three things go wrong if you let it run:

Adaptive early break detects real convergence and stops the loop. That's the whole mechanism, and it's sufficient.


What "convergence" means here

Convergence doesn't mean the models produced identical text. It means their positions stopped moving — the same set of findings, in the same rough shape, survives from one round to the next.

This distinction matters. Models will never produce byte-identical responses across rounds; sampling variance alone prevents it. Demanding exact agreement would mean convergence never fires and the debate always runs to max rounds, defeating the point.

What you want is substantive agreement on individual findings. Panel A flagged a null-pointer risk in handleAuth(). Panel B flagged the same thing, slightly reworded. That's convergence on that finding. Multiply across the full list, and when most findings look stable round over round, the debate is done.

The hard part is detecting "most findings look stable" cheaply, without asking the moderator to read both rounds and decide — which is itself another round of expensive inference.


The title-similarity heuristic

Joint Chiefs uses a title-similarity heuristic. Each finding has a short title ("null dereference in handleAuth", "unbounded recursion in walkTree"). Between rounds, Joint Chiefs compares the sets of titles across providers. If the sets are sufficiently similar — most titles match or near-match — positions are treated as converged and the debate breaks.

This is a deliberately cheap check. No extra LLM call, no embedding pass at runtime, just string-level comparison on a small set of short strings. It runs in milliseconds between rounds and has no impact on the debate budget.

It is also — and we want to be honest about this — a heuristic. It gets two classes of cases wrong. These are in KNOWN-ISSUES because we haven't solved them.

Case 1: Titles diverge while substance converges. Three providers agree that handleAuth() has a null-pointer risk, but each titles their finding differently — "null deref", "missing nil check", "unsafe force-unwrap in auth path". The substance is the same. The titles are not similar enough to trigger early break. The debate runs an extra round or two past real convergence. Cost: wasted calls. Quality impact: zero to slightly negative.

Case 2: Titles converge while substance diverges. Two providers flag "concurrency issue in cache" and mean two different things — one is flagging a race on the write path, the other is flagging lock ordering on the read path. The titles match. The heuristic breaks early. The moderator still reads both bodies during synthesis and can often merge or distinguish them, but a real disagreement got compressed into one apparent agreement. Quality impact: small but non-zero.

A future version will likely use embedding-based similarity on the full finding bodies, which handles both cases better at the cost of an extra inference pass per round. We haven't shipped it because the current heuristic is right often enough that the added cost isn't justified yet. It will change when we have evidence the noise from these edge cases is material.


Tuning the round cap

Max rounds is the hard ceiling. Default is 5. Adaptive break fires somewhere between 2 and the ceiling.

Lower the cap when the review is routine and you want the fastest pass possible:

Raise the cap when the code is ambiguous, security-sensitive, or touches subtle state:

Past 7 rounds, the MAD paper's finding about degradation starts to bite. We do not recommend going higher. If your debate is still unresolved at round 7, the models are disagreeing about something genuinely ambiguous, and you want a human to look at it, not another round.


The separate question of per-provider timeout

Max rounds and per-provider timeout are different controls for different failure modes. Both matter for total latency; they are not interchangeable.

Max rounds bounds the number of debate passes end-to-end. A lower cap means a shorter debate and a faster result. The adaptive early break can fire earlier, but never after the cap.

Per-provider timeout (default 120 seconds) bounds how long any single model call can take within a round. If a provider is slow or stuck, it gets dropped from that round and the debate continues with the remaining panel.

The interaction matters. A round doesn't complete until the slowest provider finishes or times out. A 120-second timeout on a 5-round debate means worst-case wall-clock is roughly 10 minutes before counting moderator synthesis — though in practice most rounds complete in under 30 seconds because streaming starts immediately.

If you're latency-sensitive, pull both levers: lower max rounds and lower the per-provider timeout. If you're quality-sensitive, keep rounds at 5+ and keep the timeout high enough that slow providers aren't dropped.


Key takeaways

  • The MAD paper (Liang et al., 2023) shows debate quality peaks at 2-4 rounds and degrades past that. Forced rounds past convergence invent disagreements that aren't real.
  • Convergence means substantive agreement on individual findings, not identical text. Sampling variance alone prevents byte-identical responses.
  • Joint Chiefs detects convergence with a cheap title-similarity heuristic between rounds. It is right often enough to justify the cost savings.
  • The heuristic has known limits: titles can diverge while substance converges, and vice versa. Both edge cases are documented; an embedding-based replacement is on the roadmap.
  • Max rounds defaults to 5. Lower for routine changes; raise for ambiguous or high-stakes code. Past 7, you want a human, not another round.
  • Max rounds and per-provider timeout are separate controls. Max rounds bounds the debate length; timeout bounds a single model call. Both affect latency differently.

Frequently asked questions

How many rounds of debate are actually useful?

The Multi-Agent Debate paper reports useful convergence in two to four rounds across the tasks they measured. Past that, models frequently reach for new disagreements that are not substantively real. Joint Chiefs caps at five rounds by default and breaks earlier when convergence is detected.

Why does forced debate past convergence hurt?

Once panels agree on the real findings, additional rounds pressure models to produce something new. The new output is not new signal — it is models satisfying the prompt by generating fresh-sounding disagreement on smaller and smaller issues. Quality degrades, latency grows, cost goes up. Stopping at real convergence avoids all three.

What counts as convergence in Joint Chiefs?

A title-similarity heuristic: if the set of finding titles from a round is sufficiently similar to the prior round, positions are treated as converged and the debate breaks. It is a heuristic — known to miss cases where titles diverge while substance converges (and the opposite). It is cheap, fast, and usually right.

Can I change the maximum number of rounds?

Yes. Max rounds lives in the strategy config, default 5. Lower it to 2 or 3 for routine changes where you want a fast pass. Raise it to 6 or 7 for ambiguous or high-stakes code where you want more pressure on the panel to resolve disagreements. The adaptive early break still fires regardless of the ceiling.

Is the per-provider timeout the same as max rounds?

No — they are separate controls. Max rounds bounds the total number of debate passes. The per-provider timeout (default 120 seconds) bounds how long any single model call can take within a round. Both matter for total latency but they fail differently: a round cap produces a shorter debate, a timeout drops a model from a round.

Does adaptive termination change the final output?

Usually not. If positions have stabilized, breaking at round three versus round five produces the same synthesis in most cases — the moderator reads the same converged findings either way. The savings are latency and cost, not a different answer. The cases where it matters are ambiguous reviews that would have continued to shift, where a too-eager break freezes the panel before real convergence.