When to Stop Arguing: Adaptive Termination in AI Code Review Debates

Q: Is the per-provider timeout the same as max rounds?

No. They're separate controls. Max rounds caps how many debate passes run. The per-provider timeout (default 120 seconds) caps how long any single model call can take within a round. A round cap shortens the debate. A timeout drops a single model from a round. Different failure modes.

Why you want to stop early
What "convergence" means here
The title-similarity heuristic
Tuning the round cap
The separate question of per-provider timeout
FAQ
Key takeaways

Why you want to stop early

The Multi-Agent Debate paper from Liang et al. (arXiv:2305.19118) is the clearest published evidence that multi-model debate beats single-model reflection. It's also where the 2-4 round number comes from. They observed convergence in 2 to 4 rounds across the tasks they measured. Past that, quality stops improving — and on some tasks, gets worse.

The mechanism is simple. A debate prompt tells models to review the prior round and respond — agree, push back, revise. Once the real disagreements are settled, the prompt is still sitting there. Models don't refuse it. They give you output that satisfies the shape of the prompt: fresh-sounding disagreement on smaller and smaller issues, nitpicks promoted to the level of the original findings, speculative edge cases that nobody raised when they would have mattered.

The output looks like debate. It isn't. It's the panel generating disagreement because disagreement was asked for — not because disagreement exists.

Three things go wrong if you let it run:

Quality drops. The final synthesis has to ingest invented disagreements alongside real ones. Signal-to-noise gets worse.
Latency grows. Every round is another fan-out to every provider. Two extra rounds past real convergence can double your wall-clock time.
Cost goes up. You pay per call. Rounds that aren't doing work are rounds you're buying for nothing.

Adaptive early break detects real convergence and stops the loop. That's it. That's the whole mechanism.

What "convergence" means here

Convergence doesn't mean the models produced identical text. It means their positions stopped moving. The same set of findings, in roughly the same shape, survives from one round to the next.

This matters. Models will never produce byte-identical responses across rounds — sampling variance alone kills that. If you waited for exact agreement, convergence would never fire and the debate would always run to max rounds. Defeats the point.

What you want is substantive agreement on individual findings. Panel A flagged a null-pointer risk in handleAuth(). Panel B flagged the same thing, reworded. That's convergence on that finding. Multiply across the full list, and when most findings look stable round over round, the debate is done.

The hard part is detecting "most findings look stable" cheaply — without asking the moderator to read both rounds and decide, which is itself another round of expensive inference.

The title-similarity heuristic

Joint Chiefs uses title similarity. Each finding has a short title — "null dereference in handleAuth", "unbounded recursion in walkTree". Between rounds, Joint Chiefs compares the sets of titles across providers. If most titles match or near-match, positions are treated as converged and the debate breaks.

This is a deliberately cheap check. No extra LLM call. No embedding pass at runtime. Just string-level comparison on a small set of short strings. Runs in milliseconds. Zero impact on the debate budget.

It's a heuristic. I want to be honest about that. It gets two classes of cases wrong, and I haven't solved them yet.

Case 1: Titles diverge, substance converges. Three providers all flag the same null-pointer risk in handleAuth(), but each titles the finding differently — "null deref", "missing nil check", "unsafe force-unwrap in auth path". Same issue. Different words. The titles aren't similar enough to trigger early break, so the debate runs an extra round or two past real convergence. Cost: wasted calls. Quality hit: zero to slightly negative.

Case 2: Titles converge, substance diverges. Two providers flag "concurrency issue in cache" and mean two different things. One is flagging a race on the write path. The other is flagging lock ordering on the read path. The titles match. The heuristic breaks early. The moderator still reads both bodies during synthesis and can usually merge or split them, but a real disagreement got compressed into one apparent agreement. Quality hit: small but real.

A future version will probably use embedding-based similarity on the full finding bodies. Handles both cases better, but costs an extra inference pass per round. I haven't shipped it because the current heuristic is right often enough that the added cost isn't worth it yet. That will change when there's evidence the noise from these edge cases matters.

Tuning the round cap

Max rounds is the hard ceiling. Default is 5. The adaptive break fires somewhere between 2 and the ceiling.

Lower the cap when the review is routine and you want the fastest pass possible:

maxRounds = 2 — for obvious changes. Style fixes, renames, mechanical refactors. The first round is basically parallel single-model review; the second is one cross-check. Often enough. Cheap and fast.
maxRounds = 3 — for small logic changes. Gives the panel one round of cross-talk after initial findings. Reasonable default for CI.

Raise the cap when the code is ambiguous, security-sensitive, or touches subtle state:

maxRounds = 5 (default) — covers almost every real review workload. The adaptive break usually fires before round 5 on clean reviews anyway.
maxRounds = 6 or 7 — for unusually hard reviews where you want to pressure the panel to resolve disagreements. Budget accordingly. Latency and cost scale with rounds used.

Past 7 rounds, the MAD paper's finding about degradation bites hard. Don't go higher. If your debate is still unresolved at round 7, the models are disagreeing about something genuinely ambiguous — you want a human to look at it, not another round.

The separate question of per-provider timeout

Max rounds and per-provider timeout are different controls for different failure modes. Both affect total latency. They are not interchangeable.

Max rounds caps the number of debate passes end-to-end. A lower cap means a shorter debate. The adaptive early break can fire earlier, but never after the cap.

Per-provider timeout (default 120 seconds) caps how long any single model call can take within a round. If a provider is slow or stuck, it gets dropped from that round and the debate continues with the remaining panel.

The interaction matters. A round doesn't finish until the slowest provider finishes or times out. A 120-second timeout on a 5-round debate means worst-case wall-clock is roughly 10 minutes before moderator synthesis — though in practice most rounds finish in under 30 seconds because streaming starts right away.

If you're latency-sensitive, pull both levers. Lower max rounds and lower the per-provider timeout. If you're quality-sensitive, keep rounds at 5+ and keep the timeout high enough that slow providers don't get dropped.

Key takeaways

The MAD paper (Liang et al., 2023) observed convergence in 2-4 rounds. Forced rounds past that point invent disagreements that aren't real.
Convergence means substantive agreement on individual findings — not identical text. Sampling variance alone prevents byte-identical responses.
Joint Chiefs detects convergence with a cheap title-similarity heuristic between rounds. Runs in milliseconds. Right often enough to justify the savings.
The heuristic has known limits. Titles can diverge while substance converges, and vice versa. Both cases are documented. Embedding-based replacement is on the roadmap.
Max rounds defaults to 5. Lower for routine changes. Raise for ambiguous or high-stakes code. Past 7, you want a human, not another round.
Max rounds and per-provider timeout are separate controls. Max rounds caps the debate length. Timeout caps a single model call. Different failure modes.

Frequently asked questions

How many rounds of debate are actually useful?

The MAD paper observed convergence in 2 to 4 rounds across the tasks they measured. Past that, models start reaching for disagreements that aren't real. Joint Chiefs caps at 5 rounds by default and breaks earlier when positions converge.

Why does forced debate past convergence hurt?

The prompt still asks for a response. So the models give one. They generate fresh-sounding disagreement on smaller and smaller issues — nitpicks promoted to findings, edge cases nobody raised when they would have mattered. Quality drops. Latency grows. Cost goes up. That's it.

What counts as convergence in Joint Chiefs?

The same findings survive from one round to the next. Joint Chiefs checks this with a title-similarity heuristic — if most finding titles match the prior round, positions are converged and the debate breaks. Cheap, fast, usually right. Not perfect.

Can I change the maximum number of rounds?

Yes. Max rounds lives in the strategy config. Default is 5. Drop it to 2 or 3 for routine changes when you want a fast pass. Bump it to 6 or 7 for ambiguous or high-stakes code where you want more pressure on the panel. The adaptive early break still fires regardless.

Is the per-provider timeout the same as max rounds?

No. They're separate controls. Max rounds caps the total number of debate passes. The per-provider timeout (default 120 seconds) caps how long any single model call can take within a round. A round cap shortens the debate. A timeout drops a single model from a round.

Does adaptive termination change the final output?

Usually no. If positions have stabilized, breaking at round 3 versus round 5 produces the same synthesis — the moderator reads the same converged findings either way. You save latency and cost, not answer quality. The edge case is an ambiguous review where positions would have kept shifting and a too-eager break freezes the panel early.