The paper in one paragraph

"Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate" by Liang et al., published on arXiv in May 2023 (arXiv:2305.19118, reference implementation on GitHub), proposed and evaluated a protocol they called Multi-Agent Debate (MAD). Two or more LLMs answer a question independently, then exchange and challenge each other's answers across multiple rounds. A judge — either a separate model or a deterministic aggregation rule — writes the final answer based on the full transcript. The paper demonstrates that MAD outperforms both a single LLM and a single LLM doing self-reflection on reasoning and factuality benchmarks, and names the core failure mode of self-reflection: Degeneration of Thought.

That's the claim. The rest of this article explains the experimental setup, the quantitative results, the mechanism, what Joint Chiefs inherits from it, and the important caveats that the paper itself flags.


The experimental setup

The paper runs MAD on two kinds of tasks that are hard for single-model chain-of-thought: counter-intuitive arithmetic (Common MT / "Common Machine Translation" and a small math set) and commonsense reasoning benchmarks. The choice is deliberate. Easy tasks don't discriminate between protocols because every model gets them right. Hard tasks — where models differ — are where a debate layer has room to help.

Each task gets run under three conditions:

The judge's job in MAD is not to vote. It's to read the full debate and pick the answer with the stronger argument — or synthesize a third answer if neither side is fully right. That detail matters more than the two-debater setup, and it's the part that carries over most directly to code review.


What the paper found

Multi-Agent Debate beats single-model self-reflection

Across the benchmarks the paper tests, MAD produces higher accuracy than either direct answering or self-reflection. The effect is consistent enough that the paper characterizes self-reflection as harmful relative to MAD, not just weaker. A single model that reflects on its own output gets more confident without getting more correct, so self-reflection is worse than useless when a better alternative is available.

The size of the gain depends on the task. On tasks where the base model is already near-ceiling, there's little room to improve. On tasks where the base model is making systematic reasoning errors, MAD can recover a meaningful fraction of those errors by bringing a second viewpoint into the loop. Code review, empirically, behaves like the second kind of task — there's a long tail of subtle issues that single-model review misses.

Convergence is usually fast

The paper notes that the number of useful debate rounds is small — typically two to four before positions either agree or stop moving. Continuing to debate past that point does not improve the final answer and can degrade it by introducing noise. Joint Chiefs' adaptive early break on convergence comes directly from this observation.

The judge matters

The judge's reasoning — not its majority count — is what produces the gain. A weak judge that just picks whichever side wrote more text underperforms a strong judge that reads the arguments and picks the answer with better grounding. This is why MAD is not equivalent to voting, and why Joint Chiefs picks its moderator model explicitly rather than tallying.


Degeneration of Thought, the central mechanism

The paper's core diagnostic contribution is the concept of Degeneration of Thought (DoT). The claim: when a single LLM reflects on its own previous output, the reflection is sampled from the same distribution as the original answer, so the reflection shares the same biases, the same training data priors, and the same mistakes. The model cannot notice what it was not trained to notice. Asking it to "double-check" produces more confident agreement with its earlier self, not a genuine second opinion.

DoT is what makes single-model code review fundamentally limited. The first pass and the self-review pass are drawn from the same well. They share blind spots. The failure mode is not a bug in any one model; it's a property of self-reflection as a mechanism.

The MAD protocol breaks DoT by putting two models with different weights — ideally different architectures and different training corpora — in dialogue. Their blind-spot distributions don't overlap perfectly, so each model can notice what the other missed. The diversity of the participants is what produces the gain, not the participation count.


The four design principles Joint Chiefs inherits

Joint Chiefs implements four principles drawn directly from the paper:

1. Adaptive break on convergence

Debate stops when positions stop moving. The paper observed that forcing continued debate past convergence degrades output quality. Joint Chiefs monitors cross-round agreement on individual findings and terminates when the panel reaches unanimous or near-unanimous positions on every active finding. Max rounds is a safety net, not a target — real debates often converge in two or three rounds.

2. Tit-for-tat engagement

Each round's responses must address the prior round's findings by title — agree, challenge, or revise — not just restate the model's original position. Without this, a "debate" turns into parallel monologues that ignore each other's arguments. The MAD paper's protocol enforces direct engagement, and Joint Chiefs' prompts structure responses so that skipping a prior finding is a visible omission.

3. Multiple independent models (DoT prevention)

The debate panel must include models from different labs and ideally different architectures. A panel of OpenAI + OpenAI fails in the same way a single model reflecting on itself does. Joint Chiefs ships with OpenAI, Gemini, Grok, Anthropic, and optional Ollama — five independent teams training on five different data blends. Per-provider weighting (including weight 0 to exclude a provider) lets you tune which model families dominate without sacrificing the independence property.

4. Judge arbitration

A deciding model reads the full debate transcript and writes the final synthesis. The judge evaluates reasoning quality — a well-argued minority position can override a weakly-justified majority. This is why Joint Chiefs' default consensus mode is moderator decides, and why the anonymized synthesis step strips model identities before the moderator reads the transcript: the judge should weigh arguments, not brands.


What MAD does not prove

A few caveats the paper itself names, which matter if you're building on this work:

Key takeaways

  • The MAD paper (Liang et al., 2023, arXiv:2305.19118) shows multi-model debate beats both direct answering and single-model self-reflection on reasoning and factuality benchmarks.
  • Degeneration of Thought — the mechanism behind self-reflection's failure — is a property of the protocol, not the model. Fixing it requires independent participants, not better prompts.
  • Joint Chiefs implements the paper's four design principles: adaptive break, tit-for-tat engagement, DoT prevention via diverse models, and judge arbitration on the full transcript.
  • The paper's benchmarks are reasoning-heavy, not code-specific. The mechanism generalizes, but specific accuracy numbers don't transfer directly — caveats apply to any code-review product that claims MAD as evidence.
  • The judge reads arguments, not votes. Majority-voting variants of multi-model review miss the part of the protocol that does the work.

Frequently asked questions

What is Multi-Agent Debate in LLMs?

A protocol where two or more LLMs produce independent answers to the same question, then exchange and challenge each other's answers across multiple rounds. A judge model (or aggregation rule) writes the final answer based on the full debate transcript. Introduced and evaluated in Liang et al.'s 2023 paper.

What did the MAD paper actually prove?

That MAD improves answer accuracy on reasoning and factuality benchmarks compared to a single LLM answering once and a single LLM reflecting on its own answer. It also identified Degeneration of Thought — single-model self-reflection increases confidence without increasing correctness.

Does MAD apply to code review specifically?

The paper's benchmarks are reasoning and factuality, not code review. The mechanism generalizes to any task with non-trivial verifiable correctness, and follow-up work has applied similar debate protocols to programming tasks with comparable gains. But the specific accuracy numbers in the paper don't transfer directly.

How does Joint Chiefs implement MAD?

Four principles: adaptive break (stop when positions converge), tit-for-tat engagement (address prior findings by title), DoT prevention (multiple independent models), judge arbitration (moderator reads the full transcript and synthesizes, weighing reasoning not votes).

Is MAD the same as self-consistency or chain-of-thought?

No. Self-consistency samples multiple chains from the same model and takes a majority. Chain-of-thought asks a single model to reason step by step. MAD uses multiple independent models — different architectures, different training data — that see and challenge each other's outputs. The diversity of participants is what distinguishes it.

How many debate rounds should I configure?

The paper observed useful convergence in two to four rounds. More does not help and can hurt. Joint Chiefs defaults to a max of five with an adaptive early break — the panel usually stops in round two or three once positions agree.