Contents
Direct answer
Degeneration of Thought (DoT) is the failure mode Liang et al. named in their 2023 Multi-Agent Debate paper (arXiv:2305.19118). When a large language model reflects on its own output, confidence goes up and correctness does not. The model sounds surer without being more right. This is why "ask the model to double-check" does not work, and why single-model code review inherits the writing model's blind spots no matter how many review passes you stack.
The fix is not a better prompt. It is a second model from a different lab, with different training data, looking at the same code. Prompts cannot generate information the model does not have.
What DoT means precisely
The MAD paper introduces DoT as the tendency of a single agent, when asked to reflect on and revise its own output, to converge toward its initial answer rather than improve it. The paper frames this as a core limitation of self-reflection approaches: the reviewer and the reviewed share everything, so the review has nowhere new to stand.
Liang et al. contrast this with a multi-agent debate setup, where two models argue with each other and a third judge reads the exchange. In that setup, a finding that was missed by one model can be raised by the other, and the judge can weigh both. The debate structure provides the diversity that self-reflection cannot.
DoT is not a bug in any specific model. It is a property of single-distribution sampling. You cannot train it away in one model because the problem is the absence of a second one.
The mechanism: same distribution, same priors
A language model produces an answer by sampling from a probability distribution over tokens. That distribution is shaped by training data, architecture, tokenizer, and RLHF preferences. When you ask the same model to review its answer, the review is sampled from the same distribution, shaped by the same things.
Imagine a model that was trained on a corpus where a specific concurrency primitive was rare. It writes a function using that primitive incorrectly. You ask it to review the function. The review is sampled from the same distribution in which that primitive was rare. The pattern the model would need to recognize the bug is not strongly represented. The model shrugs and approves.
Now imagine a second model, trained on a different corpus where that primitive is common. It sees the bug immediately. The first model could not get to this answer no matter how many times you asked it to reconsider, because the information required to get there is not in its distribution. The second model's distribution contains it.
That is the concrete reason self-reflection fails. Not laziness, not prompt design, not inadequate reasoning steps. The reviewer does not have the information needed to find what it missed.
Empirical signatures of DoT
DoT has two visible fingerprints in real usage.
First: confidence rises across reflection rounds. Ask a model to check its work and its language shifts from "I think" to "I'm confident." Ask it again and it shifts to "this is definitely correct." The language tracks the model's internal sense that it has reviewed the material — but that sense is triggered by the review happening, not by the review finding anything.
Second: correctness stays flat or declines. The MAD paper reports this directly: across factuality and reasoning tasks, self-reflection does not meaningfully improve answers, and sometimes makes them worse when the model talks itself out of a correct initial response. Multi-agent debate, by contrast, measurably improves accuracy in the paper's evaluations.
These two signatures combine into the most dangerous pattern in AI code review: a model that sounds certain about a wrong answer. The certainty is not evidence. It is the signature of DoT.
Why sibling models inherit it
People sometimes try to work around DoT by using a second model from the same lab — GPT-4 reviewing GPT-5's output, or Claude 3.5 reviewing Claude 4. This is better than literal self-reflection, but not by much.
Sibling models share:
- Most of their pretraining corpus.
- Most of their tokenizer and architectural choices.
- The RLHF signal from the same labs and often the same annotators.
- The house preferences of the lab — what constitutes "good writing," "safe refusal," "proper hedging."
What varies: scale, some fine-tuning data, post-training. Useful, but not enough to escape shared blind spots. If the corpus under-represented a specific concurrency primitive, both the small sibling and the large one under-represent it. The blind spot is a property of the corpus, not the scale.
The real fix is a model from a different lab, trained on a different corpus, with different RLHF pipelines. Different labs get different things wrong. That's the whole value.
What this means for daily AI coding
Most AI coding workflows today contain a latent DoT loop. You ask a model to write code. The model writes it. You ask it to review it, or you accept it and later ask the same model to debug it when something breaks. Every step of that loop is the same distribution reviewing its own output.
The symptom is familiar. The model confidently approves a diff. The bug ships. You return to the model with the bug report. It confidently proposes a fix that changes the wrong thing. You iterate for forty minutes. Eventually you paste the code into a different tool and a different model sees the issue in thirty seconds.
That second model was not smarter. It was drawing from a different distribution. That is the only thing it had over the first one, and it was sufficient.
DoT is not an edge case. It is the default outcome of any workflow where a single model writes, reviews, and debugs its own code. If you are using AI to help you ship, you are running this loop. Knowing it exists is the first step to getting out of it.
The fix is structural, not prompt-level
Prompts cannot fix DoT. You can tell a model to be more critical, to reconsider, to think step by step, to imagine it's a senior engineer — none of it introduces new information. It rearranges the model's existing probability mass.
The structural fix is independent models. Not bigger models, not better-prompted models, not fine-tuned reviewer models. Models from different labs with different training corpora reviewing the same code, then surfacing disagreement.
Joint Chiefs implements this directly. Five providers — OpenAI, Gemini, Grok, Anthropic, and optional Ollama — review in parallel, then debate through up to five rounds with an adaptive early break when positions converge. Findings are anonymized before the final synthesis so the moderator judges arguments, not brands. Four consensus modes — moderator-decides, strict majority, best-of-all, voting threshold with per-provider weighting — let you tune how disagreement resolves. The moderator defaults to Claude and reads the full debate before writing the final output.
None of this is exotic. It is the straightforward implementation of what the MAD paper argued for in 2023. The research is settled. The remaining question is how much friction you're willing to tolerate on the way to using it. Download Joint Chiefs or read the MCP setup guide to drop it into any MCP-aware host.
Key takeaways
- Degeneration of Thought is the MAD paper's name for what happens when a model reviews its own output: confidence rises, correctness doesn't.
- The cause is structural. Self-review is a second draw from the same distribution; it cannot reach information the model was not trained on.
- Sibling models from the same lab share most of their blind spots. They are a weak substitute for genuine independence.
- Most AI coding workflows today contain a DoT loop by default — the same model writes, reviews, and debugs.
- The fix is a second independent model, not a better prompt. Prompts cannot generate information the model does not have.
Frequently asked questions
What is Degeneration of Thought?
Degeneration of Thought (DoT) is a failure mode named in Liang et al.'s 2023 Multi-Agent Debate paper. It describes what happens when a single large language model reflects on its own output: confidence tends to increase, but correctness does not. The model sounds surer of itself without becoming surer on the things that matter.
Why does asking a model to double-check not work?
Because the self-critique is sampled from the same probability distribution that produced the original answer. Same priors, same training data, same blind spots. The model cannot notice what it was not trained to notice. A second pass with the same model is a second draw from the same well.
Does DoT also apply across sibling models from the same lab?
Mostly yes. Sibling models share training data, architecture choices, and RLHF signal. They tend to share blind spots. A sibling review catches more than literal self-review, but less than a review by a model from an independent lab with an independent training corpus.
What does DoT mean for my day-to-day AI coding workflow?
If you ask the same model to write code and then to review it, you are performing self-reflection and inheriting DoT. The output will sound confident. That confidence is not evidence. To get a real check, you need a second model from a different lab reviewing the same diff, with a mechanism for surfacing disagreement.
Can better prompting fix DoT?
No. Prompts can nudge a model to hedge more or to reconsider more aggressively, but they cannot generate information the model does not have. The fix is structural, not linguistic — a second independent model, not a better instruction.
How does Joint Chiefs avoid DoT?
By running five independent providers — OpenAI, Gemini, Grok, Anthropic, and optional Ollama — in parallel, then putting their findings into a structured debate across multiple rounds. Disagreement is the signal. A moderator reads the full debate and writes the final synthesis based on reasoning, not on a single model's self-assessment.