Contents
Direct answer
A model can't review its own code. Not reliably. It just agrees with itself louder. Liang et al. named the failure in their 2023 Multi-Agent Debate paper (arXiv:2305.19118) and called it Degeneration of Thought. When a model reflects on its own output, confidence goes up and correctness doesn't. The model sounds surer without being more right.
That's why "ask the model to double-check" doesn't work. And it's why stacking more review passes from the same model doesn't help. The fix is not a better prompt. The fix is a second model from a different lab, trained on different data, looking at the same code. Prompts can't generate information the model doesn't have.
What DoT means precisely
The MAD paper describes DoT as what happens when a single model reflects on its own output and drifts toward its original answer instead of improving it. The reviewer and the reviewed share everything. The review has nowhere new to stand.
Liang et al. contrast this with a debate setup. Two models argue. A third judge reads the exchange. A finding one model missed can be raised by the other, and the judge weighs both. The debate structure provides the diversity that self-reflection can't.
DoT is not a bug in any specific model. It's a property of sampling one distribution twice. You can't train it out of one model because the problem is the absence of a second one.
The mechanism: same distribution, same priors
A model produces an answer by sampling from a probability distribution over tokens. That distribution is shaped by training data, architecture, tokenizer, and RLHF. Ask the same model to review its answer and the review is sampled from the same distribution, shaped by the same things.
Imagine a model trained on a corpus where a specific concurrency primitive was rare. It writes a function using that primitive incorrectly. You ask it to review the function. The review is sampled from the same thin distribution. The pattern the model would need to spot the bug is barely represented. The model shrugs and approves.
Now imagine a second model — trained on a different corpus where that primitive is common. It sees the bug immediately. The first model could not get to this answer no matter how many times you asked it to reconsider. The information isn't in its distribution. It is in the second model's.
That's the concrete reason self-reflection fails. Not laziness. Not prompt design. Not inadequate reasoning steps. The reviewer doesn't have the information needed to find what it missed.
Empirical signatures of DoT
DoT has two visible fingerprints in real usage.
First: confidence rises across reflection rounds. Ask a model to check its work and it shifts from "I think" to "I'm confident." Ask again and it shifts to "this is definitely correct." The language tracks the model's sense that it has reviewed the material. That sense is triggered by the review happening, not by the review finding anything.
Second: correctness stays flat or declines. The MAD paper reports this directly. Across factuality and reasoning tasks, self-reflection does not meaningfully improve answers — and sometimes makes them worse when the model talks itself out of a correct initial response. Multi-agent debate measurably improves accuracy in the same evaluations.
These two combine into the most dangerous pattern in AI code review: a model that sounds certain about a wrong answer. The certainty is not evidence. It's the signature of DoT.
Why sibling models inherit it
People try to work around DoT by using a second model from the same lab. GPT-4 reviewing GPT-5. Claude 3.5 reviewing Claude 4. That's better than literal self-review. Not by much.
Sibling models share:
- Most of their pretraining corpus.
- Most of their tokenizer and architectural choices.
- The RLHF signal from the same labs and often the same annotators.
- The house preferences of the lab — what counts as "good writing," "safe refusal," "proper hedging."
What varies: scale, some fine-tuning data, post-training. Useful, but not enough to escape shared blind spots. If the corpus under-represented a specific concurrency primitive, the small sibling and the large one both under-represent it. The blind spot is a property of the corpus, not the scale.
The real fix is a model from a different lab, trained on a different corpus, with a different RLHF pipeline. Different labs get different things wrong. That's the whole value.
What this means for daily AI coding
Most AI coding workflows contain a hidden DoT loop. You ask a model to write code. The model writes it. You ask the same model to review it — or you ship it and later ask the same model to debug it when something breaks. Every step is the same distribution reviewing its own output.
The symptom is familiar. The model approves the diff. The bug ships. You come back with the bug report. The model confidently proposes a fix that changes the wrong thing. You iterate for forty minutes. Eventually you paste the code into a different tool, a different model reads it, and the issue is obvious in thirty seconds.
That second model wasn't smarter. It was drawing from a different distribution. That's the only thing it had on the first one, and it was enough.
DoT is not an edge case. It's the default outcome of any workflow where one model writes, reviews, and debugs its own code. If you ship with AI, you are running this loop. Knowing it exists is the first step to getting out of it.
The fix is structural, not prompt-level
Prompts can't fix DoT. You can tell a model to be more critical, to reconsider, to think step by step, to imagine it's a senior engineer. None of that introduces new information. It just rearranges the model's existing probability mass.
The fix is independent models. Not bigger models. Not better-prompted models. Not fine-tuned reviewer models. Models from different labs with different training data, reviewing the same code, and surfacing disagreement.
Joint Chiefs does this directly. Five providers — OpenAI, Gemini, Grok, Anthropic, and optional Ollama — review in parallel, then debate through up to five rounds with an adaptive early break when positions converge. Findings are anonymized before the final synthesis so the moderator judges arguments, not brands. Four consensus modes — moderatorDecides, strictMajority, bestOfAll, and votingThreshold with per-provider weights from 0.0 to 3.0 — let you tune how disagreement resolves. Claude is the default moderator and reads the full debate before writing the final output.
None of this is exotic. It is the straightforward version of what the MAD paper argued for in 2023. The research is settled. The only open question is how much friction you'll put up with on the way to using it. Download Joint Chiefs or read the MCP setup guide to drop it into any MCP-aware host.
Key takeaways
- Degeneration of Thought is the MAD paper's name for what happens when a model reviews its own output. Confidence rises. Correctness doesn't.
- The cause is structural. Self-review is a second draw from the same distribution. It can't reach information the model was never trained on.
- Sibling models from the same lab share most of their blind spots. A weak substitute for real independence.
- Most AI coding workflows today contain a DoT loop by default. One model writes, reviews, and debugs.
- The fix is a second independent model. Not a better prompt. Prompts can't generate information the model doesn't have.
Frequently asked questions
What is Degeneration of Thought?
It's the name Liang et al. gave to a specific failure in their 2023 Multi-Agent Debate paper. When you ask one model to reflect on its own answer, its confidence goes up. Its correctness does not. The model sounds surer of itself without being surer on the things that actually matter.
Why does asking a model to double-check not work?
Because the second pass is drawn from the same distribution as the first. Same training data. Same priors. Same blind spots. A model can't notice what it was never trained to notice. Re-asking isn't a review — it's a second sample from the same well.
Does DoT also apply across sibling models from the same lab?
Mostly, yes. Sibling models share training data, architecture, and most of the RLHF signal. They inherit the same blind spots. A sibling review catches more than literal self-review, but nowhere near what a model from a different lab catches.
What does DoT mean for my day-to-day AI coding workflow?
If one model writes your code and reviews it, you are running a DoT loop. The output will sound confident. That confidence is not evidence. To get a real check you need a second model from a different lab on the same diff — and a way to surface disagreement.
Can better prompting fix DoT?
No. A prompt can make a model hedge more or reconsider harder. It can't give the model information it was never trained on. The fix is structural — a second independent model. Not a better instruction.
How does Joint Chiefs avoid DoT?
Five independent providers — OpenAI, Gemini, Grok, Anthropic, and optional Ollama — review the same code in parallel. Their findings go into a structured debate across up to five rounds. Disagreement is the signal. A moderator reads the full anonymized transcript and writes the final answer based on arguments, not on any one model's self-assessment.