Why there is more than one mode

A multi-model debate produces a pile of findings with overlaps, disagreements, and near-duplicates. Somebody has to collapse that pile into one report. How you collapse it is the consensus mode.

There is no neutral answer. Every rule trades something — precision for recall, confidence for coverage, speed for nuance. The right choice depends on what you're doing with the output. A pre-merge check on a small PR wants a short, high-confidence list. A security audit on a payments module wants every angle, even the one only one model mentioned.

Joint Chiefs ships four modes. They cover the useful ground. Set one as the default and override per-review when the stakes change.


moderatorDecides — the default

The judge reads everything and writes the answer. Default. Usually right.

The moderator — Claude by default — reads the full anonymized transcript across all rounds and writes the final synthesis. Findings get kept, merged, reworded, or dropped based on how strong the argument is. Not the count of votes.

What it does well. It matches what the MAD literature (Liang et al., 2023) shows beats both single-model review and plain voting. A well-argued minority finding can survive. A three-model majority built on hand-waving can be trimmed. Reasoning wins.

When it's right. Default reviews. Anything where you want the reviewer's judgment, not a mechanical tally. Most PR feedback sits here.

When it's wrong. When you don't trust the moderator's priors for the specific file type. Or when you need a report that can be explained mechanically to a compliance auditor. The reasoning is in the transcript, but the final call is still one model's judgment.

Failure mode. If the moderator is miscalibrated for a domain, its synthesis inherits that miscalibration. Asking Claude to moderate a review of a language Claude trains weakly on can leave you worse off than strict majority. Weighting and the occasional moderator swap are the answers.


strictMajority — quiet and ruthless

A finding only sticks if most models flagged it. Safe. Misses real one-model catches.

More than half of the panel has to raise a finding (or a mergeable variant of it) in the final round for it to survive. Equal weights. No moderator override.

What it does well. Short, high-precision reports. Every surviving finding has cross-model corroboration, so false positives are rare. Easy to explain to a human reviewer: "three of four models flagged this."

When it's right. Routine PRs where you want a tight pass/fail signal. Smoke tests. Noise-averse contexts where you'd rather miss a rare bug than investigate a false alarm.

When it's wrong. Security-sensitive code. Novel codebases. Anything with subtle concurrency or state. These are the exact cases where the important bug is the one only one model noticed — because only that model's training distribution covered the territory. Strict majority drops that finding by construction.

Failure mode. Silent false negatives. The report looks clean because the minority-caught real bug never appears. You don't know what you missed unless you also run another mode and diff the outputs.


bestOfAll — the exhaustive union

Union of findings. Noisy. Good when you want to eyeball everything.

Every finding from every model is kept. The moderator's only job is to de-duplicate near-identical entries by title and merge their reasoning so the report isn't repetitive. No voting. No filter on count.

What it does well. Maximum recall. If any model thought it was worth flagging, you see it. Useful when you already suspect a bug exists and want every angle.

When it's right. Debugging where the bug is real but the cause is unknown. Security reviews on critical code paths. First pass on legacy code you've never seen before. Anywhere reading a long list is cheaper than missing an issue.

When it's wrong. Day-to-day PR review. The noise floor is high and human reviewers get fatigued scrolling through four models' worth of nits. You lose the signal of cross-model agreement because everything survives regardless.

Failure mode. Alert fatigue. After a few long bestOfAll reports with many false positives, reviewers start skimming and miss the real finding buried in the middle. Use it on purpose, not by default.


votingThreshold — tunable weighted vote

Like majority but with weights. Use it when you trust some models more than others.

Each finding gets a weighted vote share — the sum of the weights of providers that raised it, divided by the total panel weight. Findings at or above a configured threshold survive. Threshold and per-provider weights are both configurable.

Weights run 0.0 to 3.0. A 0.0 weight means "keep this provider in the debate, but don't count its vote." A 2.0 weight doubles its contribution. A 3.0 is close to a soft veto in favor.

What it does well. Tunable precision/recall. A 0.3 threshold is close to bestOfAll. A 0.5 threshold with equal weights is close to strictMajority. A 0.4 threshold with Claude at 2.0 on security files gives security-sensitive catches a louder voice without removing the other models from the conversation.

When it's right. When you have an informed opinion about which models you trust for which code. When the filter needs to be explainable numerically. When you want to slide the cutoff without swapping the whole mode.

When it's wrong. When you don't have those priors. Miscalibrated threshold plus miscalibrated weights produces a filter that looks principled but isn't. It's easier to defend "the moderator said so" than to defend "Claude at 1.8 and Grok at 1.2 with a 0.45 threshold said so" when you picked the numbers by feel.

Failure mode. False precision. The fact that you can express a weighting in decimals doesn't mean your decimals are right. See the guide to per-provider weighting for sensible defaults.


Comparison table

Mode Best for Worst for Failure mode
moderatorDecides Default PR review. Judgment calls. Argument quality over vote counts. Domains the moderator is weakly trained on. Compliance reports needing mechanical explanation. Inherits moderator miscalibration on unfamiliar territory.
strictMajority Routine PRs. Smoke tests. High-precision, noise-averse workflows. Security-sensitive code. Concurrency. Novel codebases with rare bugs. Silent false negatives — the minority-caught real bug disappears.
bestOfAll Debugging sessions. First pass on legacy code. Maximum recall. Day-to-day PRs. Anywhere reviewer fatigue matters. Alert fatigue — the real finding buried in noise.
votingThreshold Informed weighting. Tunable precision/recall. Explainable numeric filters. Teams without priors on which models to trust where. False precision — principled-looking numbers picked by feel.

How to pick

Start with moderatorDecides. It's the default because it's the right default.

Switch to strictMajority if your daily reviews are too noisy and you'd rather miss occasional rare catches than wade through nits. Switch to bestOfAll when you're actively debugging or auditing and you want every model's full list. Switch to votingThreshold only when you have a real opinion about why certain providers should outweigh others on certain files — otherwise you're reinventing strictMajority with worse documentation.

A pattern that works: moderatorDecides in CI, bestOfAll when you're about to ship something risky, strictMajority when you're triaging a backlog. The CLI flag and the MCP tool both take the mode as an argument, so overriding per-invocation is cheap.


Key takeaways

  • Consensus mode is the rule that collapses a multi-model debate into one report. Every rule trades precision for recall differently.
  • moderatorDecides is the default. The judge reads everything and writes the answer. Argument quality, not vote count.
  • strictMajority is precision-first. Findings survive only with majority corroboration. Quiet. Drops minority-caught real bugs by construction.
  • bestOfAll is recall-first. Keeps every finding, de-dupes by title. Exhaustive and noisy — right for debugging and audits, wrong for daily review.
  • votingThreshold is the tunable option. Weighted per-provider votes against a configurable cutoff. Powerful when you have priors. Deceptive when you don't.
  • The mode is per-invocation. Change it on the CLI or via the MCP tool when the stakes change. Don't lock yourself into one setting for every review.

Frequently asked questions

What is the default consensus mode in Joint Chiefs?

moderatorDecides. Claude reads the full anonymized debate transcript and writes the final answer based on argument quality, not vote count. It's the default because it matches what the MAD literature shows beats both single-model review and vote-based review.

When should I use strict majority instead of moderator-decides?

When you want a quiet, high-precision report and you can live with dropping rare findings. Strict majority only keeps things most of the panel raised. False positives are rare. Minority-caught real bugs disappear. Good for end-of-PR sanity checks. Bad for security-sensitive code.

What does best-of-all actually do?

It keeps every finding from every model. The moderator only de-duplicates near-identical entries by title and merges their reasoning. You see everything any model thought was worth flagging. Exhaustive by design. Noisy by consequence. Use it when you'd rather read a long list than miss something.

How does voting threshold differ from strict majority?

Voting threshold is tunable. You set the survival cutoff anywhere between 0 and 1 and apply per-provider weights from 0.0 to 3.0 before the vote. Strict majority is a fixed 50%+ rule with equal weights. A 0.4 threshold with Claude at 2.0 on security files is a very different filter from a plain majority.

Can I change consensus modes per review?

Yes. Consensus mode lives in the strategy config and can be overridden per-invocation from the CLI or via the MCP tool arguments. No rebuild. No restart. Just pass a different mode when the stakes change.

Which consensus mode is correct?

No mode is universally correct. Each trades precision for recall differently. moderatorDecides is the right default for most reviews. Swap to strict majority when you want a quiet report and can afford to miss rare catches. Swap to best-of-all when you're chasing a real bug and want every angle. Swap to voting threshold when you trust some models more than others for specific files.