Why there is more than one mode

A multi-model debate produces a stack of findings with overlaps, disagreements, and near-duplicates. Somebody has to collapse that stack into one report. The question of how to collapse it is the consensus mode.

There is no neutral answer. Every collapse rule trades something: precision for recall, confidence for coverage, speed for nuance. The choice depends on what you plan to do with the output. A pre-merge sanity check on a small PR wants a short, high-confidence list. A security audit on a payments module wants every angle, even the ones only one model mentioned.

Joint Chiefs ships four modes because those four cover the useful ground. You set one as the default and override per-review when the stakes change.


moderatorDecides — the default

The moderator — Claude by default — reads the full anonymized debate transcript across all rounds and writes the final synthesis. Findings are kept, merged, reworded, or dropped based on the strength of the argument, not the count of votes.

What it does well. It most closely matches the outcome the Multi-Agent Debate literature (Liang et al., 2023) shows to beat both single-model review and simple voting. A well-argued minority finding can survive; a three-model majority built on hand-waving can be trimmed. Reasoning wins.

When it's right. Default reviews. Anything where you want the reviewer's judgment, not a mechanical tally. Most PR feedback sits here.

When it's wrong. When you don't trust the moderator's priors for the specific file type, or when you need a report that can be explained mechanically to a compliance auditor. The moderator's reasoning is visible in the transcript, but the final decision is, ultimately, one model's judgment.

Failure mode. If the moderator is miscalibrated for a given domain, its synthesis inherits that miscalibration. Asking Claude to moderate a review of a language or framework Claude trains weakly on can leave you worse off than strict majority. Weighting and the occasional moderator swap are the answers.


strictMajority — quiet and ruthless

A finding survives only if more than half of the panel raised it (or a mergeable variant of it) in the final round. Equal weights across providers. No moderator override.

What it does well. Produces short, high-precision reports. Every surviving finding has cross-model corroboration, so false positives are rare. Easy to explain to a human reviewer: "three of four models flagged this."

When it's right. Routine PRs where you want a tight pass/fail signal. Smoke tests. Noise-averse contexts where you'd rather miss a rare bug than investigate a false alarm.

When it's wrong. Security-sensitive code, novel codebases, and anything with subtle concurrency or state. These are exactly the cases where the important bug is often the one only one model noticed — because only one model's training distribution covered the territory where it lives. Strict majority drops that finding by construction.

Failure mode. Silent false negatives. The report looks clean because the minority-caught real bug never appears. You do not know what you missed unless you also run another mode and diff the outputs.


bestOfAll — the exhaustive union

Every finding from every model is kept. The moderator's only job is to de-duplicate near-identical entries by title and merge their reasoning so the report isn't repetitive. No voting, no filtering on count.

What it does well. Maximum recall. If any model thought it was worth flagging, you see it. Useful when you are already suspicious that a bug exists and want every angle.

When it's right. Debugging sessions where the bug is real but the cause is unknown. Security reviews on critical code paths. First pass on legacy code you've never seen before. Contexts where reading a long list is cheaper than missing an issue.

When it's wrong. Day-to-day PR review. The noise floor is high, and human reviewers get fatigued scrolling through four models' worth of nits. You lose the signal of cross-model agreement because everything survives regardless.

Failure mode. Alert fatigue. After a few long bestOfAll reports with many false positives, reviewers start skimming and miss the real finding buried in the middle. Use it intentionally, not by default.


votingThreshold — tunable weighted vote

Each finding's weighted vote share is computed — the sum of the weights of providers that raised it, divided by the total panel weight. Findings with a share at or above a configured threshold survive. Threshold and per-provider weights are both configurable.

Weights are the 0.0 to 3.0 per-provider scalar. A 0.0 weight means "don't let this provider's vote count toward consensus" (without removing it from the debate). A 2.0 weight doubles its contribution. A 3.0 is close to a soft veto in favor.

What it does well. Tunable precision-recall trade-off. A 0.3 threshold is close to bestOfAll. A 0.5 threshold with equal weights is close to strictMajority. A 0.4 threshold with Claude at 2.0 on security files gives security-sensitive catches a louder voice without removing the other models from the conversation.

When it's right. When you have an informed opinion about which models you trust for which kinds of code. When you need the filter to be explainable numerically. When you want the flexibility to move the cutoff without changing the mode.

When it's wrong. When you don't have those priors. A miscalibrated threshold plus miscalibrated weights produces a filter that looks principled but isn't. It's easier to justify "the moderator said so" than to justify "Claude at 1.8 and Grok at 1.2 with a 0.45 threshold said so" if you picked the numbers by feel.

Failure mode. False precision. The fact that you can express a weighting in decimals doesn't mean your decimals are right. See our guide to per-provider weighting for sensible defaults.


Comparison table

Mode Best for Worst for Failure mode
moderatorDecides Default PR review. Judgment calls. Weighing argument quality over vote counts. Domains the moderator is weakly trained on. Compliance reports that need mechanical explanations. Inherits moderator's miscalibration on unfamiliar territory.
strictMajority Routine PRs. Smoke tests. High-precision, noise-averse workflows. Security-sensitive code. Concurrency. Novel codebases with rare bugs. Silent false negatives — the minority-caught real bug disappears.
bestOfAll Debugging sessions. First pass on legacy code. Maximum recall. Day-to-day PRs. Contexts where reviewer fatigue matters. Alert fatigue — the real finding buried in noise.
votingThreshold Informed weighting. Tunable precision-recall. Explainable numeric filters. Teams without priors on which models to trust where. False precision — principled-looking numbers picked by feel.

How to pick

Start with moderatorDecides. It's the default because it's the right default.

Switch to strictMajority if your daily reviews are too noisy and you'd rather miss occasional rare catches than wade through nits. Switch to bestOfAll when you're actively debugging or auditing and you want every model's full list. Switch to votingThreshold only when you have a real opinion about why certain providers should outweigh others on certain files — otherwise you're reinventing strictMajority with worse documentation.

A pattern that works: moderatorDecides in CI, bestOfAll when you're about to ship something risky, strictMajority when you're triaging a backlog. The CLI flag and the MCP tool both take the mode as an argument, so overriding per-invocation is cheap.


Key takeaways

  • Consensus mode is the rule that collapses a multi-model debate into one final report. Every rule trades precision for recall differently.
  • moderatorDecides is the default. The moderator reads the full anonymized transcript and writes the synthesis based on argument quality, not vote count.
  • strictMajority is precision-first. Findings survive only with majority corroboration. Quiet and ruthless — drops minority-caught real bugs by construction.
  • bestOfAll is recall-first. Keeps every finding, de-dupes by title. Exhaustive and noisy — best for debugging and audits, worst for daily review.
  • votingThreshold is the tunable option. Weighted per-provider votes against a configurable cutoff. Powerful when you have priors, deceptive when you don't.
  • The mode is per-invocation. Change it on the CLI or via the MCP tool when the stakes change — don't lock yourself into one setting for every review.

Frequently asked questions

What is the default consensus mode in Joint Chiefs?

moderatorDecides. The moderator (Claude by default) reads the full anonymized debate transcript and writes the final synthesis based on argument quality rather than vote count. It is the default because it most closely matches the outcome the Multi-Agent Debate literature shows to beat single-model review and vote-based review.

When should I use strict majority instead of moderator-decides?

When you want a high-precision, low-noise report and can afford to miss rare findings. Strict majority only keeps findings raised by more than half of the panel, so false positives are rare but minority-caught real bugs get filtered out. Good for end-of-PR sanity checks. Bad for security-sensitive code where the rare catch matters most.

What does best-of-all actually do?

It produces the union of every finding from every model, with the moderator de-duplicating near-identical entries by title and merging their reasoning. You see everything any model thought was worth flagging. Exhaustive by design and noisy by consequence — use it when you would rather read a long list than miss anything.

How does voting threshold differ from strict majority?

Voting threshold lets you set the survival cutoff anywhere between 0 and 1 and applies per-provider weights before the vote. Strict majority is a fixed 50%+ rule with equal weights. A 0.4 threshold with Claude weighted at 2.0 for security files is a very different filter from a plain majority vote.

Can I change consensus modes per review?

Yes. Consensus mode lives in the strategy config and can be overridden per-invocation from the CLI or via the MCP tool arguments. You do not have to rebuild the app or restart the MCP server to switch between modes.

Which consensus mode is correct?

There is no universally correct mode — each trades precision for recall differently. moderatorDecides is the right default for most reviews. Swap to strict majority when you want a quiet report and can afford to miss rare catches. Swap to best-of-all when you are investigating a real bug and want every angle. Swap to voting threshold when you have strong priors about which models to trust for which files.