The Lab · AI systems

A council, not a single model.

Most AI tools trust whatever one model says first. This one runs a council of models — then rigorously checks their work, with the authorship hidden, before it commits to a single answer.

01

Ask

The same question goes to every model.

02

The council answers

Four models answer independently — four blind spots.

Claude GPT Gemini Grok
03

Anonymize

Authorship stripped — so no model can favor its own.

ABCD
04

Peer review & rank

Every model grades every answer; the ranks are aggregated.

05

Chair synthesizes

A chair reconciles them into one reply.

One answer out
The actual workflow — not a mockup
The real n8n consensus workflow canvas
The real n8n canvas goes here — save the screenshot as photos/consensus-workflow.jpg and it appears automatically.
The live build in n8n — parallel answers, anonymization, peer review & ranking, then the chairman synthesis. Click to view full size.

The secondary check is the whole point.

The easy version is one model that double-checks the answer. But a lone reviewer carries its own blind spots — and models reliably rate their own output highest.

So the answers are anonymized first, then the entire council ranks them and the scores are aggregated. No single judge dominates, and no model can play favorites — because it can’t tell which answer is its own.

Rows grade columns · authors hidden
ABCD
Claude2134
GPT3124
Gemini2143
Grok3214
Aggregated rank → Answer B wins on the council’s collective judgment, not any one model’s. (Scores illustrative.)
One real run — condensed
The brief I gave the council

Act as the Strategy & Ops partner to the President of a specialty-ingredients distributor, post-acquisition: build this month’s account-priority system. Not SaaS — accounts are manufacturers and brands, every ingredient has its own margins, lead times, and launch horizons.

4 presidential visits8 structured calls5 formulation pushes3 supplier-line pushes

Anchor on risk-adjusted gross profit — est. revenue × margin × close probability — then decide where human attention changes the outcome.

“A pure revenue ranking fails. A pure pipeline ranking fails. A pure sample-count ranking fails.”

Attached: 12 anonymized accounts and 18 live opportunities — stages, margins, lead-time risk, communication signals.

What the council returned
priority = (RAGP × headroom × momentum × signal) × (1 + 0.25 × attention-leverage)
override: broken-signal opportunities are barred from scarce technical pushes until sales repairs the signal

Top of the ranking: a new incubator account’s ferment-based active — $190k × 28% × 40% ≈ $21k expected GP, lifted by very-high headroom and a maximum leverage flag: strong R&D pull, no executive sponsor yet → presidential visit.

“Notably excluded despite raw value: a $400k cost-down opportunity — 12% margin, procurement-driven, high lead-time risk. A strategic trap for senior attention.”
“Sample count alone is a vanity metric — it mistakes activity for progress.”
“This account went silent after a competitor meeting; your visit is the only way to know in one meeting if it’s a save or a loss.”
Council of four · the synthesized answer ran ~1,800 words; excerpts above · account names and figures are randomized scenario data.
Why a council?

One model is one set of blind spots. Four independent answers surface more, and disagreement is signal — it flags where the question is genuinely hard.

Why anonymize?

Evaluators favor their own writing. Hiding authorship turns peer review into a fair test of the answer, not a popularity contest between brands.

Why a chairman?

Aggregated ranks pick the strongest answer; the chairman reconciles where the council agrees and disagrees into one clean, usable reply.

What it shows

An AI-native operator: I stand up working AI systems and engineer them for reliability — not just call an API.

Built in n8n · multi-model via OpenRouter