Project Meridian: Implementing the LLM Council
When Andrej Karpathy tweeted about the concept of an "LLM Council," it struck a chord. The idea is simple yet profound: no single model is perfect, but a committee of diverse models can check each other's work, debate, and converge on a truth that is more reliable than any individual member.
Project Meridian is my implementation of this architecture. It's a system designed to orchestrate a council of heterogeneous frontier models to solve complex reasoning tasks with a level of robustness that a single model cannot achieve.
The source code is currently closed source, but I am planning to release it soon.
The Core Problem
We are moving from a world of "Prompt Engineering" to "System Engineering." The stochastic nature of LLMs means that for any non-trivial task, a single inference pass is a roll of the dice. Even the best models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) have blind spots, biases, and occasional hallucinations.
Meridian treats individual models as unreliable nodes in a distributed system. By aggregating their outputs and forcing a consensus process, we can filter out noise and amplify signal.
The Architecture
Meridian implements a 3-stage consensus pipeline. This isn't just a simple "vote"; it's a deliberative process that mimics human peer review.
Stage 1: Diversity Generation
First, we broadcast the user's query to a diverse set of council members. Diversity is critical here; using three instances of GPT-4 is less effective than using GPT-4, Claude, and Gemini because different model families have different error distributions.
In meridian, our council consists of:
- OpenAI GPT-5.1 (for raw reasoning power)
- Google Gemini 3 Pro (for long-context understanding)
- Anthropic Claude Sonnet 4.5 (for nuance and safety)
- xAI Grok 4 (for uninhibited perspective)
// From council.ts
export async function stage1CollectResponses(
userQuery: string
): Promise<Stage1Result[]> {
const messages = [{ role: "user", content: userQuery }];
// Query all models in parallel for maximum efficiency
const responses = await queryModelsParallel(COUNCIL_MODELS, messages);
// ... formatting logic ...
return stage1Results;
}
This parallel execution ensures that the latency of this stage is only as slow as the slowest model, not the sum of them all.
Stage 2: Blind Peer Review
This is where the magic happens. We don't just take the Stage 1 outputs and average them. We feed them back into the council for evaluation.
Crucially, the responses are anonymized. "Response A", "Response B", etc. This prevents bias (e.g, a model favoring its own output or favoring a specific provider).
Each council member is asked to:
- Critique each response.
- Rank them from best to worst.
// The prompt sent to each model in Stage 2
const rankingPrompt = `
You are evaluating different responses to the following question:
${userQuery}
Here are the responses from different models (anonymized):
${responsesText}
Your task:
1. First, evaluate each response individually.
2. Then, at the very end, provide a final ranking.
FINAL RANKING:
1. Response C
2. Response A
...
`;
We then aggregate these rankings to determine a "consensus score" for each response.
Stage 3: The Chairman's Synthesis
Finally, we introduce a specialized role: The Chairman. In Meridian, this role is currently held by google/gemini-3-pro-preview due to its large context window and ability to synthesize disparate information.
The Chairman is given:
- The original query.
- All individual responses (Stage 1).
- All peer reviews and rankings (Stage 2).
Its job is not to generate a new answer from scratch, but to synthesize the collective wisdom of the council into a single, definitive response. It can cherry-pick the code snippet from Claude, the explanation from GPT-5, and the edge-case handling from Gemini.
export async function stage3SynthesizeFinal(
userQuery: string,
stage1Results: Stage1Result[],
stage2Results: Stage2Result[]
): Promise<Stage3Result> {
const chairmanPrompt = `
You are the Chairman of an LLM Council.
STAGE 1 - Individual Responses:
${stage1Text}
STAGE 2 - Peer Rankings:
${stage2Text}
Synthesize all of this information into a single, comprehensive answer.
Consider the peer rankings to weight the quality of insights.
`;
return queryModel(CHAIRMAN_MODEL, [{ role: "user", content: chairmanPrompt }]);
}
Why This Matters
This architecture shifts the burden of correctness from the model to the system.
For high-stakes decisions (generating infrastructure code, analyzing legal documents, or medical diagnosis) latency is often less important than accuracy. We are trading compute (running 4+ models per query) and time (3 sequential stages) for a significant reduction in hallucination rates.
Conclusion
Project Meridian demonstrates that we can build "Super-Models" today, simply by orchestrating the models we already have. By treating LLMs as components in a larger deliberative system, we can achieve levels of reliability and reasoning that individual models cannot yet reach.