Project Meridian: Implementing the LLM Council

Dec 3, 2025

When Andrej Karpathy tweeted about the concept of an "LLM Council," it struck a chord. The idea is simple yet profound: no single model is perfect, but a committee of diverse models can check each other's work, debate, and converge on a truth that is more reliable than any individual member.

Project Meridian is my implementation of this architecture. It's a system designed to orchestrate a council of heterogeneous frontier models to solve complex reasoning tasks with a level of robustness that a single model cannot achieve.

The source code is currently closed source, but I am planning to release it soon.

The Core Problem

We are moving from a world of "Prompt Engineering" to "System Engineering." The stochastic nature of LLMs means that for any non-trivial task, a single inference pass is a roll of the dice. Even the best models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) have blind spots, biases, and occasional hallucinations.

Meridian treats individual models as unreliable nodes in a distributed system. By aggregating their outputs and forcing a consensus process, we can filter out noise and amplify signal.

The Architecture

Meridian implements a 3-stage consensus pipeline. This isn't just a simple "vote"; it's a deliberative process that mimics human peer review.

Stage 1: Diversity Generation

First, we broadcast the user's query to a diverse set of council members. Diversity is critical here; using three instances of GPT-4 is less effective than using GPT-4, Claude, and Gemini because different model families have different error distributions.

In meridian, our council consists of:

OpenAI GPT-5.1 (for raw reasoning power)
Google Gemini 3 Pro (for long-context understanding)
Anthropic Claude Sonnet 4.5 (for nuance and safety)
xAI Grok 4 (for uninhibited perspective)

// From council.ts
export async function stage1CollectResponses(
  userQuery: string
): Promise<Stage1Result[]> {
  const messages = [{ role: "user", content: userQuery }];

  // Query all models in parallel for maximum efficiency
  const responses = await queryModelsParallel(COUNCIL_MODELS, messages);

  // ... formatting logic ...
  return stage1Results;
}

This parallel execution ensures that the latency of this stage is only as slow as the slowest model, not the sum of them all.

Stage 2: Blind Peer Review

This is where the magic happens. We don't just take the Stage 1 outputs and average them. We feed them back into the council for evaluation.

Crucially, the responses are anonymized. "Response A", "Response B", etc. This prevents bias (e.g, a model favoring its own output or favoring a specific provider).

Each council member is asked to:

Critique each response.
Rank them from best to worst.

// The prompt sent to each model in Stage 2
const rankingPrompt = `
You are evaluating different responses to the following question:
${userQuery}

Here are the responses from different models (anonymized):
${responsesText}

Your task:
1. First, evaluate each response individually.
2. Then, at the very end, provide a final ranking.

FINAL RANKING:
1. Response C
2. Response A
...
`;

We then aggregate these rankings to determine a "consensus score" for each response.

Stage 3: The Chairman's Synthesis

Finally, we introduce a specialized role: The Chairman. In Meridian, this role is currently held by google/gemini-3-pro-preview due to its large context window and ability to synthesize disparate information.

The Chairman is given:

The original query.
All individual responses (Stage 1).
All peer reviews and rankings (Stage 2).

Its job is not to generate a new answer from scratch, but to synthesize the collective wisdom of the council into a single, definitive response. It can cherry-pick the code snippet from Claude, the explanation from GPT-5, and the edge-case handling from Gemini.

export async function stage3SynthesizeFinal(
  userQuery: string,
  stage1Results: Stage1Result[],
  stage2Results: Stage2Result[]
): Promise<Stage3Result> {
  const chairmanPrompt = `
    You are the Chairman of an LLM Council.

    STAGE 1 - Individual Responses:
    ${stage1Text}

    STAGE 2 - Peer Rankings:
    ${stage2Text}

    Synthesize all of this information into a single, comprehensive answer.
    Consider the peer rankings to weight the quality of insights.
  `;

  return queryModel(CHAIRMAN_MODEL, [{ role: "user", content: chairmanPrompt }]);
}

Why This Matters

This architecture shifts the burden of correctness from the model to the system.

For high-stakes decisions (generating infrastructure code, analyzing legal documents, or medical diagnosis) latency is often less important than accuracy. We are trading compute (running 4+ models per query) and time (3 sequential stages) for a significant reduction in hallucination rates.

Conclusion

Project Meridian demonstrates that we can build "Super-Models" today, simply by orchestrating the models we already have. By treating LLMs as components in a larger deliberative system, we can achieve levels of reliability and reasoning that individual models cannot yet reach.