Multi-model verification: what does it mean when models disagree 72.1% on finance questions?

Posted on 2026-05-18 05:41:30

If you have spent any time in the RAG (Retrieval-Augmented Generation) trenches, you have likely seen a vendor pitch a "near-zero hallucination rate" or a "99% factual accuracy" score. I’ve spent nine years building knowledge systems for regulated industries, and I have learned one immutable truth: if you see a single percentage point representing the "truthfulness" of an LLM, you are being sold a fairy tale.

Recently, the industry has been buzzing about the 72.1% finance disagreement figure tied to the Suprmind divergence index. When you hear that models disagree on nearly three-quarters of financial queries, the immediate reaction is panic. "Are they hallucinating 72% of the time?" The answer is almost certainly no. To understand why, we have to stop treating "hallucination" as a binary state and start treating model output as a complex, multi-modal process of reasoning, retrieval, and synthesis.

What are we actually measuring? The Suprmind Divergence Index

Before we panic about the 72.1% figure, we need to clarify what the Suprmind divergence index actually measures. It does not measure "lying." It measures output variance across multiple models when tasked with interpreting complex financial documents.

In a financial context, that "disagreement" often arises not from a hallucination, but from:

Interpretation variance: Model A interprets a tax clause as applicable to state-level filings, while Model B restricts it to federal. Contextual weighting: The models prioritized different snippets from a 100-page SEC filing. Implicit bias in training: The underlying datasets for base models differ in how they prioritize specific regulatory precedents.

The 72.1% isn’t a failure rate. It is a divergence rate. It tells us that for high-stakes financial questions, relying on a single model is essentially a coin-flip—or worse, a roll of a six-sided die.

Table 1: Why "Divergence" is not "Hallucination"

Metric What it actually measures Why it’s misused Divergence Index The delta between model outputs on identical inputs. Mistaken for a raw error rate. Hallucination Rate Generation of text unsupported by the source. Usually measured on synthetic data, not real-world edge cases. Faithfulness Does the answer strictly follow the retrieved context? Fails to account for "external knowledge" leakage.

So what? If your RAG pipeline produces a 72% divergence rate, you have a verification problem, not a model quality problem. You need a multi-model verification architecture, not a search for a "better" model.

The Fallacy of the "Single Hallucination Rate"

People love a single number. Executives want a "Hallucination KPI." The problem is that multiai hallucination is not a monolithic event. It is a spectrum. In my years of building for legal and medical firms, I’ve categorized these failures into distinct modes:

Extrinsic Hallucinations: The model ignores the retrieved context and fills in the blanks with training data garbage. Intrinsic Hallucinations: The model contradicts the retrieved context (e.g., changing a date from 2023 to 2024). Citation Hallucinations: The model makes a correct statement but invents a source or page number to "justify" it.

When someone claims a "near-zero hallucination rate," they are almost always filtering their test set to avoid complex reasoning tasks. If you throw a nuanced financial query—such as reconciling two conflicting balance sheets—at a model, the "hallucination" isn't a factual error; it’s an abduction error. The model is trying to reason its way out of a contradiction that the data itself provided.. Exactly.

Definitions Matter: Faithfulness vs. Factuality vs. Citation

If we want to build systems that survive audits, we have to stop conflating these three terms:

Faithfulness: Does the model output stick to the provided document? Factuality: Is the output true in the real world? (Note: A model can be faithful to a document containing a typo, making it unfaithful to reality). Citation: Can the model correctly map its output back to a specific byte-offset in the source document?

In finance, you need all three. A system that is faithful but factually wrong is useless. A system that is factual but unfaithful (using external data it wasn't supposed to) is a compliance nightmare.

Benchmarks vs. Reality: Why Models Disagree

We often treat benchmarks like the "truth." They are not. Benchmarks are audit trails for developers, not performance guarantees for users. The reason benchmarks disagree—and why models disagree—is because they measure different failure modes.

Some benchmarks evaluate Instruction Following (Did the model format the response as JSON?). Others evaluate Retrieval Utility (Did it find the right document?). When models disagree 72.1% on finance questions, it's often because they are prioritizing different benchmarks. A model optimized for "conciseness" will look very different from a model optimized for "thoroughness."

Table 2: Reasoning Tax on Grounded Summarization

Complexity Level Reasoning Tax Expected Output Simple Extraction Low High consistency across models. Comparative Analysis Moderate High divergence; models choose different metrics to compare. Regulatory Synthesis High Extreme divergence; requires chain-of-thought verification.

So what? When you ask a model to do "grounded summarization," you are imposing a reasoning tax. The more synthesis you require, the more you invite the model to move away from pure extraction and into creative reasoning. If your pipeline demands high-precision synthesis, you cannot rely on a single model run. You must implement Self-Consistency Prompting or Cross-Model Consensus.

How to build for "Multi-model Verification"

If you are an enterprise lead, stop asking "Which model is the best?" Start asking "How do I verify the consensus?"

Think about it: if you see a 72.1% disagreement, don't try to find the "one true model." instead, adopt these three architectural patterns:

Consensus Scoring: Run the same prompt through three distinct model architectures (e.g., a high-reasoning model, a fast-extraction model, and a smaller, highly-tuned model). If they don't reach a consensus, trigger a human-in-the-loop review. Divergence Alarms: Use the Suprmind divergence index as an observability metric. If your production queries cross a divergence threshold, route those queries to your most expensive, highest-reasoning models automatically. Evidence-Based Guardrails: Do not let the model generate the answer. Make the model select the answer from the source text and provide a citation string. If the model cannot perform extraction with a valid citation, it should abstain from answering.

Final Thoughts: Don't treat citations as proof

I see teams treat LLM-generated citations as "proof" of accuracy. I remember a project where learned this lesson the hard way.. This is a fatal mistake. A citation is not proof; it is an audit trail. The system must be able to present the citation, but the human or the verification engine must independently verify that the citation actually supports the claim.

The 72.1% divergence in finance questions isn't a sign that models are "broken." It’s a sign that they are probabilistic systems performing deterministic work. If you expect them to behave like a SQL database, you will always be disappointed. If you build systems that treat them like a committee of highly intelligent but occasionally distractible analysts, you might actually build something worth shipping.