Detecting Hallucinations in LLM Summaries

LLMs write convincingly but fabricate facts. A practical tour of automated detection techniques: BERTScore, embedding similarity, ROUGE/n-gram overlap, NER-based cross-referencing, and QAEVAL.

·11 min read

Large language models are remarkable writers. Ask one to summarise a 10-page document and it will produce something that reads like a confident, well-structured précis — often more fluent than what a human would dash off under time pressure. That fluency is precisely the danger. In our work on document summarisation we found that 10–20 % of facts in LLM-generated summaries are wrong: wrong numbers, invented dates, misattributed quotes, or subtly reversed cause-and-effect.

This post is a practical tour of techniques we use to catch those errors automatically.


Why this is hard

A language model is not a retrieval system. It does not look up facts; it predicts the next token. When context runs thin or two plausible facts compete, it picks whichever continuation fits the probability distribution it learned during training. The result reads fine. The numbers might not be.

Fluency and correctness are orthogonal. A summary can be grammatically perfect and completely fabricated.

The goal of hallucination detection is to measure faithfulness: does every claim in the summary follow from the source document? No single metric answers that question perfectly, so we use a layered approach.


1. BERTScore

BERTScore embeds each token in both the reference text and the candidate summary using a pre-trained language model (typically a BERT variant), then computes pairwise cosine similarities and takes a greedy-matched F1.

BERTScoreF=2PBERTRBERTPBERT+RBERT\text{BERTScore}_F = \frac{2 \cdot P_{\text{BERT}} \cdot R_{\text{BERT}}}{P_{\text{BERT}} + R_{\text{BERT}}}

Because it operates in semantic embedding space rather than on raw tokens, it rewards paraphrases that preserve meaning. A summary that replaces "constructed" with "built" should not be penalised — BERTScore handles this correctly.

Strengths: tolerant of legitimate paraphrase; correlates better with human judgement than token-overlap metrics on many benchmarks.

Limitations: two texts can have high BERTScore while disagreeing on specific numeric facts ("the bridge is 200 m long" vs. "the bridge is 800 m long" embed similarly). BERTScore is a semantic proximity measure, not a fact-verification tool.


2. Vector Embedding Similarity

A related approach is to compute dense embeddings — not at the token level, but at larger spans — and compare them between the source and the summary.

An embedding can be computed for a word, a sentence, a paragraph, or even an entire document. This choice matters enormously:

  • Document-level similarity tells you whether the summary is broadly on-topic, but a summary that covers 80 % of the document faithfully and invents the remaining 20 % will still score well.
  • Sentence-level similarity is more sensitive: embed each sentence in the summary, find its closest match in the source, and flag sentences whose nearest-neighbour similarity falls below a threshold.
  • Paragraph / sliding-window chunking can help when the source is long and a sentence in the summary draws on multiple scattered source passages.
💡

If you are trying to detect unfaithful spans, experiment with different chunk sizes on both the source and summary sides. A claim that looks fine at document level can be exposed as unsupported when you tighten the window.

Embedding similarity works best as a coarse filter: flag candidates for closer inspection rather than making a binary pass/fail decision.

From cosine similarity to Earth Mover's Distance

Comparing a single summary embedding against a single document embedding collapses everything into one number and loses all spatial structure. A more principled framing treats the set of chunk embeddings as a probability distribution over semantic space and asks: how much work would it take to "transport" the reference distribution onto the summary distribution?

This is the Earth Mover's Distance (EMD), also known as the Wasserstein-1 distance. Intuitively:

  • Embed every chunk from the reference document as a point in high-dimensional space.
  • Embed every chunk from the summary.
  • EMD is the minimum total work (mass × distance) needed to rearrange the reference embedding cloud into the summary embedding cloud.

A faithful summary produces embeddings that sit close to their source counterparts — transport cost is low. A hallucinated sentence lands far away in embedding space, contributing a large spike to the EMD even if the rest of the summary is fine.

EMD(μ,ν)=infγΓ(μ,ν)d(x,y)dγ(x,y)\text{EMD}(\mu, \nu) = \inf_{\gamma \,\in\, \Gamma(\mu,\nu)} \int d(x, y)\, \mathrm{d}\gamma(x, y)

where μ\mu is the reference chunk distribution, ν\nu the summary chunk distribution, Γ(μ,ν)\Gamma(\mu, \nu) is the set of all joint distributions (transport plans), and d(x,y)d(x, y) is the Euclidean distance between embeddings.

Varying chunk sizes and overlapping windows

The EMD score depends heavily on how you partition the text. Three strategies worth running in parallel:

  • Sentence-level chunks are the finest grain — each sentence gets its own embedding. Hallucinated sentences stand out clearly because they have no nearby reference embedding. However, very short sentences embed noisily and boundary effects can split a claim across two chunks.
  • Paragraph-level chunks are more stable but can bury a hallucinated sentence inside an otherwise faithful paragraph, dragging the per-chunk score toward the faithful majority.
  • Sliding windows (typically 50 % overlap) give robustness: every portion of text appears in at least two windows, so a hallucinated phrase is likely captured even when it crosses a sentence boundary.
💡

Run EMD at two or three granularities and take the maximum score. A claim that looks faithful at paragraph level can be exposed at sentence level. The sliding-window variant is particularly effective for hallucinations that span sentence boundaries.

Interactive exploration

The visualisation below shows a toy example with reference text (indigo arrows) and summary text (green = faithful, red = hallucinated). Each arrow's possition and direction represent one chunks many-dimensions after dimensionality reduction

Faithful summaries cluster near their reference counterparts (short transport lines); hallucinated sentences are displaced far away, driving up the EMD. Toggle transport plan to see the coupling lines, and switch between chunk-size modes to see how the score changes.

Earth Mover's Distance · Embedding Space

Hover a text chunk to highlight its embedding vector · red arrows = hallucinated claims

Reference

r1The company was founded in 1998 by two engineers from Stanford.
r2Revenue grew from $4 M to $6 M over the last fiscal year.
r3The CEO has over 20 years of experience in the semiconductor industry.
r4They currently operate across 12 countries in Europe and Asia.

Summary

s1The company was established in 1998 by Stanford graduates.
s2Revenue surged 75 % year-on-year, reaching $12 M.✗ hallucinated
s3The CEO brings two decades of semiconductor expertise.
s4Operations span 15 countries, including Latin America.✗ hallucinated

2-D embedding projection · each arrow = semantic direction of a chunk · hover a chunk above to highlight

dim 1dim 2faithful clusterhallucinated region
referencefaithful summaryhallucinated summary
Earth Mover's Distance
0.53/ 1.00
faithfulhallucinated
2 / 4 chunks flagged

3. BLEU, ROUGE, and N-gram Overlap

Before neural metrics existed, evaluation relied on n-gram overlap. These metrics are blunt but fast, interpretable, and — crucially — surprisingly actionable.

What they measure

An n-gram is a contiguous sequence of n tokens. ROUGE-N recall asks: of all n-grams in the reference, what fraction also appear in the candidate?

ROUGE-Nrecall=grefmin(countref(g),counthyp(g))grefcountref(g)\text{ROUGE-}N_{\text{recall}} = \frac{\sum_{g \in \text{ref}} \min(\text{count}_\text{ref}(g),\, \text{count}_\text{hyp}(g))}{\sum_{g \in \text{ref}} \text{count}_\text{ref}(g)}

BLEU flips the direction (precision: how much of the candidate appears in the reference) and adds a brevity penalty.

Interactive playground

Try it yourself — paste any reference text and summary, choose n, and see exactly which n-grams overlap:

N-gram Overlap Playground

Compare reference text and summary · hover an n-gram to highlight it

n =
31.6%
ROUGE-2 recall
35.3%
ROUGE-2 precision
33.3%
ROUGE-2 F1

Reference · green = overlapping 2-grams

the eiffel tower was built between 1887 and 1889 as the entrance arch for the 1889 worlds fair it stands 330 metres tall and was designed by gustave eiffel for 41 years it was the worlds tallest manmade structure

Summary · blue = overlapping 2-grams

the eiffel tower designed by gustave eiffel was constructed for the 1889 worlds fair it is approximately 330 metres in height and held the record as the tallest structure on earth for over four decades

Shared 2-grams (12 unique)

Low ROUGE scores do not automatically mean hallucination — a high-quality abstractive summary will legitimately paraphrase. But very low scores (especially ROUGE-1 below ~0.3) are a warning sign worth investigating.

LLMs are surprisingly steerable

One practical finding: if you explicitly instruct the model not to rephrase and to use the same wording as the source where possible, ROUGE scores can jump dramatically. In our experiments ROUGE-2 went from ~30 % to ~60 % just by adding that instruction. Your mileage will vary depending on the model and task, but it suggests that low n-gram overlap is sometimes a stylistic choice the model is making, not an inherent limitation of abstractive summarisation.


4. NER-Based Cross-Referencing

N-gram overlap treats every token equally. Named entity recognition (NER) lets you focus specifically on the tokens that carry factual content: names, dates, locations, organisations, and numbers.

Using a library like spaCy, extract all entities from the source document and from the summary, then check:

  • Does every entity in the summary appear (or have a clear antecedent) in the source?
  • Are all numbers in the summary also present in the source?

Numbers are a red flag

Hallucinated numbers are common and pernicious. An LLM might silently round a figure, invert a ratio, or invent a statistic entirely. A simple rule — flag any number in the summary that does not appear verbatim in the source — catches a surprisingly high fraction of numeric hallucinations.

Prompt the model explicitly to avoid calculations. If the source says "revenue grew from 4Mto4 M to 6 M", the model should not write "revenue grew by 50 %" even if the arithmetic is correct — that derived figure is one more thing that can go wrong.

Informational density

A complementary NER-based signal is informational density: the ratio of unique named entities (or content words) to total tokens. If the summary has substantially lower entity density than the source, the model may be:

  • Padding with generic filler ("It is important to note that…")
  • Looping — a well-known failure mode where the model starts repeating itself as the context window fills up

Track density across many summaries and treat a sharp drop as a quality alert.

Language detection

One edge case worth guarding against: the model outputting in the wrong language. In our production pipeline we saw this roughly once in every 5,000 summaries — not enough to be alarming, but enough to reach a user. A simple language-detection check (e.g. langdetect or lingua) costs almost nothing and catches it.


5. QAEVAL

The most principled automatic faithfulness metric we have found is QAEVAL (question-answering evaluation). The idea: if a summary faithfully captures the source, a model that reads only the summary should be able to answer comprehension questions about the source.

Generate questions from the source

Use an LLM to generate a single-choice comprehension test from the source document. A few practical guidelines:

  • Use single-choice (one correct answer from four options), not free-form. This lets you evaluate answers without running another LLM as a judge.
  • Randomise the position of the correct answer across questions — otherwise models learn a positional bias.
  • Always include "I don't know" as an option. A model that has not seen the relevant information should abstain rather than guess.
  • Steer the LLM to draw questions from the most information-dense, fact-rich parts of the source. Generic questions ("What is the document about?") add little signal.

Have the model take the test using only the summary

Feed the question set to the LLM with only the summary as context (no source). For each answer, ask the model to cite the sentence(s) in the summary it used to reach its conclusion.

  • Correct answer + cited sentence: that sentence is likely faithful.
  • Correct answer + no plausible cited sentence: the model may be relying on world knowledge, not the summary — treat as uncertain.
  • Wrong answer: the summary may be missing or contradicting the relevant fact.
  • "I don't know": the summary does not cover that fact (which may or may not be a problem depending on expected coverage).

QAEVAL turns faithfulness evaluation into a structured, auditable process. You can inspect exactly which questions failed and which summary sentences were implicated — far more useful than a single scalar score.

Why not just ask an LLM "is this summary correct?"

We experimented with LLM-as-judge approaches — prompting a model to rate faithfulness on a 1–5 scale or produce a binary pass/fail. We were not convinced. The core problem is circularity: how do you know the judge's judgement is correct? Larger models appear to judge more reliably than smaller ones, but they also produce better summaries in the first place. At some point you have to ask whether you are paying for LLM-as-judge or just paying for a better summariser.


Conclusion

No single metric captures faithfulness. Our current stack layers several signals:

SignalWhat it catchesBlind spots
BERTScoreSemantic driftNumeric errors, subtle inversions
Embedding similarityOff-topic passagesParaphrase of false claims
Earth Mover's DistanceHallucinated spans (semantic displacement)Short texts; near-boundary hallucinations
ROUGE / n-gram overlapVerbatim deviationLegitimate paraphrase
NER cross-referenceWrong entities / numbersImplicit claims
Language detectionWrong-language output
QAEVALFact-level errorsHigh setup cost

A few things we believe with some confidence:

Similarity is not correctness. High BERTScore or cosine similarity tells you the summary is in the right neighbourhood semantically. It does not tell you the facts are right.

LLM-as-judge is better suited to style than to correctness. Asking a model "does this text flow well?" is reasonable. Asking "is every fact in this text supported by the source?" puts the model in the position of needing to do the very thing we are trying to verify.

LLMs are surprisingly steerable — but context length is the enemy. Adding explicit instructions ("do not rephrase", "do not calculate", "use the same numbers as the source") measurably improves faithfulness metrics. The catch is that every instruction consumes context, and we consistently observe quality degrading as the context window fills up. If you are struggling with quality, the most cost-effective intervention may simply be using a model with a larger context window or a higher capacity — rather than investing in an elaborate LLM-as-judge pipeline.

Related Articles

·17 min read

Automatic Differentiation

A deep-dive into automatic differentiation: symbolic vs. numerical vs. AD, AST transformation, dual numbers, tapes, hybrid methods, and a generic C++ implementation that computes gradients and solves optimisation problems.