LLMs write convincingly but fabricate facts. A practical tour of automated detection techniques: BERTScore, embedding similarity, ROUGE/n-gram overlap, NER-based cross-referencing, and QAEVAL.
Large language models are remarkable writers. Ask one to summarise a 10-page document and it will produce something that reads like a confident, well-structured précis — often more fluent than what a human would dash off under time pressure. That fluency is precisely the danger. In our work on document summarisation we found that 10–20 % of facts in LLM-generated summaries are wrong: wrong numbers, invented dates, misattributed quotes, or subtly reversed cause-and-effect.
This post is a practical tour of techniques we use to catch those errors automatically.
Why this is hard
A language model is not a retrieval system. It does not look up facts; it predicts the next token. When context runs thin or two plausible facts compete, it picks whichever continuation fits the probability distribution it learned during training. The result reads fine. The numbers might not be.
Fluency and correctness are orthogonal. A summary can be grammatically perfect and completely fabricated.
The goal of hallucination detection is to measure faithfulness: does every claim in the summary follow from the source document? No single metric answers that question perfectly, so we use a layered approach.
1. BERTScore
BERTScore embeds each token in both the reference text and the candidate summary using a pre-trained language model (typically a BERT variant), then computes pairwise cosine similarities and takes a greedy-matched F1.
Because it operates in semantic embedding space rather than on raw tokens, it rewards paraphrases that preserve meaning. A summary that replaces "constructed" with "built" should not be penalised — BERTScore handles this correctly.
Strengths: tolerant of legitimate paraphrase; correlates better with human judgement than token-overlap metrics on many benchmarks.
Limitations: two texts can have high BERTScore while disagreeing on specific numeric facts ("the bridge is 200 m long" vs. "the bridge is 800 m long" embed similarly). BERTScore is a semantic proximity measure, not a fact-verification tool.
2. Vector Embedding Similarity
A related approach is to compute dense embeddings — not at the token level, but at larger spans — and compare them between the source and the summary.
An embedding can be computed for a word, a sentence, a paragraph, or even an entire document. This choice matters enormously:
- Document-level similarity tells you whether the summary is broadly on-topic, but a summary that covers 80 % of the document faithfully and invents the remaining 20 % will still score well.
- Sentence-level similarity is more sensitive: embed each sentence in the summary, find its closest match in the source, and flag sentences whose nearest-neighbour similarity falls below a threshold.
- Paragraph / sliding-window chunking can help when the source is long and a sentence in the summary draws on multiple scattered source passages.
If you are trying to detect unfaithful spans, experiment with different chunk sizes on both the source and summary sides. A claim that looks fine at document level can be exposed as unsupported when you tighten the window.
Embedding similarity works best as a coarse filter: flag candidates for closer inspection rather than making a binary pass/fail decision.
3. BLEU, ROUGE, and N-gram Overlap
Before neural metrics existed, evaluation relied on n-gram overlap. These metrics are blunt but fast, interpretable, and — crucially — surprisingly actionable.
What they measure
An n-gram is a contiguous sequence of n tokens. ROUGE-N recall asks: of all n-grams in the reference, what fraction also appear in the candidate?
BLEU flips the direction (precision: how much of the candidate appears in the reference) and adds a brevity penalty.
Interactive playground
Try it yourself — paste any reference text and summary, choose n, and see exactly which n-grams overlap:
N-gram Overlap Playground
Compare reference text and summary · hover an n-gram to highlight it
Reference · green = overlapping 2-grams
Summary · blue = overlapping 2-grams
Shared 2-grams (12 unique)
Low ROUGE scores do not automatically mean hallucination — a high-quality abstractive summary will legitimately paraphrase. But very low scores (especially ROUGE-1 below ~0.3) are a warning sign worth investigating.
LLMs are surprisingly steerable
One practical finding: if you explicitly instruct the model not to rephrase and to use the same wording as the source where possible, ROUGE scores can jump dramatically. In our experiments ROUGE-2 went from ~30 % to ~60 % just by adding that instruction. Your mileage will vary depending on the model and task, but it suggests that low n-gram overlap is sometimes a stylistic choice the model is making, not an inherent limitation of abstractive summarisation.
4. NER-Based Cross-Referencing
N-gram overlap treats every token equally. Named entity recognition (NER) lets you focus specifically on the tokens that carry factual content: names, dates, locations, organisations, and numbers.
Using a library like spaCy, extract all entities from the source document and from the summary, then check:
- Does every entity in the summary appear (or have a clear antecedent) in the source?
- Are all numbers in the summary also present in the source?
Numbers are a red flag
Hallucinated numbers are common and pernicious. An LLM might silently round a figure, invert a ratio, or invent a statistic entirely. A simple rule — flag any number in the summary that does not appear verbatim in the source — catches a surprisingly high fraction of numeric hallucinations.
Prompt the model explicitly to avoid calculations. If the source says "revenue grew from 6 M", the model should not write "revenue grew by 50 %" even if the arithmetic is correct — that derived figure is one more thing that can go wrong.
Informational density
A complementary NER-based signal is informational density: the ratio of unique named entities (or content words) to total tokens. If the summary has substantially lower entity density than the source, the model may be:
- Padding with generic filler ("It is important to note that…")
- Looping — a well-known failure mode where the model starts repeating itself as the context window fills up
Track density across many summaries and treat a sharp drop as a quality alert.
Language detection
One edge case worth guarding against: the model outputting in the wrong language. In our production pipeline we saw this roughly once in every 5,000 summaries — not enough to be alarming, but enough to reach a user. A simple language-detection check (e.g. langdetect or lingua) costs almost nothing and catches it.
5. QAEVAL
The most principled automatic faithfulness metric we have found is QAEVAL (question-answering evaluation). The idea: if a summary faithfully captures the source, a model that reads only the summary should be able to answer comprehension questions about the source.
Generate questions from the source
Use an LLM to generate a single-choice comprehension test from the source document. A few practical guidelines:
- Use single-choice (one correct answer from four options), not free-form. This lets you evaluate answers without running another LLM as a judge.
- Randomise the position of the correct answer across questions — otherwise models learn a positional bias.
- Always include "I don't know" as an option. A model that has not seen the relevant information should abstain rather than guess.
- Steer the LLM to draw questions from the most information-dense, fact-rich parts of the source. Generic questions ("What is the document about?") add little signal.
Have the model take the test using only the summary
Feed the question set to the LLM with only the summary as context (no source). For each answer, ask the model to cite the sentence(s) in the summary it used to reach its conclusion.
- Correct answer + cited sentence: that sentence is likely faithful.
- Correct answer + no plausible cited sentence: the model may be relying on world knowledge, not the summary — treat as uncertain.
- Wrong answer: the summary may be missing or contradicting the relevant fact.
- "I don't know": the summary does not cover that fact (which may or may not be a problem depending on expected coverage).
QAEVAL turns faithfulness evaluation into a structured, auditable process. You can inspect exactly which questions failed and which summary sentences were implicated — far more useful than a single scalar score.
Why not just ask an LLM "is this summary correct?"
We experimented with LLM-as-judge approaches — prompting a model to rate faithfulness on a 1–5 scale or produce a binary pass/fail. We were not convinced. The core problem is circularity: how do you know the judge's judgement is correct? Larger models appear to judge more reliably than smaller ones, but they also produce better summaries in the first place. At some point you have to ask whether you are paying for LLM-as-judge or just paying for a better summariser.
Conclusion
No single metric captures faithfulness. Our current stack layers several signals:
| Signal | What it catches | Blind spots |
|---|---|---|
| BERTScore | Semantic drift | Numeric errors, subtle inversions |
| Embedding similarity | Off-topic passages | Paraphrase of false claims |
| ROUGE / n-gram overlap | Verbatim deviation | Legitimate paraphrase |
| NER cross-reference | Wrong entities / numbers | Implicit claims |
| Language detection | Wrong-language output | — |
| QAEVAL | Fact-level errors | High setup cost |
A few things we believe with some confidence:
Similarity is not correctness. High BERTScore or cosine similarity tells you the summary is in the right neighbourhood semantically. It does not tell you the facts are right.
LLM-as-judge is better suited to style than to correctness. Asking a model "does this text flow well?" is reasonable. Asking "is every fact in this text supported by the source?" puts the model in the position of needing to do the very thing we are trying to verify.
LLMs are surprisingly steerable — but context length is the enemy. Adding explicit instructions ("do not rephrase", "do not calculate", "use the same numbers as the source") measurably improves faithfulness metrics. The catch is that every instruction consumes context, and we consistently observe quality degrading as the context window fills up. If you are struggling with quality, the most cost-effective intervention may simply be using a model with a larger context window or a higher capacity — rather than investing in an elaborate LLM-as-judge pipeline.
Related Articles
Automatic Differentiation
A deep-dive into automatic differentiation: symbolic vs. numerical vs. AD, AST transformation, dual numbers, tapes, hybrid methods, and a generic C++ implementation that computes gradients and solves optimisation problems.
Fuzzy Search: Typos, Tries, and the Algorithms Behind Instant Results
A deep dive into how fuzzy search works — from edit distance and tries to stemming, BK-trees, and how Lucene/Elasticsearch find typo-tolerant matches at scale.
From Regex to Automata to Generators
How regular expressions compile to deterministic finite automata — with an interactive visualizer, a DFA-powered string generator, and a note on closing the loop with property-based testing.