Hallucination detectors (9)

Hallucination detectors look for fabricated or unsupported content — phantom citations, made-up authorities, internal contradictions, stale knowledge stated as current, confidence degradation across turns, and over-eager agreement with user claims.

Reference

ID	Detector	Type	Detects
HAL-01	`PhantomCitationDetector`	Rule-based	Fake DOIs, arXiv IDs, `.invalid` / `.nonexistent` domains
HAL-02	`SelfConsistencyDetector`	Rule-based	Numeric inconsistency (values differing by >10×)
HAL-03	`CrossAgentContradictionDetector`	Semantic	Contradictions between agents in a multi-agent session
HAL-04	`SourceGroundingDetector`	Semantic	Claims unsupported by provided context
HAL-05	`ConfidenceDecayDetector`	Semantic	Confidence degradation across turns
HAL-06	`StaleKnowledgeDetector`	Semantic	Time-sensitive facts stated as current ("the latest version is X", "the current CEO is Y")
HAL-07	`IntraSessionContradictionDetector`	Semantic	Model contradicts itself within the same conversation
HAL-08	`GroundlessStatisticDetector`	Rule-based	Specific percentages / statistics asserted without any source in the provided context
HAL-09	`UncertaintyPropagationDetector`	Semantic	Hedged statements that contradict a definitive assertion in the same response

When these matter

Hallucinations are quieter than security threats — they don't typically trigger Quarantine because the response looks fine. They're best handled at Alert or Log severity:

opts.OnHigh   = SentinelAction.Alert;   // route High hallucinations to ops dashboard
opts.OnMedium = SentinelAction.Log;     // log everything else for analysis

Pair the audit feed with downstream review tooling (manual spot-checks, structured grading, or feedback loops to fine-tuning data). The detectors flag suspect responses; humans decide whether to act.

Source-grounding detector — context matters

HAL-04 SourceGroundingDetector expects the provided context (system prompt, retrieved documents, tool messages) to be embedded alongside the assistant message. If your context is empty or trivial, this detector will fire on every assertion. Best results come from:

A non-empty system prompt
Retrieved documents passed via tool messages or system instructions
Multi-turn conversations where prior turns supply grounding

When you don't have grounding context — fully ungrounded chat-style usage — disable this detector:

opts.Configure<SourceGroundingDetector>(c => c.Enabled = false);

Stale knowledge — date-sensitive

HAL-06 StaleKnowledgeDetector doesn't know what year your model thinks it is. It flags time-sensitive phrasing ("currently", "as of today", "the latest version", "the current X is Y") because those statements decay fastest. False positives are common when the model legitimately has up-to-date information; tune via:

opts.Configure<StaleKnowledgeDetector>(c => c.SeverityCap = Severity.Low);

Severity ranges

Detector	Typical severity	Notes
`HAL-01` PhantomCitation	High	Fake DOI is a hard signal — no benign explanation
`HAL-02` SelfConsistency	Medium	10× numeric mismatch is suspicious; sometimes legitimate (units, magnitudes)
`HAL-03` CrossAgent	High	Multi-agent contradictions undermine workflows
`HAL-04` SourceGrounding	Medium	Many false positives when grounding context is sparse
`HAL-05` ConfidenceDecay	Low/Medium	Trend-based; rarely Critical
`HAL-06` StaleKnowledge	Low	High false-positive rate; route to Log
`HAL-07` IntraSessionContradiction	High	Within-conversation contradictions are unambiguous
`HAL-08` GroundlessStatistic	Medium	Numeric claims without source are a known LLM failure mode
`HAL-09` UncertaintyPropagation	Low	Style signal; helpful in audit but rarely actionable

Reference​

When these matter​

Source-grounding detector — context matters​

Stale knowledge — date-sensitive​

Severity ranges​

Reference

When these matter

Source-grounding detector — context matters

Stale knowledge — date-sensitive

Severity ranges