Metrics

Hallucination detection

from rag_audit.metrics.hallucination import FaithfulnessResult, HallucinationDetector

HallucinationDetector uses an LLM as a faithfulness judge. It asks the model whether every claim in the answer is explicitly supported by the provided contexts.

`HallucinationDetector`

detector = HallucinationDetector(llm, threshold=0.5)
result = detector.detect(answer, contexts)

Parameter	Type	Description
`llm`	`BaseChatModel`	Any LangChain chat model
`threshold`	`float`	Minimum score to consider an answer faithful (default: `0.5`)

`FaithfulnessResult`

Field	Type	Description
`score`	`float`	0.0–1.0 faithfulness score
`reasoning`	`str`	LLM explanation
`is_faithful`	`bool`	`True` if `score >= threshold`

Retrieval metrics

from rag_audit.metrics.retrieval import RetrievalResult, evaluate_retrieval

`evaluate_retrieval`

result = evaluate_retrieval(retrieved, relevant, k)

Parameter	Type	Description
`retrieved`	`list[str]`	Retrieved chunks in rank order
`relevant`	`list[str]`	Ground-truth relevant chunks
`k`	`int`	Cutoff for top-k evaluation

`RetrievalResult`

Field	Type	Description
`precision_at_k`	`float`	Fraction of top-k that are relevant
`recall_at_k`	`float`	Fraction of relevant chunks in top-k
`mrr`	`float`	Mean Reciprocal Rank
`k`	`int`	Effective k used