Skip to content

Metrics

Hallucination detection

from rag_audit.metrics.hallucination import FaithfulnessResult, HallucinationDetector

HallucinationDetector uses an LLM as a faithfulness judge. It asks the model whether every claim in the answer is explicitly supported by the provided contexts.

HallucinationDetector

detector = HallucinationDetector(llm, threshold=0.5)
result = detector.detect(answer, contexts)
Parameter Type Description
llm BaseChatModel Any LangChain chat model
threshold float Minimum score to consider an answer faithful (default: 0.5)

FaithfulnessResult

Field Type Description
score float 0.0–1.0 faithfulness score
reasoning str LLM explanation
is_faithful bool True if score >= threshold

Retrieval metrics

from rag_audit.metrics.retrieval import RetrievalResult, evaluate_retrieval

evaluate_retrieval

result = evaluate_retrieval(retrieved, relevant, k)
Parameter Type Description
retrieved list[str] Retrieved chunks in rank order
relevant list[str] Ground-truth relevant chunks
k int Cutoff for top-k evaluation

RetrievalResult

Field Type Description
precision_at_k float Fraction of top-k that are relevant
recall_at_k float Fraction of relevant chunks in top-k
mrr float Mean Reciprocal Rank
k int Effective k used