Metrics
Hallucination detection
from rag_audit.metrics.hallucination import FaithfulnessResult, HallucinationDetector
HallucinationDetector uses an LLM as a faithfulness judge. It asks the model whether every claim in the answer is explicitly supported by the provided contexts.
HallucinationDetector
detector = HallucinationDetector(llm, threshold=0.5)
result = detector.detect(answer, contexts)
| Parameter |
Type |
Description |
llm |
BaseChatModel |
Any LangChain chat model |
threshold |
float |
Minimum score to consider an answer faithful (default: 0.5) |
FaithfulnessResult
| Field |
Type |
Description |
score |
float |
0.0–1.0 faithfulness score |
reasoning |
str |
LLM explanation |
is_faithful |
bool |
True if score >= threshold |
Retrieval metrics
from rag_audit.metrics.retrieval import RetrievalResult, evaluate_retrieval
evaluate_retrieval
result = evaluate_retrieval(retrieved, relevant, k)
| Parameter |
Type |
Description |
retrieved |
list[str] |
Retrieved chunks in rank order |
relevant |
list[str] |
Ground-truth relevant chunks |
k |
int |
Cutoff for top-k evaluation |
RetrievalResult
| Field |
Type |
Description |
precision_at_k |
float |
Fraction of top-k that are relevant |
recall_at_k |
float |
Fraction of relevant chunks in top-k |
mrr |
float |
Mean Reciprocal Rank |
k |
int |
Effective k used |