Chunker
from rag_audit.chunker import (
ChunkingEvaluator,
ChunkingReport,
ChunkingStrategyReport,
FixedSizeChunker,
RecursiveChunker,
SemanticChunker,
)
Strategies
All strategies implement BaseChunker and expose a single method:
FixedSizeChunker
Splits text into fixed-length character windows with optional overlap.
| Parameter | Default | Description |
|---|---|---|
chunk_size |
500 |
Maximum characters per chunk |
overlap |
50 |
Characters shared between consecutive chunks |
RecursiveChunker
Splits text hierarchically using separators in order: \n\n → \n → . → → character.
SemanticChunker
Groups consecutive sentences into chunks based on embedding similarity. Creates a new chunk when similarity between adjacent sentences drops below the threshold.
| Parameter | Default | Description |
|---|---|---|
embeddings |
— | Any langchain_core.embeddings.Embeddings |
similarity_threshold |
0.8 |
Minimum cosine similarity to keep sentences in the same chunk |
ChunkingEvaluator
Measures the semantic cohesion of each strategy as the average cosine similarity between each chunk and the full document.
evaluator = ChunkingEvaluator(embeddings)
report = evaluator.evaluate(text, {"fixed": chunker_a, "recursive": chunker_b})
| Parameter | Description |
|---|---|
embeddings |
Any langchain_core.embeddings.Embeddings |
text |
Document to chunk |
chunkers |
Dict mapping strategy names to chunker instances |
Models
ChunkingStrategyReport
| Field | Type | Description |
|---|---|---|
strategy |
str |
Strategy name |
chunk_count |
int |
Number of chunks produced |
avg_cohesion |
float |
Average cosine similarity (chunk ↔ document) |
min_cohesion |
float |
Minimum cohesion across all chunks |
max_cohesion |
float |
Maximum cohesion across all chunks |
avg_chunk_length |
float |
Average character length per chunk |
ChunkingReport
| Field | Type | Description |
|---|---|---|
strategies |
list[ChunkingStrategyReport] |
Per-strategy results |
best_strategy |
str |
Name of the strategy with highest avg_cohesion |