Skip to content

Chunker

from rag_audit.chunker import (
    ChunkingEvaluator,
    ChunkingReport,
    ChunkingStrategyReport,
    FixedSizeChunker,
    RecursiveChunker,
    SemanticChunker,
)

Strategies

All strategies implement BaseChunker and expose a single method:

chunks: list[str] = chunker.chunk(text)

FixedSizeChunker

Splits text into fixed-length character windows with optional overlap.

FixedSizeChunker(chunk_size=500, overlap=50)
Parameter Default Description
chunk_size 500 Maximum characters per chunk
overlap 50 Characters shared between consecutive chunks

RecursiveChunker

Splits text hierarchically using separators in order: \n\n\n. → character.

RecursiveChunker(chunk_size=500, overlap=0)

SemanticChunker

Groups consecutive sentences into chunks based on embedding similarity. Creates a new chunk when similarity between adjacent sentences drops below the threshold.

SemanticChunker(embeddings, similarity_threshold=0.8)
Parameter Default Description
embeddings Any langchain_core.embeddings.Embeddings
similarity_threshold 0.8 Minimum cosine similarity to keep sentences in the same chunk

ChunkingEvaluator

Measures the semantic cohesion of each strategy as the average cosine similarity between each chunk and the full document.

evaluator = ChunkingEvaluator(embeddings)
report = evaluator.evaluate(text, {"fixed": chunker_a, "recursive": chunker_b})
Parameter Description
embeddings Any langchain_core.embeddings.Embeddings
text Document to chunk
chunkers Dict mapping strategy names to chunker instances

Models

ChunkingStrategyReport

Field Type Description
strategy str Strategy name
chunk_count int Number of chunks produced
avg_cohesion float Average cosine similarity (chunk ↔ document)
min_cohesion float Minimum cohesion across all chunks
max_cohesion float Maximum cohesion across all chunks
avg_chunk_length float Average character length per chunk

ChunkingReport

Field Type Description
strategies list[ChunkingStrategyReport] Per-strategy results
best_strategy str Name of the strategy with highest avg_cohesion