Chunking Strategies
Chunking Strategies
Why Chunking Matters
Documents must be split into smaller chunks before embedding. The chunking strategy dramatically affects retrieval quality — too large and you get noisy results, too small and you lose context.
Common Strategies
1. Fixed-Size Chunking
Split by character/token count with overlap:
chunk_size = 512 tokens
chunk_overlap = 50 tokens
Simple but can split mid-sentence or mid-thought.
2. Recursive Character Splitting
Split by hierarchy of separators: paragraphs → sentences → words. Tries to keep semantic units together.
3. Semantic Chunking
Use embeddings to detect topic shifts. Group sentences with high similarity together. More expensive but produces coherent chunks.
4. Document-Structure-Aware
Use document structure (headers, sections, paragraphs) as natural boundaries. Best for well-structured content like documentation or articles.
Best Practices
- Chunk size: 256-1024 tokens is typical. Smaller for Q&A, larger for summarization.
- Overlap: 10-20% overlap prevents losing context at boundaries
- Metadata: Always attach source document, section, page number, and other metadata to chunks
- Parent-child: Retrieve small chunks but return the parent section for more context
- Test empirically: The best strategy depends on your specific data and use case
Advanced: Contextual Retrieval
Anthropic's contextual retrieval technique prepends a short context summary to each chunk before embedding. This significantly improves retrieval accuracy by giving each chunk awareness of the document it came from.
🌼 Daisy+ in Action: Smart Document Processing
When processing long documents (contracts, RFPs, policy documents) for the sign_oca digital signing workflow, Daisy+ chunks content intelligently — by section headers and semantic boundaries rather than fixed token counts — to maintain context for AI analysis. This ensures that when a digital employee summarizes a 50-page contract, no critical clause gets split across chunks and lost.
There are no comments for now.