Skip to Content

Chunking Strategies

Why Chunking Matters

Documents must be split into smaller chunks before embedding. The chunking strategy dramatically affects retrieval quality — too large and you get noisy results, too small and you lose context.

Common Strategies

1. Fixed-Size Chunking

Split by character/token count with overlap:

chunk_size = 512 tokens
chunk_overlap = 50 tokens

Simple but can split mid-sentence or mid-thought.

2. Recursive Character Splitting

Split by hierarchy of separators: paragraphs → sentences → words. Tries to keep semantic units together.

3. Semantic Chunking

Use embeddings to detect topic shifts. Group sentences with high similarity together. More expensive but produces coherent chunks.

4. Document-Structure-Aware

Use document structure (headers, sections, paragraphs) as natural boundaries. Best for well-structured content like documentation or articles.

Best Practices

  • Chunk size: 256-1024 tokens is typical. Smaller for Q&A, larger for summarization.
  • Overlap: 10-20% overlap prevents losing context at boundaries
  • Metadata: Always attach source document, section, page number, and other metadata to chunks
  • Parent-child: Retrieve small chunks but return the parent section for more context
  • Test empirically: The best strategy depends on your specific data and use case

Advanced: Contextual Retrieval

Anthropic's contextual retrieval technique prepends a short context summary to each chunk before embedding. This significantly improves retrieval accuracy by giving each chunk awareness of the document it came from.

🌼 Daisy+ in Action: Smart Document Processing

When processing long documents (contracts, RFPs, policy documents) for the sign_oca digital signing workflow, Daisy+ chunks content intelligently — by section headers and semantic boundaries rather than fixed token counts — to maintain context for AI analysis. This ensures that when a digital employee summarizes a 50-page contract, no critical clause gets split across chunks and lost.

Rating
0 0

There are no comments for now.

to be the first to leave a comment.