Large Language Models: From Fundamentals to Production

0 %

Course content

Chunking Strategies

Why Chunking Matters

Documents must be split into smaller chunks before embedding. The chunking strategy dramatically affects retrieval quality — too large and you get noisy results, too small and you lose context.

Common Strategies

1. Fixed-Size Chunking

Split by character/token count with overlap:

chunk_size = 512 tokens
chunk_overlap = 50 tokens

Simple but can split mid-sentence or mid-thought.

2. Recursive Character Splitting

Split by hierarchy of separators: paragraphs → sentences → words. Tries to keep semantic units together.

3. Semantic Chunking

Use embeddings to detect topic shifts. Group sentences with high similarity together. More expensive but produces coherent chunks.

4. Document-Structure-Aware

Use document structure (headers, sections, paragraphs) as natural boundaries. Best for well-structured content like documentation or articles.

Best Practices

Chunk size: 256-1024 tokens is typical. Smaller for Q&A, larger for summarization.
Overlap: 10-20% overlap prevents losing context at boundaries
Metadata: Always attach source document, section, page number, and other metadata to chunks
Parent-child: Retrieve small chunks but return the parent section for more context
Test empirically: The best strategy depends on your specific data and use case

Advanced: Contextual Retrieval

Anthropic's contextual retrieval technique prepends a short context summary to each chunk before embedding. This significantly improves retrieval accuracy by giving each chunk awareness of the document it came from.

🌼 Daisy+ in Action: Smart Document Processing

When processing long documents (contracts, RFPs, policy documents) for the sign_oca digital signing workflow, Daisy+ chunks content intelligently — by section headers and semantic boundaries rather than fixed token counts — to maintain context for AI analysis. This ensures that when a digital employee summarizes a 50-page contract, no critical clause gets split across chunks and lost.

Large Language Models: From Fundamentals to Production

Completed

Chunking Strategies

Chunking Strategies

Why Chunking Matters

Common Strategies

1. Fixed-Size Chunking

2. Recursive Character Splitting

3. Semantic Chunking

4. Document-Structure-Aware

Best Practices

Advanced: Contextual Retrieval

🌼 Daisy+ in Action: Smart Document Processing

Follow us