Large Language Models: From Fundamentals to Production

0 %

Course content

Key Architectures: Transformer, GPT, and Claude

Prev Next

Fullscreen Share

Key Architectures: Transformer, GPT, and Claude

The Transformer Architecture

The foundation of all modern LLMs consists of:

Self-Attention: Allows each token to attend to every other token, capturing long-range dependencies
Multi-Head Attention: Multiple attention heads learn different types of relationships
Feed-Forward Networks: Process the attended representations through dense layers
Layer Normalization & Residual Connections: Stabilize training of very deep networks
Positional Encoding: Since attention is permutation-invariant, position information must be added explicitly

GPT Architecture (Decoder-Only)

GPT (Generative Pre-trained Transformer) uses only the decoder half of the original Transformer:

Autoregressive: Generates text left-to-right, one token at a time
Causal masking: Each token can only attend to previous tokens (no "peeking" ahead)
Pre-training: Next token prediction on massive text corpora
Fine-tuning: Instruction tuning and RLHF for alignment with human preferences

Claude's Architecture Principles

While specific architecture details are proprietary, Anthropic's Claude models are built on these principles:

Constitutional AI (CAI): Training methodology that uses a set of principles to guide model behavior
RLHF + RLAIF: Combines human feedback with AI-generated feedback for alignment
Long context: Designed to handle very long documents (200K+ tokens)
Safety-first design: Trained to be helpful, harmless, and honest

Encoder-Only vs Decoder-Only vs Encoder-Decoder

Type	Examples	Best For
Encoder-Only	BERT, RoBERTa	Classification, NER, embeddings
Decoder-Only	GPT, Claude, Llama	Text generation, chat, reasoning
Encoder-Decoder	T5, BART	Translation, summarization

🌼 Daisy+ in Action: Transformers in Practice

Daisy+ leverages Claude's extended thinking and tool-use capabilities through its FastAPI gateway. The MCP (Model Context Protocol) server lets any Claude-powered application interact with ERP data natively — reading invoices, creating tasks, searching products — all through structured tool calls. The transformer architecture's ability to attend to long contexts means a digital employee can review an entire customer history before composing a response.

Large Language Models: From Fundamentals to Production

Completed

Key Architectures: Transformer, GPT, and Claude

Key Architectures: Transformer, GPT, and Claude

The Transformer Architecture

GPT Architecture (Decoder-Only)

Claude's Architecture Principles

Encoder-Only vs Decoder-Only vs Encoder-Decoder

🌼 Daisy+ in Action: Transformers in Practice

Follow us