Large Language Models: From Fundamentals to Production

0 %

Course content

LoRA and QLoRA: Efficient Fine-Tuning

The Problem

Full fine-tuning of a 7B parameter model requires ~28GB of GPU memory just for the model weights, plus optimizer states. A 70B model? Forget about it on consumer hardware.

LoRA (Low-Rank Adaptation)

Instead of updating all model weights, LoRA freezes the original weights and adds small trainable "adapter" matrices:

Original weight matrix: W (d × d, frozen)
LoRA adapters: A (d × r) and B (r × d), where r << d
Effective weight: W + A×B
Typical rank r = 8-64, reducing trainable parameters by 100-1000x

QLoRA (Quantized LoRA)

Combines LoRA with 4-bit quantization:

Base model loaded in 4-bit precision (NF4 quantization)
LoRA adapters trained in 16-bit precision
Fine-tune a 65B model on a single 48GB GPU!
Virtually no quality loss compared to full fine-tuning

Practical Example with Hugging Face

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# → trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062

Key Hyperparameters

Rank (r): Higher = more capacity but more parameters. Start with 16.
Alpha: Scaling factor. Common rule: alpha = 2 × rank
Target modules: Usually attention layers (q_proj, v_proj, k_proj, o_proj)
Learning rate: Higher than full fine-tuning (1e-4 to 3e-4)

🌼 Daisy+ in Action: Future Fine-Tuning Ready

While Daisy+ currently relies on prompt engineering rather than fine-tuning, the architecture supports future fine-tuning workflows — conversation logs from digital employees could be used to train specialized LoRA adapters for industry-specific language patterns. Imagine a LoRA trained on thousands of successful customer support conversations, making responses faster and more domain-accurate.

Large Language Models: From Fundamentals to Production

Completed

LoRA and QLoRA: Efficient Fine-Tuning

LoRA and QLoRA: Efficient Fine-Tuning

The Problem

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

Practical Example with Hugging Face

Key Hyperparameters

🌼 Daisy+ in Action: Future Fine-Tuning Ready

Follow us