Skip to Content

Dataset Preparation for Fine-Tuning

Dataset Preparation for Fine-Tuning

Data Format

Most fine-tuning frameworks expect conversational format:

[
  {
    "messages": [
      {"role": "system", "content": "You are a helpful medical assistant."},
      {"role": "user", "content": "What are the symptoms of diabetes?"},
      {"role": "assistant", "content": "Common symptoms of diabetes include..."}
    ]
  }
]

Data Quality Principles

  1. Quality over quantity: 1,000 excellent examples beat 100,000 mediocre ones
  2. Diversity: Cover the range of inputs the model will see in production
  3. Consistency: All examples should follow the same format and style
  4. Accuracy: Every response must be factually correct — the model will learn errors too
  5. Edge cases: Include examples of how to handle unusual or difficult inputs

Data Sources

  • Manual creation: Highest quality, most expensive. Best for specialized tasks.
  • Synthetic generation: Use a strong model (Claude/GPT-4) to generate training data. Very effective for distillation.
  • Existing data: Convert logs, documentation, support tickets into training format.
  • Public datasets: Hugging Face Hub has thousands of instruction-tuning datasets.

Data Cleaning Checklist

  • Remove duplicates and near-duplicates
  • Fix formatting inconsistencies
  • Remove PII (personally identifiable information)
  • Validate that responses are actually correct
  • Balance the dataset across different task types
  • Split into train/validation/test sets (80/10/10)

How Much Data?

  • Style/format transfer: 50-200 examples
  • Domain adaptation: 500-5,000 examples
  • New capability: 5,000-50,000+ examples

🌼 Daisy+ in Action: Built-In Training Data Pipeline

Every interaction with Daisy+ digital employees generates structured training data: user messages, agent responses, tool calls made, and outcomes. This data pipeline is designed to support future fine-tuning while respecting privacy — all data stays within the organization's own ERP instance. No customer data ever leaves the system for training purposes.

Rating
0 0

There are no comments for now.

to be the first to leave a comment.