Dataset Preparation for Fine-Tuning
Dataset Preparation for Fine-Tuning
Data Format
Most fine-tuning frameworks expect conversational format:
[
{
"messages": [
{"role": "system", "content": "You are a helpful medical assistant."},
{"role": "user", "content": "What are the symptoms of diabetes?"},
{"role": "assistant", "content": "Common symptoms of diabetes include..."}
]
}
]
Data Quality Principles
- Quality over quantity: 1,000 excellent examples beat 100,000 mediocre ones
- Diversity: Cover the range of inputs the model will see in production
- Consistency: All examples should follow the same format and style
- Accuracy: Every response must be factually correct — the model will learn errors too
- Edge cases: Include examples of how to handle unusual or difficult inputs
Data Sources
- Manual creation: Highest quality, most expensive. Best for specialized tasks.
- Synthetic generation: Use a strong model (Claude/GPT-4) to generate training data. Very effective for distillation.
- Existing data: Convert logs, documentation, support tickets into training format.
- Public datasets: Hugging Face Hub has thousands of instruction-tuning datasets.
Data Cleaning Checklist
- Remove duplicates and near-duplicates
- Fix formatting inconsistencies
- Remove PII (personally identifiable information)
- Validate that responses are actually correct
- Balance the dataset across different task types
- Split into train/validation/test sets (80/10/10)
How Much Data?
- Style/format transfer: 50-200 examples
- Domain adaptation: 500-5,000 examples
- New capability: 5,000-50,000+ examples
🌼 Daisy+ in Action: Built-In Training Data Pipeline
Every interaction with Daisy+ digital employees generates structured training data: user messages, agent responses, tool calls made, and outcomes. This data pipeline is designed to support future fine-tuning while respecting privacy — all data stays within the organization's own ERP instance. No customer data ever leaves the system for training purposes.
Rating
0
0
There are no comments for now.
Join this Course
to be the first to leave a comment.