This tool provides a simple and efficient way to validate JSONL format data for Claude-3 Haiku Fine-Tuning training and validation datasets. It supports both local files and data stored in Amazon S3.
-
data_validator.py
: The core Python script containing validation logic -
Claude-3 Haiku Fine-Tuning Training and Validation Data Validator.ipynb
: A Jupyter notebook interface for easy use of the validation tool
-
Validates
JSONL
format for training and validation datasets -
Supports both local files and S3 stored data
-
Checks for file size limits:
- Training data: Max 10GB
- Validation data: Max 1GB
-
Validates line counts:
- Training data: 32 to 10,000 lines
- Validation data: 32 to 1,000 lines
- Total (training + validation): Max 10,000 lines
-
Validates data structure and content for each entry
-
Estimates and checks token counts per entry: Max 32,000 tokens
-
Checks for Anthropic's reserved keywords in prompts:
- Ensures "\nHuman:" and "\nAssistant:" do not appear in prompts
- Note: Variations without colons (e.g., "\nHuman" or "\nAssistant") are allowed