Before you create a fine-tuning job in the Amazon Bedrock console, utilize the provided script to validate your dataset first, which would allow you to identify formatting errors (if any) faster and save costs.
Install the latest version of python here if you haven't already.
Download the dataset_validation
folder, cd
into the root directory, then run the following command to install the necessary dependencies:
python3 -m venv .venv && source .venv/bin/activate && pip install jsonschema
Then, use the following command to validate your dataset:
python3 dataset_validation.py -d <dataset type> -f <file path> -m <model name>
-
Dataset type options
- train
- validation
-
Model name options
- llama3-1-8b
- llama3-1-70b
- llama3-2-1b
- llama3-2-3b
- llama3-2-11b
- llama3-2-90b
- Validates the
JSONL
format - Checks that the
train
dataset has$\leq$ 10k rows andvalidation
dataset has$\leq$ 1k rows- Each conversation should only take up 1 row
- For each row
- Validates conversation format for models using conversational input
- Checks if roles are supported
- Prevents assistant messages from containing images
- Validates prompt-completion format for models using prompt-completion input
- Validates conversation format for models using conversational input
- Images
- Size
$\leq$ 10 MB - Format must be one of
png
,jpeg
,gif
,webp
- Dimensions
$\leq$ 8192 x 8192 pixels
- Size
- Input token length of each dataset row
$\leq$ 16K (10K for Llama 3.2 90B model)