These steps walk you through downloading the prepared data from Hugging Face and uploads the data into the data GCS bucket for use within the guide for the respective use case.
These prereqs were developed to be run on the playground AI/ML platform. If you are using a different environment the scripts and manifest will need to be modified for that environment.
Select a path between Full dataset and Smaller dataset (subset). The smaller dataset is a quicker way to experience the pipeline, but if it is used for fine-tuning will result in a less than ideal fine-tuned model.
-
Ensure that your
MLP_ENVIRONMENT_FILE
is configuredcat ${MLP_ENVIRONMENT_FILE} && \ source ${MLP_ENVIRONMENT_FILE}
You should see the various variables populated with the information specific to your environment.
-
If you would like to use the Smaller dataset (subset), set the variable below.
DATASET_SUBSET=-subset
-
Download the Hugging Face CLI library
pip3 install -U "huggingface_hub[cli]==0.27.1"
-
Download the prepared dataset from Hugging Face and copy it into the GCS bucket
DATAPREP_REPO=gcp-acp/flipkart-dataprep${DATASET_SUBSET} ${HOME}/.local/bin/huggingface-cli download --repo-type dataset ${DATAPREP_REPO} --local-dir ./temp gcloud storage cp -R ./temp/* \ gs://${MLP_DATA_BUCKET}/dataset/output && \ rm -rf ./temp
NOTE: Return to the respective use case instructions you were following.