-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Hi,
I'm running the code from the final chapter of the Hands-On LLM book, specifically this snippet that loads the ultrachat_200k dataset and formats it using the TinyLlama chat template:
Code:
from transformers import AutoTokenizer
from datasets import load_dataset
template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
def format_prompt(example):
chat = example["messages"]
prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
return {"text": prompt}
dataset = (
load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
.shuffle(seed=42)
.select(range(3_000))
)
dataset = dataset.map(format_prompt)
Error:
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.
Environment:
Python 3.11
datasets version: [please specify, e.g., 2.19.0]
OS: Colab
Platform: Colab
I believe the error is caused by the datasets library not being able to handle caching in certain environments (e.g., temporary FS in Colab or Docker).
Could you suggest a workaround or update the code snippet to handle this?
Thanks for the excellent book and guidance!