Skip to content

NotImplementedError when loading ultrachat_200k dataset from HuggingFace in final chapter code #77

@Devpatel1012

Description

@Devpatel1012

Hi,

I'm running the code from the final chapter of the Hands-On LLM book, specifically this snippet that loads the ultrachat_200k dataset and formats it using the TinyLlama chat template:

Code:
from transformers import AutoTokenizer
from datasets import load_dataset

template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

def format_prompt(example):
chat = example["messages"]
prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
return {"text": prompt}

dataset = (
load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
.shuffle(seed=42)
.select(range(3_000))
)
dataset = dataset.map(format_prompt)

Error:
NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

Environment:
Python 3.11

datasets version: [please specify, e.g., 2.19.0]

OS: Colab

Platform: Colab

I believe the error is caused by the datasets library not being able to handle caching in certain environments (e.g., temporary FS in Colab or Docker).

Could you suggest a workaround or update the code snippet to handle this?

Thanks for the excellent book and guidance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions