Skip to content

A lightweight dataset builder for AI projects. MasterChat-Datasets Builder allows downloading, filtering, and converting large code datasets into optimized formats like .parquet. Designed for research, fine-tuning, and startups aiming to scale AI models with efficient, clean data pipelines.

License

Notifications You must be signed in to change notification settings

MasterAI-projects/MasterChat-Datasets-builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

MasterChat-Datasets-Builder

A lightweight dataset builder for AI projects. MasterChat-Datasets Builder allows downloading, filtering, and converting large code datasets into optimized formats like .parquet. Designed for research, fine-tuning, and startups aiming to scale AI models with efficient, clean data pipelines.


๐Ÿš€ Features

  • ๐Ÿ“ฅ Download datasets directly from Hugging Face.
  • ๐Ÿ” Automatic filtering (length check, comments, TODO removal, import detection).
  • ๐Ÿงฉ Tokenization with Hugging Face tokenizers.
  • ๐Ÿ’พ Save data in .parquet, .jsonl, .arrow, or raw .py formats.
  • โšก Lightweight and fast โ€” perfect for startups or researchers.

๐Ÿ“ฆ Installation

pip install datasets transformers pyarrow
# Quickstart
from downloader import download_and_filter_dataset
from tokenised import create_tokenizer, tokenize_dataset
from convert import save_raw_py_files, save_tokenized_parquet

# Step 1: Download + Filter
dataset = download_and_filter_dataset()

# Step 2: Tokenize
tokenizer = create_tokenizer("gpt2")
tokenized_dataset = tokenize_dataset(dataset, tokenizer)

# Step 3: Save
save_raw_py_files(dataset)
save_tokenized_parquet(tokenized_dataset)

About

A lightweight dataset builder for AI projects. MasterChat-Datasets Builder allows downloading, filtering, and converting large code datasets into optimized formats like .parquet. Designed for research, fine-tuning, and startups aiming to scale AI models with efficient, clean data pipelines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages