A fully custom implementation of the GPT (Generative Pre-trained Transformer) architecture in PyTorch. This project includes everything from scratch-built tokenizers to advanced parameter-efficient fine-tuning, supporting both educational and research workflows for large language models.
- Overview
- Directory Structure
- Major Features
- How It Works
- Usage Examples
- Configuration
- Advanced Highlights
- Requirements
- License
- References
MyGPT is designed for deep learning practitioners who want to understand, customize, or extend transformer-based language models.
It includes:
- Modular tokenization (including BPE)
- Full transformer stack construction
- Robust training & evaluation pipelines
- Native integration with OpenAI GPT-2 pretrained weights
- LoRA-based fine-tuning for efficient adaptation
MyGPT/
βββ MyGPT.ipynb # Main implementation notebook (model logic & experiments)
βββ gpt_download3.py # Helper for GPT-2 weights download & parsing
βββ 01 Harry Potter and the Sorcerers Stone.txt # Sample dataset for training/testing
βββ .gitignore # Standard ignore file- π§± Tokenizer Construction β Word-level & BPE (via
tiktoken) - π Custom Dataset Loader β Sliding windows, overlapping sequences, padding
- π‘ Embeddings β Token + positional embeddings (GPT-2 spec)
- π Transformer Blocks β Multi-head attention, pre-LN, residuals, GELU activations
- π§ Text Generation β Temperature, top-k, top-p, greedy, EOS-aware sampling
- πΎ Pretrained Weights β Load GPT-2 Small/Medium/Large/XL
- βοΈ Fine-Tuning (LoRA) β Efficient parameter adaptation via PEFT
- π Training Pipeline β AdamW optimizer, live sampling, evaluation support
- Simple Tokenizer: Regex splitting, vocab mapping, encode/decode
- Advanced BPE: True GPT-compatible subword tokenization using
tiktoken
- Custom PyTorch Dataset: Generates input/target pairs using sliding windows
- Flexible DataLoader Factory: Configurable batch size, stride, and workers
-
Embeddings: Token + positional
-
Transformer Blocks:
- LayerNorm (pre-norm)
- Multi-Head Attention (causal)
- FeedForward (Linear β GELU β Linear)
- Residuals & dropout
-
Output Layers: LayerNorm + Linear β Vocab logits
- Optimizer: AdamW
- Loss: CrossEntropyLoss
- Real-time sampling during training
- GPT-2 weights downloader and parser
- Automatic layer mapping and shape alignment
- PEFT integration for low-rank adaptation
- Compatible with Stanford Alpaca and similar datasets
pip install torch torchvision torchaudio tiktoken transformers peft numpy tqdm requestsfrom MyGPT import GPTModel, create_dataloader, train_model_simple
raw_text = open("01 Harry Potter and the Sorcerers Stone.txt").read()
train_loader = create_dataloader(raw_text, batch_size=8, max_length=1024)
model = GPTModel(config=GPT_CONFIG)
train_model_simple(model, train_loader, None, optimizer, device, num_epochs=10)context = "The wizard entered the room"
ids = tokenizer.encode(context)
out = generate(model, torch.tensor([ids]), max_new_tokens=50, temperature=0.8)
print(tokenizer.decode(out[0].tolist()))from gpt_download3 import download_and_load_gpt2
settings, params = download_and_load_gpt2("124M", "gpt2")
load_weights_into_gpt(model, params)from peft import LoraConfig, get_peft_model, TaskType
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=16,
lora_dropout=0.05,
target_modules=["c_attn", "c_proj", "c_fc"]
)
model = get_peft_model(model, peft_config)GPT_CONFIG = {
"vocab_size": 50257,
"context_length": 1024,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": False
}GPT_CONFIG_SMALL = {
"vocab_size": 50257,
"context_length": 256,
"emb_dim": 384,
"n_heads": 6,
"n_layers": 6,
"drop_rate": 0.1,
"qkv_bias": False
}- Causal Masking β Enforces autoregressive constraint
- Learned Positional Encoding β Absolute position vectors up to 1024 tokens
- Pre-LayerNorm Architecture β Improves stability and training convergence
- Dropout & Regularization β On embeddings and internal blocks
- Flexible Sampling β Top-k, top-p, temperature, and greedy modes
- Dynamic Collate Functions β Handles padding & variable-length sequences
- Python β₯ 3.7
- PyTorch β₯ 1.12
tiktokentransformerspeftnumpytqdmrequests
MIT License
- Attention Is All You Need (Transformer)
- Stanford Alpaca Dataset
- LoRA: Low-Rank Adaptation of Large Language Models
- OpenAI GPT-2
