Releases: mosh3eb/TrainKeeper
Releases · mosh3eb/TrainKeeper
v0.3.0: High-Performance Distributed Training & Global Observability
🚀 TrainKeeper v0.3.0: The Distributed & Performance Era
We are thrilled to announce the release of TrainKeeper v0.3.0. This milestone transforms TrainKeeper from a simple utility into a production-grade orchestration layer for deep learning experiments.
Designed for researchers and ML engineers who demand both speed and safety, v0.3.0 introduces zero-boilerplate distributed scaling and deep system observability.
🌟 Major Highlights
⚡️ Zero-Boilerplate Scaling (FSDP & DDP)
Scaling from one GPU to a thousand should not require a code rewrite. With our new FSDP (Fully Sharded Data Parallel) integration, it doesn't.
- One-Line Sharding:
wrap_model_fsdphandles parameter sharding and mixed-precision automatically. - Smart Defaults: Pre-configured for backward prefetching and sensible auto-wrap policies.
🎨 Next-Gen Interactive Dashboard
Experience your experiments in high-fidelity.
- Live Telemetry: Monitor GPU thermal and memory state in real-time.
- Drift Analytics: Visual schema comparison and data quality scoring built-in.
- Launch Command: Just run
tk dashboardand see your training in a whole new light.
🔍 Scientific GPU Profiling
Stop guessing why you're getting OutOfMemory errors.
- Recommendation Engine: Get actionable advice like "Fragmentation detected (35%). Try periodic cache clearing."
- Leak Detection: Automatically identifies tensors that aren't being freed between steps.
🛡️ Production-Grade Checkpointing
Never lose a state, never waste a byte.
-
Metric-Aware Retention: Keep only the top
$N$ best models and last$M$ states automatically. - Distributed Safety: Multi-process locking ensures zero corruption on shared storage.
🛠️ Performance & Stability
- v0.3.0 Core:
experiment,distributed, andgpu_profilermodules are officially production-ready. - Verified Integrations: Pass-checked against HuggingFace Transformers, TIMM, and PyTorch Native.
- Full Test Suite: 57 comprehensive tests ensuring 100% logic coverage of core primitives.
📦 Getting Started
pip install trainkeeper==0.3.00.2.3
Full Changelog: https://github.com/mosh3eb/TrainKeeper/commits/0.2.3