Skip to content

Releases: mosh3eb/TrainKeeper

v0.3.0: High-Performance Distributed Training & Global Observability

18 Feb 22:58

Choose a tag to compare

🚀 TrainKeeper v0.3.0: The Distributed & Performance Era

We are thrilled to announce the release of TrainKeeper v0.3.0. This milestone transforms TrainKeeper from a simple utility into a production-grade orchestration layer for deep learning experiments.

Designed for researchers and ML engineers who demand both speed and safety, v0.3.0 introduces zero-boilerplate distributed scaling and deep system observability.


🌟 Major Highlights

⚡️ Zero-Boilerplate Scaling (FSDP & DDP)

Scaling from one GPU to a thousand should not require a code rewrite. With our new FSDP (Fully Sharded Data Parallel) integration, it doesn't.

  • One-Line Sharding: wrap_model_fsdp handles parameter sharding and mixed-precision automatically.
  • Smart Defaults: Pre-configured for backward prefetching and sensible auto-wrap policies.

🎨 Next-Gen Interactive Dashboard

Experience your experiments in high-fidelity.

  • Live Telemetry: Monitor GPU thermal and memory state in real-time.
  • Drift Analytics: Visual schema comparison and data quality scoring built-in.
  • Launch Command: Just run tk dashboard and see your training in a whole new light.

🔍 Scientific GPU Profiling

Stop guessing why you're getting OutOfMemory errors.

  • Recommendation Engine: Get actionable advice like "Fragmentation detected (35%). Try periodic cache clearing."
  • Leak Detection: Automatically identifies tensors that aren't being freed between steps.

🛡️ Production-Grade Checkpointing

Never lose a state, never waste a byte.

  • Metric-Aware Retention: Keep only the top $N$ best models and last $M$ states automatically.
  • Distributed Safety: Multi-process locking ensures zero corruption on shared storage.

🛠️ Performance & Stability

  • v0.3.0 Core: experiment, distributed, and gpu_profiler modules are officially production-ready.
  • Verified Integrations: Pass-checked against HuggingFace Transformers, TIMM, and PyTorch Native.
  • Full Test Suite: 57 comprehensive tests ensuring 100% logic coverage of core primitives.

📦 Getting Started

pip install trainkeeper==0.3.0

0.2.3

29 Jan 02:56
0a54d23

Choose a tag to compare