Aurora is a unified training-serving system for online speculative decoding. It closes the loop between speculator training and serving by continuously learning a draft model directly from live inference traces — treating online speculator learning as an asynchronous reinforcement-learning problem. Aurora is built on top of TorchSpec.
Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax-M2.1 and Qwen3-Coder-Next), and adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3).
| Mode | Description |
|---|---|
| Online | Training and inference co-located via Ray controller. Draft model updated continuously from live serving traces with hot-swapped weight sync. |
| External with draft | Standalone SGLang server with EAGLE3 speculative decoding. Training improves the draft and syncs weights back periodically. |
| External without draft | Standalone SGLang server runs target-only inference. Draft model trained from scratch — no pre-existing data or speculator required. |
./tools/build_conda.sh
micromamba activate auroraTo install into your current environment instead:
./tools/build_conda.sh currentOptional Flash Attention extras:
pip install -e ".[fa]"# Start training + external SGLang server (Qwen3-4B, day-0 from scratch)
bash examples/qwen3-4b-external-no-draft/run.sh
# In another terminal, send requests to generate training samples
bash examples/qwen3-4b-external-no-draft/send_requests.shSee examples/README.md for the full example catalog, per-model training curves, GPU layout, and config overrides.
- The example
run.shscripts are single-node oriented — they manage their own local Ray cluster. For multi-node or Kubernetes deployments, start Ray manually and invokepython3 -m aurora.train_entrydirectly. See docs/ray.md. - External with-draft mode requires a shared filesystem between training and the SGLang server for draft weight sync.
online_serving.hidden_states_dtypemust match the serving model's dtype (e.g., setfloat16when serving an FP8 model).- Training and inference GPU sets (
CUDA_VISIBLE_DEVICESvsSGLANG_GPUS) must not overlap.
Convert an Aurora checkpoint to Hugging Face format:
python tools/convert_to_hf.py --input-dir ./outputs/my_experiment/iter_0010000/Vocabulary pruning can be applied either during training (draft_vocab_size in config) or at conversion time:
python tools/convert_to_hf.py \
--input-dir ./outputs/my_experiment/iter_0010000/ \
--prune-vocab \
--dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
--draft-vocab-size 32000 \
--tokenizer Qwen/Qwen3-8B \
--chat-template qwen \
--prompt-key conversationsW&B logging is disabled by default (report_to: none). To enable it, set report_to: wandb in your config and supply your API key.
| Issue | Reference |
|---|---|
| Stuck or failing distributed runs, Ray actor errors | docs/debugging_ray_jobs.md |
| Ray cluster setup, actor hierarchy, placement groups | docs/ray.md |
| Pipeline bottlenecks, slow steps, throughput analysis | docs/performance_metrics.md |
Enable verbose logging:
AURORA_LOG_LEVEL=DEBUG bash examples/qwen3-4b-external-with-draft/run.sh@article{wang2026aurora,
title={When RL Meets Adaptive Speculative Training: A Unified Training--Serving System},
author={Wang, Junxiong and Bie, Fengxiang and Li, Jisen and Zhou, Zhongzhu and Shao, Zelei and Wang, Yubo and Liu, Yinghui and Wu, Qingyang and May, Avner and Yanamandra, Sri and Zhang, Yineng and Zhang, Ce and Dao, Tri and Liang, Percy and Athiwaratkun, Ben and Song, Shuaiwen Leon and Xu, Chenfeng and Wu, Xiaoxia},
journal={arXiv preprint arXiv:2602.06932},
year={2026}
}