Skip to content

v0.5.4

Compare
Choose a tag to compare
@github-actions github-actions released this 05 Aug 22:38
· 1222 commits to main since this release
4db5176

Highlights

Model Support

  • Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
  • Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
  • Added H2O Danube3-4b (#6451)
  • Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)

Hardware Support

  • TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
  • Intel CPU: enable multiprocessing and tensor parallelism (#6125)

Performance

We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.

  • Separated OpenAI Server's HTTP request handling and model inference loop with zeromq. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883)
  • Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
  • Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
  • Optimize get_seqs function, bring 2% throughput enhancements. (#7051)

Production Features

  • Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
  • Refactor the punica kernel based on Triton (#5036)
  • Support for guided decoding for offline LLM (#6878)

Quantization

  • Support W4A8 quantization for vllm (#5218)
  • Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
  • Support reading bitsandbytes pre-quantized model (#5753)

What's Changed

New Contributors

Full Changelog: v0.5.3...v0.5.4