📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
-
Updated
May 6, 2025 - Python
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
Light-field imaging application for plenoptic cameras
📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Light field geometry estimator for plenoptic cameras
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
xKV: Cross-Layer SVD for KV-Cache Compression
🍺 CLI for quickly generating citations for websites and books
In this section, predicting the energy efficiency of buildings with machine learning algorithms.
Provided is a Google Apps Script that's soul purpose is to help make MLA writing easier
A application that helps you create and manage citations for a research paper or other project. Named after personal adjudant and bodyguard to Cl. Mustang, Lt. Hawkeye in the Fullmetal Alchemist manga and anime series.
An efficient and scalable attention module designed to reduce memory usage and improve inference speed in large language models. Designed and implemented the Multi-Head Latent Attention (MLA) module as a drop-in replacement for traditional multi-head attention (MHA) in large language models.
An APA citation helper website (without ads!)
Add a description, image, and links to the mla topic page so that developers can more easily learn about it.
To associate your repository with the mla topic, visit your repo's landing page and select "manage topics."