[NeurIPS 2025] MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees
We use uv
to maintain dependencies. You can install it by following the guide here: UV Installation Guide.
Setting up the project is straight forward:
git clone https://github.com/laminair/mess-plus.git
cd mess-plus
uv sync
And you are ready to go.
This project uses Weights & Biases for logging. You may have to configure your machine to use W&B before launching any experiment.
Note that we have used enroot and the nvcr.io/nvidia/pytorch:24.09-py3
container for our experiments to improve reproducibility.
The container is available from the Nvidia NGC Catalog.
We have divided the setup for our paper experiments into two major parts:
- Data capturing (since inference at scale is time-consuming and expensive)
- Evaluation of MESS+ (for fast iteration and prototyping)
Important note: You can also run MESS+ directly when doing inference. This is useful when you want to explore our code base. This is important for the subsequent configuration
You can run MESS+ with any model zoo you want. To do so, pick a config yaml
file from any of the sub-folders and modify it.
Here are some important notes:
- You must order the models in increasing order by operating cost (for local inference that is likely determined by the parameter size)
- You must specify the GPU where the model should run (note: models are deployed in parallel!)
- You can specify a quantization mechanism for faster inference (if the model config supports it)
- The classifier model is set to
ModernBert
(HuggingFace) but you may choose any model that you like. This can be directly adjusted in the config file. - There are two variables where you can set alpha:
alpha
andalpha_values
.alpha
is used in the online inference script andalpha_values
in the simulator. - The LM Eval Harness config can be used to store the inference results in a csv file and directly use it for the MESS+ evaluation. This can be done by setting
write_to_disk
totrue
.
You can use any pre-generated inference outputs for evaluation of MESS+. The mess_plus_simulator.py
script expects input data of shape:
doc_id,input_text,benchmark_name,label_xsmall,acc_norm_xsmall,energy_consumption_xsmall,inference_time_xsmall,label_small,acc_norm_small,energy_consumption_small,inference_time_small,label_medium,acc_norm_medium,energy_consumption_medium,inference_time_medium,label_large,acc_norm_large,energy_consumption_large,inference_time_large
For all scenarios we provide slurm
scripts with the full execution commands. Please refer to those files for running experiments.
Note that acc_norm
can be replaced by acc
for benchmarks where no normalized accuracy is available.
Also the model label (e.g., xsmall
) must match the labels in your yaml config file.
A minimal example to run an experiment:
VLLM_USE_V1=0 python3 main.py --config config/llama3/arc_challenge.yaml --wandb-entity test --wandb-project test
You can find the algorithm's core logic inside the algorithm/
package.
While we chose a stochastic process to choose between exploration
and exploitation
, the two phases can be applied
in different ways as well to better fit certain applications.
VLLM
has a V1 and V2 engine where the V2 engine's memory management does not work well when deploying multiple model onto a single GPU.
You can get around this by specifying VLLM_USE_V1=0 python3 mess_plus.py ...
.
The Zeus
energy monitoring library sometimes recognizes CPUs to have RAPL capabilities without them actually having them.
This can lead to an error. To solve this, disable RAPL at startup and monitor the GPUs only.
As a hotfix for AttributeError: 'ZeusMonitor' object has no attribute 'cpus'. Did you mean: 'gpus'?
you can disable CPU energy monitoring in the following:
- Navigate to
.venv/lib/python3.12/site-packages/zeus/device/cpu/rapl.py
and change line 137 toreturn False
.
We have used the Language Model Evaluation Harness
(LM-Eval Harness) as a basis for our code.
Thanks to the EleutherAI Institute for making such a comprehensive evaluation suite available.