Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality
Official code for ACL 2026 Main Conference paper "Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality".
XBridge leverages a compositional encoder-LLM-decoder architecture that offloads multilingual capability to the composed NMT model while preserving the LLM as an English-centric core for general knowledge processing. XBridge brings low-resource and unseen language performance close to that of composed NMT models, substantially narrowing the gap across languages without retraining the LLM.
- Compositional multilinguality: separates responsibilities across modules: encoder for multilingual understanding, LLM for general knowledge processing, and decoder for multilingual generation.
- Strong cross-lingual generalization: the cross-model mapping layers are language-agnostic that even generalizes well to the untuned languages.
- Controllable language generation: controls output languages by the target language token of the decoder.
- Lossless language switching: supports arbitrary language-to-language generation through the LLM pivot without degrading performance.
- Mitigating catastrophic forgetting in multilingual extension: boosts low-resource or unseen languages understanding and generation of LLM to near-NMT performance, while maintaining or improving high-resource languages performance, avoiding the common new–old language trade-off in multilingual extension.
- Efficient training: requires only minimal additional parameters, limited training data (mostly bilingual pairs), and modest overhead.
XBridge_demo.mp4
git clone https://github.com/ictnlp/XBridge.gitconda create -n xbridge python=3.9.12
conda activate xbridge
pip install -r requirements.txtFor evaluation, we use MMT-LLM for translation task of base LLMs.
git clone https://github.com/NJUNLP/MMT-LLM.gitFor training, we extract multilingual translation data from OPUS-100, multilingual mathematical reasoning data from MultilingualMath, and multilingual abstractive summarization data from XL-Sum. Please refer to the paper for detailed data construction procedures.
For evaluation, we test cross-model mapping quality with FLORES-101 for stage 1, test multilingual mathematical reasoning with MGSM, and multilingual abstract summarization with XL-Sum test set.
XBridge composes LLMs with NMT models in three stages:
-
Stage 1: Cross-Model Mapping
Establish coarse-grained semantic alignment among the multilingual encoder, the LLM, and the multilingual decoder using trilingual translation data
(x, en, y) -
Stage 2: Encoder-Side Adaptation
Adapt multilingual input representations to downstream instruction-following tasks.
-
Stage 3: Decoder-Side Adaptation
Adapt the LLM-decoder interface for robust multilingual generation.
See our paper for details about training strategy.
Below is an example evaluation script.
# evaluation on stage1
generate_batch_from_file=inference_xbridge_stage1.py
mt_tokenizer_path=/path/to/your/NMT/model
llm_tokenizer_path=/path/to/your/LLM
base_model=/path/to/your/stage1/checkpoint
testset_dir=/path/to/your/FLOERS-101
output_dir=/path/to/your/output/dir
test_langs=en,bn,de,es,fr,ja,ru,sw,th,zh
mkdir -p $output_dir
CUDA_VISIBLE_DEVICES=0 python $generate_batch_from_file \
--mt_tokenizer_path $mt_tokenizer_path --llm_tokenizer_path $llm_tokenizer_path \
--base_model $base_model \
--batch_size 12 \
--testset_dir $testset_dir --output_dir $output_dir \
--test_langs $test_langs --max_new_tokens 512
# evaluation on stage2&3
generate_batch_from_file=inference_xbridge_stage2_and_3.py
mt_tokenizer_path=/path/to/your/NMT/model
llm_tokenizer_path=/path/to/your/LLM
base_model=/path/to/your/stage3/checkpoint
testset_dir=/path/to/your/MGSM
output_dir=/path/to/your/output/dir
test_langs=en,bn,de,es,fr,ja,ru,sw,th,zh
mkdir -p $output_dir
CUDA_VISIBLE_DEVICES=0 python $generate_batch_from_file \
--mt_tokenizer_path $mt_tokenizer_path --llm_tokenizer_path $llm_tokenizer_path \
--base_model $base_model \
--batch_size 12 \
--testset_dir $testset_dir --output_dir $output_dir \
--test_langs $test_langs --max_new_tokens 512We release XBridge-base and XBridge-SFT in the Hugging Face collection:
XBridge-baseis trained with stage 1 (cross-model alignment) using trilingual translation data, composingLLaMA3-8BwithNLLB-200-1.3B. Training is conducted on 10 languages: Bn, De, En, Es, Fr, Ja, Ru, Sw, Th, Zh.XBridge-SFTfurther extendsXBridge-baseby training stage 2 (encoder-side adaptation) and stage 3 (decoder-side adaptation) for instruction-following tasks onBactrian-Xdataset. We expand to the following additional languages: Af, Ar, Az, Cs, El, Et, Fa, Fi, Gl, Gu, He, Hi, Hr, Id, It, Ka, Kk, Km, Lt, Lv, Mk, Ml, Mn, Mr, My, Ne, Nl, Pl, Ps, Pt, Ro, Sl, Sv, Ta, Te, Tr, Uk, Ur, Vi, Xh.
Try our Gradio demo for general QA among 50 languages!
gradio_demo=demo.py
mt_tokenizer_path=/path/to/your/NMT/model
llm_tokenizer_path=/path/to/your/LLM
model_path=/path/to/our/hf/model
CUDA_VISIBLE_DEVICES=0 python $gradio_demo \
--model_path $model_path \
--mt_tokenizer_path $mt_tokenizer_path --llm_tokenizer_path $llm_tokenizer_path \
--max_gen_len 256Our code is released under the Apache-2.0 License. Our model is intended for academic research purposes only and may NOT be used for commercial purposes.
You are free to use, modify, and distribute this model in academic settings, provided that the following conditions are met:
- Non-commercial use: The model may not be used for any commercial purposes.
- Citation: If you use this model in your research, please cite the original work.
For any commercial use inquiries or to obtain a commercial license, please contact fengyang@ict.ac.cn.
If you have any questions, please feel free to submit an issue or contact bumengyu23z@ict.ac.cn.
If you find this repository useful, please star this repository and cite our paper:
@misc{bu2026languagedemandknowledgecore,
title={Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality},
author={Mengyu Bu and Yang Feng},
year={2026},
eprint={2603.17512},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.17512},
}