This repository provides the code to systematically investigate the the impact of adding parallel data on LLMs' multilingual capabilities, as reported in the following publication:
Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models
Muhammad Reza Qorib, Junyi Li, and Hwee Tou Ng
The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), pages 33411–33424.
The codebase is built upon TinyLlama
- No Parallel: nusnlp/JGP-No-Parallel
- Multilingual: nusnlp/JGP-Multilingual
- Parallel Non-Adjacent: nusnlp/JGP-Parallel-Non-Adjacent
- Parallel First: nusnlp/JGP-Parallel-First
- Parallel Distributed: nusnlp/JGP-Parallel-Distributed
- Parallel Last (all): nusnlp/JGP-Parallel-Last-all
- Parallel Last (uni):
- EN→ID: nusnlp/JGP-Parallel-Last-EN-ID
- ID→EN: nusnlp/JGP-Parallel-Last-ID-EN
- EN→ZH: nusnlp/JGP-Parallel-Last-EN-ZH
- ZH→EN: nusnlp/JGP-Parallel-Last-ZH-EN
Experiment | Datasets |
---|---|
No-Parallel | nusnlp/JGP-SlimPajama |
Multilingual | nusnlp/JGP-SlimPajama + nusnlp/JGP-Multilingual |
Parallel Non-Adjacent | nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-Non-Adjacent |
Parallel First, Parallel Distributed, Parallel Last (all) | nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel |
Parallel Last (uni): EN→ID | nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-EN-ID |
Parallel Last (uni): ID→EN | nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-ID-EN |
Parallel Last (uni): EN→ZH | nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-EN-ZH |
Parallel Last (uni): ZH→EN | nusnlp/JGP-SlimPajama + nusnlp/JGP-Parallel-ZH-EN |
We expect that you have CUDA>=11.8 installed.
Follow the official guidance to install the appropriate Pytorch version that fits the installed CUDA.
You can install the pre-built version or build from source as shown below:
pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
You can install the pre-built version or build from source as shown below:
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
Install the remaining dependencies:
pip install -r requirements.txt tokenizers sentencepiece
It may take ≥ 5 minutes to build XFormers/Flash-Attention. Don’t worry if the process seems stagnant or if the terminal prints many warnings.
Then you are ready to go 🎉!
Please refer to PRETRAIN.md for instructions on reproducing the pretraining of our models.
Please use ALMA to evaluate translation performance and LM-Evaluation-Harness to evaluate common-sense reasoning.
This repository is licensed under the Apache-2.0 license.
If you find our work useful, we kindly ask that you cite our paper.
@inproceedings{qorib-etal-2025-just,
title = "Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models",
author = "Qorib, Muhammad Reza and
Li, Junyi and
Ng, Hwee Tou",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1602/",
doi = "10.18653/v1/2025.acl-long.1602",
pages = "33411--33424",
ISBN = "979-8-89176-251-0",
}
This repository builds on TinyLlama, which was developed with lit-gpt and flash-attention.