llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.
You can install llm-jp-eval-mm from GitHub or via PyPI.
- Option 1: Clone from GitHub (Recommended)
git clone [email protected]:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync- Option 2: Install via PyPI
pip install eval_mmTo use LLM-as-a-Judge, configure your OpenAI API keys in a.env file:
- For Azure: Set AZURE_OPENAI_ENDPOINTandAZURE_OPENAI_KEY
- For OpenAI: Set OPENAI_API_KEY
If you are not using LLM-as-a-Judge, you can assign any value in the .env file to bypass the error.
To evaluate a model on a task, run the following command:
uv sync --group normal
uv run --group normal python examples/sample.py \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir result  \
  --metrics heron-bench \
  --judge_model gpt-4o-2024-11-20 \
  --overwriteThe evaluation results will be saved in the result directory:
result
├── japanese-heron-bench
│   ├── llava-hf
│   │   ├── llava-1.5-7b-hf
│   │   │   ├── evaluation.jsonl
│   │   │   └── prediction.jsonl
To evaluate multiple models on multiple tasks, please check eval_all.sh.
You can integrate llm-jp-eval-mm into your own code. Here's an example:
from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
class MockVLM:
    def generate(self, images: list[Image.Image], text: str) -> str:
        return "宮崎駿"
task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]
input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)
model = MockVLM()
prediction = model.generate(images, input_text)
scorer = ScorerRegistry.load_scorer(
    "rougel",
    ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})To generate a leaderboard from your evaluation results, run:
python scripts/make_leaderboard.py --result_dir resultThis will create a leaderboard.md file with your model performance:
| Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge | 
|---|---|---|---|
| llm-jp/llm-jp-3-vila-14b | 68.03 | 4.08 | 52.4 | 
| Qwen/Qwen2.5-VL-7B-Instruct | 70.29 | 4.28 | 29.63 | 
| google/gemma-3-27b-it | 69.15 | 4.36 | 30.89 | 
| microsoft/Phi-4-multimodal-instruct | 45.52 | 3.2 | 26.8 | 
| gpt-4o-2024-11-20 | 93.7 | 4.44 | 32.2 | 
The official leaderboard is available here
Japanese Tasks:
- Japanese Heron Bench
- JA-VG-VQA500
- JA-VLM-Bench-In-the-Wild
- JA-Multi-Image-VQA
- JDocQA
- JMMMU
- JIC-VQA
- MECHA-ja
- CC-OCR (multi_lan_ocr split, ja subset)
- CVQA (ja subset)
English Tasks:
We use uv’s dependency groups to manage each model’s dependencies.
For example, to use llm-jp/llm-jp-3-vila-14b, run:
uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.pySee eval_all.sh for the complete list of model dependencies.
When adding a new group, remember to configure conflict.
uv run streamlit run scripts/browse_prediction.py -- --task_id japanese-heron-bench --result_dir result --model_list llava-hf/llava-1.5-7b-hfTo add a new task, implement the Task class in src/eval_mm/tasks/task.py.
To add a new metric, implement the Scorer class in src/eval_mm/metrics/scorer.py.
To add a new model, implement the VLM class in examples/base_vlm.py
Install a new dependency using the following command:
uv add <package_name>
uv add --group <group_name> <package_name>Run the following commands to test tasks, metrics, and models::
bash test.sh
bash test_model.shEnsure code consistency with:
uv run ruff format src
uv run ruff check --fix srcTo release a new version:
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tagsFor website updates, see github_pages/README.md.
To update leaderboard data:
python scripts/make_leaderboard.py --update_pages- Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
- lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.
We also thank the developers of the evaluation datasets for their hard work.

