11# llm-jp-eval-mm
22[ ![ pypi] ( https://img.shields.io/pypi/v/eval-mm.svg )] ( https://pypi.python.org/pypi/eval-mm ) [ ![ Test workflow] ( https://github.com/llm-jp/llm-jp-eval-mm/actions/workflows/test.yml/badge.svg )] ( https://github.com/llm-jp/llm-jp-eval-mm/actions/workflows/test.yml ) [ ![ License] ( https://img.shields.io/badge/License-Apache_2.0-blue.svg )] ( https://opensource.org/licenses/Apache-2.0 )
33
4- [ [ ** Japanese** ] ( ./README_ja.md ) | English ]
5-
64llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.
75
8- ![ What llm-jp-eval-mm provides] ( https://github.com/llm-jp/llm-jp-eval-mm/blob/master/assets/teaser.png )
9-
10- ## Table of Contents
11-
12- - [ llm-jp-eval-mm] ( #llm-jp-eval-mm )
13- - [ Table of Contents] ( #table-of-contents )
14- - [ Getting Started] ( #getting-started )
15- - [ How to Evaluate] ( #how-to-evaluate )
16- - [ Running an Evaluation] ( #running-an-evaluation )
17- - [ Use llm-jp-eval-mm as a Library] ( #use-llm-jp-eval-mm-as-a-library )
18- - [ Leaderboard] ( #leaderboard )
19- - [ Supported Tasks] ( #supported-tasks )
20- - [ Required Libraries for Each VLM Model Inference] ( #required-libraries-for-each-vlm-model-inference )
21- - [ Benchmark-Specific Required Libraries] ( #benchmark-specific-required-libraries )
22- - [ Analyze VLMs Prediction] ( #analyze-vlms-prediction )
23- - [ Contribution] ( #contribution )
24- - [ How to Add a Benchmark Task] ( #how-to-add-a-benchmark-task )
25- - [ How to Add a Metric] ( #how-to-add-a-metric )
26- - [ How to Add Inference Code for a VLM Model] ( #how-to-add-inference-code-for-a-vlm-model )
27- - [ How to Add Dependencies] ( #how-to-add-dependencies )
28- - [ Testing] ( #testing )
29- - [ Formatting and Linting with Ruff] ( #formatting-and-linting-with-ruff )
30- - [ Releasing to PyPI] ( #releasing-to-pypi )
31- - [ Updating the Website] ( #updating-the-website )
32- - [ Acknowledgements] ( #acknowledgements )
6+ ![ Overview of llm-jp-eval-mm] ( https://github.com/llm-jp/llm-jp-eval-mm/blob/master/assets/teaser.png )
337
348## Getting Started
359
@@ -47,20 +21,15 @@ uv sync
4721pip install eval_mm
4822```
4923
50- This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API.
51- You need to configure the API keys in a .env file:
52- - For Azure:` AZURE_OPENAI_ENDPOINT ` and ` AZURE_OPENAI_KEY `
53- - For OpenAI: ` OPENAI_API_KEY `
24+ To use LLM-as-a-Judge, configure your OpenAI API keys in a` .env ` file:
25+ - For Azure: Set ` AZURE_OPENAI_ENDPOINT ` and ` AZURE_OPENAI_KEY `
26+ - For OpenAI: Set ` OPENAI_API_KEY `
5427
55- If you're not using the LLM-as-a-judge method , you can set any value in the .env file to bypass the error.
28+ If you are not using LLM-as-a-Judge , you can assign any value in the ` .env ` file to bypass the error.
5629
30+ ## Usage
5731
58- ## How to Evaluate
59-
60- ### Running an Evaluation
61-
62- To evaluate a model on a task, we provide an example script: ` examples/sample.py ` .
63-
32+ To evaluate a model on a task, run the following command:
6433``` bash
6534uv sync --group normal
6635uv run --group normal python examples/sample.py \
@@ -73,7 +42,7 @@ uv run --group normal python examples/sample.py \
7342```
7443
7544The evaluation results will be saved in the result directory:
76- ```
45+ ```
7746result
7847├── japanese-heron-bench
7948│ ├── llava-hf
@@ -82,11 +51,11 @@ result
8251│ │ │ └── prediction.jsonl
8352```
8453
85- If you want to evaluate multiple models on multiple tasks, please check ` eval_all.sh ` .
54+ To evaluate multiple models on multiple tasks, please check ` eval_all.sh ` .
8655
87- ### Use llm-jp-eval-mm as a Library
56+ ## Hello World Example
8857
89- You can also integrate llm-jp-eval-mm into your own code. Here's an example:
58+ You can integrate llm-jp-eval-mm into your own code. Here's an example:
9059``` python
9160from PIL import Image
9261from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
@@ -114,7 +83,8 @@ print(result)
11483# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
11584```
11685
117- ### Leaderboard
86+
87+ ## Leaderboard
11888
11989To generate a leaderboard from your evaluation results, run:
12090``` bash
@@ -137,8 +107,6 @@ The official leaderboard is available [here](https://llm-jp.github.io/llm-jp-eva
137107
138108## Supported Tasks
139109
140- Currently, the following benchmark tasks are supported:
141-
142110Japanese Tasks:
143111- [ Japanese Heron Bench] ( https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench )
144112- [ JA-VG-VQA500] ( https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500 )
@@ -153,77 +121,82 @@ English Tasks:
153121- [ MMMU] ( https://huggingface.co/datasets/MMMU/MMMU )
154122- [ LlaVA-Bench-In-the-Wild] ( https://huggingface.co/datasets/lmms-lab/llava-bench-in-the-wild )
155123
156- ## Required Dependencies for Each Model
124+ ## Managing Dependencies
157125
158- We use uv’s dependency groups to manage each model’s dependencies.
126+ We use uv’s dependency groups to manage each model’s dependencies.
159127
160128For example, to use llm-jp/llm-jp-3-vila-14b, run:
161129``` bash
162130uv sync --group vilaja
163131uv run --group vilaja python examples/VILA_ja.py
164132```
165133
166- See eval_all.sh for the complete list of model dependencies.
134+ See ` eval_all.sh ` for the complete list of model dependencies.
167135
168- When adding a new group, remember to configure [ conflict] ( https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies ) .
136+ When adding a new group, remember to configure [ conflict] ( https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies ) .
169137
170- ## Analyze Model Predictions
138+ ## Browse Predictions with Streamlit
171139
172- Visualize your model’s predictions with the following Streamlit app:
173140``` bash
174141uv run streamlit run scripts/browse_prediction.py --task_id " japanese-heron-bench" --result_dir " result"
175142```
176- You can view the visualized predictions below:
143+
177144![ Streamlit] ( ./assets/streamlit_visualization.png )
178145
179146
180147## Contribution
181148
182- If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.
149+ ### Adding a new task
183150
184- ### How to Add a Benchmark Task
185- Refer to the ` src/eval_mm/tasks ` directory to implement new benchmark tasks.
151+ To add a new task, implement the Task class in ` src/eval_mm/tasks/task.py ` .
186152
187- ### How to Add a Metric
188- To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in ` src/eval_mm/metrics ` .
153+ ### Adding a new metric
189154
190- ### How to Add Inference Code for a VLM Model
191- Implement the inference code for VLM models in the VLM class. For reference, check ` examples/base_vlm.py ` .
155+ To add a new metric, implement the Scorer class in ` src/eval_mm/metrics/scorer.py ` .
192156
193- ### How to Add Dependencies
194- To add a new dependency, run:
195- ```
157+ ### Adding a new model
158+
159+ To add a new model, implement the VLM class in ` examples/base_vlm.py `
160+
161+ ### Adding a new dependency
162+
163+ Install a new dependency using the following command:
164+ ``` bash
196165uv add < package_name>
197166uv add --group < group_name> < package_name>
198167```
199168
200169
201170### Testing
202171
203- Run the following commands to test the task classes and metrics and to test the VLM models:
172+ Run the following commands to test tasks, metrics, and models: :
204173``` bash
205174bash test.sh
206175bash test_model.sh
207176```
208177
209- ### Formatting and Linting with Ruff
210- ```
178+ ### Formatting and Linting
179+
180+ Ensure code consistency with:
181+ ``` bash
211182uv run ruff format src
212183uv run ruff check --fix src
213184```
214185
215186### Releasing to PyPI
216- To release a new version to PyPI:
217- ```
187+
188+ To release a new version:
189+ ``` bash
218190git tag -a v0.x.x -m " version 0.x.x"
219191git push origin --tags
220192```
221193
222194
223195### Updating the Website
224- For website updates, refer to the [ github_pages/README.md] ( ./github_pages/README.md ) .
225196
226- To update the leaderboard data on the website, run:
197+ For website updates, see [ github_pages/README.md] ( ./github_pages/README.md ) .
198+
199+ To update leaderboard data:
227200``` bash
228201python scripts/make_leaderboard.py --update_pages
229202```
0 commit comments