Skip to content

Commit e5f49ac

Browse files
committed
Update README
1 parent cf394ff commit e5f49ac

File tree

1 file changed

+42
-69
lines changed

1 file changed

+42
-69
lines changed

README.md

Lines changed: 42 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,9 @@
11
# llm-jp-eval-mm
22
[![pypi](https://img.shields.io/pypi/v/eval-mm.svg)](https://pypi.python.org/pypi/eval-mm) [![Test workflow](https://github.com/llm-jp/llm-jp-eval-mm/actions/workflows/test.yml/badge.svg)](https://github.com/llm-jp/llm-jp-eval-mm/actions/workflows/test.yml) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
33

4-
[ [**Japanese**](./README_ja.md) | English ]
5-
64
llm-jp-eval-mm is a lightweight framework for evaluating visual-language models across various benchmark tasks, mainly focusing on Japanese tasks.
75

8-
![What llm-jp-eval-mm provides](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/assets/teaser.png)
9-
10-
## Table of Contents
11-
12-
- [llm-jp-eval-mm](#llm-jp-eval-mm)
13-
- [Table of Contents](#table-of-contents)
14-
- [Getting Started](#getting-started)
15-
- [How to Evaluate](#how-to-evaluate)
16-
- [Running an Evaluation](#running-an-evaluation)
17-
- [Use llm-jp-eval-mm as a Library](#use-llm-jp-eval-mm-as-a-library)
18-
- [Leaderboard](#leaderboard)
19-
- [Supported Tasks](#supported-tasks)
20-
- [Required Libraries for Each VLM Model Inference](#required-libraries-for-each-vlm-model-inference)
21-
- [Benchmark-Specific Required Libraries](#benchmark-specific-required-libraries)
22-
- [Analyze VLMs Prediction](#analyze-vlms-prediction)
23-
- [Contribution](#contribution)
24-
- [How to Add a Benchmark Task](#how-to-add-a-benchmark-task)
25-
- [How to Add a Metric](#how-to-add-a-metric)
26-
- [How to Add Inference Code for a VLM Model](#how-to-add-inference-code-for-a-vlm-model)
27-
- [How to Add Dependencies](#how-to-add-dependencies)
28-
- [Testing](#testing)
29-
- [Formatting and Linting with Ruff](#formatting-and-linting-with-ruff)
30-
- [Releasing to PyPI](#releasing-to-pypi)
31-
- [Updating the Website](#updating-the-website)
32-
- [Acknowledgements](#acknowledgements)
6+
![Overview of llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm/blob/master/assets/teaser.png)
337

348
## Getting Started
359

@@ -47,20 +21,15 @@ uv sync
4721
pip install eval_mm
4822
```
4923

50-
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API.
51-
You need to configure the API keys in a .env file:
52-
- For Azure:`AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`
53-
- For OpenAI: `OPENAI_API_KEY`
24+
To use LLM-as-a-Judge, configure your OpenAI API keys in a`.env` file:
25+
- For Azure: Set `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_KEY`
26+
- For OpenAI: Set `OPENAI_API_KEY`
5427

55-
If you're not using the LLM-as-a-judge method, you can set any value in the .env file to bypass the error.
28+
If you are not using LLM-as-a-Judge, you can assign any value in the `.env` file to bypass the error.
5629

30+
## Usage
5731

58-
## How to Evaluate
59-
60-
### Running an Evaluation
61-
62-
To evaluate a model on a task, we provide an example script: `examples/sample.py`.
63-
32+
To evaluate a model on a task, run the following command:
6433
```bash
6534
uv sync --group normal
6635
uv run --group normal python examples/sample.py \
@@ -73,7 +42,7 @@ uv run --group normal python examples/sample.py \
7342
```
7443

7544
The evaluation results will be saved in the result directory:
76-
```
45+
```
7746
result
7847
├── japanese-heron-bench
7948
│ ├── llava-hf
@@ -82,11 +51,11 @@ result
8251
│ │ │ └── prediction.jsonl
8352
```
8453

85-
If you want to evaluate multiple models on multiple tasks, please check `eval_all.sh`.
54+
To evaluate multiple models on multiple tasks, please check `eval_all.sh`.
8655

87-
### Use llm-jp-eval-mm as a Library
56+
## Hello World Example
8857

89-
You can also integrate llm-jp-eval-mm into your own code. Here's an example:
58+
You can integrate llm-jp-eval-mm into your own code. Here's an example:
9059
```python
9160
from PIL import Image
9261
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
@@ -114,7 +83,8 @@ print(result)
11483
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
11584
```
11685

117-
### Leaderboard
86+
87+
## Leaderboard
11888

11989
To generate a leaderboard from your evaluation results, run:
12090
```bash
@@ -137,8 +107,6 @@ The official leaderboard is available [here](https://llm-jp.github.io/llm-jp-eva
137107

138108
## Supported Tasks
139109

140-
Currently, the following benchmark tasks are supported:
141-
142110
Japanese Tasks:
143111
- [Japanese Heron Bench](https://huggingface.co/datasets/turing-motors/Japanese-Heron-Bench)
144112
- [JA-VG-VQA500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
@@ -153,77 +121,82 @@ English Tasks:
153121
- [MMMU](https://huggingface.co/datasets/MMMU/MMMU)
154122
- [LlaVA-Bench-In-the-Wild](https://huggingface.co/datasets/lmms-lab/llava-bench-in-the-wild)
155123

156-
## Required Dependencies for Each Model
124+
## Managing Dependencies
157125

158-
We use uv’s dependency groups to manage each model’s dependencies.
126+
We use uv’s dependency groups to manage each model’s dependencies.
159127

160128
For example, to use llm-jp/llm-jp-3-vila-14b, run:
161129
```bash
162130
uv sync --group vilaja
163131
uv run --group vilaja python examples/VILA_ja.py
164132
```
165133

166-
See eval_all.sh for the complete list of model dependencies.
134+
See `eval_all.sh` for the complete list of model dependencies.
167135

168-
When adding a new group, remember to configure [conflict](https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies).
136+
When adding a new group, remember to configure [conflict](https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies).
169137

170-
## Analyze Model Predictions
138+
## Browse Predictions with Streamlit
171139

172-
Visualize your model’s predictions with the following Streamlit app:
173140
```bash
174141
uv run streamlit run scripts/browse_prediction.py --task_id "japanese-heron-bench" --result_dir "result"
175142
```
176-
You can view the visualized predictions below:
143+
177144
![Streamlit](./assets/streamlit_visualization.png)
178145

179146

180147
## Contribution
181148

182-
If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.
149+
### Adding a new task
183150

184-
### How to Add a Benchmark Task
185-
Refer to the `src/eval_mm/tasks` directory to implement new benchmark tasks.
151+
To add a new task, implement the Task class in `src/eval_mm/tasks/task.py`.
186152

187-
### How to Add a Metric
188-
To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in `src/eval_mm/metrics`.
153+
### Adding a new metric
189154

190-
### How to Add Inference Code for a VLM Model
191-
Implement the inference code for VLM models in the VLM class. For reference, check `examples/base_vlm.py`.
155+
To add a new metric, implement the Scorer class in `src/eval_mm/metrics/scorer.py`.
192156

193-
### How to Add Dependencies
194-
To add a new dependency, run:
195-
```
157+
### Adding a new model
158+
159+
To add a new model, implement the VLM class in `examples/base_vlm.py`
160+
161+
### Adding a new dependency
162+
163+
Install a new dependency using the following command:
164+
```bash
196165
uv add <package_name>
197166
uv add --group <group_name> <package_name>
198167
```
199168

200169

201170
### Testing
202171

203-
Run the following commands to test the task classes and metrics and to test the VLM models:
172+
Run the following commands to test tasks, metrics, and models::
204173
```bash
205174
bash test.sh
206175
bash test_model.sh
207176
```
208177

209-
### Formatting and Linting with Ruff
210-
```
178+
### Formatting and Linting
179+
180+
Ensure code consistency with:
181+
```bash
211182
uv run ruff format src
212183
uv run ruff check --fix src
213184
```
214185

215186
### Releasing to PyPI
216-
To release a new version to PyPI:
217-
```
187+
188+
To release a new version:
189+
```bash
218190
git tag -a v0.x.x -m "version 0.x.x"
219191
git push origin --tags
220192
```
221193

222194

223195
### Updating the Website
224-
For website updates, refer to the [github_pages/README.md](./github_pages/README.md).
225196

226-
To update the leaderboard data on the website, run:
197+
For website updates, see [github_pages/README.md](./github_pages/README.md).
198+
199+
To update leaderboard data:
227200
```bash
228201
python scripts/make_leaderboard.py --update_pages
229202
```

0 commit comments

Comments
 (0)