Skip to content

Commit 9a478f4

Browse files
authored
Merge pull request #173 from llm-jp/dev-update-v1.0
dev-update-v1.0: clean history for v1.1 content
2 parents f0998c3 + 867ea1d commit 9a478f4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+3056
-221
lines changed

.github/workflows/test.yml

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,26 @@ jobs:
77
uv-example:
88
name: python
99
runs-on: ubuntu-latest
10+
env:
11+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
12+
HUGGINGFACE_HUB_TOKEN: ${{ secrets.HF_TOKEN }}
1013

1114
steps:
1215
- uses: actions/checkout@v4
16+
with:
17+
submodules: recursive
1318

1419
- name: Install uv
1520
uses: astral-sh/setup-uv@v5
1621

17-
- name: Install the project
18-
run: uv sync --dev
22+
- name: Prepare submodules
23+
run: git submodule update --init --recursive
1924

20-
- name: Run tests
21-
# For example, using `pytest`
25+
- name: Install the project (dev deps)
26+
run: uv sync --group dev
27+
28+
- name: Run tests (metrics)
2229
run: uv run pytest src/eval_mm/metrics/*.py
30+
31+
- name: Run tests (tasks)
32+
run: bash test_git_workflow.sh

.gitignore

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,9 @@ venv.bak/
101101
# mypy
102102
.mypy_cache/
103103

104+
# Ruff
105+
.ruff_cache/
106+
104107
# weights and biases
105108
wandb/
106109
outputs/
@@ -123,7 +126,9 @@ tmp/
123126
*verbose.jsonl
124127
__depr__/
125128

126-
# examples/llava for evaluating LLM-jp-3 VILA
129+
# examples/vila for evaluating LLM-jp-3 VILA
130+
examples/vila/*
131+
# optional official LLaVA submodule
127132
examples/llava/*
128133

129134
# experiments
@@ -142,3 +147,37 @@ uv.lock
142147

143148
# cache
144149
.cache/
150+
tmp*/
151+
152+
# pytest cache
153+
.pytest_cache/
154+
155+
# OS/editor files
156+
.DS_Store
157+
Thumbs.db
158+
.idea/
159+
160+
# Streamlit local config
161+
.streamlit/
162+
163+
# Local artifacts
164+
artifact/
165+
artifact/*_tmp_*/
166+
outputs/*_tmp_*/
167+
result/*_tmp_*/
168+
169+
# Personal dev settings
170+
.python-version
171+
.uv/
172+
.uv
173+
.direnv/
174+
.envrc
175+
.devcontainer/
176+
*.code-workspace
177+
*.swp
178+
*.swo
179+
*~
180+
181+
# HPC scripts (ignored per request)
182+
scripts/nvlink/
183+
scripts/tsubame/

.gitmodules

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[submodule "examples/llava"]
2+
path = examples/llava
3+
url = https://github.com/haotian-liu/LLaVA.git
4+
[submodule "examples/vila"]
5+
path = examples/vila
6+
url = https://github.com/llm-jp/llm-jp-VILA.git
7+
branch = dep-relax

AGENTS.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Repository Guidelines
2+
3+
This is a concise, coding‑agent–friendly guide for contributing and extending the llm-jp-eval-mm evaluation framework.
4+
5+
## Project Structure
6+
7+
- `src/eval_mm/`: Core library
8+
- `tasks/`: Task loaders/adapters; register in `task_registry.py`
9+
- `metrics/`: Scorers and aggregation utilities; register in `scorer_registry.py`
10+
- `utils/`: Helpers (e.g., Azure/OpenAI client)
11+
- `examples/`: Reference VLM wrappers and runnable samples
12+
- `vila/`: llm-jp VILA wrapper (submodule)
13+
- `llava/`: official LLaVA (optional submodule)
14+
- `scripts/`: Leaderboard, Streamlit browser, dataset prep
15+
- `assets/`, `data/`, `dataset/`: Static assets and datasets (not committed)
16+
- `result/`, `outputs/`: Evaluation artifacts written by runs
17+
18+
## Key Commands
19+
20+
- Setup: `uv sync` (model deps via groups, e.g., `uv sync --group normal`)
21+
- Run sample eval: `uv run --group normal python examples/sample.py ...`
22+
- Tests: `bash test.sh` (tasks/metrics), `bash test_model.sh` (model smoke)
23+
- Lint/format: `uv run ruff format src && uv run ruff check --fix src`
24+
- Type check: `uv run mypy src`
25+
- Browse predictions: `uv run streamlit run scripts/browse_prediction.py -- --task_id <id> --result_dir result --model_list <model>`
26+
- Leaderboard: `python scripts/make_leaderboard.py --result_dir result`
27+
28+
## Development Playbook (for Agents)
29+
30+
- Add a task: implement `Task` in `src/eval_mm/tasks/<name>.py`; import it in `src/eval_mm/tasks/__init__.py`; register with `@register_task` in `task_registry.py`.
31+
- Add a scorer: implement in `src/eval_mm/metrics/<name>_scorer.py`; import in `metrics/__init__.py`; register in `scorer_registry.py`.
32+
- Add a model: wrap in `examples/` (see existing VLM wrappers) and map via `examples/model_table.py`.
33+
- Import pattern: `from eval_mm import TaskRegistry, ScorerRegistry` (avoid `src.` prefixes).
34+
- Tests: include `def test_*` near tasks/metrics; prefer `bash test.sh` (tasks/metrics) and `bash test_model.sh` (model smoke). For a single file, you may optionally run `uv run --group dev pytest <path> -v`, but CI expects the scripts.
35+
36+
## Plan-First Workflow
37+
38+
- Before any change, prepare a short checklist: objective, source of truth, inventory, diff policy, implementation steps, and acceptance criteria.
39+
- After alignment, implement the minimum needed for the agreed scope.
40+
- Example (naming unification):
41+
- Source of truth: treat `scripts/nvlink/config.sh` entries (e.g., task IDs and metric map) as canonical.
42+
- Inventory: compare identifiers used across code and configuration, and list discrepancies.
43+
- Implementation: adopt the canonical identifiers in public-facing interfaces; keep backward-compatible aliases only if necessary.
44+
- Validation: run `uv run python scripts/validate_config_consistency.py` and `bash test.sh`.
45+
46+
## Coding Style & Conventions
47+
48+
- Python ≥ 3.12, 4‑space indentation, type hints required
49+
- Names: packages/modules `lower_snake_case`; classes `CamelCase`; functions/vars `lower_snake_case`
50+
- Keep functions focused; prefer dataclasses/typed types for structured data
51+
- Use Ruff + pre-commit; follow existing import order and ignore rules
52+
53+
## Commit & PR Guidelines
54+
55+
- Prefix commits with `feat:`, `fix:`, `chore:`, `docs:` (see `git log`)
56+
- PRs include: clear description, linked issues, repro commands, sample outputs (e.g., `result/<task>/<model>/evaluation.jsonl`); CI must pass
57+
58+
## Security & Config
59+
60+
- LLM‑as‑a‑Judge: set `.env` with `AZURE_OPENAI_ENDPOINT`/`AZURE_OPENAI_KEY` or `OPENAI_API_KEY`
61+
- Do not commit secrets or large datasets; use `.env.sample`
62+
- Add model deps via `uv` groups and update conflicts in `pyproject.toml`
63+
64+
## Temporary Validation (_tmp_ Policy)
65+
66+
- Name temporary files/dirs with `_tmp_` (e.g., `result/<task>/<model>_tmp_/<run>`).
67+
- Keep them under `result/`, `outputs/`, or `artifact/` and remove after validation.
68+
- Avoid committing `_tmp_` artifacts; they are ignored by `.gitignore`.

CONTRIBUTORS.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Contributors (CRediT Taxonomy)
2+
3+
本ファイルは CRediT Taxonomy に準拠して、仮の貢献者情報を示します。依頼に基づき「プログラミング・実験・テスト・検証・文書作成」の貢献を以下の CRediT 役割に対応づけています。
4+
5+
- プログラミング → Software
6+
- 実験 → Investigation
7+
- テスト → Software(Testing を含む)/ Validation(検証・再現性確認)
8+
- 検証 → Validation
9+
- 文書作成 → Writing – Original Draft
10+
11+
- Koki Maeda (@Silviase)
12+
- Software(プログラミング/テスト)
13+
- Investigation(実験)
14+
- Validation(検証/再現性確認)
15+
- Writing – Original Draft
16+
17+
- Issa Sugiura (@speed1313)
18+
- Software
19+
- Investigation
20+
- Validation
21+
- Writing – Original Draft
22+
23+
- Yusuke Oda(@odashi
24+
- Conceptualization(全体構成/フレームワークに関する構想・合意)
25+
- Methodology(評価フレームワーク/方針の設計)
26+
- Supervision(助言・進行の監督)
27+
28+
- Shuhei Kurita(@shuheikurita
29+
- Supervision(助言・進行の監督)
30+
- Writing – Review & Editing(文書レビュー/編集)
31+
32+
- Naoaki Okazaki(@chokkanorg
33+
- Supervision(助言・進行の監督)
34+
- Writing – Review & Editing(文書レビュー/編集)
35+
36+
備考:
37+
38+
- 実プロジェクトでは、必要に応じて他の CRediT 役割(Conceptualization, Methodology, Data Curation, Resources, Supervision, Writing – Review & Editing など)も追記してください。

examples/vila.py renamed to examples/_depr_vila.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# -------- DEPRECATED --------
12
# This file is modified from https://github.com/haotian-liu/LLaVA/
23

34
# rye add protobuf
@@ -8,13 +9,13 @@
89

910
import torch
1011

11-
from llava_vila.conversation import SeparatorStyle, conv_templates
12-
from llava_vila.mm_utils import (
12+
from vila.conversation import SeparatorStyle, conv_templates
13+
from vila.mm_utils import (
1314
get_model_name_from_path,
1415
process_images,
1516
tokenizer_image_token,
1617
)
17-
from llava_vila.model.builder import load_pretrained_model
18+
from vila.model.builder import load_pretrained_model
1819
from PIL import Image
1920

2021

examples/aya_vision.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
import torch
2+
from PIL import Image
3+
from transformers import AutoProcessor, AutoModelForImageTextToText
4+
from base_vlm import BaseVLM
5+
from utils import GenerationConfig
6+
7+
8+
class VLM(BaseVLM):
9+
def __init__(self, model_id: str = "CohereLabs/aya-vision-8b") -> None:
10+
self.model_id = model_id
11+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
12+
13+
# Check for HuggingFace token
14+
import os
15+
token = os.getenv("HF_TOKEN", None)
16+
17+
self.processor = AutoProcessor.from_pretrained(self.model_id, token=token)
18+
self.model = AutoModelForImageTextToText.from_pretrained(
19+
self.model_id,
20+
device_map="auto",
21+
torch_dtype=torch.float16,
22+
low_cpu_mem_usage=True,
23+
token=token
24+
)
25+
26+
def generate(
27+
self,
28+
images: list[Image.Image] | None,
29+
text: str,
30+
gen_kwargs: GenerationConfig = GenerationConfig(),
31+
) -> str:
32+
if images is None:
33+
images = []
34+
35+
# Build the prompt with special tokens (based on VLLM implementation)
36+
# For multiple images, we need to add <image> token for each image
37+
image_tokens = "<image>" * len(images) if images else ""
38+
prompt = f"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{image_tokens}{text}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"
39+
40+
# Process the inputs
41+
inputs = self.processor(
42+
text=prompt,
43+
images=images if images else None,
44+
return_tensors="pt",
45+
padding=True
46+
)
47+
48+
# Move inputs to device
49+
inputs = {k: v.to(self.device) if torch.is_tensor(v) else v for k, v in inputs.items()}
50+
51+
# Generate response
52+
with torch.no_grad():
53+
outputs = self.model.generate(
54+
**inputs,
55+
max_new_tokens=gen_kwargs.max_new_tokens,
56+
temperature=gen_kwargs.temperature,
57+
top_p=gen_kwargs.top_p,
58+
do_sample=gen_kwargs.do_sample if gen_kwargs.temperature > 0 else False,
59+
)
60+
61+
# Decode only the generated tokens
62+
generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
63+
generated_text = self.processor.decode(generated_ids, skip_special_tokens=True)
64+
65+
return generated_text.strip()
66+
67+
def batch_generate(
68+
self,
69+
images_list: list[list[Image.Image]] | None,
70+
text_list: list[str],
71+
gen_kwargs: GenerationConfig = GenerationConfig(),
72+
) -> list[str]:
73+
# For batch processing, we'll process each item sequentially
74+
# as the model may not support true batch processing with different image counts
75+
results = []
76+
for images, text in zip(images_list or [[] for _ in text_list], text_list):
77+
result = self.generate(images, text, gen_kwargs)
78+
results.append(result)
79+
return results
80+
81+
82+
if __name__ == "__main__":
83+
vlm = VLM()
84+
vlm.test_vlm()

examples/base_vllm.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,12 @@
77

88

99
class VLLM(BaseVLM):
10-
def __init__(self, model_id: str = "google/gemma-3-4b-it") -> None:
10+
def __init__(self,
11+
model_id: str,
12+
gpu_memory_utilization: float = 0.9,
13+
max_model_len: int = None,
14+
tensor_parallel_size: int = 1,
15+
) -> None:
1116
self.model_id = model_id
1217
self.registry = VLLMModelRegistry(self.model_id)
1318
self.processor = self.registry.processor
@@ -16,7 +21,8 @@ def __init__(self, model_id: str = "google/gemma-3-4b-it") -> None:
1621
engine_config = self.registry.get_engine_config(self.model_id)
1722
self.engine_args_dict = {
1823
"model": self.model_id,
19-
"tensor_parallel_size": 2, # number of GPUs of the machine, but 40 should be divisible by tensor_parallel_size
24+
"tensor_parallel_size": tensor_parallel_size, # number of GPUs of the machine, but 40 should be divisible by tensor_parallel_size
25+
"gpu_memory_utilization": gpu_memory_utilization,
2026
"download_dir": "./.cache/vllm",
2127
**engine_config,
2228
}

0 commit comments

Comments
 (0)