|
| 1 | +# Repository Guidelines |
| 2 | + |
| 3 | +This is a concise, coding‑agent–friendly guide for contributing and extending the llm-jp-eval-mm evaluation framework. |
| 4 | + |
| 5 | +## Project Structure |
| 6 | + |
| 7 | +- `src/eval_mm/`: Core library |
| 8 | + - `tasks/`: Task loaders/adapters; register in `task_registry.py` |
| 9 | + - `metrics/`: Scorers and aggregation utilities; register in `scorer_registry.py` |
| 10 | + - `utils/`: Helpers (e.g., Azure/OpenAI client) |
| 11 | +- `examples/`: Reference VLM wrappers and runnable samples |
| 12 | + - `vila/`: llm-jp VILA wrapper (submodule) |
| 13 | + - `llava/`: official LLaVA (optional submodule) |
| 14 | +- `scripts/`: Leaderboard, Streamlit browser, dataset prep |
| 15 | +- `assets/`, `data/`, `dataset/`: Static assets and datasets (not committed) |
| 16 | +- `result/`, `outputs/`: Evaluation artifacts written by runs |
| 17 | + |
| 18 | +## Key Commands |
| 19 | + |
| 20 | +- Setup: `uv sync` (model deps via groups, e.g., `uv sync --group normal`) |
| 21 | +- Run sample eval: `uv run --group normal python examples/sample.py ...` |
| 22 | +- Tests: `bash test.sh` (tasks/metrics), `bash test_model.sh` (model smoke) |
| 23 | +- Lint/format: `uv run ruff format src && uv run ruff check --fix src` |
| 24 | +- Type check: `uv run mypy src` |
| 25 | +- Browse predictions: `uv run streamlit run scripts/browse_prediction.py -- --task_id <id> --result_dir result --model_list <model>` |
| 26 | +- Leaderboard: `python scripts/make_leaderboard.py --result_dir result` |
| 27 | + |
| 28 | +## Development Playbook (for Agents) |
| 29 | + |
| 30 | +- Add a task: implement `Task` in `src/eval_mm/tasks/<name>.py`; import it in `src/eval_mm/tasks/__init__.py`; register with `@register_task` in `task_registry.py`. |
| 31 | +- Add a scorer: implement in `src/eval_mm/metrics/<name>_scorer.py`; import in `metrics/__init__.py`; register in `scorer_registry.py`. |
| 32 | +- Add a model: wrap in `examples/` (see existing VLM wrappers) and map via `examples/model_table.py`. |
| 33 | +- Import pattern: `from eval_mm import TaskRegistry, ScorerRegistry` (avoid `src.` prefixes). |
| 34 | +- Tests: include `def test_*` near tasks/metrics; prefer `bash test.sh` (tasks/metrics) and `bash test_model.sh` (model smoke). For a single file, you may optionally run `uv run --group dev pytest <path> -v`, but CI expects the scripts. |
| 35 | + |
| 36 | +## Plan-First Workflow |
| 37 | + |
| 38 | +- Before any change, prepare a short checklist: objective, source of truth, inventory, diff policy, implementation steps, and acceptance criteria. |
| 39 | +- After alignment, implement the minimum needed for the agreed scope. |
| 40 | +- Example (naming unification): |
| 41 | + - Source of truth: treat `scripts/nvlink/config.sh` entries (e.g., task IDs and metric map) as canonical. |
| 42 | + - Inventory: compare identifiers used across code and configuration, and list discrepancies. |
| 43 | + - Implementation: adopt the canonical identifiers in public-facing interfaces; keep backward-compatible aliases only if necessary. |
| 44 | + - Validation: run `uv run python scripts/validate_config_consistency.py` and `bash test.sh`. |
| 45 | + |
| 46 | +## Coding Style & Conventions |
| 47 | + |
| 48 | +- Python ≥ 3.12, 4‑space indentation, type hints required |
| 49 | +- Names: packages/modules `lower_snake_case`; classes `CamelCase`; functions/vars `lower_snake_case` |
| 50 | +- Keep functions focused; prefer dataclasses/typed types for structured data |
| 51 | +- Use Ruff + pre-commit; follow existing import order and ignore rules |
| 52 | + |
| 53 | +## Commit & PR Guidelines |
| 54 | + |
| 55 | +- Prefix commits with `feat:`, `fix:`, `chore:`, `docs:` (see `git log`) |
| 56 | +- PRs include: clear description, linked issues, repro commands, sample outputs (e.g., `result/<task>/<model>/evaluation.jsonl`); CI must pass |
| 57 | + |
| 58 | +## Security & Config |
| 59 | + |
| 60 | +- LLM‑as‑a‑Judge: set `.env` with `AZURE_OPENAI_ENDPOINT`/`AZURE_OPENAI_KEY` or `OPENAI_API_KEY` |
| 61 | +- Do not commit secrets or large datasets; use `.env.sample` |
| 62 | +- Add model deps via `uv` groups and update conflicts in `pyproject.toml` |
| 63 | + |
| 64 | +## Temporary Validation (_tmp_ Policy) |
| 65 | + |
| 66 | +- Name temporary files/dirs with `_tmp_` (e.g., `result/<task>/<model>_tmp_/<run>`). |
| 67 | +- Keep them under `result/`, `outputs/`, or `artifact/` and remove after validation. |
| 68 | +- Avoid committing `_tmp_` artifacts; they are ignored by `.gitignore`. |
0 commit comments