Refacto and remove bloated code #709

NathanHB · 2025-05-07T11:59:29Z

What does this PR do?

This PR gives the prompt building logic in lighteval a much-needed spring cleaning

The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥

Highlights

Prompt Manager Overhaul: Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it).
- system-prompt: now part of the model config
- use-chat-template: now part of model config
Metrics Slimdown: Metrics now only care about samplingMethod (generative or loglikelihood). Say goodbye to use_case and all those old request types.
Request Layer Gone: Models get the raw Doc directly -—no more unnecessary request wrappers that were bloating the code.
Unified ModelResponse: All models return a single ModelResponse type, whether generative or loglikelihood. This means simpler logging and metric computation.
Consistent Metric Signatures: Every metric now uses the same function signature: compute(doc: Doc, model_response: ModelResponse).
Standardized Details: Each sample’s details now always include three fields: doc, metric, and model_response.
Generative Metrics Unified: All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0. Exception will be raised if the user tries to run a sampling metric with temp = 0
Removed Loglikelihood Single Token: bloated and almost not used
Tests: All tests pass, and no changes were needed to expected values.

Why?

Less code, fewer headaches.
Easier to add new benchmarks (including weird and wonderful ones).
More user-friendly inspection tools.
A single, unified way to handle prompts, responses, and metrics.

architecture of lighteval

Example details dataset

HuggingFaceDocBuilderDev · 2025-05-07T12:01:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…t-building

Copilot

Pull Request Overview

This PR refactors the prompt-building and evaluation logic in lighteval by removing legacy request wrappers, unifying data structures (Doc and ModelResponse), and simplifying pipeline and registry handling.

Introduces a single Doc dataclass for all task inputs and a unified ModelResponse
Replaces multiple request types and response classes with SamplingMethod and ModelResponse
Updates Pipeline, Registry, and prompt management to work with the new structures

Reviewed Changes

Copilot reviewed 84 out of 89 changed files in this pull request and generated 3 comments.

File	Description
tests/utils.py	Update `FakeModel` to return `ModelResponse` and use `Doc`
src/lighteval/tasks/default_prompts.py	Changed default prompt construction, removed `instructions`
src/lighteval/tasks/requests.py	Replaced old request classes with a large `Doc` dataclass
src/lighteval/models/model_output.py	Consolidated response types into a single, expanded `ModelResponse`

Comments suppressed due to low confidence (1)

src/lighteval/tasks/default_prompts.py:64

The instructions variable was removed from the default prompt, so any task-specific instructions will no longer appear. Consider restoring instructions (e.g. f"{instructions}\n{question}\n{formatted_choices}") or explicitly handling when instructions is empty.

    prompt = f"\n{question}\n{formatted_choices}"

tests/utils.py

src/lighteval/tasks/requests.py

src/lighteval/models/model_output.py

Co-authored-by: Copilot <[email protected]>

docs/source/package_reference/models.mdx

src/lighteval/data.py

clefourrier · 2025-06-23T15:55:26Z

src/lighteval/metrics/sample_preparator.py

@@ -102,6 +110,51 @@ def prepare(self, gold_ixs: list[int], choices_logprob: list[float], **kwargs) -
        return LogprobCorpusMetricInput(golds=gold_ixs, preds=np.argmax(choices_logprob))


+class TargetPerplexityPreparator:


Why introduce a new class instead of adding a is_target (False be default) parameter to the next one? (esp when so much of the code is the same)

src/lighteval/models/endpoints/endpoint_model.py

src/lighteval/models/transformers/transformers_model.py

clefourrier · 2025-06-23T16:16:23Z

src/lighteval/models/transformers/transformers_model.py

+        if num_samples > 1 and self.generation_config_dict["temperature"] == 0:
+            raise ValueError(
+                "You cannot generate multiple samples with temperature=0. Please set temperature > 0. Or use a non sampling metric."
+            )
+


I wonder if we could not put this one in the abstract class

src/lighteval/models/vllm/vllm_model.py

…ace/lighteval into nathan-refactor-prompt-building

fxmarty-amd · 2025-07-10T10:37:06Z

src/lighteval/models/transformers/transformers_model.py

+                        pad_amount = global_max_choices - cont_batch.shape[0]
+                        padded = F.pad(cont_batch, (0, pad_amount), value=-1)


Shouldn't it be

pad_amount = global_max_choices - cont_batch.shape[1] padded = F.pad(cont_batch, (0, pad_amount), value=-1)

here?

hum then I have other shape errors in torch.stack. something looks wrong here

refacto prompt building

9684d16

NathanHB added 28 commits May 19, 2025 12:00

commit

cab4027

working state for generative metrics (mocked the model)

0b1e213

working state, removed Metrictype and use_case

723daeb

working state, all metrics should work, need to unmock the models now

65c2508

remove unused functions from pipeline

4a16bec

working for transformer's greedyuntil

29e6657

working on loglikelihood but getting random results

9624de6

loglikelihood working

747ebf1

transformers model working

0358cb4

remove unused functions

471247b

all unit tests pass

62658d4

all unit tests pass

31c80cb

loglikelihood vllm works

c30ff90

end to end works

6dfc502

end to end works

d458183

all tests pass

2913dfd

all tests pass official

2c4cd69

sglang works

9675dc3

fixing more models

52dcf11

Merge remote-tracking branch 'origin/main' into nathan-refactor-promp…

9691d33

…t-building

all tests passing

8102e77

all models files were reviewed except nanotron

81992a9

working

c7502d3

load from details working

872a6be

fix tests

e42d999

documentation

c0a0b82

documentation

bbd0c11

documentation

55cdfb6

NathanHB added 6 commits June 23, 2025 13:51

fix end to end tests to reflect changes in prompt manager

1d7a2bb

fix tests

9fe5efb

fix tidi

566725d

last details

288f999

last details

093f465

last details

a74740e

NathanHB changed the title ~~refacto prompt building~~ Refacto and remove bloated code Jun 23, 2025

NathanHB requested a review from Copilot June 23, 2025 15:51

Copilot AI reviewed Jun 23, 2025

View reviewed changes

tests/utils.py Outdated Show resolved Hide resolved

src/lighteval/tasks/requests.py Show resolved Hide resolved

src/lighteval/models/model_output.py Show resolved Hide resolved

Update tests/utils.py

75d0005

Co-authored-by: Copilot <[email protected]>

clefourrier reviewed Jun 23, 2025

View reviewed changes

NathanHB and others added 16 commits June 24, 2025 09:38

fixes from review

084d35c

fixes from review

6b32de5

Merge branch 'nathan-refactor-prompt-building' of github.com:huggingf…

464dcf7

…ace/lighteval into nathan-refactor-prompt-building

Merge branch 'main' into nathan-refactor-prompt-building

d88ec32

fix tests

67cbbd8

Merge branch 'nathan-refactor-prompt-building' of github.com:huggingf…

3cd4b51

…ace/lighteval into nathan-refactor-prompt-building

gpqa extractive match bug

5c3be1c

few shot management with instruction out of system prompt

3e74336

fix chat template, notably with few shots, for apis

bd5bab6

fix gpqa metric

c688ebe

Merge branch 'nathan-refactor-prompt-building' of github.com:huggingf…

2cf7219

…ace/lighteval into nathan-refactor-prompt-building

fix gpqa metric

655b9e3

fix tests for prompt manager

a5400a3

add gpqa to tests

e1584e5

remove gpqa from tests :)

a58ae41

fix main_tasks and ultilingual tasks

9a78a82

NathanHB merged commit 9288bd8 into main Jun 25, 2025
5 checks passed

NathanHB mentioned this pull request Jun 25, 2025

[FT] Store system prompt in results #788

Closed

fxmarty-amd reviewed Jul 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refacto and remove bloated code #709

Refacto and remove bloated code #709

Uh oh!

NathanHB commented May 7, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clefourrier Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clefourrier Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

fxmarty-amd Jul 10, 2025

Uh oh!

fxmarty-amd Jul 10, 2025

Uh oh!

Uh oh!

		@@ -102,6 +110,51 @@ def prepare(self, gold_ixs: list[int], choices_logprob: list[float], **kwargs) -
		return LogprobCorpusMetricInput(golds=gold_ixs, preds=np.argmax(choices_logprob))


		class TargetPerplexityPreparator:

		pad_amount = global_max_choices - cont_batch.shape[0]
		padded = F.pad(cont_batch, (0, pad_amount), value=-1)

Refacto and remove bloated code #709

Refacto and remove bloated code #709

Uh oh!

Conversation

NathanHB commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Highlights

Why?

architecture of lighteval

Example details dataset

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clefourrier Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clefourrier Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fxmarty-amd Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NathanHB commented May 7, 2025 •

edited

Loading