Skip to content

Conversation

NathanHB
Copy link
Member

@NathanHB NathanHB commented May 7, 2025

What does this PR do?

This PR gives the prompt building logic in lighteval a much-needed spring cleaning

The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥

Highlights

  • Prompt Manager Overhaul: Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it).
    • system-prompt: now part of the model config
    • use-chat-template: now part of model config
  • Metrics Slimdown: Metrics now only care about samplingMethod (generative or loglikelihood). Say goodbye to use_case and all those old request types.
  • Request Layer Gone: Models get the raw Doc directly -—no more unnecessary request wrappers that were bloating the code.
  • Unified ModelResponse: All models return a single ModelResponse type, whether generative or loglikelihood. This means simpler logging and metric computation.
  • Consistent Metric Signatures: Every metric now uses the same function signature: compute(doc: Doc, model_response: ModelResponse).
  • Standardized Details: Each sample’s details now always include three fields: doc, metric, and model_response.
  • Generative Metrics Unified: All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0. Exception will be raised if the user tries to run a sampling metric with temp = 0
  • Removed Loglikelihood Single Token: bloated and almost not used
  • Tests: All tests pass, and no changes were needed to expected values.

Why?

  • Less code, fewer headaches.
  • Easier to add new benchmarks (including weird and wonderful ones).
  • More user-friendly inspection tools.
  • A single, unified way to handle prompts, responses, and metrics.

architecture of lighteval

Screenshot 2025-06-16 at 18 13 51

Example details dataset

Screenshot 2025-06-23 at 17 31 01

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB NathanHB changed the title refacto prompt building Refacto and remove bloated code Jun 23, 2025
@NathanHB NathanHB requested a review from Copilot June 23, 2025 15:51
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the prompt-building and evaluation logic in lighteval by removing legacy request wrappers, unifying data structures (Doc and ModelResponse), and simplifying pipeline and registry handling.

  • Introduces a single Doc dataclass for all task inputs and a unified ModelResponse
  • Replaces multiple request types and response classes with SamplingMethod and ModelResponse
  • Updates Pipeline, Registry, and prompt management to work with the new structures

Reviewed Changes

Copilot reviewed 84 out of 89 changed files in this pull request and generated 3 comments.

File Description
tests/utils.py Update FakeModel to return ModelResponse and use Doc
src/lighteval/tasks/default_prompts.py Changed default prompt construction, removed instructions
src/lighteval/tasks/requests.py Replaced old request classes with a large Doc dataclass
src/lighteval/models/model_output.py Consolidated response types into a single, expanded ModelResponse
Comments suppressed due to low confidence (1)

src/lighteval/tasks/default_prompts.py:64

  • The instructions variable was removed from the default prompt, so any task-specific instructions will no longer appear. Consider restoring instructions (e.g. f"{instructions}\n{question}\n{formatted_choices}") or explicitly handling when instructions is empty.
    prompt = f"\n{question}\n{formatted_choices}"

Co-authored-by: Copilot <[email protected]>
@@ -102,6 +110,51 @@ def prepare(self, gold_ixs: list[int], choices_logprob: list[float], **kwargs) -
return LogprobCorpusMetricInput(golds=gold_ixs, preds=np.argmax(choices_logprob))


class TargetPerplexityPreparator:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce a new class instead of adding a is_target (False be default) parameter to the next one? (esp when so much of the code is the same)

Comment on lines +651 to +655
if num_samples > 1 and self.generation_config_dict["temperature"] == 0:
raise ValueError(
"You cannot generate multiple samples with temperature=0. Please set temperature > 0. Or use a non sampling metric."
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could not put this one in the abstract class

@NathanHB NathanHB merged commit 9288bd8 into main Jun 25, 2025
5 checks passed
Comment on lines +888 to +889
pad_amount = global_max_choices - cont_batch.shape[0]
padded = F.pad(cont_batch, (0, pad_amount), value=-1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be

                        pad_amount = global_max_choices - cont_batch.shape[1]
                        padded = F.pad(cont_batch, (0, pad_amount), value=-1)

here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum then I have other shape errors in torch.stack. something looks wrong here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants