-
Notifications
You must be signed in to change notification settings - Fork 329
Refacto and remove bloated code #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the prompt-building and evaluation logic in lighteval by removing legacy request wrappers, unifying data structures (Doc and ModelResponse), and simplifying pipeline and registry handling.
- Introduces a single
Doc
dataclass for all task inputs and a unifiedModelResponse
- Replaces multiple request types and response classes with
SamplingMethod
andModelResponse
- Updates
Pipeline
,Registry
, and prompt management to work with the new structures
Reviewed Changes
Copilot reviewed 84 out of 89 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
tests/utils.py | Update FakeModel to return ModelResponse and use Doc |
src/lighteval/tasks/default_prompts.py | Changed default prompt construction, removed instructions |
src/lighteval/tasks/requests.py | Replaced old request classes with a large Doc dataclass |
src/lighteval/models/model_output.py | Consolidated response types into a single, expanded ModelResponse |
Comments suppressed due to low confidence (1)
src/lighteval/tasks/default_prompts.py:64
- The
instructions
variable was removed from the default prompt, so any task-specific instructions will no longer appear. Consider restoringinstructions
(e.g.f"{instructions}\n{question}\n{formatted_choices}"
) or explicitly handling wheninstructions
is empty.
prompt = f"\n{question}\n{formatted_choices}"
Co-authored-by: Copilot <[email protected]>
@@ -102,6 +110,51 @@ def prepare(self, gold_ixs: list[int], choices_logprob: list[float], **kwargs) - | |||
return LogprobCorpusMetricInput(golds=gold_ixs, preds=np.argmax(choices_logprob)) | |||
|
|||
|
|||
class TargetPerplexityPreparator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce a new class instead of adding a is_target
(False be default) parameter to the next one? (esp when so much of the code is the same)
if num_samples > 1 and self.generation_config_dict["temperature"] == 0: | ||
raise ValueError( | ||
"You cannot generate multiple samples with temperature=0. Please set temperature > 0. Or use a non sampling metric." | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could not put this one in the abstract class
…ace/lighteval into nathan-refactor-prompt-building
…ace/lighteval into nathan-refactor-prompt-building
…ace/lighteval into nathan-refactor-prompt-building
pad_amount = global_max_choices - cont_batch.shape[0] | ||
padded = F.pad(cont_batch, (0, pad_amount), value=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be
pad_amount = global_max_choices - cont_batch.shape[1]
padded = F.pad(cont_batch, (0, pad_amount), value=-1)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum then I have other shape errors in torch.stack
. something looks wrong here
What does this PR do?
This PR gives the prompt building logic in lighteval a much-needed spring cleaning
The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥
Highlights
samplingMethod
(generative or loglikelihood). Say goodbye touse_case
and all those old request types.Doc
directly -—no more unnecessaryrequest
wrappers that were bloating the code.ModelResponse
type, whether generative or loglikelihood. This means simpler logging and metric computation.compute(doc: Doc, model_response: ModelResponse)
.Why?
architecture of lighteval
Example details dataset