-
Notifications
You must be signed in to change notification settings - Fork 29
combined columns to create a new field in prepare datasets #248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d25aefa
to
0cb9506
Compare
0cb9506
to
9c371ab
Compare
6af01e4
to
2f4df5d
Compare
Config for concat with loss masking pans. (Drop
|
Can you please add a short description of what this PR does? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nitsanluke, I'm having a bit of trouble reconciling this change with the existing design.
-
Old behaviour: one text column selected via
field
; optional mask spans read from a second column (loss_masking_spans
). -
New need: two text columns,
prompt
andcompletion
, concatenated asf"{prompt}{completion}"
(maybe with a delimiter). The loss‑mask spans should then be derived from thecompletion
slice inside that joined string.
Those two modes are mutually exclusive, so the config should make that clear. I'd expect something like:
class SourceSchemaConfig(Config):
pass
class TextColumnConfig(SourceSchemaConfig):
text_col: str = "text"
mask_col: str | None = None
class PromptCompletionConfig(SourceSchemaConfig):
prompt_col: str
completion_col: str
delimiter: str = "" # likely not necessary
GPTHuggingfaceDatasetConfig
would contain exactly one of these (maybe via a source_schema: SourceSchemaConfig
and using dynamic config classes via #245). The preparator can branch cleanly on which variant is present.
In the patch, however, combine_fields
bolted onto the existing dataset.field
leaves us in a half‑way state: we can still set field
/loss_masking_spans
, yet also specify columns to combine. That feels ambiguous and invites invalid combos.
Am I reading this right, or is there a use‑case I'm missing?
The new combine feature uses the (old) two variables for rest of the execution. i.e if Your suggestion to bring this back to datasets looks good we can branch from there depending on the config. |
Closing PR follow feature at |
✨ Description
This PR includes functionality to combine HF dataset columns into a new column and tokenize. It also provides means to add a loss-mask-span.
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Testing