combined columns to create a new field in prepare datasets #248

nitsanluke · 2025-05-01T14:57:27Z

✨ Description

This PR includes functionality to combine HF dataset columns into a new column and tokenize. It also provides means to add a loss-mask-span.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Combining columns when preparing datasets
Adding loss masking spans while combining columns

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
[x ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
[x ] 🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
[x ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
[x ] 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

🧪 I have added or updated tests to cover my changes.
[x ] ✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

nitsanluke · 2025-05-02T15:31:00Z

Config for concat with loss masking pans. (Drop set_masking_span for w/o loss masking)

loading_workers: 1
tokenize_workers: 1
saving_workers: 1
output_path: ./test_output/gsm8k_masked
dataset:
  path: openai/gsm8k 
  config_name: main
  split: train
  trust_remote_code: true
combine_fields:
  col_names:
  - question
  - answer
  delimiter: "<s>"
  set_masking_span: 
    masking_column: question

tokenizer:
  path: /mnt/slam_checkpoints/upstream/Mistral-Nemo-Base-2407/

jlamypoirier · 2025-05-02T15:42:34Z

Can you please add a short description of what this PR does?

tscholak

Hi @nitsanluke, I'm having a bit of trouble reconciling this change with the existing design.

Old behaviour: one text column selected via field; optional mask spans read from a second column (loss_masking_spans).
New need: two text columns, prompt and completion, concatenated as f"{prompt}{completion}" (maybe with a delimiter). The loss‑mask spans should then be derived from the completion slice inside that joined string.

Those two modes are mutually exclusive, so the config should make that clear. I'd expect something like:

class SourceSchemaConfig(Config):
    pass

class TextColumnConfig(SourceSchemaConfig):
    text_col: str = "text"
    mask_col: str | None = None

class PromptCompletionConfig(SourceSchemaConfig):
    prompt_col: str
    completion_col: str
    delimiter: str = ""  # likely not necessary

GPTHuggingfaceDatasetConfig would contain exactly one of these (maybe via a source_schema: SourceSchemaConfig and using dynamic config classes via #245). The preparator can branch cleanly on which variant is present.

In the patch, however, combine_fields bolted onto the existing dataset.field leaves us in a half‑way state: we can still set field/loss_masking_spans, yet also specify columns to combine. That feels ambiguous and invites invalid combos.

Am I reading this right, or is there a use‑case I'm missing?

nitsanluke · 2025-05-02T20:15:01Z

Hi @nitsanluke, I'm having a bit of trouble reconciling this change with the existing design.

Old behaviour: one text column selected via field; optional mask spans read from a second column (loss_masking_spans).

New need: two text columns, prompt and completion, concatenated as f"{prompt}{completion}" (maybe with a delimiter). The loss‑mask spans should then be derived from the completion slice inside that joined string.

Those two modes are mutually exclusive, so the config should make that clear. I'd expect something like:
class SourceSchemaConfig(Config):
    pass

class TextColumnConfig(SourceSchemaConfig):
    text_col: str = "text"
    mask_col: str | None = None

class PromptCompletionConfig(SourceSchemaConfig):
    prompt_col: str
    completion_col: str
    delimiter: str = ""  # likely not necessary
GPTHuggingfaceDatasetConfig would contain exactly one of these (maybe via a source_schema: SourceSchemaConfig and using dynamic config classes via #245). The preparator can branch cleanly on which variant is present.

In the patch, however, combine_fields bolted onto the existing dataset.field leaves us in a half‑way state: we can still set field/loss_masking_spans, yet also specify columns to combine. That feels ambiguous and invites invalid combos.

Am I reading this right, or is there a use‑case I'm missing?

The new combine feature uses the (old) two variables for rest of the execution. i.e if combine_fields is activated the final value for dataset.field is set to this new combined col. (Even if you had field in our config). I imagined when combine_fields is set concat process takes precedence. Similarly when the set_masking_span is set loss_masking_spans is updated to the spans column which will be created. The main reason for this re-set is to re-use the remaining parts of the code with the same variables since they are using these config variables all over the prepare source. I don't expect both old and new functions to be needed for a single dataset.

Your suggestion to bring this back to datasets looks good we can branch from there depending on the config.

nitsanluke · 2025-05-08T22:54:27Z

Closing PR follow feature at

Concat prompt and completion cols for tokenizing #257

nitsanluke force-pushed the prepare_combine_ds branch 2 times, most recently from d25aefa to 0cb9506 Compare May 1, 2025 16:02

nitsanluke changed the title ~~combined cols to create a new col~~ combined columns to create a new field May 1, 2025

nitsanluke changed the title ~~combined columns to create a new field~~ combined columns to create a new field in prepare datasets May 1, 2025

combined cols to create a new col

9c371ab

nitsanluke force-pushed the prepare_combine_ds branch from 0cb9506 to 9c371ab Compare May 1, 2025 16:07

nitsanluke added 4 commits May 1, 2025 17:19

rename fields to colnames name re-used

52b9179

set dataset.field to the new field

7d57592

add loss masking col when combining cols

80584e7

updates to config names and desc.

2f4df5d

nitsanluke force-pushed the prepare_combine_ds branch from 6af01e4 to 2f4df5d Compare May 2, 2025 15:16

nitsanluke marked this pull request as ready for review May 2, 2025 15:31

tscholak requested a review from sohamparikh May 2, 2025 17:53

tscholak requested changes May 2, 2025

View reviewed changes

nitsanluke mentioned this pull request May 8, 2025

Concat prompt and completion cols for tokenizing #257

Draft

14 tasks

nitsanluke closed this May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combined columns to create a new field in prepare datasets #248

combined columns to create a new field in prepare datasets #248

nitsanluke commented May 1, 2025 •

edited

Loading

nitsanluke commented May 2, 2025 •

edited

Loading

jlamypoirier commented May 2, 2025

tscholak left a comment •

edited

Loading

nitsanluke commented May 2, 2025 •

edited

Loading

nitsanluke commented May 8, 2025 •

edited

Loading

combined columns to create a new field in prepare datasets #248

combined columns to create a new field in prepare datasets #248

Conversation

nitsanluke commented May 1, 2025 • edited Loading

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Testing

nitsanluke commented May 2, 2025 • edited Loading

jlamypoirier commented May 2, 2025

tscholak left a comment • edited Loading

Choose a reason for hiding this comment

nitsanluke commented May 2, 2025 • edited Loading

nitsanluke commented May 8, 2025 • edited Loading

nitsanluke commented May 1, 2025 •

edited

Loading

nitsanluke commented May 2, 2025 •

edited

Loading

tscholak left a comment •

edited

Loading

nitsanluke commented May 2, 2025 •

edited

Loading

nitsanluke commented May 8, 2025 •

edited

Loading