Skip to content

combined columns to create a new field in prepare datasets #248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

nitsanluke
Copy link
Contributor

@nitsanluke nitsanluke commented May 1, 2025

✨ Description

This PR includes functionality to combine HF dataset columns into a new column and tokenize. It also provides means to add a loss-mask-span.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. Combining columns when preparing datasets
  2. Adding loss masking spans while combining columns

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • [x ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • [x ] 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • [x ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • [x ] 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • [x ] ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

@nitsanluke nitsanluke force-pushed the prepare_combine_ds branch 2 times, most recently from d25aefa to 0cb9506 Compare May 1, 2025 16:02
@nitsanluke nitsanluke changed the title combined cols to create a new col combined columns to create a new field May 1, 2025
@nitsanluke nitsanluke changed the title combined columns to create a new field combined columns to create a new field in prepare datasets May 1, 2025
@nitsanluke nitsanluke force-pushed the prepare_combine_ds branch from 0cb9506 to 9c371ab Compare May 1, 2025 16:07
@nitsanluke nitsanluke force-pushed the prepare_combine_ds branch from 6af01e4 to 2f4df5d Compare May 2, 2025 15:16
@nitsanluke
Copy link
Contributor Author

nitsanluke commented May 2, 2025

Config for concat with loss masking pans. (Drop set_masking_span for w/o loss masking)

loading_workers: 1
tokenize_workers: 1
saving_workers: 1
output_path: ./test_output/gsm8k_masked
dataset:
  path: openai/gsm8k 
  config_name: main
  split: train
  trust_remote_code: true
combine_fields:
  col_names:
  - question
  - answer
  delimiter: "<s>"
  set_masking_span: 
    masking_column: question

tokenizer:
  path: /mnt/slam_checkpoints/upstream/Mistral-Nemo-Base-2407/

@nitsanluke nitsanluke marked this pull request as ready for review May 2, 2025 15:31
@jlamypoirier
Copy link
Collaborator

Can you please add a short description of what this PR does?

@tscholak tscholak requested a review from sohamparikh May 2, 2025 17:53
Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nitsanluke, I'm having a bit of trouble reconciling this change with the existing design.

  • Old behaviour: one text column selected via field; optional mask spans read from a second column (loss_masking_spans).

  • New need: two text columns, prompt and completion, concatenated as f"{prompt}{completion}" (maybe with a delimiter). The loss‑mask spans should then be derived from the completion slice inside that joined string.

Those two modes are mutually exclusive, so the config should make that clear. I'd expect something like:

class SourceSchemaConfig(Config):
    pass

class TextColumnConfig(SourceSchemaConfig):
    text_col: str = "text"
    mask_col: str | None = None

class PromptCompletionConfig(SourceSchemaConfig):
    prompt_col: str
    completion_col: str
    delimiter: str = ""  # likely not necessary

GPTHuggingfaceDatasetConfig would contain exactly one of these (maybe via a source_schema: SourceSchemaConfig and using dynamic config classes via #245). The preparator can branch cleanly on which variant is present.

In the patch, however, combine_fields bolted onto the existing dataset.field leaves us in a half‑way state: we can still set field/loss_masking_spans, yet also specify columns to combine. That feels ambiguous and invites invalid combos.

Am I reading this right, or is there a use‑case I'm missing?

@nitsanluke
Copy link
Contributor Author

nitsanluke commented May 2, 2025

Hi @nitsanluke, I'm having a bit of trouble reconciling this change with the existing design.

  • Old behaviour: one text column selected via field; optional mask spans read from a second column (loss_masking_spans).
  • New need: two text columns, prompt and completion, concatenated as f"{prompt}{completion}" (maybe with a delimiter). The loss‑mask spans should then be derived from the completion slice inside that joined string.

Those two modes are mutually exclusive, so the config should make that clear. I'd expect something like:

class SourceSchemaConfig(Config):
    pass

class TextColumnConfig(SourceSchemaConfig):
    text_col: str = "text"
    mask_col: str | None = None

class PromptCompletionConfig(SourceSchemaConfig):
    prompt_col: str
    completion_col: str
    delimiter: str = ""  # likely not necessary

GPTHuggingfaceDatasetConfig would contain exactly one of these (maybe via a source_schema: SourceSchemaConfig and using dynamic config classes via #245). The preparator can branch cleanly on which variant is present.

In the patch, however, combine_fields bolted onto the existing dataset.field leaves us in a half‑way state: we can still set field/loss_masking_spans, yet also specify columns to combine. That feels ambiguous and invites invalid combos.

Am I reading this right, or is there a use‑case I'm missing?

The new combine feature uses the (old) two variables for rest of the execution. i.e if combine_fields is activated the final value for dataset.field is set to this new combined col. (Even if you had field in our config). I imagined when combine_fields is set concat process takes precedence. Similarly when the set_masking_span is set loss_masking_spans is updated to the spans column which will be created. The main reason for this re-set is to re-use the remaining parts of the code with the same variables since they are using these config variables all over the prepare source. I don't expect both old and new functions to be needed for a single dataset.

Your suggestion to bring this back to datasets looks good we can branch from there depending on the config.

@nitsanluke
Copy link
Contributor Author

nitsanluke commented May 8, 2025

Closing PR follow feature at

@nitsanluke nitsanluke closed this May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants