Language model sample #378

jlamypoirier · 2025-10-16T03:52:07Z

✨ Description

Replace GPTSample and GPTBatch with LanguageModelSample and LanguageModelBatch, encapsulating much of the related functionality and simplifying much of the code that use them (ex. SampledIndexedDataset, GPTBaseModel.preprocess_batch, gpt_data_collate_fn
Make SampledIndexedDataset agnostic of the sample type.
Change spans to use a standard range format, i.e. (first, last + 1) instead of (first, last). (Spans are still stored in (first, last) format in the binary file, this will be changed for the new binary format.)
Redo much of the code for preference spans. SampledIndexedDataset was using an entirely different code path for preference spans avoiding multi-document samples (for historical reasons I think?), I dropped it and made it use the common code path. This does mean a change in behavior (ex. multi-document samples), but that's an improvement. I have some doubts about the dpo ploss function though, I suspect the log softmax needs to be calculated separately for each document.
Datasets now always provide sequence lengths, and move cross_document_attention to AttentionConfig (from BatchConfig) so the attention layer decides itself whether to use varlen or not. Replace use_flash_attention with a more generic implementation enum. (See discussion in Base model interface review #370.) Add separate LanguageModelEmbeddingsConfig.cross_document_position_embeddings since absolute position embeddings may also use the sequence lengths.

Language model sample

92e93e8

jlamypoirier mentioned this pull request Oct 16, 2025

Dataset interface #377

Open

jlamypoirier added 4 commits October 16, 2025 00:19

fix

d6f6944

fixes

5c802fa

test

95d1840

fixes

eafd9cb

jlamypoirier marked this pull request as ready for review October 17, 2025 03:20

jlamypoirier added 3 commits October 16, 2025 23:29

cleanup

c56df69

misc

7f437e1

misc

dfd27f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Language model sample #378

Language model sample #378

Uh oh!

jlamypoirier commented Oct 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Language model sample #378

Are you sure you want to change the base?

Language model sample #378

Uh oh!

Conversation

jlamypoirier commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlamypoirier commented Oct 16, 2025 •

edited

Loading