Language model sample #378
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Description
GPTSampleandGPTBatchwithLanguageModelSampleandLanguageModelBatch, encapsulating much of the related functionality and simplifying much of the code that use them (ex.SampledIndexedDataset,GPTBaseModel.preprocess_batch,gpt_data_collate_fnSampledIndexedDatasetagnostic of the sample type.SampledIndexedDatasetwas using an entirely different code path for preference spans avoiding multi-document samples (for historical reasons I think?), I dropped it and made it use the common code path. This does mean a change in behavior (ex. multi-document samples), but that's an improvement. I have some doubts about the dpo ploss function though, I suspect the log softmax needs to be calculated separately for each document.cross_document_attentiontoAttentionConfig(fromBatchConfig) so the attention layer decides itself whether to use varlen or not. Replaceuse_flash_attentionwith a more genericimplementationenum. (See discussion in Base model interface review #370.) Add separateLanguageModelEmbeddingsConfig.cross_document_position_embeddingssince absolute position embeddings may also use the sequence lengths.