Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Oct 30, 2025

✨ Description

  • Add patch sample type to handle image patches and similar objects.
  • Add optional image patches to language model samples
    *Update gpt preparator to support images and perform the necessary offline preprocessing.
  • Image normalization can't be done offline because of file size issues. Added it in the language model reader. Works, but not ideal and lacks configurability.
    *Add tests for images.
  • TODO Add image separators for patches (lenghts?) so we can calculate cu_seqlens.

@jlamypoirier jlamypoirier marked this pull request as ready for review November 6, 2025 21:30
@jlamypoirier jlamypoirier changed the title Vsion dataset Vsion dataset and preprocessing Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants