Skip to content

Latest commit

 

History

History
149 lines (113 loc) · 4.02 KB

DATA.md

File metadata and controls

149 lines (113 loc) · 4.02 KB

Data Preparation

Long-VITA Training Data

The data configures are defined by the YAML files in configs folder.

An example YAML file of the training data:

dataset:

  ...
  LLaVA-ReCap:
    ratio: 1
    data_paths:
      - datasets/jsonl/lmms-lab/LLaVA-ReCap-558K/data.jsonl
      - datasets/jsonl/lmms-lab/LLaVA-ReCap-118K/data.jsonl
      - datasets/jsonl/lmms-lab/LLaVA-ReCap-CC3M/data.jsonl
   ...

Our processed JSONL files can be downloaded from Long-VITA-Training-Data.

The images and videos can be downloaded by following the instructions from their original websites.

We list the data used in Long-VITA:

Custom Data

  • An example JSONL file of the training data:
[
    ...
    {
        "messages": [
            {
                "role": "user",
                "content": "...<image><image>..."
            },
            {
                "role": "assistant",
                "content": "..."
            }
        ],
        "images": ["path/to/first/image", "path/to/second/image", ...],
    },
    {
        "messages": [
            {
                "role": "user",
                "content": "...<video><video>..."
            },
            {
                "role": "assistant",
                "content": "..."
            }
        ],
        "videos": ["path/to/first/video", "path/to/second/video", ...],
    },
    ...
]