The data configures are defined by the YAML files in configs folder.
An example YAML file of the training data:
dataset:
...
LLaVA-ReCap:
ratio: 1
data_paths:
- datasets/jsonl/lmms-lab/LLaVA-ReCap-558K/data.jsonl
- datasets/jsonl/lmms-lab/LLaVA-ReCap-118K/data.jsonl
- datasets/jsonl/lmms-lab/LLaVA-ReCap-CC3M/data.jsonl
...
Our processed JSONL files can be downloaded from Long-VITA-Training-Data.
The images and videos can be downloaded by following the instructions from their original websites.
We list the data used in Long-VITA:
-
LLaVA
-
LLaVA-ReCap
-
ALLaVA
-
LVIS
-
ShareGPT4V
-
the cauldron
-
Docmatix
-
LLaVA-OneVision-Mid-Data
-
LLaVA-OneVision-Data
-
M4-Instruct-Data
-
OpenHermes
-
lima
-
databricks-dolly-15k
-
MetaMathQA
-
MathInstruct
-
orca-math-word-problems-200k
-
atlas-math-sets
-
goat
-
camel-ai
-
Long-Instruction-with-Paraphrasing
-
Long
- https://huggingface.co/datasets/akoksal/LongForm
- https://huggingface.co/datasets/THUDM/LongAlign-10k
- https://huggingface.co/datasets/THUDM/LongCite-45k
- https://huggingface.co/datasets/THUDM/LongWriter-6k
- https://huggingface.co/datasets/YeungNLP/LongQLoRA-Dataset
- https://huggingface.co/datasets/Yukang/LongAlpaca-12k
- https://huggingface.co/datasets/togethercomputer/Long-Data-Collections
-
VideoGPT-plus_Training_Dataset
-
ShareGemini
-
Movie
-
Comic-9K
-
lmms-lab/LLaVA-Video-178K
- An example JSONL file of the training data:
[
...
{
"messages": [
{
"role": "user",
"content": "...<image><image>..."
},
{
"role": "assistant",
"content": "..."
}
],
"images": ["path/to/first/image", "path/to/second/image", ...],
},
{
"messages": [
{
"role": "user",
"content": "...<video><video>..."
},
{
"role": "assistant",
"content": "..."
}
],
"videos": ["path/to/first/video", "path/to/second/video", ...],
},
...
]