We present DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details.
DetailFlow encodes tokens with an inherent semantic ordering, where each subsequent token contributes additional high-resolution information. On the ImageNet 256ร256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2ร faster inference speed compared to VAR and FlexVAR.
2025.06.24: ๐๐๐ The code and model weights of Detailflow have been open-sourced. 2025.06.16: The code and model weights are finalized and currently undergoing legal review. We expect to release them soon. 2025.05.28: ๐๐๐ DetailFlow is released! ๐๐๐ See our paper .
Tokenizer | reso. | rFID | AR model | gFID |
---|---|---|---|---|
DetailFlow-16 | 256 | 1.13 | DetailFlow-16-GPT-L | 2.96 |
DetailFlow-32 | 256 | 0.80 | DetailFlow-32-GPT-L | 2.75 |
DetailFlow-64 | 256 | 0.52 | DetailFlow-64-GPT-L | 2.59 |
-
Install
torch>=2.1.2
. -
Install other pip packages via
bash init.sh
. -
Prepare the ImageNet dataset
assume the ImageNet is in `/path/to/imagenet`. It should be like this:
/path/to/imagenet/: train/: n01440764: many_images.JPEG ... n01443537: many_images.JPEG ... val/: n01440764: ILSVRC2012_val_00000293.JPEG ... n01443537: ILSVRC2012_val_00000236.JPEG ...
-
Modify the path-related environment variables in the
init.sh
file.
50 epochs can already yield good results, worth a try.
# 128 tokens, group size 8
bash scripts/train_tokenizer.sh detailflow_128token --config configs/vq_8k_siglip_b_res_p02_pw15_enc.yaml --num-latent-tokens 128 --group-size 8 --epochs 250 --global-token-loss-weight 1
# 256 tokens, group size 8
bash scripts/train_tokenizer.sh detailflow_256token --config configs/vq_8k_siglip_b_res_p02_pw15_enc.yaml --num-latent-tokens 256 --group-size 8 --epochs 250 --global-token-loss-weight 1
# 512 tokens, group size 8
bash scripts/train_tokenizer.sh detailflow_512token --config configs/vq_8k_siglip_b_res_p02_pw15_enc.yaml --num-latent-tokens 512 --group-size 8 --epochs 250 --global-token-loss-weight 1
# DetailFlow-16 and DetailFlow-32 can better utilize GPU resources by using a batch size of 512.
bash scripts/train_c2i_token.sh /xxx/DetailFlow-16/checkpoints/128.pt demo_task_name --global-batch-size 512 --epochs 300
The model will be stored in the directory ./logs/task/job_name.
-
config.json and config.yaml: files contain some configuration information about the model and are necessary for loading the model during subsequent inference.
-
checkpoints: These store the ckpt files, with only the most recent 10 ckpt files being retained.
- If you encounter the error
torch._dynamo.exc.CacheLimitExceeded: cache_size_limit reached
, you can try settingargs.compile = False
, but this will slow down the inference speed. - Due to randomness, evaluation metrics may vary by up to 0.3; we recommend running with multiple seeds and reporting the average value.
# Tokenizer
bash scripts/eval_tokenizer.sh /path_to_ckpt/xxx.pt
# ar model, 128 tokens
bash scripts/eval_c2i.sh /xxx/DetailFlow-16/checkpoints/128.pt /xxx/DetailFlow-16-GPT-L/checkpoints/128.pt 1.45
# ar model, 256 tokens
bash scripts/eval_c2i.sh /xxx/DetailFlow-32/checkpoints/256.pt /xxx/DetailFlow-32-GPT-L/checkpoints/256.pt 1.5
# ar model, 512 tokens
bash scripts/eval_c2i.sh /xxx/DetailFlow-64/checkpoints/512.pt /xxx/DetailFlow-64-GPT-L/checkpoints/512.pt 1.32
We thank the great work from SoftVQ-VAE, LlamaGen and PAR.
If our work assists your research, feel free to give us a star โญ or cite us using
@article{liu2025detailflow,
title={DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction},
author={Liu, Yiheng and Qu, Liao and Zhang, Huichao and Wang, Xu and Jiang, Yi and Gao, Yiming and Ye, Hu and Li, Xian and Wang, Shuai and Du, Daniel K and others},
journal={arXiv preprint arXiv:2505.21473},
year={2025}
}
We are hiring interns and full-time researchers at the ByteFlow Group, ByteDance, with a focus on multimodal understanding and generation (preferred base: Beijing, Shenzhen and Hangzhou). If you are interested, please contact [email protected].