MammothModa

Unified Multimodal Understanding, Generation, and Editing

🌐 Project Page | 📑 Mamoda2 Tech Report | 📑 Mamoda2.5 Tech Report

Introduction

Mamoda is a family of unified AR-Diffusion models that seamlessly integrate multimodal understanding and generation within a single architecture. One model handles text-to-image, text-to-video, image editing, video editing, and multimodal understanding.

🎉 News

2026-05-06: 🔥Mamoda2.5 technical report is now online! Achieves SOTA on video editing benchmarks. Open-source model weights are under internal review.
2026-02-15: 🔥Released Mamoda2.5 inference code for Video Generation and Video Editing! Check out our Project Page.
2025-12-10: 🔥Mamoda2-Dev build upon Qwen3VL-8B supports Image Editing are now available at HuggingFace.
2025-10-01: 🔥Mamoda2-Preview models are now available at HuggingFace. Note: To use the Preview version, please switch to the qwen25vl branch.

Highlights

Fine-Grained MoE: 128 routed experts with Top-8 routing — 25B total parameters, only ~3B active per forward pass (~12%), yielding 12x faster inference than dense models of comparable capacity.
Unified Generation & Editing: A single model for text-to-image, text-to-video, image editing, and video editing — no separate task-specific models needed.
SOTA Video Editing: #1 on OpenVE-Bench (3.86), #1 on FiVE-Bench (87.41), best overall on Reco-Bench.
Top-Tier Video Generation: 61.64 on VBench 2.0, on par with HunyuanVideo 1.5 and LongCat-Video, with only 110s latency.

Showcases

Text-to-Video

video_60.mp4 Cinematic Shots	video_11_.mp4 Animal Interaction
video_42.mp4 Motion	video_29.mp4 Scenery

Video Editing

022_add_backpack_merged.mp4 Add Backpack	043_creative_merged.mp4 Transform Hand into Robotic Hand
001_style_ghibli_merged.mp4 Ghibli Style	014_remove_moving_person_merged.mp4 Remove Right Person

Model Family

Version	Architecture	Capabilities	Details
Mamoda2.5	Qwen3-VL + 25B-A3B MoE DiT (E128A8)	Video Gen, Video Edit, Image Edit	→ mamoda25/
Mamoda2	Qwen3VL-8B + 3B experts + 2B DiT	Image Gen, Image Edit, Understanding	→ mamoda2/

Citation

@article{shen2025mammothmoda2,
    title={MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation},
    author={Shen, Tao and Wan, Xin and Chen, Taicai and Zhang, Rui and Pan, Junwen and Lu, Dawei and Lei, Fanding and Lu, Zhilin and Yang, Yunfei and Cheng, Chen and She, Qi and Liu, Chang and Sun, Zhenbang},
    journal={arXiv preprint arXiv:2511.18262},
    year={2025},
    url={https://arxiv.org/abs/2511.18262}
}

@article{mamoda25,
    title={Mamoda2.5: Unified Visual Generation and Editing with Fine-Grained MoE DiT},
    journal={arXiv preprint arXiv:2605.02641},
    year={2025},
    url={https://arxiv.org/abs/2605.02641}
}

🎯 Join Our Team

Moderation LLM Team @ ByteDance — We're hiring! Passionate about multimodal AI, computer vision, and MLLM development?

We develop leading MLLMs for content moderation, building infrastructure including model benchmarking, data pipelines, efficient architectures, and training methodologies.

Recent Publications (2024–2026)

Pan, J., Zhang, Q., Zhang, R., Lu, M., Wan, X., Zhang, Y., Liu, C., & She, Q. (2025). TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning. ICLR 26.
Li, Y., Wang, Y., Zhu, Y., Zhao, Z., Lu, M., She, Q., & Zhang, S. (2025). BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models. ICLR 26.
Li, Z., Qian, D., Su, K., Diao, Q., Xia, X., Liu, C., ... & Yuan, Z. (2025). Bindweave: Subject-consistent video generation via cross-modal integration. ICLR 26.
Zhang, Q., Cheng, A., Lu, M., Zhuo, Z., Wang, M., Cao, J., Guo, S., She, Q., & Zhang, S. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. ICCV 25.
Xie, R., Du, C., Song, P., & Liu, C. (2025). Muse-vl: Modeling unified vlm through semantic discrete encoding. ICCV 25.
Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., & Zhang, S. (2025). Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs. NeurIPS 25.
Lin, L., Shi, D., Han, A., Chen, F., Chen, Q., Li, J., ... & Gao, J. (2025). ACT as human: Multimodal large language model data annotation with critical thinking. NeurIPS 25.
Yu, S., Jin, C., Wang, H., Chen, Z., Jin, S., Zuo, Z., ... & Sun, Q. (2024). Frame-voyager: Learning to query frames for video large language models. ICLR 25.
Pan, J., Zhang, R., Wan, X., Zhang, Y., Lu, M., & She, Q. (2025). Timesearch: Hierarchical video search with spotlight and reflection for human-like long video understanding. arXiv Preprint arXiv:2504.01407.
Liu, Z., Pan, J., She, Q., Gao, Y., & Xia, G. (2025). On the Faithfulness of Visual Thinking: Measurement and Enhancement. arXiv Preprint arXiv:2510.23482.
Zhang, Y., Fan, C.-K., Huang, T., Lu, M., Yu, S., Pan, J., Cheng, K., She, Q., & Zhang, S. (2025). AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models. arXiv Preprint arXiv:2506.16112.
Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Chang Liu, Qi She, Shanghang Zhang(2025). ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better. arXiv Preprint arXiv:2511.17106.
Shi, H., Liang, J., Xie, R., Wu, X., Chen, C., & Liu, C. (2025). Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios. arXiv preprint arXiv:2505.10584.
Shen, T., Wan, X., Chen, T., Zhang, R., Pan, J., Lu, D., Lei, F., Lu, Z., Yang, Y., & Cheng, C. (2025). MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation. arXiv Preprint arXiv:2511.18262.
Qi She, Junwen Pan, Xin Wan, Rui Zhang, Dawei Lu, Kai Huang. (2024). MammothModa: Multi-Modal Large Language Model. arXiv.

Contact: liuchang.lab@bytedance.com

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
doc		doc
mamoda2		mamoda2
mamoda25		mamoda25
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MammothModa

Introduction

🎉 News

Highlights

Showcases

Text-to-Video

Video Editing

Model Family

Citation

🎯 Join Our Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MammothModa

Introduction

🎉 News

Highlights

Showcases

Text-to-Video

Video Editing

Model Family

Citation

🎯 Join Our Team

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages