Skip to content

bytedance/mammothmoda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MammothModa

Unified Multimodal Understanding, Generation, and Editing

🌐 Project Page   |   📑 Mamoda2 Tech Report   |   📑 Mamoda2.5 Tech Report

Introduction

Mamoda is a family of unified AR-Diffusion models that seamlessly integrate multimodal understanding and generation within a single architecture. One model handles text-to-image, text-to-video, image editing, video editing, and multimodal understanding.

🎉 News

  • 2026-05-06: 🔥Mamoda2.5 technical report is now online! Achieves SOTA on video editing benchmarks. Open-source model weights are under internal review.
  • 2026-02-15: 🔥Released Mamoda2.5 inference code for Video Generation and Video Editing! Check out our Project Page.
  • 2025-12-10: 🔥Mamoda2-Dev build upon Qwen3VL-8B supports Image Editing are now available at HuggingFace.
  • 2025-10-01: 🔥Mamoda2-Preview models are now available at HuggingFace. Note: To use the Preview version, please switch to the qwen25vl branch.

Highlights

MoE Architecture

  • Fine-Grained MoE: 128 routed experts with Top-8 routing — 25B total parameters, only ~3B active per forward pass (~12%), yielding 12x faster inference than dense models of comparable capacity.
  • Unified Generation & Editing: A single model for text-to-image, text-to-video, image editing, and video editing — no separate task-specific models needed.
  • SOTA Video Editing: #1 on OpenVE-Bench (3.86), #1 on FiVE-Bench (87.41), best overall on Reco-Bench.
  • Top-Tier Video Generation: 61.64 on VBench 2.0, on par with HunyuanVideo 1.5 and LongCat-Video, with only 110s latency.

Benchmark Results

Showcases

Text-to-Video

video_60.mp4

Cinematic Shots
video_11_.mp4

Animal Interaction
video_42.mp4

Motion
video_29.mp4

Scenery

Video Editing

022_add_backpack_merged.mp4

Add Backpack
043_creative_merged.mp4

Transform Hand into Robotic Hand
001_style_ghibli_merged.mp4

Ghibli Style
014_remove_moving_person_merged.mp4

Remove Right Person

Model Family

Version Architecture Capabilities Details
Mamoda2.5 Qwen3-VL + 25B-A3B MoE DiT (E128A8) Video Gen, Video Edit, Image Edit → mamoda25/
Mamoda2 Qwen3VL-8B + 3B experts + 2B DiT Image Gen, Image Edit, Understanding → mamoda2/

Citation

@article{shen2025mammothmoda2,
    title={MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation},
    author={Shen, Tao and Wan, Xin and Chen, Taicai and Zhang, Rui and Pan, Junwen and Lu, Dawei and Lei, Fanding and Lu, Zhilin and Yang, Yunfei and Cheng, Chen and She, Qi and Liu, Chang and Sun, Zhenbang},
    journal={arXiv preprint arXiv:2511.18262},
    year={2025},
    url={https://arxiv.org/abs/2511.18262}
}

@article{mamoda25,
    title={Mamoda2.5: Unified Visual Generation and Editing with Fine-Grained MoE DiT},
    journal={arXiv preprint arXiv:2605.02641},
    year={2025},
    url={https://arxiv.org/abs/2605.02641}
}

🎯 Join Our Team

Moderation LLM Team @ ByteDance — We're hiring! Passionate about multimodal AI, computer vision, and MLLM development?

We develop leading MLLMs for content moderation, building infrastructure including model benchmarking, data pipelines, efficient architectures, and training methodologies.

Recent Publications (2024–2026)
  • Pan, J., Zhang, Q., Zhang, R., Lu, M., Wan, X., Zhang, Y., Liu, C., & She, Q. (2025). TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning. ICLR 26.
  • Li, Y., Wang, Y., Zhu, Y., Zhao, Z., Lu, M., She, Q., & Zhang, S. (2025). BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models. ICLR 26.
  • Li, Z., Qian, D., Su, K., Diao, Q., Xia, X., Liu, C., ... & Yuan, Z. (2025). Bindweave: Subject-consistent video generation via cross-modal integration. ICLR 26.
  • Zhang, Q., Cheng, A., Lu, M., Zhuo, Z., Wang, M., Cao, J., Guo, S., She, Q., & Zhang, S. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. ICCV 25.
  • Xie, R., Du, C., Song, P., & Liu, C. (2025). Muse-vl: Modeling unified vlm through semantic discrete encoding. ICCV 25.
  • Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., & Zhang, S. (2025). Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs. NeurIPS 25.
  • Lin, L., Shi, D., Han, A., Chen, F., Chen, Q., Li, J., ... & Gao, J. (2025). ACT as human: Multimodal large language model data annotation with critical thinking. NeurIPS 25.
  • Yu, S., Jin, C., Wang, H., Chen, Z., Jin, S., Zuo, Z., ... & Sun, Q. (2024). Frame-voyager: Learning to query frames for video large language models. ICLR 25.
  • Pan, J., Zhang, R., Wan, X., Zhang, Y., Lu, M., & She, Q. (2025). Timesearch: Hierarchical video search with spotlight and reflection for human-like long video understanding. arXiv Preprint arXiv:2504.01407.
  • Liu, Z., Pan, J., She, Q., Gao, Y., & Xia, G. (2025). On the Faithfulness of Visual Thinking: Measurement and Enhancement. arXiv Preprint arXiv:2510.23482.
  • Zhang, Y., Fan, C.-K., Huang, T., Lu, M., Yu, S., Pan, J., Cheng, K., She, Q., & Zhang, S. (2025). AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models. arXiv Preprint arXiv:2506.16112.
  • Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Chang Liu, Qi She, Shanghang Zhang(2025). ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better. arXiv Preprint arXiv:2511.17106.
  • Shi, H., Liang, J., Xie, R., Wu, X., Chen, C., & Liu, C. (2025). Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios. arXiv preprint arXiv:2505.10584.
  • Shen, T., Wan, X., Chen, T., Zhang, R., Pan, J., Lu, D., Lei, F., Lu, Z., Yang, Y., & Cheng, C. (2025). MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation. arXiv Preprint arXiv:2511.18262.
  • Qi She, Junwen Pan, Xin Wan, Rui Zhang, Dawei Lu, Kai Huang. (2024). MammothModa: Multi-Modal Large Language Model. arXiv.

Contact: liuchang.lab@bytedance.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages