Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Abstract
Omni-Diffusion introduces the first any-to-any multimodal language model based on mask-based discrete diffusion models, unifying text, speech, and image processing in a single framework.
While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.
Community
Omni-Diffusion, the first any-to-any multimodal language model build on a mask-based discrete diffusion model.
Excited to see more progress in diffusion-based multimodal modeling!
This line of work is also related to our earlier paper Dream-VL, where we study vision-language models built on the masked diffusion language model Dream 7B. https://huggingface.co/papers/2512.22615
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation (2026)
- CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models (2026)
- Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space (2026)
- LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model (2026)
- Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device (2026)
- LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens (2026)
- Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper