Emova-ollm (EMOVA Hugging Face)

🤔 Advanced reasoning LLMs keep releasing before your current MLLM alignment is done? Try our RACRO! Train once, flexible change to novel LLM reasoners during inference time!

📢 RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions.

✨ Highlights
✅ State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash.
✅ Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning.
✅ Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL).

🔥 You are all welcome to try and star!
- Paper: Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning (2506.04559)
- Github: https://github.com/gyhdog99/RACRO2
- Demo: Emova-ollm/RACRO-demo

KaiChen1998

updated a Space 7 months ago

RACRO Online Interactive Demo

🔥

2

Live Interactive demo for RACRO-7B-CRO-GRPO backbone

KaiChen1998

published a Space 7 months ago

RACRO Online Interactive Demo

🔥

2

Live Interactive demo for RACRO-7B-CRO-GRPO backbone

KaiChen1998

authored a paper 7 months ago

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Paper • 2506.04559 • Published Jun 5, 2025 • 2

KaiChen1998

in Emova-ollm/Qwen2.5-7B-Instruct_add_speech_token_4096_nostrip 8 months ago

Improve language tag

1

#1 opened 8 months ago by

lbourdois

KaiChen1998

posted an update 10 months ago

Post

5005

📢 Our EMOVA paper has been accepted by CVPR 2025, and we are glad to release all resources, including code (training & inference), datasets (training & evaluation), and checkpoints (EMOVA-3B/7B/72B)!

🤗 EMOVA is a novel end-to-end omni-modal LLM that can see, hear and speak. Given omni-modal (i.e., textual, visual and speech) inputs, EMOVA can generate both textual and speech responses with vivid emotional controls by utilizing the speech decoder and a style controller.

✨ EMOVA Highlights
✅ State-of-the-art omni-modality: EMOVA achieves SoTA comparable results on both vision-language and speech benchmarks simultaneously.
✅ Device adaptation: our codebase supports training/inference on both NVIDIA GPUs (e.g., A800 & H20) and Ascend NPUs (e.g., 910B3)!
✅ Modular design: we integrate multiple implementations of vision encoder, vision projector, and language model, even including the most recent DeepSeekMoE-tiny!

🔥 You are all welcome to try and star!
- Project page: https://emova-ollm.github.io/
- Github: https://github.com/emova-ollm/EMOVA
- Demo: Emova-ollm/EMOVA-demo