AI & ML interests

Omni-modal Large Language Models, Multi-modal Large Language Models (MLLMs), Emotional spoken dialogue

KaiChen1998Β 
posted an update 7 months ago
view post
Post
301
πŸ€” Advanced reasoning LLMs keep releasing before your current MLLM alignment is done? Try our RACRO! Train once, flexible change to novel LLM reasoners during inference time!

πŸ“’ RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions.

✨ Highlights
βœ… State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash.
βœ… Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning.
βœ… Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL).

πŸ”₯ You are all welcome to try and star!
- Paper: Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning (2506.04559)
- Github: https://github.com/gyhdog99/RACRO2
- Demo: Emova-ollm/RACRO-demo
KaiChen1998Β 
posted an update 10 months ago
view post
Post
5005
πŸ“’ Our EMOVA paper has been accepted by CVPR 2025, and we are glad to release all resources, including code (training & inference), datasets (training & evaluation), and checkpoints (EMOVA-3B/7B/72B)!

πŸ€— EMOVA is a novel end-to-end omni-modal LLM that can see, hear and speak. Given omni-modal (i.e., textual, visual and speech) inputs, EMOVA can generate both textual and speech responses with vivid emotional controls by utilizing the speech decoder and a style controller.

✨ EMOVA Highlights
βœ… State-of-the-art omni-modality: EMOVA achieves SoTA comparable results on both vision-language and speech benchmarks simultaneously.
βœ… Device adaptation: our codebase supports training/inference on both NVIDIA GPUs (e.g., A800 & H20) and Ascend NPUs (e.g., 910B3)!
βœ… Modular design: we integrate multiple implementations of vision encoder, vision projector, and language model, even including the most recent DeepSeekMoE-tiny!

πŸ”₯ You are all welcome to try and star!
- Project page: https://emova-ollm.github.io/
- Github: https://github.com/emova-ollm/EMOVA
- Demo: Emova-ollm/EMOVA-demo