ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Abstract
ARM-Thinker, an agentic reward model, uses external tools for verification, improving accuracy and interpretability in complex multimodal reasoning tasks.
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
Community
đ Github Repo: https://github.com/InternLM/ARM-Thinker
âď¸ For Agent Evaluation: https://github.com/open-compass/VLMEvalKit/pull/1334 (We added a new feature to VLMEvalKit that supports evaluating diverse models within the ARM-Thinker agent flow, including full tool-use and multi-step reasoning.)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning (2025)
- DeepEyesV2: Toward Agentic Multimodal Model (2025)
- VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning (2025)
- Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch (2025)
- MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning (2025)
- OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning (2025)
- Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper