CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models Paper • 2412.12932 • Published Dec 17, 2024 • 2
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining Paper • 2412.10342 • Published Dec 13, 2024
Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark Paper • 2502.04976 • Published Feb 7, 2025
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology Paper • 2503.14911 • Published Mar 19, 2025 • 3
Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning Paper • 2602.00971 • Published Feb 28
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models Paper • 2604.12617 • Published 30 days ago • 6
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment Paper • 2604.19548 • Published 23 days ago • 16
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment Paper • 2604.19548 • Published 23 days ago • 16
Reasoning Implicit Sentiment with Chain-of-Thought Prompting Paper • 2305.11255 • Published May 18, 2023 • 2
CMNER: A Chinese Multimodal NER Dataset based on Social Media Paper • 2402.13693 • Published Feb 21, 2024
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis Paper • 2408.09481 • Published Aug 18, 2024 • 1
LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model Paper • 2304.06248 • Published Apr 13, 2023
NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations Paper • 2501.17261 • Published Aug 22, 2024
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7, 2025 • 83
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist Paper • 2511.08521 • Published Nov 11, 2025 • 39