Table of Contents
- cs.CL [Total: 21]
- cs.CV [Total: 24]
- eess.IV [Total: 2]
- cs.SD [Total: 1]
- cs.RO [Total: 1]
- cs.LG [Total: 2]
- cs.AI [Total: 1]
cs.CL [Back]
[1] M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models cs.CL | cs.AIPDF
Aleix Torres-Camps, Nathaniel Mitrani Hadida, Víctor Conchello Vendrell, Àlex Batlle Casellas, Arnau Padrés Masdemont
TL;DR: 本文介绍了M3Kang,一个从全球最大的数学竞赛——袋鼠数学竞赛中构建的大规模多语言多模态数学推理数据集,包含1,747个按年级难度组织的选择题,并翻译成108种语言,部分问题包含解题必需的图表。作者利用该数据集对闭源和开源SOTA模型进行了广泛基准测试,发现模型在基础数学和基于图表的推理上仍有困难,性能与语言覆盖率和模型规模相关,但与年级难度无关。研究还表明多语言技术可有效扩展到多模态场景,并显著超越基线方法。
Details
Motivation: 尽管SOTA视觉语言模型展现出强大的推理能力,但其在多语言数学推理方面的性能仍未得到充分探索,特别是与人类表现相比存在差距。
Result: 在M3Kang数据集上的基准测试表明,模型在基础数学和图表推理上表现不佳,性能随语言覆盖率和模型规模提升,但与年级难度无关;多语言技术扩展至多模态场景可带来显著改进。
Insight: 创新点在于构建了首个大规模多语言多模态数学推理数据集M3Kang,并进行了模型与人类表现的直接对比分析;客观来看,该数据集为评估VLMs在跨语言、跨文化数学推理能力提供了重要基准,并揭示了当前模型在数学和图表理解上的关键局限。
Abstract: Despite state-of-the-art vision-language models (VLMs) have demonstrated strong reasoning capabilities, their performance in multilingual mathematical reasoning remains underexplored, particularly when compared to human performance. To bridge this gap, we introduce M3Kang, the first massively multilingual, multimodal mathematical reasoning dataset for VLMs. It is derived from the Kangaroo Math Competition, the world’s largest mathematics contest, which annually engages over six million participants under the age of 18 across more than 90 countries. M3Kang includes 1,747 unique multiple-choice problems organized by grade-level difficulty, with translations into 108 culturally diverse languages, some of them including diagrams essential for solving them. Using this dataset, we conduct extensive benchmarking on both closed- and open-source SOTA models. We observe that, despite recent advances, models still struggle with basic math and diagram-based reasoning, with performance scaling with language presence and model size, but not with grade level. We also find that multilingual techniques can be effectively extended to the multimodal setting, resulting in significant improvements over baseline approaches. Our analysis also incorporates performance data from over 68,000 students, enabling direct comparison with human performance. We are open-sourcing M3Kang, including the English-only subset M2Kang, along with the framework and codebase used to construct the dataset.
[2] GameTalk: Training LLMs for Strategic Conversation cs.CL | cs.AI | cs.GT | cs.LG | cs.MAPDF
Victor Conchello Vendrell, Max Ruiz Luyten, Mihaela van der Schaar
TL;DR: 本文提出了GameTalk框架,用于训练大型语言模型(LLMs)通过多轮对话进行战略决策。该框架通过调整GRPO、DPO和STaR等微调方法,将依赖于整个对话过程的奖励信号纳入训练,以优化长期全局目标。
Details
Motivation: 解决LLMs在多智能体环境中进行长期战略决策和协调的挑战,特别是在需要通过扩展对话进行协商和协调的场景。现有研究多关注单轮决策任务,而缺乏对通过对话优化长期目标的探索。
Result: 在一系列复杂度递增的游戏中评估,这些游戏旨在测试推理、协调和对手建模的不同方面。结果显示,GameTalk显著优于未经训练的模型,尤其是在奖励塑形下,其中DPO方法持续带来最强的性能提升。
Insight: 创新点在于将对话微调(conversational fine-tuning)作为一种有前景的路径,使LLMs能够在交互环境中进行推理、协商和行动。客观来看,其核心是将长期、全局的奖励信号整合到LLM的微调过程中,以优化多轮战略对话,这超越了传统的单轮或静态预测方法。
Abstract: Strategic decision-making in multi-agent settings is a key challenge for large language models (LLMs), particularly when coordination and negotiation must unfold over extended conversations. While recent work has explored the use of LLMs in isolated decision tasks, little attention has been given to optimizing long-term objectives through dialogue. We introduce \textbf{GameTalk}, a framework for training LLMs to make strategic decisions via multi-turn interactions. Unlike prior work that focuses on single-turn objectives or static action prediction, we train LLMs to optimize a global objective across full conversations. We achieve this by adapting fine-tuning methods like GRPO, DPO, and STaR to incorporate reward signals that depend on the entire interaction. We evaluate this approach on a suite of increasingly complex games, designed to stress different aspects of reasoning, coordination, and opponent modeling. Our results show that GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains. These findings position conversational fine-tuning as a promising path for LLMs to reason, negotiate, and act in interactive environments.
[3] Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks cs.CL | cs.AIPDF
Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao
TL;DR: 本文介绍了PolyBench,一个包含超过125K聚合物设计相关任务的大规模训练与测试基准数据集,旨在解决现有LLMs在聚合物设计领域因缺乏专业知识和能力覆盖而表现不佳的问题。通过知识增强推理蒸馏方法,该数据集结合结构化思维链(CoT)进行增强,并组织从简单到复杂的分析推理任务。实验表明,基于PolyBench训练的7B至14B参数的小型语言模型(SLMs)在PolyBench测试集上超越了同规模模型甚至闭源前沿LLMs,并在其他聚合物基准上取得增益。
Details
Motivation: 当前LLMs在聚合物设计领域效果不佳,主要因为缺乏聚合物特定知识以及现有对齐模型的知识与能力覆盖不足,因此需要构建专业数据集和方法来提升模型在该科学应用中的表现。
Result: 在PolyBench测试数据集上,训练的7B至14B参数SLMs超越了同规模模型和闭源前沿LLMs,并在其他聚合物基准上显示出性能提升。
Insight: 创新点包括构建大规模聚合物设计基准数据集PolyBench,以及引入知识增强推理蒸馏方法结合结构化CoT来增强模型推理能力;从客观角度看,通过组织从简单到复杂的任务进行泛化测试和诊断探测,有助于系统评估和提升模型在专业科学领域的推理性能。
Abstract: Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs prove ineffective on this problem space because: (i) most models lack polymer-specific knowledge (ii) existing aligned models lack coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large scale training and test benchmark dataset of more than 125K polymer design related tasks, leveraging a knowledge base of 13M+ data points obtained from experimental and synthetic sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small language models (SLMs), of 7B to 14B parameters, trained on PolyBench data outperform similar sized models, and even closed source frontier LLMs on PolyBench test dataset while demonstrating gains on other polymer benchmarks as well.
[4] PolyAgent: Large Language Model Agent for Polymer Design cs.CLPDF
Vani Nigam, Achuth Chandrasekhar, Amir Barati Farimani
TL;DR: 本文提出了PolyAgent,一个集成在终端中的闭环聚合物结构-性能预测框架,旨在加速早期聚合物发现。该框架利用大型语言模型(LLM)的推理能力,为用户提供性能预测、性能引导的聚合物结构生成以及结构修改功能。它通过合成可及性分数和合成复杂性分数(SC Score)引导SMILES序列,以确保生成的聚合物结构尽可能接近可合成的单体水平结构,从而为实验室研究人员提供计算见解。
Details
Motivation: 解决聚合物发现过程中实验试错周期长、资源消耗大的问题,并克服实验室研究人员因基础设施限制而难以直接访问代码和机器学习模型来提取单个结构和性能的挑战。
Result: 摘要中未提及具体的定量实验结果、基准测试或与现有方法的比较。
Insight: 主要创新点在于将LLM推理能力与闭环聚合物设计流程相结合,创建了一个用户友好的终端集成框架,并通过合成可及性指标(如SC Score)引导生成,确保结构在合成上的可行性,从而弥合了计算模型与实验室实践之间的鸿沟。
Abstract: On-demand Polymer discovery is essential for various industries, ranging from biomedical to reinforcement materials. Experiments with polymers have a long trial-and-error process, leading to long procedures and extensive resources. For these processes, machine learning has accelerated scientific discovery at the property prediction and latent space search fronts. However, laboratory researchers cannot readily access codes and these models to extract individual structures and properties due to infrastructure limitations. We present a closed-loop polymer structure-property predictor integrated in a terminal for early-stage polymer discovery. The framework is powered by LLM reasoning to provide users with property prediction, property-guided polymer structure generation, and structure modification capabilities. The SMILES sequences are guided by the synthetic accessibility score and the synthetic complexity score (SC Score) to ensure that polymer generation is as close as possible to synthetically accessible monomer-level structures. This framework addresses the challenge of generating novel polymer structures for laboratory researchers, thereby providing computational insights into polymer research.
[5] Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization cs.CL | cs.AIPDF
Qianqi Yan, Huy Nguyen, Sumana Srivatsa, Hari Bandi, Xin Eric Wang
TL;DR: 本文提出了一种无需训练的生成时证据归因框架Cite-While-You-Generate,用于多模态临床摘要生成,通过解码器注意力机制直接引用支持性的文本片段或图像,以提升摘要的可信度和透明度。
Details
Motivation: 解决可信临床摘要生成中需要提供生成陈述来源透明性的问题,克服事后归因或基于重训练方法的局限性。
Result: 在临床医患对话(CliConSummation)和放射学报告(MIMIC-CXR)两个代表性领域上的评估表明,该方法在文本级和多模态归因准确率上均优于基于嵌入和自归因的基线方法(例如F1分数提升15%),且基于字幕的归因在保持竞争力的同时更轻量实用。
Insight: 创新点在于利用解码器注意力进行生成时的直接证据引用,并提出了两种多模态归因策略:原始图像模式和字幕作为文本片段模式,为可解释和可部署的临床摘要系统提供了新思路。
Abstract: Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.
[6] Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification cs.CLPDF
Zongwan Cao, Bingbing Wen, Lucy Lu Wang
TL;DR: 本文提出了一种名为CoA(Clarify-or-Answer)的智能体,用于处理上下文信息不足的视觉问答任务。该智能体通过强化学习优化澄清问题的生成,在多个VLLM和数据集上显著提升了端到端VQA的准确性。
Details
Motivation: 解决现实世界中视觉问答任务因图像-问题对信息不足而导致的上下文依赖问题,避免直接回答可能产生的自信但错误的预测。
Result: 在三个VLLM和三个数据集上,CoA在模块和系统层面均取得一致改进,端到端VQA准确率平均比基于提示的基线方法提升15.3个百分点(相对提升83%)。
Insight: 创新点在于将澄清与回答决策分离,并引入强化学习框架GRPO-CR来优化澄清问题的生成,确保问题具有良好形式、聚焦且能有效消除歧义。
Abstract: Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines
[7] Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning cs.CL | cs.CVPDF
Qinglong Cao, Yuntian Chen, Chao Ma, Xiaokang Yang
TL;DR: 本文提出了一种通过强化微调框架将领域知识整合到多模态大语言模型(MLLMs)中的方法,以解决MLLMs在遥感、医学影像等专业领域表现受限的问题。研究发现,仅通过文本指令或提示注入领域知识效果甚微,因此将领域知识编码为约束和奖励信号,在优化层面进行整合,从而显著提升了模型在专业多模态任务上的性能。
Details
Motivation: 当前多模态大语言模型在通用感知和理解任务上表现出色,但在遥感、医学等专业领域效果有限。研究发现,仅通过输入层面的文本指令注入领域知识几乎无法提升性能,这表明模型难以仅通过语言内化领域先验知识,因此需要在优化层面进行整合。
Result: 在遥感和医学领域的多个数据集上进行广泛实验,该方法均取得了显著的性能提升,并在多模态领域任务上达到了最先进的(SOTA)水平。
Insight: 论文的创新点在于揭示了当前MLLMs仅通过文本进行领域条件化的根本局限性,并提出将领域知识作为约束和奖励信号直接整合到学习目标中的强化微调框架,实现了优化层面的领域知识集成,而非仅依赖描述性信息。
Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in multimodal perception and understanding tasks. However, their effectiveness in specialized domains, such as remote sensing and medical imaging, remains limited. A natural approach to domain adaptation is to inject domain knowledge through textual instructions, prompts, or auxiliary captions. Surprisingly, we find that such input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks, even when the domain knowledge is explicitly provided. This observation suggests that current MLLMs fail to internalize domain-specific priors through language alone, and that domain knowledge must be integrated at the optimization level. Motivated by this insight, we propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective. Instead of treating domain knowledge as descriptive information, we encode it as domain-informed constraints and reward signals, shaping the model’s behavior in the output space. Extensive experiments across multiple datasets in remote sensing and medical domains consistently demonstrate good performance gains, achieving state-of-the-art results on multimodal domain tasks. Our results highlight the necessity of optimization-level domain knowledge integration and reveal a fundamental limitation of textual domain conditioning in current MLLMs.
[8] Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go cs.CLPDF
Yichuan Ma, Linyang Li, Yongkang Chen, Peiji Li, Jiasheng Ye
TL;DR: 本文提出LoGos模型,通过混合微调结构化围棋专业知识和通用长链思维推理数据,并结合强化学习,将LLM的通用推理能力与围棋领域的专家知识相结合,使模型在保持出色通用推理能力的同时,能以自然语言进行围棋对弈,达到人类职业棋手水平。
Details
Motivation: 解决通用大语言模型在围棋等专业领域中推理能力严重不足的问题,弥合LLM通用推理能力与领域专家知识之间的鸿沟。
Result: LoGos模型在围棋上的性能与人类职业棋手相当,显著超越了所有现有LLM,并发布了首个用于LLM训练的大规模围棋数据集、首个LLM围棋评估基准。
Insight: 创新点在于通过混合微调(结构化领域知识+通用CoT数据)与强化学习相结合的方法,将专家知识有效整合到通用LLM中,为将通用LLM推理能力应用于专业领域提供了新思路。
Abstract: Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs’ general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present \textbf{LoGos}, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional-level performance in Go at: https://github.com/Entarochuan/LoGos.
[9] Persona Jailbreaking in Large Language Models cs.CLPDF
Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
TL;DR: 本文提出了名为PHISH(Persona Hijacking via Implicit Steering in History)的黑盒攻击框架,通过将语义负载线索嵌入用户查询中,逐步诱导大型语言模型(LLM)形成反向人格,从而揭示了LLM在人格一致性方面的新安全漏洞。该攻击在仅推理的黑盒设置下,通过对抗性对话历史有效操纵LLM的人格特质,并在心理健康、辅导和客户支持等高风险领域得到验证。
Details
Motivation: 现有研究主要关注叙事或角色扮演任务,忽视了仅通过对抗性对话历史就能重塑LLM诱导人格的可能性,黑盒人格操纵尚未被探索,这引发了现实交互中模型鲁棒性的担忧。
Result: 在3个基准测试和8个LLM上,PHISH能够可预测地转移人格,触发相关特质的附带变化,并在多轮对话设置中表现出更强的效果;攻击仅导致推理基准性能小幅下降,整体实用性基本不受影响。人类和LLM-as-Judge评估均验证了其在高风险领域的可靠人格操纵能力。
Insight: 论文的创新点在于首次系统性地定义了人格编辑任务,并提出了PHISH框架来暴露LLM人格安全的新漏洞;从客观角度看,其通过语义嵌入和渐进式诱导实现黑盒攻击的方法,以及量化攻击成功的度量标准,为LLM安全性和人格鲁棒性研究提供了新的视角和工具。
Abstract: Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role-playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user-side inputs under a black-box, inference-only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack success. Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral changes in correlated traits, and exhibits stronger effects in multi-turn settings. In high-risk domains mental health, tutoring, and customer support, PHISH reliably manipulates personas, validated by both human and LLM-as-Judge evaluations. Importantly, PHISH causes only a small reduction in reasoning benchmark performance, leaving overall utility largely intact while still enabling significant persona manipulation. While current guardrails offer partial protection, they remain brittle under sustained attack. Our findings expose new vulnerabilities in personas and highlight the need for context-resilient persona in LLMs. Our codebase and dataset is available at: https://github.com/Jivnesh/PHISH
[10] DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering cs.CL | cs.AIPDF
Haotian Chen, Qingqing Long, Siyu Pu, Xiao Luo, Wei Ju
TL;DR: 本文提出了一种名为DeepEra的深度证据重排序智能体,用于解决科学检索增强生成问答中存在的语义相似但逻辑无关的段落干扰问题。该方法通过整合逐步推理,实现对候选段落更精确的评估。为支持系统评估,作者构建了包含约30万个实例的SciRAG-SSLI数据集。
Details
Motivation: 现有检索增强生成方法在科学问答中,容易受到语义相似但逻辑无关的段落干扰,这会降低事实可靠性并加剧幻觉问题。本文旨在解决这一挑战。
Result: 综合评估证实,该方法在SciRAG-SSLI数据集上,相比领先的重排序器取得了更优的检索性能。
Insight: 主要创新点在于提出了一个整合逐步推理的深度证据重排序智能体,以超越表层语义评估段落逻辑相关性。同时,构建了首个大规模、系统性地研究两阶段RAG框架中SSLI问题的数据集,并进行了实证验证。
Abstract: With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying hallucinations.To address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
[11] TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization cs.CLPDF
Peiji Li, Linyang Li, Handa Sun, Wenjin Mai, Yongkang Chen
TL;DR: 本文提出了一种名为TL-GRPO的轻量级强化学习算法,专门用于解决迭代优化类推理任务中的细粒度优化问题。该方法通过回合级分组采样,在模拟电路尺寸调整任务上超越了标准GRPO和贝叶斯优化方法,其训练的30B模型在相同模拟预算下达到了最先进的性能。
Details
Motivation: 现有基于GRPO的方法无法在迭代优化任务中进行细粒度的回合级优化,而黑盒优化方法则丢弃了先验知识和推理能力,因此需要一种能结合两者优势的新方法。
Result: 在模拟电路尺寸调整任务上,TL-GRPO在多种规格下均优于标准GRPO和贝叶斯优化方法,其训练的30B模型在相同模拟预算下实现了最先进的性能。
Insight: 创新点在于将强化学习优化粒度从轨迹级细化到回合级,通过回合级分组采样实现细粒度优化,同时保留了模型的推理能力,为迭代优化类任务提供了新的解决方案。
Abstract: Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.
[12] Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic cs.CL | cs.AIPDF
Yichuan Ma, Linyang Li, Yongkang chen, Peiji Li, Xiaozhe Li
TL;DR: 本文提出Timely Machine框架,将测试时间重新定义为挂钟时间,使模型能够根据时间预算动态调整策略。作者构建了Timely-Eval基准,涵盖高频工具调用、低频工具调用和时间受限推理场景,并发现模型规模与工具延迟的交互效应:小模型在低延迟下通过频繁交互表现更佳,而大模型在高延迟下凭借更优的交互质量占优。针对现有模型缺乏时间预算适应能力的问题,作者提出了Timely-RL方法,通过强化学习增强时间规划能力,在Timely-Eval上实现了性能提升。
Details
Motivation: 在智能体频繁调用工具的场景中,传统基于生成长度的测试时间定义失效,因为工具延迟使推理时间与生成长度解耦。本文旨在解决智能体场景下测试时间扩展的挑战,使模型能够感知并适应实际时间约束。
Result: 在Timely-Eval基准测试中,Timely-RL方法提升了模型对时间预算的感知能力,并一致提高了性能。实验表明,小模型在低延迟设置下表现更优,而大模型在高延迟设置中占据优势。
Insight: 创新点在于将测试时间重新定义为挂钟时间,并引入时间预算感知的强化学习框架Timely-RL。这为智能体时代的测试时间扩展提供了新视角,强调了实际时间约束下动态策略调整的重要性。
Abstract: As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.
[13] MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine cs.CL | cs.AIPDF
Wei Zhu
TL;DR: 本文提出了医学领域的检索增强生成(RAG)基准测试MRAG,涵盖中英文多种任务,并构建了基于Wikipedia和PubMed的语料库。同时开发了MRAG-Toolkit以系统探索RAG组件。实验表明RAG能提升LLM在医学任务中的可靠性,但其性能受检索方法、模型规模和提示策略影响,且在长问题中可能略微降低可读性。
Details
Motivation: 当前医学领域缺乏全面的RAG评估基准,本文旨在填补这一空白,为科学和临床QA系统提供标准化评估工具。
Result: 在MRAG基准上的实验表明,RAG提升了LLM的可靠性和推理质量,但性能受多种因素影响;具体定量结果未在摘要中详述。
Insight: 创新点在于构建了首个全面的医学RAG基准(MRAG)和配套工具包(MRAG-Toolkit),为系统评估不同RAG组件提供了标准化平台,有助于推动医学AI应用的发展。
Abstract: While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench’s dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.
[14] LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning cs.CL | cs.AIPDF
Obed Junias, Maria Leonor Pacheco
TL;DR: 本文提出了LOGICAL-COMMONSENSEQA基准测试,将常识推理重新定义为使用逻辑运算符(AND, OR, NEITHER/NOR)对原子陈述对进行逻辑组合,以评估模型处理联合、互斥或联合不可信陈述的能力。
Details
Motivation: 现有常识推理基准大多依赖单标签评估,无法区分陈述是联合可信、互斥还是联合不可信,因此需要一个新的基准来评估模型在逻辑组合层面的推理能力。
Result: 在零样本、少样本和思维链提示下评估指令微调、推理专用和微调模型,发现模型在合取推理上表现尚可,在析取推理上表现中等,但在基于否定的问题上性能急剧下降。
Insight: 创新点在于引入逻辑运算符来构建常识推理基准,揭示了模型在否定推理上的根本性局限,为推进组合式常识推理提供了一个可控的评估框架。
Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
[15] Curate-Train-Refine: A Closed-Loop Agentic Framework for Zero Shot Classification cs.CL | cs.LGPDF
Gaurav Maheshwari, Kevin El Haddad
TL;DR: 本文提出了一种名为Curate-Train-Refine的闭环智能体框架,用于零样本分类。该方法利用大型语言模型动态生成监督信号来训练轻量级文本分类器,通过一个迭代的智能体循环(包括数据筛选、模型表现分析、针对性示例合成)来持续提升数据质量并适配下游任务,从而在降低推理成本和延迟的同时实现高效分类。
Details
Motivation: 尽管大型语言模型和高容量编码器推动了零样本和少样本分类的发展,但其高昂的推理成本和延迟限制了实际部署。本文旨在通过利用LLM生成监督数据来训练轻量级分类器,以解决这一部署瓶颈。
Result: 在四个广泛使用的基准测试上,该方法持续超越了标准的零样本和少样本基线方法,表明其能够实现准确且高效的分类。
Insight: 创新点在于提出了一个由LLM驱动的闭环智能体框架,将LLM作为数据筛选器,通过迭代式的数据生成与评估来动态优化训练数据,从而使得轻量级模型能够在不依赖大型模型部署的情况下达到高性能。这为降低AI系统运营成本提供了一种可借鉴的范式。
Abstract: Large language models (LLMs) and high-capacity encoders have advanced zero and few-shot classification, but their inference cost and latency limit practical deployment. We propose training lightweight text classifiers using dynamically generated supervision from an LLM. Our method employs an iterative, agentic loop in which the LLM curates training data, analyzes model successes and failures, and synthesizes targeted examples to address observed errors. This closed-loop generation and evaluation process progressively improves data quality and adapts it to the downstream classifier and task. Across four widely used benchmarks, our approach consistently outperforms standard zero and few-shot baselines. These results indicate that LLMs can serve effectively as data curators, enabling accurate and efficient classification without the operational cost of large-model deployment.
[16] AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model cs.CLPDF
Xiang Chen
TL;DR: 本文提出了AuroraEdge-V-2B,一个专为边缘计算设计的紧凑、鲁棒且高速的视觉大语言模型。该模型仅有20亿参数,通过提出的压缩融合方法提升了推理效率,减少了视觉令牌数量,在保持强大性能的同时,实现了更快的推理速度和更低的部署成本。
Details
Motivation: 针对视觉大语言模型在工业应用中存在的参数量大、推理速度慢、实时性差以及在特定领域性能不如定制化深度学习模型的问题,旨在开发一个更适合边缘部署的轻量、快速且性能强大的VLLM。
Result: 在9个基准测试中,其得分超过了同等参数规模的其他模型,如Qwen2-VL-2B、Qwen2.5-VL-3B和InternVL-2.5-2B,实现了更强的性能。
Insight: 主要创新点在于设计了一个专为边缘部署优化的轻量级VLLM架构,并提出了压缩融合方法来提升推理效率,核心思路是通过减少解码过程中的视觉令牌数量来显著降低计算开销,从而在参数量、推理速度和模型性能之间取得了更好的平衡。
Abstract: Recently, due to the advancement of multimodal technology, people are attempting to use visual large language models (VLLMs) in industrial production. Many deep learning models (DLMs) deployed in the production environment are gradually being replaced by VLLMs. Compared with DLMs, VLLMs have some advantages in industrial applications: (1) Their strong generalization ability enables them to perform well across a wide range of tasks. (2) They are flexible and can deal with unfamiliar samples through context learning quickly. However, VLLMs also have obvious drawbacks: (1) VLLMs do not perform as well as custom-developed DLMs in specific domains. (2) The number of parameters in VLLMs is generally quite large, and their deployment requires substantial computational resources. (3) VLLMs generally operate much slower than DLMs, making real-time response challenging to achieve. To better utilize VLLMs in industrial applications, we introduce AuroraEdge-V-2B in this work, a compact, robust, and high-speed VLLM designed for edge deployment. To make the model run faster, we also propose a compression-fusion method to improve inference efficiency. AuroraEdge-V-2B has the following notable features: (1) Easy deployment and faster: It has only 2B parameters and is highly suitable for edge deployment, offering better real-time performance. (2) Fewer visual tokens and cheaper: It significantly reduces the number of visual tokens in the decoding process, thereby reducing the floating-point operations by half during inference and making it cheaper to use. (3) Strong performance: It gets a higher score on 9 benchmarks than models with the same number of parameter (e.g., Qwen2-VL-2B, Qwen2.5-VL-3B, InternVL-2.5-2B).
[17] How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants cs.CLPDF
Xueyang Feng, Weinan Gan, Xu Chen, Quanyu Dai, Yong Liu
TL;DR: 本文提出了RPEval基准测试,用于评估大型语言模型(LLM)个性化助手在利用用户偏好记忆时的理性程度,揭示了现有模型中普遍存在的非理性个性化现象及其负面影响,并提出了基于语用推理的RP-Reasoner方法,以选择性整合个性化信息,有效提升了性能。
Details
Motivation: LLM助手集成的用户偏好记忆机制虽然能提供更个性化的响应,但无关记忆的引入会干扰模型对用户意图的理解,因此需要系统研究个性化的双重效应(正面与负面)。
Result: 在RPEval基准测试上,提出的RP-Reasoner方法显著优于精心设计的基线模型,并在一个大规模商业个性化助手中解决了80%的坏案例,证明了其有效性。
Insight: 创新点在于将记忆利用建模为一个语用推理过程,从而实现对个性化信息的选择性整合,这为缓解非理性个性化问题提供了新思路;客观来看,构建专门的基准(RPEval)来量化评估个性化记忆的理性利用程度,是推动该领域研究的关键贡献。
Abstract: Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM’s intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at https://github.com/XueyangFeng/RPEval.
[18] PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice cs.CL | cs.AI | cs.CYPDF
Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu
TL;DR: 本文介绍了PLawBench,一个基于评分标准的基准测试,旨在评估大语言模型在真实法律实践中的表现。该基准包含850个问题,覆盖13个实际法律场景,并配有约12,500个细粒度评分项,通过公共法律咨询、实际案例分析和法律文件生成三类任务来评估模型的法律问题识别、结构化推理和文件生成能力。
Details
Motivation: 现有法律基准测试任务过于简化且标准化,无法捕捉真实法律实践的模糊性、复杂性和推理需求,且评估指标粗糙、单一维度,缺乏对细粒度法律推理的明确评估。
Result: 使用与人类专家判断对齐的基于LLM的评估器对10个最先进的LLM进行评估,实验结果表明,没有模型在PLawBench上表现出色,揭示了当前LLM在细粒度法律推理能力上的显著局限。
Insight: 创新点在于构建了一个基于真实法律工作流程、包含专家设计评分标准的细粒度基准,强调评估结构化法律推理和实际场景任务,为未来法律LLM的评估和发展指明了重要方向。
Abstract: As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model’s ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.
[19] EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents cs.CL | cs.CVPDF
Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang
TL;DR: 本文介绍了EMemBench,一个用于通过交互式游戏评估智能体长时记忆的程序化基准测试。该基准通过从智能体自身轨迹生成问题来覆盖文本和视觉游戏环境,并使用可验证的真实答案,评估多种记忆技能。研究评估了基于大型语言模型和视觉语言模型的记忆智能体,发现归纳和空间推理是主要瓶颈,视觉情境下的情景记忆仍是一个开放挑战。
Details
Motivation: 为了解决现有基准测试在评估智能体长时记忆方面的不足,特别是缺乏从智能体自身交互体验中动态生成问题、覆盖多种记忆技能(如多跳回忆、归纳、时空推理等)的综合性评估框架。
Result: 在15个文本游戏和多个视觉种子上的评估结果显示,性能远未饱和:归纳和空间推理是持续瓶颈,尤其在视觉环境中。持久性记忆对文本游戏中的开放骨干模型有明显提升,但对视觉语言模型(VLM)智能体的改进不一致,表明视觉情境下的情景记忆仍具挑战性。人类研究进一步证实了EMemBench的难度。
Insight: 创新点在于提出了一个从智能体轨迹动态生成问题、覆盖多种记忆技能的程序化基准,并揭示了当前VLM在视觉情境记忆任务上的显著不足,为未来研究指明了方向。
Abstract: We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent’s own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.
[20] Do LLM hallucination detectors suffer from low-resource effect? cs.CL | cs.AIPDF
Debtanu Datta, Mohan Kishore Chilukuri, Yash Kumar, Saptarshi Ghosh, Muhammad Bilal Zafar
TL;DR: 本文研究了大型语言模型(LLMs)的两个常见失败模式:幻觉(产生错误信息)和低资源效应(在低资源语言上性能显著下降)。作者探讨了幻觉检测器是否也受低资源效应影响,通过在三个领域(事实回忆、STEM和人文学科)的五个任务上实验,发现任务准确率在低资源语言中大幅下降,但幻觉检测器的准确率下降幅度远小于任务准确率下降。
Details
Motivation: 动机是探究LLMs在低资源语言中的幻觉检测性能,即幻觉检测器是否像LLMs本身一样受低资源效应影响,以理解模型内部不确定性编码的跨语言鲁棒性。
Result: 实验使用四个LLMs和三个幻觉检测器,结果显示在低资源语言(如孟加拉语)中,任务准确率相比英语大幅下降,但检测器准确率下降幅度小得多;检测器在单语言(包括非英语)和多语言设置中表现稳健,但在无目标语言监督的跨语言设置中性能不佳。
Insight: 创新点在于首次系统研究幻觉检测器的低资源效应,发现LLMs内部可能编码了跨语言的不确定性信号,且检测器在低资源语言中相对鲁棒,这为改进跨语言幻觉检测提供了新视角。
Abstract: LLMs, while outperforming humans in a wide range of tasks, can still fail in unanticipated ways. We focus on two pervasive failure modes: (i) hallucinations, where models produce incorrect information about the world, and (ii) the low-resource effect, where the models show impressive performance in high-resource languages like English but the performance degrades significantly in low-resource languages like Bengali. We study the intersection of these issues and ask: do hallucination detectors suffer from the low-resource effect? We conduct experiments on five tasks across three domains (factual recall, STEM, and Humanities). Experiments with four LLMs and three hallucination detectors reveal a curious finding: As expected, the task accuracies in low-resource languages experience large drops (compared to English). However, the drop in detectors’ accuracy is often several times smaller than the drop in task accuracy. Our findings suggest that even in low-resource languages, the internal mechanisms of LLMs might encode signals about their uncertainty. Further, the detectors are robust within language (even for non-English) and in multilingual setups, but not in cross-lingual settings without in-language supervision.
[21] Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess cs.CL | cs.AIPDF
Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsäcker
TL;DR: 本文通过国际象棋作为测试平台,探究大型语言模型(LLMs)中固化智能(基于记忆的回忆)与流体智能(基于原理的推理)的差异。研究发现,模型在需要流体智能的任务上表现显著下降,尤其是在分布外任务中表现接近随机水平,表明当前架构在系统性泛化方面存在局限。
Details
Motivation: 动机在于澄清LLMs的卓越能力究竟源于复杂的记忆回忆(固化智能)还是真正的推理能力(流体智能),并利用国际象棋的结构化环境和可扩展的引擎评估来区分这两种智能。
Result: 在基于训练语料接近度分类的象棋位置任务中,模型性能随着流体智能需求增加而持续下降;在分布外任务中,性能崩溃至随机水平。新模型虽有改进,但在训练分布外的任务上进展显著放缓。推理增强的推断能提升性能,但其边际效益随分布接近度而减少。
Insight: 创新点在于使用国际象棋作为受控测试平台来解耦智能类型,揭示了LLMs在系统性泛化上的局限性;客观分析认为,这强调了超越规模扩展的机制对于实现稳健流体智能的必要性。
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game’s structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity–ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.
cs.CV [Back]
[22] GR3EN: Generative Relighting for 3D Environments cs.CVPDF
Xiaoyan Xing, Philipp Henzler, Junhwa Hur, Runze Li, Jonathan T. Barron
TL;DR: 本文提出了一种名为GR3EN的方法,用于对大规模房间尺度环境的3D重建进行重光照。该方法通过将视频到视频重光照扩散模型的输出蒸馏到3D重建中,绕过了解决困难的反渲染问题的需求,实现了对复杂真实世界场景的3D重光照。
Details
Motivation: 现有3D场景重光照方案通常需要解决欠定或病态的反渲染问题,难以在复杂真实世界场景上产生高质量结果;而近期基于生成式图像/视频扩散模型的方法又局限于2D图像/视频或单个物体的3D重光照。
Result: 方法在合成和真实世界数据集上进行了验证,能够在新光照条件下忠实地渲染场景的新视角。
Insight: 核心创新在于将2D视频重光照扩散模型的生成能力蒸馏到3D重建中,从而避免了直接求解复杂的反渲染问题,实现了对复杂、大规模3D场景的可控重光照。
Abstract: We present a method for relighting 3D reconstructions of large room-scale environments. Existing solutions for 3D scene relighting often require solving under-determined or ill-conditioned inverse rendering problems, and are as such unable to produce high-quality results on complex real-world scenes. Though recent progress in using generative image and video diffusion models for relighting has been promising, these techniques are either limited to 2D image and video relighting or 3D relighting of individual objects. Our approach enables controllable 3D relighting of room-scale scenes by distilling the outputs of a video-to-video relighting diffusion model into a 3D reconstruction. This side-steps the need to solve a difficult inverse rendering problem, and results in a flexible system that can relight 3D reconstructions of complex real-world scenes. We validate our approach on both synthetic and real-world datasets to show that it can faithfully render novel views of scenes under new lighting conditions.
[23] Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory cs.CV | cs.AI | cs.LGPDF
Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan
TL;DR: 本文提出了Memory-V2V框架,通过为现有的视频到视频扩散模型引入显式记忆机制,解决了多轮交互式视频编辑中难以保持跨轮次一致性的问题。该框架利用外部缓存存储历史编辑结果,并采用精确检索和动态标记化策略来指导当前编辑步骤,同时通过一个可学习的标记压缩器减少冗余计算,实现了30%的加速。
Details
Motivation: 解决现有视频编辑器在多轮交互式编辑过程中难以保持跨轮次一致性的问题,使视频编辑更符合真实世界的迭代工作流程。
Result: 在视频新视角合成和文本条件长视频编辑等挑战性任务上进行了广泛实验,结果表明Memory-V2V在保持或超越SOTA基线模型任务性能的同时,能生成显著更跨轮次一致的视频,且计算开销最小。
Insight: 核心创新点在于为视频扩散模型引入了显式记忆机制和外部缓存,通过检索历史编辑结果来增强跨轮次一致性;同时,在DiT骨干网络中设计可学习的标记压缩器以减少冗余,兼顾了性能与效率,为迭代式生成模型提供了新思路。
Abstract: Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: https://dohunlee1.github.io/MemoryV2V
[24] Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments cs.CV | cs.AI | cs.CLPDF
Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
TL;DR: 该论文研究了基础模型在识别视频中重要子事件方面的能力,特别是针对足球比赛场景。作者构建了一个基于足球比赛精彩集锦的新数据集,用于评估模型区分重要与非重要事件的表现,发现当前最先进的多模态模型性能接近随机水平。
Details
Motivation: 动机是评估基础模型在识别多模态事件中上下文重要时刻的能力,这是生成叙述或摘要的基本前提,以解决模型在实际应用中可能存在的局限性。
Result: 在构建的足球比赛数据集上,多个最先进的多模态模型表现接近随机水平,未达到SOTA水平,分析显示模型倾向于依赖单一主导模态且在多源信息合成上无效。
Insight: 创新点在于利用人类偏好隐式构建低成本数据集,并揭示了模型在处理多模态数据时的样本级异质性问题和跨模态协同不足,强调了模块化架构和互补训练程序的重要性。
Abstract: Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, which we will publicly release to the community, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.
[25] Coarse-to-Fine Non-rigid Multi-modal Image Registration for Historical Panel Paintings based on Crack Structures cs.CVPDF
Aline Sindel, Andreas Maier, Vincent Christlein
TL;DR: 本文提出了一种基于裂纹结构的从粗到细非刚性多模态图像配准方法,用于历史木板画的多模态图像对齐。该方法利用卷积神经网络和图像神经网络进行关键点检测、描述与匹配,并通过多级关键点细化实现混合分辨率图像的高精度配准。
Details
Motivation: 历史木板画的多模态图像分析需要像素级对齐,但现有方法多依赖手动操作,效率低且精度有限。由于图像分辨率差异大、尺寸巨大、存在非刚性形变以及模态间内容差异,自动配准具有挑战性。
Result: 在包含五个多模态领域和不同分辨率图像的大型测试集上,该方法相比竞争的关键点匹配、密集匹配及细化方法取得了最佳的配准结果,消融实验验证了各模块的有效性。
Insight: 创新点在于利用历史画作中普遍存在的裂纹结构作为跨模态稳定特征,结合从粗到细的多级关键点细化策略处理混合分辨率图像,以及采用基于图神经网络的分块描述符匹配与局部单应性重投影误差过滤来提升匹配鲁棒性。
Abstract: Art technological investigations of historical panel paintings rely on acquiring multi-modal image data, including visual light photography, infrared reflectography, ultraviolet fluorescence photography, x-radiography, and macro photography. For a comprehensive analysis, the multi-modal images require pixel-wise alignment, which is still often performed manually. Multi-modal image registration can reduce this laborious manual work, is substantially faster, and enables higher precision. Due to varying image resolutions, huge image sizes, non-rigid distortions, and modality-dependent image content, registration is challenging. Therefore, we propose a coarse-to-fine non-rigid multi-modal registration method efficiently relying on sparse keypoints and thin-plate-splines. Historical paintings exhibit a fine crack pattern, called craquelure, on the paint layer, which is captured by all image systems and is well-suited as a feature for registration. In our one-stage non-rigid registration approach, we employ a convolutional neural network for joint keypoint detection and description based on the craquelure and a graph neural network for descriptor matching in a patch-based manner, and filter matches based on homography reprojection errors in local areas. For coarse-to-fine registration, we introduce a novel multi-level keypoint refinement approach to register mixed-resolution images up to the highest resolution. We created a multi-modal dataset of panel paintings with a high number of keypoint annotations, and a large test set comprising five multi-modal domains and varying image resolutions. The ablation study demonstrates the effectiveness of all modules of our refinement method. Our proposed approaches achieve the best registration results compared to competing keypoint and dense matching methods and refinement methods.
[26] Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models cs.CV | cs.AI | q-bio.NCPDF
Bridget Leonard, Scott O. Murray
TL;DR: 本文提出了一种受人类空间认知启发的‘视角令牌’方法,通过嵌入体现身体关键点线索或支持心理旋转的抽象表示来编码方向信息,以解决多模态语言模型在需要采用他人视觉视角的空间推理任务中存在的自我中心偏差问题。
Details
Motivation: 多模态语言模型在语义视觉语言任务上表现良好,但在需要采用其他智能体视觉视角的空间推理任务上失败,这些错误反映了持续的自我中心偏差,并引发了对当前模型是否支持他中心推理的疑问。
Result: 将视角令牌集成到LLaVA-1.5-13B模型中,在合成和自然基准测试(如Isle Bricks V2、COCO、3DSRBench)上提高了准确性,特别是在二级视觉视角采择任务上达到先进水平,其中基于旋转的令牌能泛化到非人类参考智能体。
Insight: 创新点在于将认知基础的空间结构直接嵌入令牌空间,提供了一种轻量级、模型无关的视角采择机制;分析表明微调增强了基础模型中已有的潜在方向敏感性,暗示多模态语言模型包含他中心推理的前兆但缺乏适当的内部结构,这为设计更类人空间推理的模型提供了新思路。
Abstract: Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent’s visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
[27] VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection cs.CVPDF
Yuxin Jiang, Yunkang Cao, Yuqi Cheng, Yiheng Zhang, Weiming Shen
TL;DR: 本文提出VTFusion,一个专为少样本异常检测设计的视觉-文本多模态融合网络。它通过自适应特征提取器学习针对工业检测任务的特定表示,并利用一个专门的多模态预测融合模块来促进跨模态信息交换,从而生成精细的像素级异常图。
Details
Motivation: 现有方法主要依赖在自然场景预训练的特征,忽略了工业检测所需的细粒度、领域特定语义,且普遍采用简单的特征拼接融合策略,未能解决视觉与文本模态间的语义错位问题,导致对跨模态干扰的鲁棒性不足。
Result: 在MVTec AD和VisA数据集的2-shot设置下,VTFusion分别取得了96.8%和86.2%的图像级AUROC。在本文新引入的工业汽车塑料部件真实世界数据集上,取得了93.5%的AUPRO,展现了其在实际工业场景中的适用性。
Insight: 核心创新在于两个设计:1) 为图像和文本引入自适应特征提取器,以弥合预训练模型与工业数据间的领域鸿沟,并辅以合成异常数据增强特征判别力;2) 设计了一个包含融合块和分割网络的多模态预测融合模块,实现了丰富的跨模态信息交互,并在多模态指导下生成精细的异常分割图。
Abstract: Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pre-trained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pre-trained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level AUROCs of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this paper, further demonstrating its practical applicability in demanding industrial scenarios.
[28] ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation cs.CV | cs.AIPDF
Yihao Wang, Jusheng Zhang, Ziyi Tang, Keze Wang, Meng Yang
TL;DR: 本文提出了一种名为ResAgent的新型参考表达式分割(RES)框架,该框架通过整合基于熵的点发现(EBD)和基于视觉的推理(VBR)来解决现有方法中粗粒度边界框导致冗余点提示以及过度依赖文本坐标推理不可靠的问题,实现了从粗到细的工作流程,并在多个基准数据集上达到了最先进的性能。
Details
Motivation: 解决现有基于多模态大语言模型(MLLM)的RES方法存在的两个关键局限:一是MLLM生成的粗粒度边界框导致冗余或非判别性的点提示;二是普遍依赖文本坐标推理不可靠,无法从视觉相似的干扰物中区分目标。
Result: 在RefCOCO、RefCOCO+、RefCOCOg和ReasonSeg四个基准数据集上的广泛评估表明,ResAgent在所有四个基准上都达到了新的最先进(SOTA)性能。
Insight: 主要创新点包括:1)将点选择建模为信息最大化过程,通过基于熵的点发现(EBD)在粗边界框内识别高信息量的候选点;2)放弃纯文本坐标推理,通过基于视觉的推理(VBR)进行联合视觉-语义对齐以验证点正确性,实现更鲁棒的验证;3)整体上,提出了一种结合EBD和VBR的从粗到细的工作流程,以最少的提示生成准确且语义基础的分割掩码。
Abstract: Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf{\model}, a novel RES framework integrating \textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery (\textbf{EBD}) and \textbf{V}ision-\textbf{B}ased \textbf{R}easoning (\textbf{VBR}). Specifically, EBD identifies high-information candidate points by modeling spatial uncertainty within coarse bounding boxes, treating point selection as an information maximization process. VBR verifies point correctness through joint visual-semantic alignment, abandoning text-only coordinate inference for more robust validation. Built on these components, \model implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding. Extensive evaluations on four benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg) demonstrate that \model achieves new state-of-the-art performance across all four benchmarks, highlighting its effectiveness in generating accurate and semantically grounded segmentation masks with minimal prompts.
[29] AlphaFace: High Fidelity and Real-time Face Swapper Robust to Facial Pose cs.CV | cs.AIPDF
Jongmin Yu, Hyeontaek Oh, Zhongtian Sun, Angelica I Aviles-Rivero, Moongu Jeon
TL;DR: AlphaFace是一种高保真、实时且对脸部姿态鲁棒的人脸交换方法,通过利用开源视觉语言模型和CLIP嵌入,结合新颖的视觉和文本语义对比损失,在极端面部姿态下仍能保持强身份表示和精确属性保留。
Details
Motivation: 现有的人脸交换方法在受限场景下表现良好,但在处理极端面部姿态时质量显著下降;基于几何特征的方法增加了依赖性和计算成本,而基于扩散的方法无法满足实时处理需求。
Result: 在FF++、MPIE和LPFF等基准测试中,AlphaFace在姿态挑战性案例中超越了最先进的方法,同时保持了实时性能。
Insight: 创新点包括利用视觉语言模型和CLIP嵌入来增强身份表示和属性保留,以及引入视觉和文本语义对比损失以提高姿态鲁棒性,同时确保实时处理能力。
Abstract: Existing face-swapping methods often deliver competitive results in constrained settings but exhibit substantial quality degradation when handling extreme facial poses. To improve facial pose robustness, explicit geometric features are applied, but this approach remains problematic since it introduces additional dependencies and increases computational cost. Diffusion-based methods have achieved remarkable results; however, they are impractical for real-time processing. We introduce AlphaFace, which leverages an open-source vision-language model and CLIP image and text embeddings to apply novel visual and textual semantic contrastive losses. AlphaFace enables stronger identity representation and more precise attribute preservation, all while maintaining real-time performance. Comprehensive experiments across FF++, MPIE, and LPFF demonstrate that AlphaFace surpasses state-of-the-art methods in pose-challenging cases. The project is publicly available on `https://github.com/andrewyu90/Alphaface_Official.git‘.
[30] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding cs.CV | cs.AIPDF
Xiaojiang Peng, Jingyi Chen, Zebang Cheng, Bao Peng, Fengyi Wu
TL;DR: 本文提出了Emotion-LLaMAv2框架和MMEVerse基准,用于解决多模态情感理解任务中的挑战。Emotion-LLaMAv2是一个端到端的多模态大语言模型,通过引入多视图编码器、Conv Attention预融合模块和感知到认知的课程指令调优方案,统一了情感识别与自由形式的情感推理。MMEVerse则整合了12个公开情感数据集,并进行了大规模高质量重标注,为训练和评估提供了标准化基准。
Details
Motivation: 当前多模态大语言模型在通用视觉-语言任务上表现出色,但在情感推理方面能力有限。该领域缺乏大规模、高质量描述性情感标注的数据集,也缺少标准化的评估基准。先前提出的Emotion-LLaMA框架受限于显式人脸检测器、隐式融合策略以及规模有限、质量不高的训练数据。
Result: 在MMEVerse基准上进行了评估,该基准整合了IEMOCAP、MELD、DFEW、MAFW等12个数据集,并重新标注产生了13万训练片段和3.6万测试片段,构成了18个评估基准。论文提出的Emotion-LLaMAv2框架在该基准上建立了端到端的情感识别与推理标准化评估设置。
Insight: 创新点包括:1. 端到端多视图编码器,无需外部人脸检测器,通过更丰富的时空多视图令牌捕捉细微情感线索;2. Conv Attention预融合模块,在LLM主干外部实现局部与全局多模态特征的同步交互;3. 感知到认知的课程指令调优方案,在LLaMA2主干内统一情感识别与自由形式推理。客观来看,其构建的大规模、高质量重标注多模态指令数据集(MMEVerse)为领域提供了宝贵的标准化训练与评估资源。
Abstract: Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.
[31] Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos cs.CVPDF
Meng Cao, Haoran Tang, Haoze Zhao, Mingfei Han, Ruyang Liu
TL;DR: 该论文提出了一种利用游戏视频中的视觉异常(即‘故障’)作为监督信号来增强多模态大语言模型物理世界理解能力的新范式。通过构建包含大量故障中心问答对的指令调优数据集PhysGame和专家标注的评估基准GameBench,实验表明该方法能有效提升模型在真实世界物理推理和通用多模态推理任务上的性能。
Details
Motivation: 解决现有物理推理数据集依赖高成本真实视频标注或缺乏真实感与多样性的合成数据的问题,探索一种可扩展且有效的监督源来提升AI对物理原理的理解。
Result: 在PhysBench基准上,使用PhysGame调优的Qwen2.5VL模型在真实世界物理推理性能上提升了2.5%;在MVBench基准上通用多模态推理性能提升了1.9%;在自建的GameBench基准上,模型检测物理不合理性的鲁棒性绝对提升了3.7%。
Insight: 创新性地利用游戏视频中违反物理定律的视觉故障作为监督信号,构建了大规模、高质量的指令调优数据集,并通过元信息引导的提示策略确保数据准确性,为物理世界理解提供了一条可扩展的新途径。
Abstract: Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an meta information guided instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.
[32] Multi-View Consistent Wound Segmentation With Neural Fields cs.CVPDF
Remi Chierchia, Léo Lebrat, David Ahmedt-Aristizabal, Yulia Arzhaeva, Olivier Salvado
TL;DR: 本文提出了一种名为WoundNeRF的基于NeRF SDF的方法,用于从自动生成的标注中估计鲁棒的多视角一致的伤口分割,以解决从2D图像推断3D结构的挑战。
Details
Motivation: 伤口护理面临经济和后勤负担,计算机视觉和机器学习算法可提供支持;伤口分割能快速自动评估组织,但现有方法在从2D图像推断多视角一致的3D结构方面存在挑战。
Result: 通过比较最先进的Vision Transformer网络和传统基于光栅化的算法,展示了该方法在恢复准确分割方面的潜力,代码将开源以促进发展。
Insight: 创新点在于利用NeRF SDF实现多视角一致的3D伤口分割,从自动标注中学习鲁棒表示;客观分析认为该方法结合了神经场和分割任务,为医疗图像处理提供了新范式。
Abstract: Wound care is often challenged by the economic and logistical burdens that consistently afflict patients and hospitals worldwide. In recent decades, healthcare professionals have sought support from computer vision and machine learning algorithms. In particular, wound segmentation has gained interest due to its ability to provide professionals with fast, automatic tissue assessment from standard RGB images. Some approaches have extended segmentation to 3D, enabling more complete and precise healing progress tracking. However, inferring multi-view consistent 3D structures from 2D images remains a challenge. In this paper, we evaluate WoundNeRF, a NeRF SDF-based method for estimating robust wound segmentations from automatically generated annotations. We demonstrate the potential of this paradigm in recovering accurate segmentations by comparing it against state-of-the-art Vision Transformer networks and conventional rasterisation-based algorithms. The code will be released to facilitate further development in this promising paradigm.
[33] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer cs.CVPDF
Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao
TL;DR: 本文提出SALAD方法,通过在线性注意力分支与稀疏注意力分支之间引入输入相关的门控机制,实现了视频扩散Transformer中90%的稀疏度和1.72倍的推理加速,同时保持与完整注意力基线相当的生成质量。
Details
Motivation: 针对视频扩散Transformer中完整注意力的二次计算复杂度导致高延迟的问题,现有稀疏注意力方法存在训练无关方法稀疏度有限、训练相关方法数据与计算成本高的局限性。
Result: 在视频生成任务中,SALAD达到90%的稀疏度,实现1.72倍推理加速,生成质量与完整注意力基线相当,且微调仅需2,000个视频样本和1,600训练步(批次大小为8)。
Insight: 创新点在于并行轻量级线性注意力分支与输入依赖门控机制的协同设计,实现了高稀疏度与高效微调的平衡,为视频扩散模型的高效注意力机制提供了新思路。
Abstract: Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.
[34] TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning cs.CV | cs.AI | cs.CLPDF
Daixian Liu, Jiayi Kuang, Yinghui Li, Yangning Li, Di Yin
TL;DR: 该论文提出了TangramPuzzle基准,用于评估多模态大语言模型在组合空间推理方面的能力。该基准基于七巧板游戏,通过符号化的几何框架Tangram Construction Expression来精确描述组件位置,并设计了轮廓预测和端到端代码生成两个任务。实验发现,MLLMs倾向于匹配目标轮廓而忽略几何约束,导致组件变形。
Details
Motivation: 现有基准在评估多模态大语言模型的组合空间推理能力时存在不足,任务相对简单,依赖语义近似或粗略相对定位,且评估指标有限、缺乏严格的数学公式化。
Result: 在先进的开源和专有模型上进行了广泛的评估实验,发现MLLMs在任务中倾向于优先匹配目标轮廓而忽视几何约束,导致组件扭曲或变形。
Insight: 创新点在于引入了基于几何的、机器可验证的符号框架TCE来精确表示空间关系,并设计了互补的任务来全面评估组合推理。客观来看,该工作为评估MLLMs的精确空间理解能力提供了一个严谨、可量化的基准,揭示了模型在几何一致性方面的常见缺陷。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual recognition and semantic understanding. Nevertheless, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often involve relatively simple tasks and rely on semantic approximations or coarse relative positioning, while their evaluation metrics are typically limited and lack rigorous mathematical formulations. To bridge this gap, we introduce TangramPuzzle, a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game. We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications, to mitigate the ambiguity of visual approximation. We design two complementary tasks: Outline Prediction, which demands inferring global shapes from local components, and End-to-End Code Generation, which requires solving inverse geometric assembly problems. We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints, leading to distortions or deformations of the pieces.
[35] OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding cs.CVPDF
Zixian Liu, Zhaoxi Chen, Liang Pan, Ziwei Liu
TL;DR: 本文提出了OnlineSI框架,旨在使多模态大语言模型(MLLM)能够通过视频流持续理解并定位三维空间环境。核心思想是维护有限的空间记忆来保留历史观测,避免计算量随输入累积而增加,并融合3D点云与语义信息以提升场景中物体的定位与识别能力。
Details
Motivation: 现有方法大多忽视了模型在不断变化的世界中持续工作的能力,且缺乏在现实世界具身系统中部署的可能性,因此需要开发能够持续提升空间理解能力的框架。
Result: 在两个代表性数据集上的实验表明,该方法有效;为缓解模糊性,引入了模糊F1分数进行评估,为现实世界具身系统的发展铺平了道路。
Insight: 创新点在于引入有限空间记忆机制以实现在线增量学习,避免计算膨胀,并融合3D点云与语义信息增强空间定位;客观来看,这为MLLM在动态环境中的持续空间理解提供了实用解决方案。
Abstract: In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the computation required for each inference does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy $F_1$-Score to mitigate ambiguity, and test our method on two representative datasets. Experiments demonstrate the effectiveness of our method, paving the way towards real-world embodied systems.
[36] Semi-Supervised Hierarchical Open-Set Classification cs.CV | cs.LGPDF
Erik Wallin, Fredrik Kahl, Lars Hammarstrand
TL;DR: 本文提出了一种半监督分层开放集分类方法,通过教师-学生框架和伪标签技术,利用包含已知和未知类别的未标注数据集提升分类性能。
Details
Motivation: 解决在类别层次结构中处理未知类别的问题,并扩展至半监督设置,以利用大规模未标注数据提升分层开放集分类性能。
Result: 在iNaturalist19基准测试中,仅使用每类20个标注样本,该方法超越了自监督预训练加监督适应的性能,甚至与全监督方法相当。
Insight: 创新点包括子树伪标签提供可靠监督,以及年龄门控机制缓解伪标签过自信问题,可借鉴于半监督和开放集学习场景。
Abstract: Hierarchical open-set classification handles previously unseen classes by assigning them to the most appropriate high-level category in a class taxonomy. We extend this paradigm to the semi-supervised setting, enabling the use of large-scale, uncurated datasets containing a mixture of known and unknown classes to improve the hierarchical open-set performance. To this end, we propose a teacher-student framework based on pseudo-labeling. Two key components are introduced: 1) subtree pseudo-labels, which provide reliable supervision in the presence of unknown data, and 2) age-gating, a mechanism that mitigates overconfidence in pseudo-labels. Experiments show that our framework outperforms self-supervised pretraining followed by supervised adaptation, and even matches the fully supervised counterpart when using only 20 labeled samples per class on the iNaturalist19 benchmark. Our code is available at https://github.com/walline/semihoc.
[37] X-Aligner: Composed Visual Retrieval without the Bells and Whistles cs.CVPDF
Yuqian Zheng, Mariana-Iuliana Georgescu
TL;DR: 本文提出了一种新颖的组成式视频检索框架,通过引入名为X-Aligner的交叉注意力模块,逐步融合视觉和文本查询,并将其多模态表示与目标视频对齐。该框架基于BLIP系列架构,采用两阶段训练策略,并在Webvid-CoVR数据集上训练,在Webvid-CoVR-Test上取得了63.93%的Recall@1的SOTA性能,同时在CIRCO和Fashion-IQ等组成式图像检索数据集上展现了强大的零样本泛化能力。
Details
Motivation: 解决现有组成式视频检索框架通常将多模态输入在单阶段融合,导致性能提升有限的问题,旨在更有效地利用视觉语言模型的表征能力。
Result: 在Webvid-CoVR-Test数据集上达到Recall@1为63.93%的SOTA性能;在CIRCO和Fashion-IQ等CIR数据集上展现出强大的零样本泛化能力。
Insight: 创新点包括:1) 提出X-Aligner交叉注意力模块,实现视觉和文本输入的渐进式融合与对齐;2) 引入视觉查询的标题作为额外输入以增强多模态查询表示;3) 采用两阶段训练策略以保留预训练VLM的表征。从客观角度看,其模块化设计和渐进式对齐策略是提升多模态检索性能的有效途径。
Abstract: Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.
[38] Reliable Brain Tumor Segmentation Based on Spiking Neural Networks with Efficient Training cs.CV | cs.NEPDF
Aurora Pia Ghiardelli, Guangzhi Tang, Tao Sun
TL;DR: 本文提出了一种基于脉冲神经网络(SNN)的可靠且高能效的3D脑肿瘤分割框架。该框架采用多视角(矢状面、冠状面、轴向)SNN模型集成,以提供体素级不确定性估计并增强分割鲁棒性。为降低SNN模型在语义图像分割训练中的高计算成本,作者采用了前向传播通过时间(FPTT)方法,在保持时间学习效率的同时显著减少了计算开销。
Details
Motivation: 解决在医学图像分割中实现高可靠性、低能耗计算的需求,特别是针对脑肿瘤分割任务中SNN训练计算成本高的问题。
Result: 在BraTS 2017和BraTS 2023基准测试上取得了有竞争力的分割精度和良好校准的不确定性,同时浮点运算量(FLOPs)减少了87%。
Insight: 创新点在于将多视角SNN模型集成用于不确定性估计以提升可靠性,并采用FPTT方法高效训练SNN,显著降低了计算成本,展示了SNN在可靠、低功耗医疗物联网和床旁系统中的应用潜力。
Abstract: We propose a reliable and energy-efficient framework for 3D brain tumor segmentation using spiking neural networks (SNNs). A multi-view ensemble of sagittal, coronal, and axial SNN models provides voxel-wise uncertainty estimation and enhances segmentation robustness. To address the high computational cost in training SNN models for semantic image segmentation, we employ Forward Propagation Through Time (FPTT), which maintains temporal learning efficiency with significantly reduced computational cost. Experiments on the Multimodal Brain Tumor Segmentation Challenges (BraTS 2017 and BraTS 2023) demonstrate competitive accuracy, well-calibrated uncertainty, and an 87% reduction in FLOPs, underscoring the potential of SNNs for reliable, low-power medical IoT and Point-of-Care systems.
[39] REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion cs.CV | cs.AIPDF
Xuewei Li, Xinghan Bao, Zhimin Chen, Xi Li
TL;DR: 本文提出了一种名为REL-SF4PASS的全景语义分割方法,该方法通过引入基于柱坐标的REL深度表示和球面动态多模态融合(SMMF)模块,旨在更充分地利用全景图像的几何信息,以提升分割性能和对三维扰动的鲁棒性。
Details
Motivation: 现有全景语义分割方法多关注球面几何或使用原始/HHA格式的深度信息,未能充分利用全景图像的几何特性,因此本文旨在通过新的深度表示和融合策略来解决这一问题。
Result: 在Stanford2D3D Panoramic数据集上的实验表明,该方法在所有3个fold上平均mIoU提升了2.35%,并且在面对三维扰动时性能方差降低了约70%,显著提高了性能和鲁棒性。
Insight: 创新点在于提出了REL深度表示(包含校正深度、高程增益垂直倾角和横向方位角),以柱坐标形式全面表征3D空间和表面法线方向,并设计了SMMF模块,针对全景图像不同区域采用差异化融合策略,以减少ERP投影中柱面侧向展开的断裂问题。
Abstract: As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to give complete scene perception based on an ultra-wide angle of view. Most PASS methods often focus on spherical geometry with RGB input or using the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL-SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical-dynamic Multi-Modal Fusion SMMF. REL is made up of Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represents 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, which uses different fusion strategies to match the different regions in panoramic images. Experimental results show that REL-SF4PASS considerably improves performance and robustness on popular benchmark, Stanford2D3D Panoramic datasets. It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.
[40] Incorporating Eye-Tracking Signals Into Multimodal Deep Visual Models For Predicting User Aesthetic Experience In Residential Interiors cs.CV | cs.AIPDF
Chen-Ying Chien, Po-Chih Kuo
TL;DR: 本研究提出了一种双分支CNN-LSTM框架,通过融合视觉特征与眼动追踪信号来预测用户对住宅室内设计的审美体验。研究收集了包含224个室内设计视频及28名参与者同步眼动数据的多模态数据集,模型在客观审美维度(如光线)上达到72.2%的准确率,在主观维度(如放松感)上达到66.8%的准确率,超越了现有视频基线方法。
Details
Motivation: 由于审美感知的主观性和视觉反应的复杂性,预测用户对室内空间的审美体验一直很困难。本研究旨在通过结合眼动信号来更准确地建模用户的美学体验。
Result: 模型在包含15个美学维度的数据集上评估,在客观维度准确率72.2%,主观维度66.8%,优于当前最先进的视频基线方法,特别是在主观评估任务上提升明显。
Insight: 创新点在于将眼动追踪信号作为训练时的特权信息融入多模态深度视觉模型,以提升审美预测性能。客观分析表明,瞳孔反应对客观评估贡献最大,而注视点与视觉特征的结合能增强主观评估;且训练后模型仅使用视觉输入也能保持可比性能,提高了实用性。
Abstract: Understanding how people perceive and evaluate interior spaces is essential for designing environments that promote well-being. However, predicting aesthetic experiences remains difficult due to the subjective nature of perception and the complexity of visual responses. This study introduces a dual-branch CNN-LSTM framework that fuses visual features with eye-tracking signals to predict aesthetic evaluations of residential interiors. We collected a dataset of 224 interior design videos paired with synchronized gaze data from 28 participants who rated 15 aesthetic dimensions. The proposed model attains 72.2% accuracy on objective dimensions (e.g., light) and 66.8% on subjective dimensions (e.g., relaxation), outperforming state-of-the-art video baselines and showing clear gains on subjective evaluation tasks. Notably, models trained with eye-tracking retain comparable performance when deployed with visual input alone. Ablation experiments further reveal that pupil responses contribute most to objective assessments, while the combination of gaze and visual cues enhances subjective evaluations. These findings highlight the value of incorporating eye-tracking as privileged information during training, enabling more practical tools for aesthetic assessment in interior design.
[41] Evaluating Large Vision-language Models for Surgical Tool Detection cs.CV | cs.AIPDF
Nakul Poudel, Richard Simon, Cristian A. Linte
TL;DR: 本研究评估了三种先进的大规模视觉语言模型(Qwen2.5、LLaVA1.5和InternVL3.5)在手术工具检测任务上的表现,使用了GraSP机器人手术数据集,并在零样本和LoRA微调两种设置下进行测试。结果表明,Qwen2.5在两种配置中均表现出最优的检测性能,且在零样本泛化能力上优于开放集检测基线Grounding DINO,而微调后性能相当。
Details
Motivation: 当前大多数AI系统是单模态的,限制了其对手术工作流程的整体理解,因此需要能够全面建模手术场景相关组件的通用手术AI系统。大规模视觉语言模型在整合多模态数据处理方面具有潜力,但它们在手术应用中的系统性研究仍有限,本研究旨在评估这些模型在基础手术视觉任务(如手术工具检测)上的有效性。
Result: 在GraSP机器人手术数据集上,Qwen2.5在零样本和LoRA微调设置下均优于LLaVA1.5和InternVL3.5,达到最佳检测性能。与基线Grounding DINO相比,Qwen2.5展现出更强的零样本泛化能力和可比的微调性能,其中Qwen2.5在工具识别上更优,而Grounding DINO在定位上更强。
Insight: 论文的创新点在于首次系统性地评估大规模视觉语言模型在手术工具检测任务上的表现,揭示了Qwen2.5在零样本泛化和识别能力上的优势,同时指出不同模型在识别与定位任务上的互补性,为开发通用手术AI系统提供了实证基础。
Abstract: Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.
[42] LoL: Longer than Longer, Scaling Video Generation to Hour cs.CV | cs.AIPDF
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li
TL;DR: 本文针对长视频生成中自回归模型存在的错误累积和长期一致性丧失问题,特别是由注意力汇聚帧引起的sink-collapse现象,提出了一种轻量级、无需训练的方法,通过引入多头RoPE抖动来打破多头注意力的同质化,从而缓解长期崩溃。该方法成功实现了实时、流式、无限长度的视频生成,并展示了长达12小时的连续视频生成能力。
Details
Motivation: 当前长视频生成研究从双向模型转向自回归模型,但普遍存在错误累积和长期一致性丧失的问题。尽管引入了注意力汇聚帧来缓解性能衰减,但这常常导致一种关键故障模式——sink-collapse,即生成内容反复回退到汇聚帧,导致场景突然重置和循环运动模式。
Result: 大量实验表明,该方法成功缓解了sink-collapse,同时保持了生成质量。据作者所知,这是首次实现了质量衰减极小的实时、流式、无限长度视频生成,并生成了长达12小时的连续视频,是目前公开演示的流式视频生成中最长的结果之一。
Insight: 论文的创新点在于揭示了sink-collapse源于RoPE的周期性结构与当前生成模型中普遍存在的多头注意力机制之间的内在冲突,并提出了一种无需训练的多头RoPE抖动方法来解决此问题。从客观角度看,该方法提供了一种轻量级且有效的解决方案,突破了长视频生成的时长限制,具有重要的实践意义。
Abstract: Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
[43] Reward-Forcing: Autoregressive Video Generation with Reward Feedback cs.CV | cs.LGPDF
Jingran Zhang, Ning Li, Yuanhao Ban, Andrew Bai, Justin Cui
TL;DR: 本文提出了一种名为Reward-Forcing的新方法,用于自回归视频生成。该方法利用奖励信号来引导生成过程,简化了训练,同时保持了高视觉保真度和时间一致性。通过标准基准测试,该方法在性能上与现有自回归模型相当,并在某些情况下超越了类似规模的双向模型。
Details
Motivation: 解决现有自回归视频生成模型过度依赖教师模型、导致性能受限(尤其是在缺乏强自回归教师时)且输出质量通常落后于双向模型的问题。
Result: 在VBench基准测试中,该方法总分为84.92,与需要大量异构蒸馏的最先进自回归方法(得分84.31)性能相当,并在某些情况下超越了类似规模的双向模型。
Insight: 创新点在于使用奖励信号而非教师模型来引导自回归生成,这简化了训练流程,避免了教师架构带来的约束,从而实现了更高效和可扩展的生成。从客观角度看,这是一种将强化学习思想融入视频生成的轻量化训练策略。
Abstract: While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically lags behind their bidirectional counterparts. In this paper, we explore an alternative approach that uses reward signals to guide the generation process, enabling more efficient and scalable autoregressive generation. By using reward signals to guide the model, our method simplifies training while preserving high visual fidelity and temporal consistency. Through extensive experiments on standard benchmarks, we find that our approach performs comparably to existing autoregressive models and, in some cases, surpasses similarly sized bidirectional models by avoiding constraints imposed by teacher architectures. For example, on VBench, our method achieves a total score of 84.92, closely matching state-of-the-art autoregressive methods that score 84.31 but require significant heterogeneous distillation.
[44] VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents cs.CVPDF
Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu
TL;DR: 本文介绍了VisGym,一个包含17个多样化环境(涵盖符号谜题、真实图像理解、导航和操作)的评估与训练平台,用于系统评估和提升视觉语言模型在多步视觉交互中的能力。研究发现前沿模型在交互设置中表现不佳,并揭示了模型在长上下文利用、视觉渲染任务难度等方面的具体局限,同时指出了通过显式目标观察、文本反馈和监督微调等途径可带来改进。
Details
Motivation: 现代视觉语言模型在多步视觉交互中(如感知、记忆和动作的长期整合)能力尚不明确,缺乏系统评估和训练环境,因此需要构建一个多样化、可定制、可扩展的平台来深入分析其局限并推动改进。
Result: 在VisGym的评估中,所有前沿模型在交互设置中成功率较低(简单配置46.6%,困难配置26.0%),且模型在利用长上下文时表现更差(无界历史窗口比截断窗口性能更糟),视觉渲染的符号任务比纯文本版本显著更难。
Insight: 创新点在于构建了一个涵盖多种任务类型、提供灵活控制(如难度、输入表示、规划视野和反馈)的标准化评估与训练平台VisGym,并提供了生成结构化演示的多步求解器以支持监督微调;客观分析表明,显式目标观察、文本反馈以及在部分可观测或未知动力学设置中的探索性演示能带来一致性能提升,这为改进多步视觉决策提供了具体路径。
Abstract: Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.
[45] AnyView: Synthesizing Any Novel View in Dynamic Scenes cs.CV | cs.LG | cs.ROPDF
Basile Van Hoorick, Dian Chen, Shun Iwase, Pavel Tokmakov, Muhammad Zubair Irshad
TL;DR: 本文提出了AnyView,一个基于扩散模型的视频生成框架,用于动态场景下的任意新视角合成。该框架利用多种监督级别的数据源(单目2D、静态多视图3D和动态多视图4D数据集)训练了一个通用的时空隐式表示,能够从任意相机位置和轨迹生成零样本的新视频。
Details
Motivation: 现代生成式视频模型在产生高质量输出方面表现出色,但在高度动态的真实世界环境中难以保持多视角和时空一致性。本文旨在解决动态视角合成问题,减少对归纳偏置或几何假设的依赖。
Result: 在标准基准测试中,AnyView取得了与当前最先进方法(SOTA)相竞争的结果。作者还提出了一个新的、更具挑战性的基准AnyViewBench,用于评估极端动态视角合成。在该基准上,大多数基线方法性能大幅下降,而AnyView仍能从任意视角生成真实、合理且时空一致的视频。
Insight: 创新点在于利用多种数据源(2D/3D/4D)联合训练一个通用的时空隐式表示,以实现对高度动态场景的零样本、任意视角视频生成。其关键在于减少对视角重叠的依赖,从而在极端视角变化下仍能保持性能,这为动态场景的生成建模提供了新思路。
Abstract: Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce \textbf{AnyView}, a diffusion-based video generation framework for \emph{dynamic view synthesis} with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose \textbf{AnyViewBench}, a challenging new benchmark tailored towards \emph{extreme} dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emph{any} viewpoint. Results, data, code, and models can be viewed at: https://tri-ml.github.io/AnyView/
eess.IV [Back]
[46] PanopMamba: Vision State Space Modeling for Nuclei Panoptic Segmentation eess.IV | cs.CV | eess.SP | stat.APPDF
Ming Kang, Fung Fung Ting, Raphaël C. -W. Phan, Zongyuan Ge, Chee-Ming Ting
TL;DR: 本文提出了一种名为PanopMamba的新型混合编码器-解码器架构,用于组织病理学图像中的细胞全景分割。该方法整合了Mamba和Transformer,并通过状态空间建模进行特征增强融合,以解决小目标检测、模糊边界和类别不平衡等挑战。在MoNuSAC2020和NuInsSeg基准数据集上的实验表明,该方法优于现有最先进方法。
Details
Motivation: 细胞全景分割在癌症诊断中至关重要,但面临小物体检测、模糊边界和类别不平衡等挑战。现有方法在这些方面存在不足,因此需要一种能够高效处理长程依赖并增强多尺度特征表示的新架构。
Result: 在MoNuSAC2020和NuInsSeg两个多类细胞分割基准数据集上,PanopMamba在多个评估指标上均优于现有最先进方法,验证了其鲁棒性。
Insight: 主要创新点包括:首次将Mamba架构应用于全景分割任务;设计了一个多尺度Mamba主干网络和基于状态空间模型的融合网络,以实现高效的长程感知和跨尺度信息共享;提出了针对细胞分割独特挑战的替代评估指标(如iPQ、wPQ、fwPQ),以减轻传统全景质量指标的潜在偏差。
Abstract: Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder-decoder architecture that integrates Mamba and Transformer with additional feature-enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)-based fusion network to enable efficient long-range perception in pyramid features, thereby extending the pure encoder-decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM-based feature-enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba-based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image-level Panoptic Quality ($i$PQ), boundary-weighted PQ ($w$PQ), and frequency-weighted PQ ($fw$PQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state-of-the-art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at https://github.com/mkang315/PanopMamba.
[47] PocketDVDNet: Realtime Video Denoising for Real Camera Noise eess.IV | cs.CVPDF
Crispian Morris, Imogen Dexter, Fan Zhang, David R. Bull, Nantheera Anantrasirichai
TL;DR: 本文提出PocketDVDNet,一种轻量级视频去噪网络,通过结合稀疏引导的结构化剪枝、物理噪声模型和知识蒸馏,在保持高质量恢复的同时显著降低模型复杂度,实现实时处理。
Details
Motivation: 解决现实多分量传感器噪声下的实时视频去噪问题,适用于自动对焦、自动驾驶和监控等应用场景。
Result: 模型尺寸减少74%,去噪质量提升,并能实时处理5帧图像块,在真实噪声场景下达到高效性能。
Insight: 创新点在于将模型压缩框架与领域自适应蒸馏结合,使学生网络隐式学习噪声处理,无需显式噪声图输入,实现了性能与效率的平衡。
Abstract: Live video denoising under realistic, multi-component sensor noise remains challenging for applications such as autofocus, autonomous driving, and surveillance. We propose PocketDVDNet, a lightweight video denoiser developed using our model compression framework that combines sparsity-guided structured pruning, a physics-informed noise model, and knowledge distillation to achieve high-quality restoration with reduced resource demands. Starting from a reference model, we induce sparsity, apply targeted channel pruning, and retrain a teacher on realistic multi-component noise. The student network learns implicit noise handling, eliminating the need for explicit noise-map inputs. PocketDVDNet reduces the original model size by 74% while improving denoising quality and processing 5-frame patches in real-time. These results demonstrate that aggressive compression, combined with domain-adapted distillation, can reconcile performance and efficiency for practical, real-time video denoising.
cs.SD [Back]
[48] SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models cs.SD | cs.AI | cs.CL | cs.LG | eess.ASPDF
Aafiya Hussain, Gaurav Srivastava, Alvi Ishmam, Zaber Hakim, Chris Thomas
TL;DR: 本文系统研究了针对音频-视频-语言三模态模型的纯音频对抗攻击。通过分析六种针对不同多模态处理阶段的攻击目标,在三个SOTA模型和多个基准测试上证明,仅对音频进行扰动即可引发严重的多模态故障,攻击成功率高达96%。研究还发现攻击在低感知失真下有效,且优化时长比数据规模更重要。
Details
Motivation: 多模态基础模型在推理和生成任务上表现强劲,但其对抗鲁棒性尚未被充分理解。本文旨在探索一个现实且未被充分研究的威胁模型:针对三模态模型的、无目标纯音频对抗攻击,以揭示多模态系统中被忽视的单模态攻击面。
Result: 在三个SOTA三模态模型和多个基准测试上,音频对抗攻击成功率最高达96%。攻击在低感知失真条件下(LPIPS <= 0.08, SI-SNR >= 0)仍可成功,且扩展优化比增加数据规模更有效。攻击在不同模型和编码器间的可迁移性有限,而Whisper等语音识别系统在严重失真下攻击成功率>97%。
Insight: 创新点在于系统性地提出了针对三模态模型不同处理阶段(如音频编码器表示、跨模态注意力、隐藏状态和输出似然)的六种攻击目标,揭示了纯音频扰动足以破坏多模态系统。客观来看,该研究强调了在多模态系统中强制跨模态一致性的防御必要性,并指出对抗鲁棒性评估需考虑单模态攻击面。
Abstract: Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.
cs.RO [Back]
[49] ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance cs.RO | cs.CVPDF
Zhuohao Li, Yinghao Li, Jian-Jian Jiang, Lang Zhou, Tianyu Zhang
TL;DR: 本文提出ReViP框架,通过视觉-本体感觉再平衡机制解决视觉-语言-动作模型中因模态不平衡导致的虚假完成问题,利用外部VLM提取任务中心视觉线索,通过特征级线性调制增强环境感知,并在多个基准上验证了其降低虚假完成率、提升任务成功率的效果。
Details
Motivation: 现有VLA模型将本体感觉信号直接与视觉语言特征融合,导致模型过度依赖内部状态而忽视视觉证据,产生状态主导偏差和虚假完成问题,需解决模态不平衡以提升鲁棒性。
Result: 在基于LIBERO构建的虚假完成基准测试中,ReViP显著降低了虚假完成率并提高了成功率,优于现有VLA基线,结果在LIBERO、RoboTwin 2.0和真实世界评估中均得到验证。
Insight: 创新点在于引入任务感知环境先验,通过外部VLM作为任务阶段观察器提取实时视觉线索,并采用视觉-本体感觉特征级线性调制自适应调节感知与动态耦合,从而增强视觉基础并减少状态驱动错误;客观分析认为该方法通过显式建模任务阶段视觉信息有效缓解了模态偏差问题。
Abstract: Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with VLM-encoded vision-language features, resulting in state-dominant bias and false completions despite visible execution failures. We attribute this to modality imbalance, where policies over-rely on internal state while underusing visual evidence. To address this, we present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary task-aware environment priors to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations, which drive a Vision-Proprioception Feature-wise Linear Modulation to enhance environmental awareness and reduce state-driven errors. Moreover, to evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop. Extensive experiments show that ReViP effectively reduces false-completion rates and improves success rates over strong VLA baselines on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.
cs.LG [Back]
[50] Endless Terminals: Scaling RL Environments for Terminal Agents cs.LG | cs.CLPDF
Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos
TL;DR: 本文提出了Endless Terminals,一个用于训练终端智能体的、可扩展的强化学习环境生成流水线。该流水线能够自动生成多样化的终端任务(如文件操作、数据处理等),无需人工标注。实验表明,在该流水线生成的大量任务上,使用简单的PPO算法训练的智能体,在内部测试集和外部基准(如TerminalBench 2.0)上均取得了显著性能提升,证明了环境规模对简单RL方法的重要性。
Details
Motivation: 当前用于评估终端智能体的基准测试集是为评估而非训练设计的,这成为了自改进智能体的瓶颈。强化学习需要一个可扩展的训练环境生成流水线,而不仅仅是静态数据集。
Result: 在内部保留开发集上,Llama-3.2-3B从4.0%提升至18.2%,Qwen2.5-7B从10.7%提升至53.3%,Qwen3-8B-openthinker-sft从42.6%提升至59.0%。在外部基准TerminalBench 2.0上,这些模型同样取得显著提升,并超越了使用更复杂智能体框架的替代方法。
Insight: 核心创新在于构建了一个全自动、可扩展的环境生成流水线,通过程序化生成任务来突破训练数据瓶颈。研究结果表明,当环境规模足够大时,简单的强化学习算法(如PPO)配合极简的交互循环(无需检索或多智能体协作),就能取得显著效果,这挑战了为提升性能必须增加智能体复杂性的常见思路。
Abstract: Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.
[51] Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs cs.LG | cs.AI | cs.CL | cs.CVPDF
Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi
TL;DR: 本文针对多模态大语言模型中的物体幻觉问题,提出了一种名为SARE的鲁棒遗忘方法,通过将遗忘建模为有目标的极小极大优化问题,并利用Targeted-SAM机制显式地平坦损失景观,从而在几何上稳定地移除幻觉概念,防止其在轻量级再学习后灾难性复发。
Details
Motivation: 多模态大语言模型存在物体幻觉问题,现有遗忘方法存在结构性脆弱缺陷,仅实现表面抑制,模型陷入尖锐最小值,导致幻觉在再学习后灾难性复发。
Result: 大量实验表明,SARE在遗忘效果上显著优于基线方法,同时保持了模型的通用生成质量,并能持久抑制幻觉,抵抗再学习和参数更新的影响。
Insight: 创新点在于将遗忘问题形式化为有目标的极小极大优化,并引入Targeted-SAM机制进行几何稳定化,通过模拟最坏情况参数扰动来确保鲁棒移除,这为模型编辑和安全性提供了新的稳定化视角。
Abstract: Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.
cs.AI [Back]
[52] Reasoning Promotes Robustness in Theory of Mind Tasks cs.AI | cs.CLPDF
Ian B. de Haan, Peter van der Putten, Max van Duijn
TL;DR: 本文研究了通过强化学习与可验证奖励(RLVR)训练的推理导向大型语言模型在心理理论(ToM)任务中的表现,发现这些模型对提示变化和任务扰动表现出更强的鲁棒性。
Details
Motivation: 探讨推理导向的LLMs在ToM任务中的行为,以澄清其底层能力本质和真实性能,并回应关于LLMs是否真正具备心理理论能力的争论。
Result: 在机器心理实验的新颖改编和现有基准测试中,推理模型显示出对提示变化和任务扰动的一致性鲁棒性提升。
Insight: 创新点在于将RLVR训练的推理模型应用于ToM评估,并指出性能提升主要源于解决方案搜索的鲁棒性增强,而非全新的ToM推理形式;这为评估LLMs的社会认知行为提供了新视角。
Abstract: Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.