Table of Contents

cs.CL [Back]

[1] TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding

Kimihiro Hasegawa,Wiradee Imrattanatrai,Masaki Asada,Ken Fukuda,Teruko Mitamura

Main category: cs.CL

TL;DR: TAMA是一个工具增强的多模态代理,旨在通过多媒体返回工具在无需训练的情况下实现多模态推理,提升视觉语言模型在流程活动理解任务中的表现。

Details Motivation: 流程活动助手在日常生活和专业场景中有广泛应用潜力,但相关系统开发仍不足,因此提出了TAMA框架来解决这一问题。

Contribution: 1. 提出了TAMA框架,支持多模态交错推理;2. 在无需训练的情况下,使用多媒体返回工具提升任务性能;3. 通过实验验证了工具的灵活选择和多媒体返回的有效性。

Method: TAMA通过多媒体返回工具实现多模态推理,无需额外训练,支持灵活的工具选择和交错推理。

Result: 在ProMQA-Assembly数据集上,TAMA显著提升了GPT-5和MiMo-VL等视觉语言模型的性能。

Insight: TAMA的框架设计推动了图像思维范式在多模态任务中的应用,并促进了流程活动助手的开发。

Abstract: Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.

[2] PrimeX: A Dataset of Worldview, Opinion, and Explanation

Rik Koncel-Kedziorski,Brihi Joshi,Tim Paek

Main category: cs.CL

TL;DR: PrimeX是一个包含世界观、意见和解释的数据集,旨在帮助语言模型更好地与用户对齐。

Details Motivation: 随着语言模型的广泛应用,需要更好地代表个体用户的信念系统以提高模型对齐性。

Contribution: 开发了PrimeX数据集,包含858名美国居民的公共意见调查数据、书面解释和世界观评估。

Method: 利用公共意见调查数据,结合书面解释和世界观评估(Primal World Belief调查),分析对语言模型个性化的价值。

Result: 展示了信念解释和世界观信息在个性化语言模型中的价值,为NLP和心理研究提供了新方向。

Insight: PrimeX为研究个体信念系统如何影响语言模型对齐提供了新工具,是多学科研究的桥梁。

Abstract: As the adoption of language models advances, so does the need to better represent individual users to the model. Are there aspects of an individual’s belief system that a language model can utilize for improved alignment? Following prior research, we investigate this question in the domain of opinion prediction by developing PrimeX, a dataset of public opinion survey data from 858 US residents with two additional sources of belief information: written explanations from the respondents for why they hold specific opinions, and the Primal World Belief survey for assessing respondent worldview. We provide an extensive initial analysis of our data and show the value of belief explanations and worldview for personalizing language models. Our results demonstrate how the additional belief information in PrimeX can benefit both the NLP and psychological research communities, opening up avenues for further study.

[3] Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

Shuyue Stella Li,Avinandan Bose,Faeze Brahman,Simon Shaolei Du,Pang Wei Koh,Maryam Fazel,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 该论文提出了个性化推理的概念,指出当前大语言模型(LLM)在处理即时个性化任务时的局限性,并引入了PREFDISCO评测框架,揭示了现有模型在交互能力上的不足。

Details Motivation: 当前LLM的任务解决和偏好对齐被视为独立挑战,导致在即时个性化场景(如冷启动或隐私限制)中无法有效满足用户需求。论文旨在解决这一问题。

Contribution: 1. 提出‘个性化推理’概念,强调LLM需要动态适应用户偏好;2. 开发PREFDISCO评测框架,将静态基准转化为交互任务;3. 揭示现有模型在个性化推理中的缺陷。

Method: 通过PREFDISCO评测框架,使用心理学基础的角色和稀疏偏好数据,将静态任务转化为交互式个性化任务,评估LLM的个性化推理能力。

Result: 评测21个前沿模型显示,29%的个性化尝试比通用回答更差,而通用回答也无法满足用户需求,表明个性化推理需专门开发。

Insight: 个性化推理是一个可衡量的研究方向,现有LLM在交互能力上存在根本限制,需要进一步研究以适应教育、医疗等领域的个性化需求。

Abstract: Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user’s needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don’t know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly – a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs’ interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.

[4] TASER: Translation Assessment via Systematic Evaluation and Reasoning

Monishwaran Maheswaran,Marco Carini,Christian Federmann,Tony Diaz

Main category: cs.CL

TL;DR: TASER 是一种利用大型推理模型(LRMs)进行自动化翻译质量评估的指标,通过系统化、分步的评估方法在 WMT24 Metrics Shared Task 中表现出色,优于现有所有指标。

Details Motivation: 现有的自动化翻译质量评估指标缺乏透明性和解释性,TASER 旨在利用 LRMs 的显式推理能力解决这一问题,并提供更准确的评估。

Contribution: TASER 的主要贡献是:1) 提出了基于 LRMs 的系统化翻译质量评估方法;2) 在参考和无参考场景下均取得了最佳性能;3) 通过显式推理提供了更高的可解释性。

Method: TASER 使用结构化提示模板(structured prompting templates)引导 LRMs 进行分步推理评估翻译质量,相比传统开放式的 LLMs 方法更有效。实验中使用 OpenAI 的 o3 模型,探讨了推理深度与评估质量的关系。

Result: 在 WMT24 Metrics Shared Task 中,TASER 在系统级和片段级评估中均表现优异:1) 系统级评估中,在参考和无参考场景下均获得最高的软成对准确性;2) 无参考变体在所有无参考方法中排名第一。

Insight: 研究表明:1) LRMs 的显式推理能力显著提升了评估的准确性和可解释性;2) 结构化提示模板比开放式的 LLMs 方法更适合 LRMs;3) 推理深度与评估质量之间存在关联。

Abstract: We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.

[5] Judging with Confidence: Calibrating Autoraters to Preference Distributions

Zhuohang Li,Xiaowei Li,Chengyu Huang,Guowang Li,Katayoon Goshvadi,Bo Dai,Dale Schuurmans,Paul Zhou,Hamid Palangi,Yiwen Song,Palash Goyal,Murat Kantarcioglu,Bradley A. Malin,Yuan Xue

Main category: cs.CL

TL;DR: 本文提出了一种校准概率自动评分器(autoraters)的方法,使其能够更好地建模目标群体的偏好分布,从而提高评分器的可靠性和校准性。

Details Motivation: 大型语言模型(LLMs)的校准日益依赖于其他LLMs作为自动评分器,但传统的离散偏好标签训练方式无法很好地处理主观、模糊或多义的任务,导致评分器的可靠性受限。因此,需要一种能建模完整偏好分布的方法。

Contribution: 1) 提出了一个通用的框架,用于将概率自动评分器校准到任意给定的偏好分布。2) 提出了两种学习方法:基于密集概率标签的直接监督微调和基于稀疏二元标签的强化学习方法。

Method: 1) 监督微调:适用于密集的、概率性的标签数据。2) 强化学习:适用于稀疏的二元标签数据。

Result: 实验结果表明,通过分布匹配目标微调的评分器在口头化概率预测中与目标偏好分布更一致,校准性更高,位置偏差显著降低,同时不影响客观任务的性能。

Insight: 建模完整的偏好分布能有效提升自动评分器的可靠性,尤其是在处理主观和模糊任务时。这表明未来LLM校准需要更多地关注分布建模而非离散标签。

Abstract: The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters’’. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.

[6] Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction

Zhexiong Liu,Diane Litman

Main category: cs.CL

TL;DR: 论文提出了IR-Tuning,一种针对LLM的层高效参数微调方法,专注于文本修订意图预测任务,以解决LLM在分类任务中的不足和数据稀缺问题。

Details Motivation: 大型语言模型(LLM)在文本生成任务中表现出色,但在文本分类任务(如修订意图预测)中表现不足,且缺乏足够的修订标注数据。

Contribution: 提出了IR-Tuning框架,动态选择重要LLM层进行微调,实现参数高效、快速收敛和低GPU内存消耗。

Method: 采用层级参数高效微调(PEFT)技术,通过梯度范数分布动态选择重要层进行微调,冻结冗余层。

Result: 实验表明,IR-Tuning在多样化的文本修订任务中优于基线方法,且在小规模数据集上表现良好。

Insight: LLM的分类能力可以通过动态层选择和高效参数微调有效提升,尤其在数据稀缺的场景下。

Abstract: Large Language Models (LLMs) have shown extraordinary success across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they often struggle to categorize nuanced texts. One such example is text revision, which involves nuanced edits between pairs of texts. Although simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are exceptionally expensive and scarce in the community. To address this issue, we introduce a plug-and-play layer-wise parameter-efficient fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of important LLM layers that are dynamically selected based on their gradient norm distribution, while freezing those of redundant layers. Extensive experiments suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse text revisions, while achieving fast convergence, low GPU memory consumption, and effectiveness on small revision corpora.

[7] CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage

Bowen Wei,Yuan Shen Tay,Howard Liu,Jinhao Pan,Kun Luo,Ziwei Zhu,Chris Jordan

Main category: cs.CL

TL;DR: CORTEX提出了一种多智能体LLM架构,用于高风险警报分类,通过分工协作提升决策的透明性和准确性。

Details Motivation: SOC(安全运营中心)面临每日数万警报的过载问题,现有方法或因单模型处理复杂数据效果不佳,或因缺乏透明性难以信任。

Contribution: 引入多智能体LLM架构CORTEX,通过分工协作(行为分析、证据收集、推理合成)提升警报分类的准确性和可审计性;发布细粒度SOC调查数据集。

Method: 设计三类专门化智能体:行为分析智能体检查活动序列,证据收集智能体查询外部系统,推理智能体综合证据生成可审计决策。

Result: 在多企业场景测试中,CORTEX显著降低误报率,较单智能体LLM提升调查质量。

Insight: 多智能体分工设计能有效解决噪声数据和透明性问题,为高复杂度任务的LLM应用提供新思路。

Abstract: Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts, with only a small fraction corresponding to genuine attacks. This overload creates alert fatigue, leading to overlooked threats and analyst burnout. Classical detection pipelines are brittle and context-poor, while recent LLM-based approaches typically rely on a single model to interpret logs, retrieve context, and adjudicate alerts end-to-end – an approach that struggles with noisy enterprise data and offers limited transparency. We propose CORTEX, a multi-agent LLM architecture for high-stakes alert triage in which specialized agents collaborate over real evidence: a behavior-analysis agent inspects activity sequences, evidence-gathering agents query external systems, and a reasoning agent synthesizes findings into an auditable decision. To support training and evaluation, we release a dataset of fine-grained SOC investigations from production environments, capturing step-by-step analyst actions and linked tool outputs. Across diverse enterprise scenarios, CORTEX substantially reduces false positives and improves investigation quality over state-of-the-art single-agent LLMs.

[8] TokMem: Tokenized Procedural Memory for Large Language Models

Zijun Wu,Yongchang Hao,Lili Mou

Main category: cs.CL

TL;DR: TokMem是一种令牌化的程序内存,为大型语言模型提供了一种高效的任务指定和知识召回方法,避免了传统提示工程的低效率问题。

Details Motivation: 大型语言模型(LLMs)严重依赖提示来完成任务,但提示需要每一步重复读取,扩展性差且缺乏模块化复用机制。TokMem的提出是为了解决这些问题。

Contribution: 引入了TokMem,一种令牌化的程序内存,能够将重复使用的程序编码为紧凑、可训练的嵌入,从而实现高效的任务执行和行为控制。

Method: TokMem将程序存储为内存令牌,每个令牌包含程序地址和控制信号,支持恒定大小的开销。它保持主干模型冻结,支持持续适应。

Result: 在1000个任务和函数调用任务上,TokMem表现优于检索增强生成,避免了重复上下文的开销,且参数更少。

Insight: TokMem提供了一种可扩展和模块化的方法,替代了传统的提示工程和微调策略。

Abstract: Large language models rely heavily on prompts to specify tasks, recall knowledge and guide reasoning. However, this reliance is inefficient as prompts must be re-read at each step, scale poorly across tasks, and lack mechanisms for modular reuse. We introduce TokMem, a tokenized procedural memory that stores recurring procedures as compact, trainable embeddings. Each memory token encodes both an address to a procedure and a control signal that steers generation, enabling targeted behavior with constant-size overhead. To support continual adaptation, TokMem keeps the backbone model frozen, allowing new procedures to be added without interfering with existing ones. We evaluate TokMem on 1,000 tasks for atomic recall, and on function-calling tasks for compositional recall, where it consistently outperforms retrieval-augmented generation while avoiding repeated context overhead, and fine-tuning with far fewer parameters. These results establish TokMem as a scalable and modular alternative to prompt engineering and fine-tuning, offering an explicit procedural memory for LLMs.

[9] LongCodeZip: Compress Long Context for Code Language Models

Yuling Shi,Yichun Qian,Hongyu Zhang,Beijun Shen,Xiaodong Gu

Main category: cs.CL

TL;DR: LongCodeZip是一个专为代码语言模型设计的双阶段压缩框架,通过粗粒度和细粒度压缩显著减少上下文长度而不影响任务性能。

Details Motivation: 代码生成需处理长上下文,但现有压缩方法忽略代码结构,导致性能不佳,高API成本和延迟是主要瓶颈。

Contribution: 提出LongCodeZip,一种针对代码语言模型的压缩框架,显著提升压缩率和任务性能。

Method: 采用双阶段策略:(1)粗粒度压缩基于条件困惑度筛选相关函数;(2)细粒度压缩在块级别选择最优子集。

Result: 在多任务评测中,压缩比达5.6倍且不降低性能,适用于大规模代码场景。

Insight: 代码专用压缩方法优于通用方法,长上下文处理可通过结构感知技术高效优化。

Abstract: Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.

[10] Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews

Koki Ryu,Hitomi Yanaka

Main category: cs.CL

TL;DR: 该论文研究了现成的大型语言模型(LLMs)在评分预测任务中的表现,发现用户评论能显著提升预测性能,并提出了一种通过生成假设评论进一步优化的方法。

Details Motivation: 个性化大型语言模型的输出以匹配用户偏好是一个研究热点,但现有工作主要集中在分类或排序任务上,忽略了评分预测这一需要语言和数学推理的任务。

Contribution: 论文的主要贡献在于验证了现成LLMs在评分预测任务中的潜力,揭示了用户评论的有效性,并提出通过生成假设评论来进一步提升性能的方法。

Method: 通过在三个数据集上对八种模型进行实验,研究了不同上下文信息(如用户评论)对LLMs评分预测性能的影响,并对比了传统矩阵分解方法。

Result: 实验表明,用户评论能显著提升LLMs的评分预测性能,效果接近传统矩阵分解方法,且在具体物品评论上的表现优于通用偏好描述。

Insight: 论文揭示了用户评论对评分预测的重要性,并为解决冷启动问题提供了新思路,同时也展示了LLMs在回归任务中的潜力。

Abstract: Personalizing the outputs of large language models (LLMs) to align with individual user preferences is an active research area. However, previous studies have mainly focused on classification or ranking tasks and have not considered Likert-scale rating prediction, a regression task that requires both language and mathematical reasoning to be solved effectively. This task has significant industrial applications, but the utilization of LLMs remains underexplored, particularly regarding the capabilities of off-the-shelf LLMs. This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information. Through comprehensive experiments with eight models across three datasets, we demonstrate that user-written reviews significantly improve the rating prediction performance of LLMs. This result is comparable to traditional methods like matrix factorization, highlighting the potential of LLMs as a promising solution for the cold-start problem. We also find that the reviews for concrete items are more effective than general preference descriptions that are not based on any specific item. Furthermore, we discover that prompting LLMs to first generate a hypothetical review enhances the rating prediction performance. Our code is available at https://github.com/ynklab/rating-prediction-with-reviews.

[11] Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains

Yawen Xue,Masaya Tsunokake,Yuta Koreeda,Ekant Muljibhai Amin,Takashi Sumiyoshi,Yasuhiro Sogawa

Main category: cs.CL

TL;DR: 该论文研究了在特定技术微领域(如Hitachi的JP1中间件)中,通过蒸馏优化代理大型语言模型(LLMs),以提升其推理能力和决策效率。

Details Motivation: 现有代理LLMs主要通过上下文学习实现多步推理,但输入冗长且计算成本高,而在技术微领域中的表现尚不明确。本文探讨了代理微调在JP1中间件中的潜在优势。

Contribution: 提出了基于蒸馏的代理微调方法,利用JP1特定数据集和LLM生成的推理轨迹,显著提升了模型在微领域中的推理性能和搜索效率。

Method: 通过领域手册生成JP1数据集,结合LLM自身的推理轨迹进行微调;引入检索增强生成和上下文-答案提取器,优化信息相关性。

Result: 在JP1认证考试问题上,该方法比基础模型提升了14%的性能,验证了代理微调在复杂微领域中的有效性。

Insight: 代理微调结合领域特定数据和知识蒸馏,能够显著提升LLMs在技术微领域中的推理能力和实用性。

Abstract: Agentic large language models (LLMs) have become prominent for autonomously interacting with external environments and performing multi-step reasoning tasks. Most approaches leverage these capabilities via in-context learning with few-shot prompts, but this often results in lengthy inputs and higher computational costs. Agent fine-tuning offers an alternative by enabling LLMs to internalize procedural reasoning and domain-specific knowledge through training on relevant data and demonstration trajectories. While prior studies have focused on general domains, their effectiveness in specialized technical microdomains remains unclear. This paper explores agent fine-tuning for domain adaptation within Hitachi’s JP1 middleware, a microdomain for specialized IT operations. We fine-tuned LLMs using JP1-specific datasets derived from domain manuals and distilled reasoning trajectories generated by LLMs themselves, enhancing decision making accuracy and search efficiency. During inference, we used an agentic prompt with retrieval-augmented generation and introduced a context-answer extractor to improve information relevance. On JP1 certification exam questions, our method achieved a 14% performance improvement over the base model, demonstrating the potential of agent fine-tuning for domain-specific reasoning in complex microdomains.

[12] Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

Pengzhou Cheng,Lingzhong Dong,Zeng Wu,Zongru Wu,Xiangru Tang,Chengwei Qin,Zhuosheng Zhang,Gongshen Liu

Main category: cs.CL

TL;DR: 论文提出了Agent-ScanKit框架,通过三种正交的探测范式(视觉引导、文本引导和结构引导)量化多模态代理的记忆和推理能力,发现现有模型多依赖机械记忆而非系统推理。

Details Motivation: 多模态代理在图形用户界面(GUI)中的自主交互能力虽有提升,但其在复杂或域外任务中的可靠性仍受限,引发了对现有代理是否存在伪推理的质疑。

Contribution: 提出了Agent-ScanKit框架,无需访问模型内部即可量化多模态代理的记忆和推理能力,揭示了现有模型对机械记忆的过度依赖。

Method: 设计了视觉引导、文本引导和结构引导三种正交探测范式,通过受控扰动量化记忆和推理的贡献。

Result: 在五个公开GUI基准测试中,18个多模态代理的结果表明,机械记忆通常优于系统推理,模型多为训练知识的检索器,泛化能力有限。

Insight: 强调了多模态代理在现实场景中需建模健壮的推理能力,为开发可靠的代理提供了重要见解。

Abstract: Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.

[13] MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Xingjian Zhao,Zhe Xu,Luozhijie Jin,Yang Wang,Hanfu Chen,Yaozhou Jiang,Ke Chen,Ruixiao Li,Mingshu Chen,Ruiming Wang,Wenbo Zhang,Yiyang Zhang,Donghua Yu,Yang Gao,Xiaogui Yang,Yitian Gong,Yuanfan Xu,Qinyuan Cheng,Zhaoye Fei,Shimin Li,Yaqian Zhou,Xuanjing Huang,Xipeng Qiu

Main category: cs.CL

TL;DR: MOSS-Speech 是一个无需文本中介的真正端到端语音转语音大语言模型,通过模态分层的架构设计保持预训练文本LLM的知识和推理能力。

Details Motivation: 现有的语音对话系统通常依赖级联式流程(语音转录、文本处理、语音合成),这会丢失副语言信息并限制表达能力。虽然最新的端到端方法减少了延迟并更好地保留了这些信息,但仍依赖于文本中介,形成瓶颈。

Contribution: MOSS-Speech 是第一个直接理解和生成语音的无文本中介的语音转语音大语言模型,结合了模态分层架构和冻结预训练策略,保留了文本LLM的能力并增加了原生语音功能。

Method: 采用模态分层的架构设计(modality-based layer-splitting architecture)和冻结预训练策略,保持文本LLM的知识和推理能力,同时引入原生语音能力。

Result: 在语音问答任务上取得了SOTA结果,语音转语音性能与现有文本中介系统相当,同时保持竞争力的文本任务性能。

Insight: 这项工作缩小了文本中介和直接语音生成之间的差距,为高效且表达力强的端到端语音交互提供了新范式。

Abstract: Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

[14] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Yurun Chen,Xavier Hu,Yuhan Liu,Ziqi Wang,Zeyi Liao,Lin Chen,Feng Wei,Yuxi Qian,Bo Zheng,Keting Yin,Shengyu Zhang

Main category: cs.CL

TL;DR: Graph2Eval是一个基于知识图谱的框架,自动生成多模态任务以评估智能代理在多步交互和动态环境中的能力。

Details Motivation: 现有静态数据集和基于LLM的合成数据方法无法充分评估智能代理的动态任务和多步交互能力,尤其是在多模态和网络环境中。

Contribution: 提出了Graph2Eval框架,通过知识图谱自动生成多模态任务,支持对智能代理的推理、协作和交互能力的全面评估,并构建了一个包含1,319个任务的基准数据集。

Method: 利用多源外部数据构建知识图谱,通过子图采样、任务模板和元路径将语义关系转化为结构化任务,并通过多阶段过滤保证任务质量。

Result: 实验表明Graph2Eval能高效生成任务,区分不同代理和模型的性能,揭示推理、协作和网络交互能力的差异。

Insight: 知识图谱是生成多样化任务的有效工具,动态任务生成能够更真实地评估智能代理的实际能力。

Abstract: As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents’ reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.

[15] Copy-Paste to Mitigate Large Language Model Hallucinations

Yongchao Long,Xian Wu,Yingying Zhang,Xianbin Wen,Yuxi Zhou,Shenda Hong

Main category: cs.CL

TL;DR: 提出了CopyPasteLLM,通过两阶段高复制响应偏好训练,显著减少大语言模型(LLM)的幻觉问题,并在多个数据集上表现优异。

Details Motivation: 检索增强生成(RAG)虽然能提供上下文基础,但LLM可能仍会生成不符合上下文的回答(幻觉),影响可靠性。研究发现高复制回答与幻觉呈负相关,因此提出高复制训练方法来提升模型可靠性。

Contribution: 1. 提出CopyPasteLLM,通过两阶段高复制偏好训练减少幻觉;2. 设计了三种提示方法增强复制效果;3. 提出Context-Parameter Copying Capturing算法,揭示模型校准机制。

Method: 1. 两阶段高复制响应偏好训练;2. 三种提示方法增强复制;3. 自动生成高复制偏好数据用于训练。

Result: CopyPasteLLM在FaithEval、ConFiQA和PubMedQA上表现最佳,FaithEval准确率提升12.2%-24.5%,仅需365个训练样本(基线数据的1/50)。

Insight: CopyPasteLLM通过校准模型对内参数知识的依赖而非外部知识,显著减少幻觉问题,说明高复制回答能提升模型可靠性。

Abstract: While Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to generate contextually grounded responses, contextual faithfulness remains challenging as LLMs may not consistently trust provided context, leading to hallucinations that undermine reliability. We observe an inverse correlation between response copying degree and context-unfaithful hallucinations on RAGTruth, suggesting that higher copying degrees reduce hallucinations by fostering genuine contextual belief. We propose CopyPasteLLM, obtained through two-stage high-copying response preference training. We design three prompting methods to enhance copying degree, demonstrating that high-copying responses achieve superior contextual faithfulness and hallucination control. These approaches enable a fully automated pipeline that transforms generated responses into high-copying preference data for training CopyPasteLLM. On FaithEval, ConFiQA and PubMedQA, CopyPasteLLM achieves best performance in both counterfactual and original contexts, remarkably with 12.2% to 24.5% accuracy improvements on FaithEval over the best baseline, while requiring only 365 training samples – 1/50th of baseline data. To elucidate CopyPasteLLM’s effectiveness, we propose the Context-Parameter Copying Capturing algorithm. Interestingly, this reveals that CopyPasteLLM recalibrates reliance on internal parametric knowledge rather than external knowledge during generation. All codes are available at https://github.com/longyongchao/CopyPasteLLM

[16] JoyAgent-JDGenie: Technical Report on the GAIA

Jiarun Liu,Shiyue Xu,Shangkun Liu,Yang Li,Wen Liu,Min Liu,Xiaoqing Zhou,Hanmin Wang,Shilin Jia,zhen Wang,Shaohua Tian,Hanhao Li,Junbo Zhang,Yongli Yu,Peng Cao,Haofen Wang

Main category: cs.CL

TL;DR: JoyAgent-JDGenie提出了一种通用的智能体架构,通过多智能体协作、分层内存系统和增强工具集提升了复杂任务的鲁棒性和适应性。

Details Motivation: 当前大语言模型在复杂任务中表现不足,缺乏系统级的设计,作者希望通过整合多智能体协作和分层内存等方法解决这一问题。

Contribution: 1)提出集体多智能体框架,结合规划与执行智能体及评审模型投票;2)设计分层内存系统,涵盖工作、语义和程序层面;3)优化工具集(搜索、代码执行、多模态解析)。

Method: 采用多智能体协作、分层内存管理和工具集成的方法,通过评审模型投票确保决策鲁棒性。

Result: 在综合基准测试中表现优于开源基线,接近专有系统性能。

Insight: 系统级整合是实现可扩展、鲁棒和自适应AI助手的关键,多智能体和分层设计适用于多样化任务。

Abstract: Large Language Models are increasingly deployed as autonomous agents for complex real-world tasks, yet existing systems often focus on isolated improvements without a unifying design for robustness and adaptability. We propose a generalist agent architecture that integrates three core components: a collective multi-agent framework combining planning and execution agents with critic model voting, a hierarchical memory system spanning working, semantic, and procedural layers, and a refined tool suite for search, code execution, and multimodal parsing. Evaluated on a comprehensive benchmark, our framework consistently outperforms open-source baselines and approaches the performance of proprietary systems. These results demonstrate the importance of system-level integration and highlight a path toward scalable, resilient, and adaptive AI assistants capable of operating across diverse domains and tasks.

[17] EuroSpeech: A Multilingual Speech Corpus

Samuel Pfisterer,Florian Grötschla,Luca A. Lanzendörfer,Florian Yan,Roger Wattenhofer

Main category: cs.CL

TL;DR: EuroSpeech提出了一种可扩展的管道,用于从议会录音中构建多语言语音数据集,显著提升了低资源语言的语音识别性能。

Details Motivation: 当前多语言语音数据集对大多数语言的数据覆盖不足,导致模型在多数语言上表现不佳。EuroSpeech旨在通过大规模高质量的语音数据集解决这一问题。

Contribution: 贡献包括:1) 一个可扩展的管道,用于从议会录音中提取对齐的语音片段;2) EuroSpeech数据集,覆盖22种欧洲语言,其中19种语言数据量超过1k小时。

Method: 方法包括:1) 媒体检索的鲁棒组件;2) 两阶段对齐算法,用于处理非逐字转录和长音频。

Result: 提取了61k小时高质量语音数据,finetune现有ASR模型后,词错误率平均降低41.8%。

Insight: 议会录音是构建高质量多语言语音数据集的宝贵资源,适用于解决低资源语言的语音处理任务。

Abstract: Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.

[18] Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

Gaotang Li,Ruizhong Qiu,Xiusi Chen,Heng Ji,Hanghang Tong

Main category: cs.CL

TL;DR: 该论文探讨了监督微调(SFT)中负对数似然(NLL)目标的局限性,并提出了一类基于概率的目标函数,通过模型能力连续体的适应性选择,显著提升了性能。

Details Motivation: 传统NLL目标在监督微调中泛化能力有限,尤其是在模型已具备任务相关先验知识且监督信号长而嘈杂的情境下。

Contribution: 揭示模型能力连续体对目标函数选择的关键影响,并提出一系列基于概率的目标函数,在不同模型能力下表现优于NLL。

Method: 研究了一种通用的基于概率的目标函数家族,并通过7种模型主干、14个基准和3个领域的实验,验证其有效性。

Result: 实验表明,在模型能力强的一端,倾向于先验的低概率词权重下调目标(如$-p$、$-p^{10}$)优于NLL;在能力弱的一端,NLL表现最佳;中间区域则需动态选择。

Insight: 目标函数的有效性高度依赖模型能力水平,为动态选择目标函数提供了理论依据。

Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

[19] GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Kung-Hsiang Huang,Haoyi Qiu,Yutong Dai,Caiming Xiong,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 这篇论文提出了GUI-KV,一种高效的KV缓存压缩方法,专门针对视觉语言模型的GUI代理,通过空间显着性和时间冗余评分优化缓存,显著提升了效率。

Details Motivation: GUI代理在长序列高分辨率截图处理中长期面临效率低下的问题,现有缓存压缩方法未能充分利用GUI的空间和时间冗余特性。

Contribution: 1) 提出GUI-KV,一种无需重新训练的KV缓存压缩方法;2) 引入空间显着性指导和时间冗余评分技术,优化缓存利用率。

Method: 1) 空间显着性指导:通过隐藏状态的L2范数增强注意力分数,保留重要视觉令牌;2) 时间冗余评分:将历史帧投影到当前帧子空间,优先剪枝冗余数据。

Result: 在AgentNetBench上,GUI-KV解码FLOPs减少38.9%,步骤准确率提升4.1%,接近全缓存性能。

Insight: GUI中注意力稀疏性在Transformer各层均匀分布,简单均匀预算分配优于复杂分层策略。

Abstract: Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames’ keys onto the current frame’s key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

[20] ThinkBrake: Mitigating Overthinking in Tool Reasoning

Minjae Oh,Sangjun Song,Seungkyu Lee,Sungmin Jo,Yohan Jo

Main category: cs.CL

TL;DR: 论文提出ThinkBrake,一种无需训练的解码启发式方法,用于解决小推理模型(SRMs)在工具使用时过度思考的问题,显著提升推理效率。

Details Motivation: 小推理模型在工具使用时容易过度思考:它们会先达到正确的工具参数配置,但随后继续推理并覆盖为错误的最终调用。这种现象导致效率低下和冗余推理。

Contribution: 1. 通过oracle rollouts诊断过度思考问题;2. 提出ThinkBrake,一种训练自由的解码启发式方法,有效减少冗余推理;3. 在BFCL数据集上验证了方法的有效性。

Method: ThinkBrake通过监控句子边界处与当前最高概率token之间的log-probability margin,并在该margin变小时触发终止,从而避免冗余推理。

Result: 在BFCL的单轮、非实时和实时任务中,ThinkBrake保持或提升准确率的同时减少了25%的token,优于多种基线方法。

Insight: 工具推理中的过度思考问题导致显著的冗余计算,简单的解码启发式方法(如ThinkBrake)可以有效解决这一问题,提升推理效率。

Abstract: Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8% to 94.2% while reducing tokens by 80-94%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL’s single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25%, outperforming various baselines.

[21] Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation

Yubo Xie,Chenkai Wang,Zongyang Ma,Fahui Miao

Main category: cs.CL

TL;DR: 论文介绍了CHIME数据集,评估了大型语言模型对中国网络流行语的理解能力,发现模型在多语言和文化的细微差别及来源追溯方面表现不佳。

Details Motivation: 研究旨在探索大型语言模型是否真正理解快速传播的网络流行语(即“梗”),尤其是在中国文化语境中的表现。

Contribution: 1. 提出了CHIME数据集,包含中文网络流行语的详细注释;2. 设计了两个评测任务(解释和选择题),分析了大型语言模型的表现。

Method: 1. 构建CHIME数据集,标注流行语的含义、来源等;2. 设计两个任务(解释和填空题),评估模型的理解能力。

Result: 模型能解释部分流行语的含义,但对文化和语言细微差别的表现较差;在填空题任务中表现低于人类水平。

Insight: 大型语言模型在网络流行语理解上仍有局限,尤其是文化和语言相关的任务;数据集可推动相关研究。

Abstract: Large language models (LLMs) are trained on vast amounts of text from the Internet, but do they truly understand the viral content that rapidly spreads online – commonly known as memes? In this paper, we introduce CHIME, a dataset for CHinese Internet Meme Explanation. The dataset comprises popular phrase-based memes from the Chinese Internet, annotated with detailed information on their meaning, origin, example sentences, types, etc. To evaluate whether LLMs understand these memes, we designed two tasks. In the first task, we assessed the models’ ability to explain a given meme, identify its origin, and generate appropriate example sentences. The results show that while LLMs can explain the meanings of some memes, their performance declines significantly for culturally and linguistically nuanced meme types. Additionally, they consistently struggle to provide accurate origins for the memes. In the second task, we created a set of multiple-choice questions (MCQs) requiring LLMs to select the most appropriate meme to fill in a blank within a contextual sentence. While the evaluated models were able to provide correct answers, their performance remains noticeably below human levels. We have made CHIME public and hope it will facilitate future research on computational meme understanding.

[22] ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Shiyu Li,Yang Tang,Yifan Wang,Peiming Li,Xi Chen

Main category: cs.CL

TL;DR: ReSeek提出了一个自校正框架,通过引入密集、指导性的奖励函数和JUDGE动作,让搜索代理能够在推理过程中动态识别和纠正错误,显著提升了任务成功率和路径可信度。

Details Motivation: 现有基于强化学习的搜索代理常依赖稀疏或基于规则的奖励,导致代理可能在错误路径上无法自我纠正,影响任务性能。

Contribution: 提出了ReSeek框架,包含自校正机制和密集奖励函数,设计了新基准FictionalHot。

Method: 引入JUDGE动作动态校正路径,设计正确性和实用性奖励。

Result: 实验表明ReSeek显著超越现有基线模型。

Insight: 密集奖励和动态自校正机制能有效提升搜索代理的性能和鲁棒性。

Abstract: Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.

[23] CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs

Li Li,Ziyi Wang,Yongliang Wu,Jianfei Cai,Xu Yang

Main category: cs.CL

TL;DR: 论文提出了一种名为CoT Vectors的低成本方法,通过编码任务通用的多步推理知识来提升大型语言模型(LLMs)的推理能力,取代传统的上下文学习和微调等高成本方法。

Details Motivation: 现有的CoT提示方法(如上下文学习和微调)成本高且效率低,因此需要一种更高效且低成本的替代方案来增强LLMs的推理能力。

Contribution: 1. 提出了CoT Vectors,一种紧凑的任务通用多步推理表示;
2. 揭示了LLMs推理过程中逐层不稳定的U型性能曲线;
3. 提出可学习的CoT Vectors,通过教师-学生框架优化,提供稳定和鲁棒的推理指导。

Method: 1. 使用任务向量范式提取CoT Vectors;
2. 通过实验发现推断过程中的逐层不稳定性;
3. 设计可学习的CoT Vectors,并采用教师-学生框架进行优化。

Result: CoT Vectors在多样化的基准测试和模型上表现优于现有基线,且性能接近参数高效的微调方法,同时需要更少的可训练参数。

Insight: CoT Vectors的有效性受到潜在空间结构、信息密度、获取机制和预训练差异的影响,揭示了LLMs中多步推理功能组织的新见解。

Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher-student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.

[24] MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation

Jinlan Fu,Shenzhen Huangfu,Hao Fei,Yichong Huang,Xiaoyu Shen,Xipeng Qiu,See-Kiong Ng

Main category: cs.CL

TL;DR: MCM-DPO提出了一种新的多模态直接偏好优化方法,用于改进alt-text生成任务,解决了用户标注噪声和上下文敏感性不足的问题。

Details Motivation: 现有的alt-text生成任务存在标注噪声和不一致性,且大型视觉语言模型对上下文信息敏感度不足。传统监督微调方法依赖高质量标注,但在用户生成数据中表现不佳。

Contribution: 提出MCM-DPO方法,通过跨模态多维度偏好优化提高alt-text生成质量,无需精确标注;构建了两个高质量数据集TAlt和PAlt支持研究。

Method: MCM-DPO在多维度(单/对/多偏好)上优化跨模态(文本/视觉)偏好,直接从偏好对中学习更好的alt-text选项。

Result: 实验表明MCM-DPO优于DPO和SFT,成为alt-text生成的新SOTA。

Insight: 偏好优化方法(如DPO)适用于标注噪声场景;跨模态多维偏好学习能显著提升生成任务性能。

Abstract: The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs’ insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO

[25] Facilitating Cognitive Accessibility with LLMs: A Multi-Task Approach to Easy-to-Read Text Generation

François Ledoyen,Gaël Dias,Jeremie Pantin,Alexis Lechervy,Fabrice Maurel,Youssef Chahir

Main category: cs.CL

TL;DR: 该论文研究了利用大型语言模型(LLMs)自动生成易读文本(ETR)的潜力,通过多任务学习(MTL)结合文本摘要、文本简化和ETR生成任务,提出了两种策略:基于检索增强生成(RAG)和参数高效的微调(MTL-LoRA),实验证明了多任务方法的优势。

Details Motivation: 简化复杂文本对认知障碍群体尤为重要,但手动生成易读文本耗时耗力,作者希望通过LLMs自动化这一过程。

Contribution: 1)提出了多任务学习方法,结合文本摘要、简化和ETR生成;2)提出了两种策略(RAG和MTL-LoRA);3)发布了高质量数据集ETR-fr。

Method: 采用了多任务学习框架,具体包括基于检索增强生成(RAG)的上下文学习策略和参数高效的微调策略(MTL-LoRA),并在Mistral-7B和LLaMA-3-8B上进行了实验。

Result: 实验表明,多任务方法在所有配置中均优于单任务基线,RAG策略在跨领域场景中表现优异,而MTL-LoRA在领域内设置中表现最佳。

Insight: 多任务学习能有效结合不同任务的互补信息,提高ETR生成的性能;RAG策略有助于模型泛化,而MTL-LoRA在参数效率上更具优势。

Abstract: Simplifying complex texts is essential for ensuring equitable access to information, especially for individuals with cognitive impairments. The Easy-to-Read (ETR) initiative offers a framework for making content accessible to the neurodivergent population, but the manual creation of such texts remains time-consuming and resource-intensive. In this work, we investigate the potential of large language models (LLMs) to automate the generation of ETR content. To address the scarcity of aligned corpora and the specificity of ETR constraints, we propose a multi-task learning (MTL) approach that trains models jointly on text summarization, text simplification, and ETR generation. We explore two different strategies: multi-task retrieval-augmented generation (RAG) for in-context learning, and MTL-LoRA for parameter-efficient fine-tuning. Our experiments with Mistral-7B and LLaMA-3-8B, based on ETR-fr, a new high-quality dataset, demonstrate the benefits of multi-task setups over single-task baselines across all configurations. Moreover, results show that the RAG-based strategy enables generalization in out-of-domain settings, while MTL-LoRA outperforms all learning strategies within in-domain configurations.

Harethah Abu Shairah,Somayah AlHarbi,Abdulaziz AlHussein,Sameer Alsabea,Omar Shaqaqi,Hebah AlShamlan,Omar Knio,George Turkiyyah

Main category: cs.CL

TL;DR: ALARB是一个阿拉伯语法律论证推理基准数据集,包含13K+沙特阿拉伯商业法庭案例,用于评估大语言模型在阿拉伯语法律领域的多步骤推理能力。

Details Motivation: 现有的阿拉伯语基准缺乏针对开放环境多步骤推理的数据集,特别是在法律领域。ALARB填补了这一空白,旨在提升阿拉伯语大语言模型的法律推理能力。

Contribution: 1. 提出ALARB数据集,包含详细的法庭案例和相关法规;2. 定义多项挑战性任务(如判决预测、推理链补全);3. 通过指令微调显著提升模型性能。

Method: 利用ALARB数据集设计任务,并对代表性阿拉伯语大语言模型进行评估。使用指令微调方法优化模型(如12B参数模型)。

Result: 指令微调后,12B参数模型在判决预测和阿拉伯语判决生成任务上的性能显著提升,接近GPT-4o水平。

Insight: ALARB展示了领域特定数据集和任务对提升大语言模型性能的重要性,为阿拉伯语法律领域的AI应用提供了新方向。

Abstract: We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well as the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset’s utility for instruction tuning. Notably, we show that instruction-tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.

[27] Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese

Jenny Kunz,Iben Nyholm Debess,Annika Simonsen

Main category: cs.CL

TL;DR: 论文研究了如何通过迁移学习和参数高效调优技术(如LoRA)将小型高效LLMs适配到法罗语(低资源语言)。结果表明语言迁移至关重要,但根据任务不同需选择不同源语言和调优方法。

Details Motivation: 法罗语是一种低资源的北日耳曼语言,缺乏适配的评估数据和模型。研究旨在探索如何利用相关语言(如冰岛语和丹麦语)及不同调优方法(如LoRA和全微调)提升模型性能。

Contribution: 1) 构建了两个新的法罗语评估基准;2) 验证了语言迁移的必要性,提出最优源语言和调优方法的选择依据;3) 结合专家评估,全面分析了模型在语言准确性和文本理解上的表现。

Method: 1) 从英文模型出发,通过预训练相关北欧语言(单独或合并);2) 在法罗语上进行调优,比较全微调和LoRA的效果;3) 构建新数据集并引入专家评估。

Result: 结果显示:1) 冰岛语提升语言准确性,丹麦语增强理解;2) LoRA提升语言接受度,全微调优化理解和下游任务能力。

Insight: 针对低资源语言,迁移学习中源语言的选择和调优方法的差异需根据具体任务权衡,语言相似性和任务目标是关键因素。

Abstract: We investigate how to adapt small, efficient LLMs to Faroese, a low-resource North Germanic language. Starting from English models, we continue pre-training on related Scandinavian languages, either individually or combined via merging, before fine-tuning on Faroese. We compare full fine-tuning with parameter-efficient tuning using LoRA, evaluating their impact on both linguistic accuracy and text comprehension. Due to the lack of existing Faroese evaluation data, we construct two new minimal-pair benchmarks from adapted and newly collected datasets and complement them with human evaluations by Faroese linguists. Our results demonstrate that transfer from related languages is crucial, though the optimal source language depends on the task: Icelandic enhances linguistic accuracy, whereas Danish boosts comprehension. Similarly, the choice between full fine-tuning and LoRA is task-dependent: LoRA improves linguistic acceptability and slightly increases human evaluation scores on the base model, while full fine-tuning yields stronger comprehension performance and better preserves model capabilities during downstream fine-tuning.

[28] Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation

Yanming Sun,Runzhe Zhan,Chi Seng Cheang,Han Wu,Xuebo Liu,Yuyao Niu,Fengying Ye,Kaixin Lan,Lidia S. Chao,Derek F. Wong

Main category: cs.CL

TL;DR: 论文研究了检索增强的LLM机器翻译(REAL-MT)在噪声环境中的脆弱性,并提出了一种噪声合成框架和评估方法。结果显示低资源语言对更容易受噪声影响,且大型推理模型(LRM)反而更容易被噪声误导。

Details Motivation: 尽管REAL-MT在知识密集型任务(如惯用语翻译)中表现优异,但其在噪声检索环境中的可靠性尚未被充分研究。

Contribution: 论文提出了噪声合成框架和新的评估指标,系统性地评估了REAL-MT的鲁棒性,并揭示了其在噪声环境中的局限性和潜在改进方向。

Method: 作者使用噪声合成框架对REAL-MT(基于Qwen系列模型)进行评估,分析了标准LLM和增强推理能力的LRM在不同资源语言对中的表现。

Result: 低资源语言对在噪声环境下性能下降更严重;LRM未能纠正错误,反而更容易被噪声误导;作者发现了一种注意力偏移现象,即在噪声环境下模型置信度上升但准确性下降。

Insight: 研究表明,当前方法存在局限性,需要在检索增强和自验证机制之间找到平衡。

Abstract: \textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.

[29] ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Adi Simhi,Jonathan Herzig,Martin Tutek,Itay Itzhak,Idan Szpektor,Yonatan Belinkov

Main category: cs.CL

TL;DR: ManagerBench是一种新的基准测试,用于评估大型语言模型(LLM)在自主决策中安全性与实效性之间的权衡。研究表明当前前沿LLM在这些情境中表现不佳,倾向于选择实效但有害的选项或过于保守而无效。

Details Motivation: 随着LLM从对话助手发展为自主代理,其行为的安全性评估变得至关重要。现有的安全基准主要关注有害内容的生成,忽略了代理为实现操作目标而采取的潜在有害行为。

Contribution: 引入ManagerBench,一个用于评估LLM在现实管理情境中决策能力的基准测试。该测试要求模型在选择实效但有害的行动和安全的低效行动之间权衡,并包含对照组以衡量模型的实效性倾向。

Method: 设计了人类验证的管理情境,迫使模型在安全性和实效性之间选择;同时使用对照组(仅针对无生命物体)测量模型的实效性倾向。

Result: 研究表明前沿LLM在安全性与实效性的权衡中表现不佳:一些模型倾向于选择有害但实效的行动,另一些则过于保守而无效。模型的危害评估与人类一致,但其优先级设定存在问题。

Insight: LLM在安全性与实效性的权衡中表现不佳的核心原因是优先级设定问题,而非危害感知能力不足。ManagerBench为评估代理行为的关键组成部分提供了挑战性的基准。

Abstract: As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model’s pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models’ harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.

[30] Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Ziliang Wang,Kang An,Xuhui Zheng,Faqiang Qian,Weikun Zhang,Cijun Ouyang,Jialu Cai,Yuhang Wang,Yichao Wu

Main category: cs.CL

TL;DR: 论文提出了一种名为‘Erasable Reinforcement Learning (ERL)’的新框架,通过识别、擦除和重建推理链中的错误步骤,提升搜索增强大型语言模型(LLMs)在多跳推理中的可靠性。

Details Motivation: 尽管搜索增强的大型语言模型在多跳推理中表现出色,但其可靠性仍受限于分解错误、检索缺失和推理错误等问题。单一阶段的错误可能导致最终答案的失败。

Contribution: 提出了ERL框架,通过显式识别和修正推理链中的错误步骤,显著提升了多跳推理的鲁棒性。

Method: ERL的核心方法是识别推理链中的错误步骤,擦除这些步骤,并重新生成正确的推理逻辑,防止错误传播。

Result: 在HotpotQA、MuSiQue、2Wiki和Bamboogle等数据集上,3B和7B模型分别实现了EM和F1分数的显著提升,超过了之前的SOTA结果。

Insight: 研究表明,ERL为LLMs的多步推理提供了一种强大的鲁棒性解决方案,有助于减少错误传播。

Abstract: While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.

[31] HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation

Loris Bergeron,Ioana Buhnila,Jérôme François,Radu State

Main category: cs.CL

TL;DR: HalluGuard是一个4B参数的小型推理模型,用于减少检索增强生成中的幻觉问题,通过证据驱动的分类和理由生成,表现媲美更大型模型。

Details Motivation: 大型语言模型(LLMs)在NLP任务中表现出色,但存在幻觉问题,限制了实际应用中的可信度,因此需要小型高效的解决方案。

Contribution: 提出HalluGuard,一个4B参数的小型推理模型,用于检测和减少幻觉;提出了一种领域无关的合成数据集和多阶段优化方法。

Method: 结合合成数据集、多阶段数据优化和偏好微调(Odds Ratio Preference Optimization),将大模型的推理能力迁移到小模型。

Result: 在LLM-AggreFact基准测试中,HalluGuard达到了84.0%的平衡准确率,与更大模型表现相当。

Insight: 小型模型通过合成数据和偏好微调可以有效减少幻觉问题,同时保持高效性,为实际部署提供了可行方案。

Abstract: Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.

[32] Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Shunfeng Zheng,Yudi Zhang,Meng Fang,Zihan Zhang,Zhitan Wu,Mykola Pechenizkiy,Ling Chen

Main category: cs.CL

TL;DR: 该论文探讨了检索增强生成(RAG)在解决奥林匹克物理问题中的应用,提出了高质量多模态数据集PhoPile,并展示了RAG如何提升基础模型的物理推理能力。

Details Motivation: 研究者受到学生通过复习过去题目准备竞赛的启发,希望探索RAG是否能够增强基础模型在高级物理问题中的推理能力。

Contribution: 论文的主要贡献包括:1) 引入专门为奥林匹克物理设计的多模态数据集PhoPile;2) 展示了RAG在提升模型性能中的作用;3) 揭示了进一步研究的挑战。

Method: 研究使用了PhoPile数据集对RAG增强的基础模型(包括LLMs和LMMs)进行基准测试,并采用了多种检索器。

Result: 结果表明,结合物理语料库的检索可以显著提升模型性能,但也暴露了一些挑战。

Insight: 论文揭示了多模态数据和检索机制在复杂推理任务中的重要性,为未来研究提供了方向。

Abstract: Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.

[33] Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

Eileen Pan,Anna Seo Gyeong Choi,Maartje ter Hoeve,Skyler Seto,Allison Koenecke

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)在非标准英语方言问答任务中的性能下降问题,发现特定语法规则对性能影响最大,并呼吁针对这些规则开展偏差缓解研究。

Details Motivation: LLMs在自然语言处理中广泛应用,但其在非标准英语方言中的表现较差。论文旨在分析这种性能差异的具体原因,尤其是语法规则的影响。

Contribution: 论文的主要贡献包括:1)量化了LLMs在非标准英语方言中的性能下降(高达20%);2)识别了三种关键语法规则(existential ‘it’、zero copula和y’all),它们解释了大部分性能差异;3)呼吁未来研究聚焦于这些高影响语法结构的偏差缓解方法。

Method: 研究方法包括:1)将标准英语问题转化为非标准方言变体;2)在多项选择问答任务中测试LLMs的性能;3)分析不同语法规则对性能的影响。

Result: 实验结果显示,LLMs在非标准方言问题中的准确率下降了高达20%,其中三种特定语法规则对性能下降的解释力最强。

Insight: 论文揭示了LLMs在语言多样性问题中的局限性,强调了针对高影响语法结构的偏差缓解的重要性,为未来研究提供了明确方向。

Abstract: Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying “standard” American English language questions as non-“standard” dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-“standard” English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential “it”, zero copula, and y’all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.

[34] Syntax-Guided Diffusion Language Models with User-Integrated Personalization

Ruqian Zhang,Yijiao Zhang,Juan Shen,Zhongyi Zhu,Annie Qu

Main category: cs.CL

TL;DR: 本文提出一种语法引导的扩散语言模型,通过结合结构监督和个性化条件来提升文本质量、多样性和可控性。

Details Motivation: 传统大型语言模型的输出通常过于通用,缺乏结构性多样性和个性化表达,限制了文本生成的多样性和个性化需求。

Contribution: 主要贡献包括:(1)提出语法引导的扩散语言模型,通过结构监督生成更丰富的文本;(2)设计了级联和非级联架构,优化结构与内容的对齐;(3)开发了共享表示机制,支持细粒度个性化生成和零样本推理。

Method: 方法分为两部分:(1)级联框架,先生成语法引导再生成条件文本;(2)非级联架构,直接对齐结构与内容。此外,通过共享表示机制整合用户信息。

Result: 实验表明,该方法在流畅性、多样性和风格保真度方面优于现有方法。

Insight: 通过在生成过程中引入语法信息,模型能更精准地捕捉词汇和结构的风格特征;共享表示机制提升了模型的个性化能力和泛化性。

Abstract: Large language models have made revolutionary progress in generating human-like text, yet their outputs often tend to be generic, exhibiting insufficient structural diversity, which limits personalized expression. Recent advances in diffusion models have opened new opportunities for improving language generation beyond the limitations of autoregressive paradigms. In this work, we propose a syntax-guided diffusion language model that integrates structural supervision and personalized conditioning to enhance text quality, diversity, and controllability. We introduce a cascaded framework that generates syntactic guidance before conditional text generation, and further generalize it to a novel noncascaded architecture for better alignment between structure and content. By incorporating syntactic information in the generating process, the proposed model better captures the lexical and structural characteristics of stylistic sentence construction. To enable fine-grained personalization, we develop a shared representation mechanism that facilitates information integration across users, supporting both faithful stylistic generation and generalizable zero-shot inference. Extensive experiments on multiple tasks demonstrate the superiority of our approach in fluency, diversity, and stylistic fidelity. Further qualitative analyses highlight its interpretability and flexibility in learning personalized patterns.

[35] Research on the Integration of Embodied Intelligence and Reinforcement Learning in Textual Domains

Haonan Wang,Junfeng Sun,Mingjia Zhao,Wei Liu

Main category: cs.CL

TL;DR: 本文提出了一种结合具身智能和强化学习的文本处理模型,利用具身智能的感知与行动优势和强化学习的决策优化能力,在多种文本任务中表现出色。

Details Motivation: 提升文本处理的智能化水平,结合具身智能的感知与行动能力以及强化学习的决策优化能力。

Contribution: 提出了一种新的具身智能与强化学习结合的模型,验证了其在广泛文本处理任务中的有效性。

Method: 通过理论分析和实验探索,设计了结合具身智能感知与行动优势和强化学习决策优化的模型。

Result: 模型在多种文本处理任务中表现出高效性和潜在应用价值。

Insight: 具身智能与强化学习的结合为文本处理领域的智能化提供了新的解决思路。

Abstract: This article addresses embodied intelligence and reinforcement learning integration in the field of text processing, aiming to enhance text handling with more intelligence on the basis of embodied intelligence’s perception and action superiority and reinforcement learning’s decision optimization capability. Through detailed theoretical explanation and experimental exploration, a novel integration model is introduced. This model has been demonstrated to be very effective in a wide range oftext processing tasks, validating its applicative potential

[36] Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review

Sukairaj Hafiz Imam,Tadesse Destaw Belay,Kedir Yassin Husse,Ibrahim Said Ahmad,Idris Abdulmumin,Hadiza Ali Umar,Muhammad Yahuza Bello,Joyce Nakatumba-Nabende,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad

Main category: cs.CL

TL;DR: 本文通过系统性文献综述(SLR)探讨了非洲低资源语言自动语音识别(ASR)的研究现状,聚焦数据集、模型与训练方法、评估技术及挑战,并提出了未来发展方向。

Details Motivation: 非洲拥有2000多种语言,但在ASR领域的研究和应用严重不足,阻碍了数字包容性。本文旨在填补这一研究空白,推动非洲语言的ASR发展。

Contribution: 1. 对非洲低资源语言ASR的研究进行全面梳理;2. 提供74个数据集及111种语言的统计;3. 指出现有研究的不足和改进方向。

Method: 采用PRISMA 2020流程,从DBLP、ACM等平台筛选2020至2025年的相关研究,纳入标准包括数据集、模型或指标的非洲语言研究。最终分析71篇论文。

Result: 研究发现:1. 仅有15%的研究提供可复现材料;2. 数据集许可不明确;3. 自监督和迁移学习有潜力但受限于预训练数据不足;4. 评估指标单一(WER为主),未充分考虑音调和形态丰富的语言。

Insight: 1. 社区驱动倡议和方法论进步为改进指明了方向;2. 可持续发展需多方合作、伦理数据集、轻量模型和基准测试;3. 未来应关注方言覆盖和资源可用性。

Abstract: ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv for studies published between January 2020 and July 2025. We include studies related to ASR datasets, models or metrics for African languages, while excluding non-African, duplicates, and low-quality studies (score <3/5). We screen 71 out of 2,062 records and we record a total of 74 datasets across 111 languages, encompassing approximately 11,206 hours of speech. Fewer than 15% of research provided reproducible materials, and dataset licensing is not clear. Self-supervised and transfer learning techniques are promising, but are hindered by limited pre-training data, inadequate coverage of dialects, and the availability of resources. Most of the researchers use Word Error Rate (WER), with very minimal use of linguistically informed scores such as Character Error Rate (CER) or Diacritic Error Rate (DER), and thus with limited application in tonal and morphologically rich languages. The existing evidence on ASR systems is inconsistent, hindered by issues like dataset availability, poor annotations, licensing uncertainties, and limited benchmarking. Nevertheless, the rise of community-driven initiatives and methodological advancements indicates a pathway for improvement. Sustainable development for this area will also include stakeholder partnership, creation of ethically well-balanced datasets, use of lightweight modelling techniques, and active benchmarking.

[37] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

David Anugraha,Shou-Yi Hung,Zilu Tang,Annie En-Shiun Lee,Derry Tanti Wijaya,Genta Indra Winata

Main category: cs.CL

TL;DR: 论文提出了mR3,一种支持72种语言的rubric无关奖励推理模型,通过数据与课程选择策略实现高效的多语言奖励建模,性能优于更大模型。

Details Motivation: 现有基于LLM的评估方法在非英语环境中表现不佳,缺乏有效的多语言训练策略,因此需要研究如何构建高质量的多语言奖励模型。

Contribution: 1. 提出支持72种语言的mR3模型;2. 系统研究了数据与课程选择策略;3. 性能优于更大模型(如GPT-OSS-120B)。

Method: 采用多语言数据集及目标语言推理数据训练rubric无关奖励模型,并提出高效的数据与课程选择策略。

Result: mR3在多语言奖励模型基准测试中表现SOTA,模型更小(至多缩小9倍)。

Insight: 目标语言推理数据的整合及有效数据选择策略对多语言奖励模型的性能至关重要。

Abstract: Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.

[38] Pay-Per-Search Models are Abstention Models

Mustafa Omer Gul,Claire Cardie,Tanya Goyal

Main category: cs.CL

TL;DR: 论文提出了MASH训练框架,通过选择性寻求外部帮助(如搜索工具)来实现LLM的弃权行为。MASH利用强化学习和按搜索付费的奖励机制,显著提高了回答准确性和选择性弃权能力。

Details Motivation: 当前LLM无法可靠识别其参数知识的边界,经常对超出边界的问题产生幻觉式回答。人类则能识别自身限制并选择性弃权或寻求外部帮助。MASH旨在通过外部帮助行为实现LLM的类似弃权功能。

Contribution: MASH的主要贡献是通过强化学习框架,将外部帮助行为(搜索)作为弃权的代理,无需预定义知识边界即可实现高效的选择性弃权和准确回答。

Method: MASH使用强化学习,通过按搜索付费的奖励机制(惩罚外部帮助同时奖励准确回答)训练LLM,使其选择性寻求外部帮助并实现弃权行为。

Result: 在三个知识密集型QA数据集上的实验显示,MASH在多跳数据集上回答准确率提高7.6%,并能有效区分可回答与不可回答问题,表现出与专门弃权方法类似的行为。

Insight: MASH表明,通过训练LLM选择性寻求外部帮助,可以自然实现弃权行为,而无需预先定义知识边界。这种方法为LLM的可靠性和实用性提供了新思路。

Abstract: LLMs cannot reliably recognize their parametric knowledge boundaries and often hallucinate answers to outside-of-boundary questions. In contrast, humans recognize their limitations and can either seek external help for such questions or abstain. In this paper, we introduce MASH (Modeling Abstention via Selective Help-seeking), a training framework that readily extracts abstentions from LLMs. Our key idea is that any external help-seeking by an LLM, i.e. search tool use, can serve as a proxy for abstention if the external help (search) is appropriately penalized while simultaneously rewarding answer accuracy. MASH operationalizes this idea using reinforcement learning with a pay-per-search reward. We run experiments on three knowledge-intensive QA datasets. Our results show that MASH substantially improves upon the selective help-seeking performance of prior efficient search approaches; on multi-hop datasets, MASH improves answer accuracy by 7.6%. Furthermore, MASH demonstrates strong off-the-shelf abstention – it can distinguish between unanswerable/answerable questions and selectively generate responses for answerable questions – showcasing behavior analogous to specialized abstention approaches. We emphasize that contrary to prior abstention methods, MASH does not require pre-determining knowledge boundaries to construct training data. Instead, MASH’s abstentions are a by-product of training for the auxiliary selective help-seeking task. Overall, we show that MASH training effectively aligns search tool use with parametric knowledge, which can be successfully leveraged for making abstention decisions.

[39] Backdoor Attacks Against Speech Language Models

Alexandrine Fortier,Thomas Thebaud,Jesús Villalba,Najim Dehak,Patrick Cardinal

Main category: cs.CL

TL;DR: 该论文首次系统研究了针对语音语言模型的音频后门攻击,展示了其在多种语音编码器和任务中的高成功率,并提出了基于微调的防御方法。

Details Motivation: 随着大语言模型(LLMs)及其多模态扩展的普及,模型可能继承其组件的漏洞,尤其是音频领域的后门攻击尚未被充分研究。

Contribution: 1. 首次系统研究了音频后门攻击在语音语言模型中的有效性;2. 通过组件分析揭示了攻击传播的最脆弱阶段;3. 提出了基于微调的防御方法。

Method: 研究了四种语音编码器和三个数据集,覆盖四项任务(自动语音识别、情感识别、性别和年龄预测),评估后门攻击的成功率,并通过组件分析识别脆弱阶段。

Result: 攻击成功率高达90.76%至99.41%,表明语音语言模型对后门攻击高度敏感。

Insight: 语音语言模型的脆弱性集中在特定组件阶段,微调可以有效缓解预训练编码器的后门威胁。

Abstract: Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.

[40] Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare

Zhengliang Shi,Ruotian Ma,Jen-tse Huang,Xinbei Ma,Xingyu Chen,Mengru Wang,Qu Yang,Yue Wang,Fanghua Ye,Ziyang Chen,Shanyi Wang,Cixing Li,Wenxuan Wang,Zhaopeng Tu,Xiaolong Li,Zhaochun Ren,Linus

Main category: cs.CL

TL;DR: 论文介绍了社会福祉函数基准测试(SWF Benchmark),用于评估LLMs在分配稀缺社会资源时的表现,发现主流LLMs普遍偏向功利主义,且在社交影响力或输出长度限制下策略脆弱。

Details Motivation: LLMs在高风险决策中的应用日益广泛,但其分配社会资源的原则和价值观尚未得到充分研究。因此,需要专门的基准测试来评估和引导其行为。

Contribution: 提出了SWF Benchmark,用于动态模拟LLMs作为资源分配者的表现,并首次构建了社会福祉分配的排行榜。

Method: 通过模拟环境让LLMs分配任务,衡量其在集体效率(ROI)和分配公平性(基尼系数)之间的权衡。评估了20种先进LLMs的表现。

Result: 发现LLMs的通用对话能力与分配技能无关;多数LLMs偏向功利主义,牺牲公平性;策略易受输出长度和社交框架影响。

Insight: 当前LLMs作为社会决策者存在风险,需针对性优化和专门基准测试以确保其与社会价值观对齐。

Abstract: Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model’s general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.

[41] GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning

Oussama Gabouj,Kamel Charaf,Ivan Zakazov,Nicolas Baldwin,Robert West

Main category: cs.CL

TL;DR: 论文提出了一种动态演示生成方法GRAD,通过训练LLM生成输入相关的简洁演示,提升少样本推理的效率,优于传统静态检索增强方法,并在数学推理和STEM领域表现出色。

Details Motivation: 传统检索增强生成(RAG)依赖静态数据库,可能导致演示内容与输入无关。为了提升少样本推理的效果和适应性,需要一个动态生成演示的方法。

Contribution: 提出了GRAD方法,动态生成输入相关的演示;展示了其在预算限制下的高效性;证明了小模型生成的演示能指导大模型,降低训练成本。

Method: 训练LLM生成输入相关的演示,限制演示和输出的token数量,专注于数学数据集。

Result: 在数学推理和STEM领域(如物理、化学、计算机科学)中,GRAD表现优于基线模型,且能泛化到分布外(OOD)领域。

Insight: 动态生成演示优于静态检索方法,小模型生成的演示可有效指导大模型,为资源受限环境下的少样本学习提供了新思路。

Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks, but their effectiveness often depends on the quality of the provided context. Retrieval-Augmented Generation (RAG) enriches prompts with external information, but its reliance on static databases constrains adaptability and can result in irrelevant demonstrations. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. We demonstrate the superiority of GRAD under budget constraints, where we limit both the number of tokens used per demonstration and the number of tokens used for the final output. Trained solely on a math dataset, GRAD consistently outperforms strong baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM questions, highlighting GRAD’s robust generalization to out-of-distribution (OOD) domains such as physics, chemistry, and computer science. Furthermore, we show that demonstrations generated by trained smaller models can effectively guide larger target models, reducing training costs while maintaining competitive accuracy. Overall, this work introduces a scalable demonstration generator model presenting the first step toward a dynamic few-shot learning paradigm in resource-constrained settings. We release the code used for the project.

[42] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Jiayi Zhang,Simon Yu,Derek Chong,Anthony Sicilia,Michael R. Tomz,Christopher D. Manning,Weiyan Shi

Main category: cs.CL

TL;DR: 本文提出了一种名为‘Verbalized Sampling’(VS)的训练免费提示策略,通过让模型对一组回答的概率分布进行语言化描述,解决了LLM在训练后对齐过程中出现的模式崩溃问题。

Details Motivation: 研究发现,LLM在训练后对齐过程中多样性降低(模式崩溃)的根本原因是偏好数据中的典型性偏差,即注释者倾向于选择熟悉的文本。

Contribution: 1. 理论分析了典型性偏差对模式崩溃的影响;2. 提出了VS方法,通过语言化采样提升多样性;3. 全面实验验证了VS在多种任务中的有效性。

Method: VS是一种提示策略,要求模型生成一组回答及其对应的概率分布(如‘生成5个关于咖啡的笑话及其概率’)。

Result: VS显著提升了创造性写作、对话模拟、开放问答等任务的多样性,多样性提高了1.6-2.1倍,同时保持事实准确性与安全性。

Insight: 1. 数据级偏差是模式崩溃的核心原因;2. VS是一种简单有效的推理时补救措施;3. 更强能力的模型从VS中获益更多。

Abstract: Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., ``Generate 5 jokes about coffee and their corresponding probabilities’’). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

[43] Energy-Regularized Sequential Model Editing on Hyperspheres

Qingyuan Liu,Jia-Chen Gu,Yunzhi Yao,Hong Wang,Nanyun Peng

Main category: cs.CL

TL;DR: 该论文提出了一种基于超球面能量(HE)正则化的顺序模型编辑方法SPHERE,通过稳定神经元权重分布来缓解顺序编辑带来的性能退化问题。

Details Motivation: 大型语言模型(LLMs)需要不断更新以保持与现实世界知识的同步。模型编辑是一种轻量级的替代方案,但顺序编辑会导致表示不稳定和灾难性遗忘。本文旨在理解并解决这一问题。

Contribution: 1) 揭示了超球面能量(HE)动态与编辑性能的强相关性;2) 理论证明了HE动态对预训练知识退化的下界影响;3) 提出了SPHERE方法,通过HE驱动的正则化策略优化编辑性能。

Method: SPHERE方法利用HE量化神经元权重的均匀性,识别与主超球面方向互补的稀疏空间,并将新知识投影到该空间以减少扰动。

Result: 在LLaMA3(8B)和Qwen2.5(7B)上的实验表明,SPHERE平均提升了16.41%的编辑能力,同时更好地保持了模型的整体性能。

Insight: 超球面均匀性是模型稳定性和知识保留的关键,而HE稳定性对避免编辑失败至关重要。

Abstract: Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.

cs.CV [Back]

[44] Review of Hallucination Understanding in Large Language and Vision Models

Zhengyi Ho,Siyuan Liang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文综述了大语言和视觉模型中的幻觉问题,提出统一的多层次框架分析文本和图像幻觉,揭示了数据分布和偏见的可预测模式。

Details Motivation: 大语言和视觉模型的幻觉问题在实际应用中可能导致错误传播和经济损失,但目前对其理解仍零散。

Contribution: 提出统一的多层次框架,将幻觉问题与模型生命周期中的机制关联,促进更完整的理解。

Method: 采用任务-模态交织方法,分析数据分布和偏见对幻觉的影响。

Result: 发现幻觉常源于数据分布的可预测模式和模型继承的偏见。

Insight: 通过系统性理解幻觉的根源,有助于开发更鲁棒的生成式AI解决方案。

Abstract: The widespread adoption of large language and vision models in real-world applications has made urgent the need to address hallucinations – instances where models produce incorrect or nonsensical outputs. These errors can propagate misinformation during deployment, leading to both financial and operational harm. Although much research has been devoted to mitigating hallucinations, our understanding of it is still incomplete and fragmented. Without a coherent understanding of hallucinations, proposed solutions risk mitigating surface symptoms rather than underlying causes, limiting their effectiveness and generalizability in deployment. To tackle this gap, we first present a unified, multi-level framework for characterizing both image and text hallucinations across diverse applications, aiming to reduce conceptual fragmentation. We then link these hallucinations to specific mechanisms within a model’s lifecycle, using a task-modality interleaved approach to promote a more integrated understanding. Our investigations reveal that hallucinations often stem from predictable patterns in data distributions and inherited biases. By deepening our understanding, this survey provides a foundation for developing more robust and effective solutions to hallucinations in real-world generative AI systems.

[45] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

Jianing Guo,Zhenhong Wu,Chang Tu,Yiyao Ma,Xiangqi Kong,Zhiqian Liu,Jiaming Ji,Shuning Zhang,Yuanpei Chen,Kai Chen,Xianglong Liu,Qi Dou,Yaodong Yang,Huijie Zhao,Weifeng Lv,Simin Li

Main category: cs.CV

TL;DR: 该论文研究了Vision-Language-Action(VLA)模型在多模态扰动下的鲁棒性,提出了RobustVLA方法,通过离线鲁棒优化和输入一致性增强,显著提升了模型的性能。

Details Motivation: 现有的VLA模型在视觉扰动上表现较好,但忽视了动作、指令、环境和观察等多模态扰动的影响,这限制了其在真实场景中的应用。

Contribution: 1.首次评估了主流VLA模型在17种多模态扰动下的表现;2.提出RobustVLA方法,通过输出鲁棒优化和输入一致性增强提升模型性能;3.在多模态扰动下实现了显著性能提升,尤其在真实机器人任务中表现突出。

Method: 1.输出鲁棒性:通过离线鲁棒优化对抗最坏情况动作噪声;2.输入鲁棒性:强制执行输入变化下的动作一致性;3.多模态扰动建模为多臂赌博机问题,使用UCB算法识别最有害噪声。

Result: 在LIBERO数据集上,RobustVLA在17种扰动下比基线提升了12.6%(pi0主干)和10.4%(OpenVLA主干),推理速度提升50.6倍,混合扰动下提升10.4%。在FR5机器人任务中,性能提升65.6%。

Insight: 1.动作是多模态中最脆弱的环节;2.视觉鲁棒的VLA未扩展到其他模态;3.扩散动作头的设计能显著提升鲁棒性。

Abstract: In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.

[46] Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

Junjie Li,Ziao Wang,Jianghong Ma,Xiaofeng Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为CADC的能力导向数据筛选框架,通过分析模型的内部能力而非任务启发式来优化视觉-语言模型的指令微调数据。

Details Motivation: 现有视觉-语言模型(VLMs)在指令微调时表现不稳定,数据筛选方法多为黑盒启发式,忽略了模型内在能力的影响,导致资源浪费和性能下降。

Contribution: 提出了CADC框架,首次通过无监督方式发现模型的内部能力,并基于能力分析筛选数据,实现了高效且可控的指令微调。

Method: 1. 利用梯度学习轨迹无监督发现模型的内在能力;2. 通过影响估计将训练数据与能力关联;3. 设计能力感知的平衡选择和分阶段训练课程。

Result: 仅用5%的原始数据,CADC在多模态基准上超越了全数据训练效果,验证了内部能力是模型学习的基础单元。

Insight: 模型的内在能力是调控指令微调的关键因素,数据筛选应从能力视角出发,而非传统任务导向的黑盒方法。

Abstract: Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

[47] Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

Yuchen Song,Andong Chen,Wenxin Zhu,Kehai Chen,Xuefeng Bai,Muyun Yang,Tiejun Zhao

Main category: cs.CV

TL;DR: 论文提出了一个名为C$^3$B的新型多文化、多任务和多语言的文化意识基准测试,基于漫画设计,包含2000多张图像和18000多个问答对,用于评估多模态大语言模型的文化意识能力。

Details Motivation: 当前的文化意识基准测试在任务设计上缺乏难度递进,且缺少跨语言任务。此外,这些基准测试常使用现实世界图像,每张图像通常只包含一种文化内容,使得测试对多模态大语言模型相对简单。

Contribution: 提出了C$^3$B基准测试,通过漫画形式提供多文化场景,设计了三种逐步增加难度的任务,填补了多模态大语言模型评估的文化意识能力空白。

Method: 基于漫画构建了包含2000多张图像和18000多个问答对的基准测试,涵盖三个层次的任务:基础视觉识别、文化冲突理解和文化内容生成。

Result: 在11个开源多模态大语言模型上的评估显示,这些模型与人类表现之间存在显著差距,表明C$^3$B对当前模型提出了较大挑战。

Insight: 通过漫画形式的多文化场景和多层次任务设计,C$^3$B有效评估了模型的文化意识能力,为未来研究提供了重要方向。

Abstract: Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B ($\textbf{C}$omics $\textbf{C}$ross-$\textbf{C}$ultural $\textbf{B}$enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^3$B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.

[48] Beyond the Prompt: Gender Bias in Text-to-Image Models, with a Case Study on Hospital Professions

Franck Vandewiele,Remi Synave,Samuel Delepoulle,Remi Cozot

Main category: cs.CV

TL;DR: 该论文研究了六种先进的文本到图像(TTI)模型在性别表征上的偏差,尤其以医院职业为例,发现模型普遍存在性别刻板印象,但不同模型的表现和提示词敏感性各异。

Details Motivation: 随着TTI模型在专业、教育和创意领域的广泛应用,其输出中嵌入的社会偏见问题日益显著。论文旨在揭示这些模型在性别表征上的系统性偏差,并提出改进建议。

Contribution: 论文的主要贡献包括:1)系统分析了六种TTI模型在医院职业中的性别偏差;2)揭示了提示词对性别表征的关键影响;3)提出了减少偏见的设计建议。

Method: 通过精心设计的提示词,为五种医院职业生成图像,并对六种模型进行比较。分析了不同肖像修饰词对性别平衡的影响。

Result: 研究发现所有模型均表现出性别刻板印象(如护士全为女性,外科医生多为男性),但不同模型对提示词的敏感性差异显著。

Insight: TTI模型的性别偏差是系统性和模型特定的,提示词的设计和模型的默认设置对生成结果的多样性至关重要。

Abstract: Text-to-image (TTI) models are increasingly used in professional, educational, and creative contexts, yet their outputs often embed and amplify social biases. This paper investigates gender representation in six state-of-the-art open-weight models: HunyuanImage 2.1, HiDream-I1-dev, Qwen-Image, FLUX.1-dev, Stable-Diffusion 3.5 Large, and Stable-Diffusion-XL. Using carefully designed prompts, we generated 100 images for each combination of five hospital-related professions (cardiologist, hospital director, nurse, paramedic, surgeon) and five portrait qualifiers (“”, corporate, neutral, aesthetic, beautiful). Our analysis reveals systematic occupational stereotypes: all models produced nurses exclusively as women and surgeons predominantly as men. However, differences emerge across models: Qwen-Image and SDXL enforce rigid male dominance, HiDream-I1-dev shows mixed outcomes, and FLUX.1-dev skews female in most roles. HunyuanImage 2.1 and Stable-Diffusion 3.5 Large also reproduce gender stereotypes but with varying degrees of sensitivity to prompt formulation. Portrait qualifiers further modulate gender balance, with terms like corporate reinforcing male depictions and beautiful favoring female ones. Sensitivity varies widely: Qwen-Image remains nearly unaffected, while FLUX.1-dev, SDXL, and SD3.5 show strong prompt dependence. These findings demonstrate that gender bias in TTI models is both systematic and model-specific. Beyond documenting disparities, we argue that prompt wording plays a critical role in shaping demographic outcomes. The results underscore the need for bias-aware design, balanced defaults, and user guidance to prevent the reinforcement of occupational stereotypes in generative AI.

[49] Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models

Xiaotian Zou

Main category: cs.CV

TL;DR: 该论文揭示了多模态大语言模型(MLLMs)中提示模板的安全漏洞,提出了一种基于强化学习的框架RLStealer,能够从小量示例图像中高效窃取提示模板,并展示了其优越性能和低成本特点。

Details Motivation: 随着文本到图像模型的应用和提示交易市场的兴起,提示模板的窃取成为一个未充分研究的安全风险,论文旨在揭示并解决这一问题。

Contribution: 提出了RLStealer框架,利用强化学习方法优化提示模板窃取任务,显著降低了攻击成本并提升了性能。

Method: 将提示模板窃取建模为序列决策问题,通过多相似度反馈信号作为奖励函数,高效探索提示空间。

Result: 在公开数据集上,RLStealer实现了最先进的性能,攻击成本降至基线方法的13%以下,并能泛化到不同图像风格。

Insight: 研究突出了提示交易中的安全威胁,为未来MLLMs市场的保护标准开发奠定了基础。

Abstract: Multimodal Large Language Models (MLLMs) have transformed text-to-image workflows, allowing designers to create novel visual concepts with unprecedented speed. This progress has given rise to a thriving prompt trading market, where curated prompts that induce trademark styles are bought and sold. Although commercially attractive, prompt trading also introduces a largely unexamined security risk: the prompts themselves can be stolen. In this paper, we expose this vulnerability and present RLStealer, a reinforcement learning based prompt inversion framework that recovers its template from only a small set of example images. RLStealer treats template stealing as a sequential decision making problem and employs multiple similarity based feedback signals as reward functions to effectively explore the prompt space. Comprehensive experiments on publicly available benchmarks demonstrate that RLStealer gets state-of-the-art performance while reducing the total attack cost to under 13% of that required by existing baselines. Our further analysis confirms that RLStealer can effectively generalize across different image styles to efficiently steal unseen prompt templates. Our study highlights an urgent security threat inherent in prompt trading and lays the groundwork for developing protective standards in the emerging MLLMs marketplace.

[50] Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations

Sihao Ding,Santosh Vasa,Aditi Ramadwar

Main category: cs.CV

TL;DR: 论文提出了一种名为EDCT的自动化验证方法,用于检测视觉语言模型生成的解释是否真实反映预测的因果因素。

Details Motivation: 视觉语言模型(VLMs)生成的解释可能听起来合理但不可靠,存在技术和管理风险,因此需要一种方法来验证其解释的真实性。

Contribution: 引入了EDCT方法,通过将模型的解释视为可证伪的假设,自动化生成反事实编辑并计算一致性分数,从而评估解释的真实性。

Method: EDCT主要包括四个步骤:获取模型的答案和解释、将解释解析为可测试的视觉概念、通过生成式修复生成目标反事实编辑、使用LLM辅助分析答案和解释的变化来计算一致性分数。

Result: 在120个OK-VQA示例和多个VLMs上,EDCT揭示了显著的忠实性差距,并生成了符合监管要求的审计记录。

Insight: 该研究表明,当前VLMs生成的解释可能存在严重的忠实性问题,EDCT提供了一种可行的自动化验证工具,有助于提升模型的透明度和可信度。

Abstract: Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect the causal factors driving predictions. This mismatch of plausibility and faithfulness poses technical and governance risks. We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM that treats the model’s own explanation as a falsifiable hypothesis. Given an image-question pair, EDCT: (1) obtains the model’s answer and NLE, (2) parses the NLE into testable visual concepts, (3) generates targeted counterfactual edits via generative inpainting, and (4) computes a Counterfactual Consistency Score (CCS) using LLM-assisted analysis of changes in both answers and explanations. Across 120 curated OK-VQA examples and multiple VLMs, EDCT uncovers substantial faithfulness gaps and provides regulator-aligned audit artifacts indicating when cited concepts fail causal tests.

[51] HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Xianjie Liu,Yiman Hu,Yixiong Zou,Liang Wu,Jian Xu,Bo Zheng

Main category: cs.CV

TL;DR: HiDe提出了一种无需训练的Hierarchical Decoupling Framework(HiDe),通过Token-wise Attention Decoupling和Layout-Preserving Decoupling,解决了高分辨率MLLMs中复杂背景干扰的问题,实现了新的SOTA性能。

Details Motivation: 高分辨率图像中的小物体识别问题通常被归因于感知限制,但作者发现实际问题是复杂背景干扰。现有的'放大'策略效果不佳,因此需要一种新方法来消除干扰并提升性能。

Contribution: 1) 揭示了高分辨率MLLMs性能受限的真正原因是背景干扰而非物体大小;2) 提出了HiDe框架,通过分层解耦实现高效信息提取;3) 在多个benchmark上实现了新的SOTA性能。

Method: 1) Token-wise Attention Decoupling(TAD)解耦问题token和关键信息token;2) Layout-Preserving Decoupling(LPD)分离目标区域与背景,保留空间布局。

Result: HiDe在V*Bench、HRBench4K和HRBench8K上实现了新的SOTA性能(如Qwen2.5-VL 7B和InternVL3 8B分别达到92.1%和91.6%),内存占用减少75%。

Insight: 背景干扰是影响高分辨率MLLMs性能的关键因素,而非传统认为的物体大小问题;解耦操作能有效提升模型对关键信息的捕捉能力。

Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use “zoom in” strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this “zoom in” operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on VBench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on VBench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.

[52] FSDENet: A Frequency and Spatial Domains based Detail Enhancement Network for Remote Sensing Semantic Segmentation

Jiahao Fu,Yinfeng Yu,Liejun Wang

Main category: cs.CV

TL;DR: FSDENet提出了一种结合频域和空间域的方法,通过FFT和小波变换增强遥感图像的语义分割,特别是在边界和灰度变化区域表现优异。

Details Motivation: 解决遥感图像分割中因灰度变化(如阴影和低对比度区域)导致的语义边缘模糊问题。

Contribution: 提出了FSDENet,一种双域协同的细节增强网络,结合空间多尺度特征和频域全局信息,显著提升边界区域的语义分割精度。

Method: 1. 使用FFT提取全局和频域信息;2. 利用Haar小波变换分解高低频特征;3. 双域协同整合空间粒度和频域边缘敏感性。

Result: 在LoveDA、Vaihingen、Potsdam和iSAID四个数据集上达到SOTA性能。

Insight: 频域全局信息和空间多尺度特征的结合能够有效提升遥感图像分割在复杂场景下的鲁棒性。

Abstract: To fully leverage spatial information for remote sensing image segmentation and address semantic edge ambiguities caused by grayscale variations (e.g., shadows and low-contrast regions), we propose the Frequency and Spatial Domains based Detail Enhancement Network (FSDENet). Our framework employs spatial processing methods to extract rich multi-scale spatial features and fine-grained semantic details. By effectively integrating global and frequency-domain information through the Fast Fourier Transform (FFT) in global mappings, the model’s capability to discern global representations under grayscale variations is significantly strengthened. Additionally, we utilize Haar wavelet transform to decompose features into high- and low-frequency components, leveraging their distinct sensitivity to edge information to refine boundary segmentation. The model achieves dual-domain synergy by integrating spatial granularity with frequency-domain edge sensitivity, substantially improving segmentation accuracy in boundary regions and grayscale transition zones. Comprehensive experimental results demonstrate that FSDENet achieves state-of-the-art (SOTA) performance on four widely adopted datasets: LoveDA, Vaihingen, Potsdam, and iSAID.

[53] Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

Sheng Yang,Tong Zhan,Guancheng Chen,Yanfeng Lu,Jian Wang

Main category: cs.CV

TL;DR: 论文提出了一种新的端到端自动驾驶框架Max-V1,将轨迹规划任务重新定义为下一个航路点预测,通过单次生成的视觉语言模型实现了高性能的轨迹预测。

Details Motivation: 现有自动驾驶方法通常依赖多阶段处理或复杂模型设计,导致计算负担和泛化能力不足。本文旨在提出一种简洁但高效的框架,通过语言生成的方式直接预测轨迹,减少复杂度。

Contribution: 1. 将自动驾驶任务重新定义为语言生成问题;2. 提出单次生成的端到端框架Max-V1;3. 设计了一种基于统计建模的监督策略,提升学习效率。

Method: 利用视觉语言模型(VLM)的生成能力,从单视角摄像头输入直接预测轨迹,并通过模仿学习从大规模专家示范数据中学习复杂驾驶策略。

Result: 在nuScenes数据集上达到SOTA性能,比基线方法提升30%以上,并在跨域数据集上表现出优秀的泛化能力。

Insight: 通过语言生成方式简化轨迹预测任务,可以提高模型的效率和泛化能力,为自动驾驶领域提供了一种新的研究思路。

Abstract: In this work, we reconceptualize autonomous driving as a generalized language and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the VLM (Vision-Language Model) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to master complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset, delivers an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. Due to these empirical strengths, this work introduces a model enabling fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.

[54] OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

Jiancong Xie,Wenjin Wang,Zhuomeng Zhang,Zihan Liu,Qi Liu,Ke Feng,Zixun Sun,Yuedong Yang

Main category: cs.CV

TL;DR: OIG-Bench是一个用于评估多模态大语言模型(MLLMs)在单图像指南(One-Image Guides)理解能力的基准测试平台,通过多智能体协作的半自动标注方法构建数据集。

Details Motivation: 尽管MLLMs在多模态理解方面展现出了强大的能力,但其在单图像指南这种结合文本、图像和符号的特殊视觉形式上的理解能力尚未充分研究。这激发了对专门评估工具的需求。

Contribution: 1. 提出OIG-Bench基准测试,覆盖多领域的单图像指南理解任务;2. 开发了一种多智能体协作的半自动标注方法,降低人工标注成本;3. 对29种MLLMs进行了全面评估,揭示了它们在语义理解和逻辑推理上的不足。

Method: 1. 使用多智能体协作的半自动标注流水线生成图像-文本对;2. 设计包含文本、图像和符号的单图像指南测试集;3. 评估模型在多个领域的表现,涵盖语义理解和逻辑推理任务。

Result: Qwen2.5-VL-72B在评估中表现最佳,总体准确率达77%,但所有模型在语义理解和逻辑推理上均存在明显缺陷。多智能体标注系统在图像描述生成任务中优于所有MLLMs。

Insight: 1. 当前MLLMs在复杂视觉-文本关系理解上仍有挑战;2. 多智能体协作的标注方法为未来数据集构建提供了高效工具;3. OIG-Bench为改进MLLMs的理解能力提供了重要参考。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.

[55] Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning

Chenhui Xu,Fuxun Yu,Michael J. Bianco,Jacob Kovarskiy,Raphael Tang,Qi Zhang,Zirui Xu,Will LeVine,Brandon Dubbs,Heming Liao,Cassandra Burgess,Suvam Bag,Jay Patravali,Rupanjali Kukal,Mikael Figueroa,Rishi Madhok,Nikolaos Karianakis,Jinjun Xiong

Main category: cs.CV

TL;DR: Geo-R1 是一個聚焦於地理空間推理的後訓練框架,通過結合思維引導(scaffolding)和提升(elevating)兩個階段,強化視覺語言模型(VLM)的地理推理能力。

Details Motivation: 現有視覺語言模型在地理空間推理任務中表現不佳,且人工標註推理數據成本高昂。Geo-R1 旨在通過自動生成的思維鏈數據和強化學習,低成本地提升模型的地理推理能力。

Contribution: 1) 提出 Geo-R1,第一個專注於地理空間推理的後訓練框架;2) 使用合成思維鏈數據進行監督微調,避免高成本人工標註;3) 引入跨視圖配對的弱監督強化學習,提供可驗證的獎勵信號。

Method: 1) 思維引導階段:通過合成思維鏈數據進行監督微調,建立視覺線索與地理先驗的聯繫;2) 提升階段:使用基於 GRPO 的強化學習和跨視圖配對代理任務,增強跨模態特徵融合和推理能力。

Result: Geo-R1 在多個地理空間推理基準測試中達到最先進性能,並在開放平台上發布模型。

Insight: Geo-R1 展示了通過合成數據和強化學習,可以有效提升模型的地理推理能力,同時避免高成本的人工標註,為類似任務提供了新思路。

Abstract: We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm” via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1.

[56] Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks

Hanjiang Hu,Bowei Li,Ziwei Wang,Tianhao Wei,Casidhe Hutchison,Eric Sample,Changliu Liu

Main category: cs.CV

TL;DR: 本文提出了一种通过稳健剪枝增强深度神经网络可认证语义鲁棒性的方法,分析了神经元稳定性和方差,并提出了一种新的度量标准和剪枝策略。

Details Motivation: 深度神经网络在视觉和机器人应用中广泛使用,但其对抗语义变换扰动的鲁棒性验证面临过参数化问题,影响了紧密度和可扩展性。

Contribution: 1. 提出了Unbiased and Smooth Neuron (USN) 度量标准;2. 引入了一种基于USN的剪枝方法,保留高USN神经元;3. 设计了Wasserstein距离损失函数以优化剪枝过程。

Method: 通过分析神经元稳定性和方差定义USN度量,剪除低USN神经元并保留高USN神经元,同时利用Wasserstein距离损失优化剪枝过程。

Result: 在亮度与对比度扰动下的鲁棒性关键点检测任务中,该方法优于基线方法,表现出更高的认证鲁棒性和效率。

Insight: 剪枝不仅能减少过参数化,还能通过保留高鲁棒性神经元提升模型的语义鲁棒性。Wasserstein距离损失有助于神经元分布的集中化。

Abstract: Deep neural networks have been widely adopted in many vision and robotics applications with visual inputs. It is essential to verify its robustness against semantic transformation perturbations, such as brightness and contrast. However, current certified training and robustness certification methods face the challenge of over-parameterization, which hinders the tightness and scalability due to the over-complicated neural networks. To this end, we first analyze stability and variance of layers and neurons against input perturbation, showing that certifiable robustness can be indicated by a fundamental Unbiased and Smooth Neuron metric (USN). Based on USN, we introduce a novel neural network pruning method that removes neurons with low USN and retains those with high USN, thereby preserving model expressiveness without over-parameterization. To further enhance this pruning process, we propose a new Wasserstein distance loss to ensure that pruned neurons are more concentrated across layers. We validate our approach through extensive experiments on the challenging robust keypoint detection task, which involves realistic brightness and contrast perturbations, demonstrating that our method achieves superior robustness certification performance and efficiency compared to baselines.

[57] EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations

Jiayi Liu,Jiaming Zhou,Ke Ye,Kun-Yu Lin,Allan Wang,Junwei Liang

Main category: cs.CV

TL;DR: 论文提出了EgoTraj-Bench,首个结合第一人称视角噪声观测与鸟瞰视角未来轨迹的真实世界基准,并提出了双流流匹配模型BiFlow(结合EgoAnchor机制),显著提升了轨迹预测的鲁棒性。

Details Motivation: 现有的轨迹预测方法通常在理想化的观测历史下训练,忽视了第一人称视角中固有的感知噪声(如遮挡、ID切换和跟踪漂移),导致模型在实际部署中鲁棒性不足。

Contribution: 1. 提出首个真实世界的基准EgoTraj-Bench,将噪声的第一人称观测与干净的未来轨迹结合;2. 提出BiFlow模型,通过双流流匹配和EgoAnchor机制,同时去噪历史观测和预测未来运动。

Method: BiFlow通过共享潜在表征的双流设计,同步处理历史观测去噪和未来轨迹预测。EgoAnchor机制通过特征调制将历史特征整合到预测解码器中,更好地建模智能体意图。

Result: 实验表明,BiFlow平均将minADE和minFDE降低了10-15%,达到SOTA性能,并表现出更高的鲁棒性。

Insight: 论文强调了现实世界中第一人称视角感知噪声的重要性,并展示了通过联合去噪和预测设计可以显著提升轨迹预测模型的鲁棒性。

Abstract: Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird’s-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real-world, ego-centric perception.

[58] David and Goliath in Medical Vision: Convolutional Networks vs Biomedical Vision Language Models

Ran Tong,Jiaqi Liu,Su Liu,Jiexi Xu,Lanruo Wang,Tong Wang

Main category: cs.CV

TL;DR: 本文比较了轻量级监督CNN和零样本医学视觉语言模型BiomedCLIP在肺炎和肺结核检测任务中的表现,发现通过简单的决策阈值校准,BiomedCLIP可以超越或接近监督CNN的性能。

Details Motivation: 医学影像的自动准确解读至关重要,本文探讨了监督CNN和零样本VLM在医学图像分析中的性能差异,旨在揭示如何充分发挥VLM的潜力。

Contribution: 展示了BiomedCLIP通过决策阈值校准后的性能提升,证明零样本VLM在医学任务中可以媲美甚至超越监督CNN。

Method: 在两个数据集(PneumoniaMNIST和Shenzhen TB)上比较监督CNN和BiomedCLIP的性能,并通过验证集优化分类阈值以提升VLM表现。

Result: 校准后,BiomedCLIP在肺炎检测中F1-score达0.8841(优于CNN的0.8803),肺结核检测中从0.4812提升至0.7684(接近CNN的0.7834)。

Insight: 零样本VLM在医学任务中潜力巨大,但需通过阈值校准才能充分发挥其性能,为未来研究提供了简单有效的优化方向。

Abstract: The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN’s 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline’s 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

[59] PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents

Zikang Liu,Junyi Li,Wayne Xin Zhao,Dawei Gao,Yaliang Li,Ji-rong Wen

Main category: cs.CV

TL;DR: PAL-UI是一个新颖的框架,通过主动回溯过去观察来解决基于视觉的GUI代理在长周期任务中记忆受限的问题,显著提升了移动GUI导航任务的性能。

Details Motivation: 现有的多模态大语言模型(MLLMs)驱动的GUI代理在长周期任务中面临记忆受限的挑战,传统方法要么截断历史记录,要么依赖简单的文本摘要,可能丢失对未来决策关键的视觉细节。

Contribution: 提出了PAL-UI框架,结合双层级摘要代理和专用检索工具,使GUI代理能够在需要时自适应地检索过去的观察。

Method: PAL-UI采用双层级摘要代理(捕捉观察级线索和动作级结果)和专用检索工具,支持代理在规划时回溯历史截图。

Result: 实验表明,PAL-UI在移动GUI导航任务中显著优于基线模型和先前方法,并在无需额外训练的情况下表现出强大的跨领域泛化能力。

Insight: 主动记忆检索可显著提升基于视觉的GUI代理在长周期规划任务中的能力。

Abstract: Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) promise human-like interaction with software applications, yet long-horizon tasks remain challenging due to memory limitations. Existing approaches either truncate history or rely on simple textual summaries, which risk losing critical information when past visual details become necessary for future decisions. In this paper, we propose \textbf{PAL-UI} (\textbf{P}lanning with \textbf{A}ctive \textbf{L}ook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required. PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool that allows the agent to recall specific historical screenshots during planning. We curate a step-level instruction dataset of 8.6K samples from mobile GUI navigation trajectories and train \textbf{PAL-UI-3B} and \textbf{PAL-UI-7B} models based on Qwen2.5-VL. Extensive experiments demonstrate that PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings. Moreover, PAL-UI exhibits strong cross-domain generalization, achieving notable improvements in web navigation without additional training. Our work highlights the potential of active memory retrieval for long-horizon planning capabilities of vision-based GUI agents.

[60] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Zhaoyang Li,Dongjun Qian,Kai Su,Qishuai Diao,Xiangyang Xia,Chang Liu,Wenfei Yang,Tianzhu Zhang,Zehuan Yuan

Main category: cs.CV

TL;DR: BindWeave是一个通过跨模态整合实现主题一致视频生成的统一框架,利用MLLM-DiT架构解决多主题场景中的提示解析难题。

Details Motivation: 现有视频生成模型在多主题场景中难以保持主题一致性和复杂空间关系,BindWeave旨在填补这一空白。

Contribution: 提出了MLLM-DiT框架,通过多模态大语言模型实现跨模态推理,以提升主题一致的视频生成质量。

Method: 结合预训练的多模态大语言模型(MLLM)和扩散Transformer(DiT),实现跨模态推理和解耦角色、属性及交互。

Result: 在OpenS2V基准测试中,BindWeave在主题一致性、自然性和文本相关性上均优于现有开源和商业模型。

Insight: 跨模态推理和解耦能够有效提升多主题视频生成的复杂语义理解能力。

Abstract: Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.

[61] VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal,Heitor R. Medeiros,Marco Pedersoli,Eric Granger

Main category: cs.CV

TL;DR: 本文提出了一种名为VLOD-TTA的测试时适应框架,旨在提升视觉-语言目标检测器(VLODs)在领域偏移下的性能。通过IoU加权的熵目标和图像条件提示选择方法,显著改善了YOLO-World和Grounding DINO的检测效果。

Details Motivation: 视觉-语言目标检测器在零样本识别中表现优异,但在领域偏移下性能下降。本文旨在通过测试时适应框架解决这一问题。

Contribution: 1. 提出了一种IoU加权的熵目标,专注于空间一致的区域提案聚类;2. 引入了图像条件提示选择,提升了提示信息的兼容性;3. 在多种领域偏移下验证了方法的有效性。

Method: 1. 使用IoU加权的熵目标减少孤立框的确认偏差;2. 通过图像条件提示选择融合最优提示信息。

Result: 在多种分布偏移(如风格化域、驾驶场景、低光条件和常见损坏)下,VLOD-TTA显著提升了YOLO-World和Grounding DINO的性能。

Insight: 测试时适应(TTA)可以有效缓解视觉-语言目标检测器在未知领域的性能下降问题,尤其是在处理空间一致性和提示兼容性时尤为关键。

Abstract: Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts – including stylized domains, driving scenes, low-light conditions, and common corruptions – shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA

[62] MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles

Yuheng Ji,Huajie Tan,Cheng Chi,Yijie Xu,Yuting Zhao,Enshen Zhou,Huaihai Lyu,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Xiaolong Zheng

Main category: cs.CV

TL;DR: MathSticks是一个新的基准测试,专注于视觉符号组合推理(VSCR),通过火柴棒谜题测试模型的视觉感知、符号操作和算术一致性能力。该基准包含文本引导和纯视觉设置,评估了14种视觉-语言模型,发现现有模型在许多任务上表现不佳,而人类表现优异。

Details Motivation: 当前视觉-语言模型在复杂的组合推理任务(特别是需要同时处理视觉和符号信息的任务)中表现有限。MathSticks旨在填补这一空白,提供一个系统的测试平台。

Contribution: 1. 推出MathSticks基准,涵盖1.4M生成实例和一个精选测试集;2. 系统评估现有模型,发现其在复杂任务中的局限性;3. 人类表现远超模型,为未来研究指明方向。

Method: 通过生成火柴棒等式谜题,要求模型在严格守恒规则下修正错误,涵盖不同数字规模、移动复杂性、解的多重性和操作符变化。基准包含文本和纯视觉两种设置。

Result: 14种模型中,闭源模型仅能处理简单任务,开源模型在纯视觉任务中表现更差,而人类准确率超过90%。

Insight: MathSticks凸显了当前模型在组合推理任务中的不足,尤其是在视觉和符号结合的任务上。未来的模型需要更强的跨模态整合能力。

Abstract: We introduce \textsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision–language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90% accuracy. These findings establish \textsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.

[63] Normal-Abnormal Guided Generalist Anomaly Detection

Yuexin Wang,Xiaolei Wang,Yizheng Gong,Jimin Xiao

Main category: cs.CV

TL;DR: 该论文提出了一种名为NAGL的新框架,利用正常和异常样本作为参考,改进通用异常检测(GAD)的性能。

Details Motivation: 现有GAD方法仅依赖正常样本作为参考,忽略了现实中可用的异常样本的有价值信息,导致跨域异常检测性能受限。

Contribution: 首次提出在GAD中使用正常和异常样本混合作为参考,并设计了NAGL框架,包含残差挖掘(RM)和异常特征学习(AFL)两个关键组件。

Method: 通过RM从正常异常参考残差中提取异常模式,构建可迁移的异常表示;AFL则通过残差映射自适应学习查询图像的异常特征,实现实例感知的异常检测。

Result: 在多个基准测试中,该方法显著优于现有GAD方法。

Insight: 异常样本的引入丰富了参考信息,提升了跨域异常检测的准确性和效率,为GAD领域提供了新的方向。

Abstract: Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at https://github.com/JasonKyng/NAGL.

[64] Affordance-Guided Diffusion Prior for 3D Hand Reconstruction

Naru Suzuki,Takehiko Ohkawa,Tatsuro Banno,Jihyun Lee,Ryosuke Furuta,Yoichi Sato

Main category: cs.CV

TL;DR: 论文提出了一种基于affordance引导的扩散先验方法,用于严重遮挡下的3D手部姿态重建,通过利用手-物体交互的文本描述生成更准确的姿态。

Details Motivation: 在严重遮挡情况下,传统方法难以准确重建3D手部姿态。人类通过利用物体的功能和形状(affordance)来解决此类模糊性,论文受此启发,提出了一种结合affordance的生成先验方法。

Contribution: 1. 提出了一种affordance引导的扩散先验模型,用于3D手部姿态的细化;2. 使用大视觉语言模型(VLM)推断手-物体交互的文本描述;3. 在严重遮挡的数据集HOGraspNet上证明了方法的有效性。

Method: 1. 通过VLM生成affordance感知的文本描述;2. 训练扩散生成模型,学习基于文本描述的手部姿态分布;3. 利用扩散模型细化遮挡区域的手部姿态,使其更准确且功能连贯。

Result: 在HOGraspNet数据集上,affordance引导的细化方法显著优于现有回归方法和缺乏上下文推理的扩散方法。

Insight: 结合affordance的上下文信息可以显著提升遮挡情况下3D手部姿态的生成质量,展示了生成模型在姿态重建中的潜力。

Abstract: How can we reconstruct 3D hand poses when large portions of the hand are heavily occluded by itself or by objects? Humans often resolve such ambiguities by leveraging contextual knowledge – such as affordances, where an object’s shape and function suggest how the object is typically grasped. Inspired by this observation, we propose a generative prior for hand pose refinement guided by affordance-aware textual descriptions of hand-object interactions (HOI). Our method employs a diffusion-based generative model that learns the distribution of plausible hand poses conditioned on affordance descriptions, which are inferred from a large vision-language model (VLM). This enables the refinement of occluded regions into more accurate and functionally coherent hand poses. Extensive experiments on HOGraspNet, a 3D hand-affordance dataset with severe occlusions, demonstrate that our affordance-guided refinement significantly improves hand pose estimation over both recent regression methods and diffusion-based refinement lacking contextual reasoning.

[65] Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

Zichen Wen,Shaobo Wang,Yufa Zhou,Junyuan Zhang,Qintong Zhang,Yifeng Gao,Zhaorun Chen,Bin Wang,Weijia Li,Conghui He,Linfeng Zhang

Main category: cs.CV

TL;DR: EPIC提出了一种渐进式一致性蒸馏框架,通过分解特征空间的扰动并引入令牌一致性和层一致性蒸馏,提升多模态大模型在视觉令牌压缩下的效率和训练效果。

Details Motivation: 视觉令牌在多模态大模型中占用大量计算资源,现有方法通过压缩令牌提高效率,但忽略了压缩带来的特征空间扰动和训练难度增加问题。

Contribution: 1. 提出了渐进式一致性蒸馏框架EPIC;2. 分解了令牌压缩引入的特征空间扰动;3. 引入了令牌一致性和层一致性蒸馏方法。

Method: 通过令牌一致性蒸馏和层一致性蒸馏,逐步减小压缩带来的扰动,利用教师模型的指导进行渐进学习。

Result: 实验表明EPIC具有高效性、鲁棒性和泛化能力。

Insight: 分解扰动并通过渐进学习策略可以显著降低训练难度,提升模型对令牌压缩的适应性。

Abstract: Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

[66] Assessing Foundation Models for Mold Colony Detection with Limited Training Data

Henrik Pichler,Janis Keuper,Matthew Copping

Main category: cs.CV

TL;DR: 本文研究了在有限训练数据下,基础模型(如MaskDINO)在霉菌菌落检测任务中的表现,发现其仅需少量标注数据即可与传统方法(如YoloV9)竞争。

Details Motivation: 微生物学中霉菌菌落检测任务通常依赖大量标注数据,耗时耗力。本文旨在探索基础模型是否能以更少标注数据实现与传统方法相当的性能。

Contribution: 1. 构建了一个包含5000张标注图像的霉菌菌落数据集;2. 对比了三种视觉基础模型与传统方法的性能,证明基础模型在少样本(如25张)场景下仍能保持可靠性;3. 展示了MaskDINO在仅150张图像微调下的优异表现,接近YoloV9的大规模训练结果。

Method: 1. 构建了包含传统标注和少样本子集的数据集;2. 对三种基础模型(包括MaskDINO)和传统方法(YoloV9)在任务相关指标上进行了基准测试;3. 评估了少样本(25张)和低样本(150张)场景下的性能。

Result: MaskDINO在仅150张图像微调下,性能接近YoloV9的大规模训练结果,且在25张图像时仍能在约70%的样本中保持可靠。

Insight: 基础模型(如MaskDINO)在少样本场景下表现出色,能够显著减少标注需求,加速自动化微生物系统的开发与迭代。这为其在实际应用中的推广提供了有力支持。

Abstract: The process of quantifying mold colonies on Petri dish samples is of critical importance for the assessment of indoor air quality, as high colony counts can indicate potential health risks and deficiencies in ventilation systems. Conventionally the automation of such a labor-intensive process, as well as other tasks in microbiology, relies on the manual annotation of large datasets and the subsequent extensive training of models like YoloV9. To demonstrate that exhaustive annotation is not a prerequisite anymore when tackling a new vision task, we compile a representative dataset of 5000 Petri dish images annotated with bounding boxes, simulating both a traditional data collection approach as well as few-shot and low-shot scenarios with well curated subsets with instance level masks. We benchmark three vision foundation models against traditional baselines on task specific metrics, reflecting realistic real-world requirements. Notably, MaskDINO attains near-parity with an extensively trained YoloV9 model while finetuned only on 150 images, retaining competitive performance with as few as 25 images, still being reliable on $\approx$ 70% of the samples. Our results show that data-efficient foundation models can match traditional approaches with only a fraction of the required data, enabling earlier development and faster iterative improvement of automated microbiological systems with a superior upper-bound performance than traditional models would achieve.

[67] Arbitrary Generative Video Interpolation

Guozhen Zhang,Haiguang Wang,Chunyu Wang,Yuan Zhou,Qinglin Lu,Limin Wang

Main category: cs.CV

TL;DR: 该论文提出了ArbInterp,一种灵活的生成式视频帧插值框架,支持任意时间戳和长度的插值,解决了现有方法在帧率和序列时长调整上的局限性。

Details Motivation: 现有生成式视频帧插值方法仅支持固定数量的中间帧生成,无法灵活调整帧率或总时长,限制了实际应用的多样性需求。

Contribution: 1) 提出了ArbInterp框架,支持任意时间戳和长度的插值;2) 设计了Timestamp-aware Rotary Position Embedding (TaRoPE)以实现精细时间戳控制;3) 提出了一种外观-运动解耦的条件策略,确保跨片段的无缝时空过渡。

Method: 1) 使用TaRoPE调制时间RoPE的位置,对齐目标时间戳;2) 将长序列生成解耦为分段帧合成;3) 通过外观一致性和时间语义保持运动连贯性。

Result: 实验表明,ArbInterp在多尺度帧插值(2x至32x)中优于现有方法,具有更高的保真度和更无缝的时空连续性。

Insight: 通过解耦外观和运动,并结合分段生成策略,可实现更灵活和高质量的视频帧插值。

Abstract: Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: https://mcg-nju.github.io/ArbInterp-Web/.

[68] Color Models in Image Processing: A Review and Experimental Comparison

Muragul Muratbekova,Nuray Toganas,Ayan Igali,Maksat Shagyrov,Elnara Kadyrgali,Adilet Yerkin,Pakizar Shamoi

Main category: cs.CV

TL;DR: 本文综述了图像处理中的多种颜色模型,并通过实验比较了它们的性能。研究发现HS*系列颜色模型最符合人类视觉感知,并指出了现有模型的局限性与未来研究方向。

Details Motivation: 颜色表示在计算机视觉和人机交互中至关重要,但选择合适的颜色模型对应用效果影响显著。本文旨在提供一个全面的颜色模型综述和实验评估,以帮助研究人员更好地理解和选择适合的颜色模型。

Contribution: 1. 详细综述了传统颜色模型(如RGB、CMYK、YUV)和感知均匀空间(如CIELAB、CIELUV);2. 通过实验评估了颜色模型的设备依赖性、色度一致性和计算复杂度;3. 发现HS*系列模型最符合人类感知,并总结了现有模型的优缺点与未来挑战。

Method: 1. 理论分析:研究了各种颜色模型的理论基础和计算特性;2. 实验评估:通过实验比较颜色模型的性能,如设备依赖性、色度一致性和计算复杂度。

Result: HS*系列颜色模型在实验中表现最佳,与人类视觉感知最匹配。实验还揭示了现有模型的局限性,如设备依赖性和色度一致性问题。

Insight: HS*模型因其与人类感知的一致性成为颜色处理的优选方案。未来研究需进一步解决颜色模型的设备依赖性和计算效率问题。

Abstract: Color representation is essential in computer vision and human-computer interaction. There are multiple color models available. The choice of a suitable color model is critical for various applications. This paper presents a review of color models and spaces, analyzing their theoretical foundations, computational properties, and practical applications. We explore traditional models such as RGB, CMYK, and YUV, perceptually uniform spaces like CIELAB and CIELUV, and fuzzy-based approaches as well. Additionally, we conduct a series of experiments to evaluate color models from various perspectives, like device dependency, chromatic consistency, and computational complexity. Our experimental results reveal gaps in existing color models and show that the HS* family is the most aligned with human perception. The review also identifies key strengths and limitations of different models and outlines open challenges and future directions This study provides a reference for researchers in image processing, perceptual computing, digital media, and any other color-related field.

[69] Multi-level Dynamic Style Transfer for NeRFs

Zesheng Li,Shuaibo Li,Wei Ma,Jianwei Guo,Hongbin Zha

Main category: cs.CV

TL;DR: MDS-NeRF提出了一种多级动态风格迁移方法,针对NeRF进行了重新设计,并通过动态风格注入模块和多级特征适配器提升了3D风格迁移的效果。

Details Motivation: 现有NeRF风格迁移方法通常在原有NeRF流程中集成风格统计信息,导致内容和艺术风格的保留效果不佳,因此需要一种更高效的方法。

Contribution: 1) 提出了多级动态风格迁移方法MDS-NeRF;2) 设计了多级特征适配器和动态风格注入模块;3) 扩展了3D风格迁移方法以支持全视角风格迁移。

Method: 1) 多级特征适配器生成多级特征网格;2) 动态风格注入模块自适应地融合风格特征;3) 多级级联解码器生成最终风格化视图。

Result: 实验表明MDS-NeRF在3D风格迁移中表现出色,成功保留了多尺度空间结构并有效迁移了风格特征。

Insight: 通过重新设计NeRF流程并引入动态风格注入,MDS-NeRF显著提升了风格迁移的质量和灵活性。

Abstract: As the application of neural radiance fields (NeRFs) in various 3D vision tasks continues to expand, numerous NeRF-based style transfer techniques have been developed. However, existing methods typically integrate style statistics into the original NeRF pipeline, often leading to suboptimal results in both content preservation and artistic stylization. In this paper, we present multi-level dynamic style transfer for NeRFs (MDS-NeRF), a novel approach that reengineers the NeRF pipeline specifically for stylization and incorporates an innovative dynamic style injection module. Particularly, we propose a multi-level feature adaptor that helps generate a multi-level feature grid representation from the content radiance field, effectively capturing the multi-scale spatial structure of the scene. In addition, we present a dynamic style injection module that learns to extract relevant style features and adaptively integrates them into the content patterns. The stylized multi-level features are then transformed into the final stylized view through our proposed multi-level cascade decoder. Furthermore, we extend our 3D style transfer method to support omni-view style transfer using 3D style references. Extensive experiments demonstrate that MDS-NeRF achieves outstanding performance for 3D style transfer, preserving multi-scale spatial structures while effectively transferring stylistic characteristics.

[70] LVLMs as inspectors: an agentic framework for category-level structural defect annotation

Sheng Jiang,Yuanmin Ning,Bingxi Huang,Peiyin Chen,Zhaohui Chen

Main category: cs.CV

TL;DR: 论文提出了一种基于大型视觉语言模型(LVLMs)的自主缺陷标注框架ADPT,通过语义模式匹配和迭代自问优化机制,无需人工监督即可生成高质量的结构缺陷标注数据集。

Details Motivation: 传统人工标注结构性缺陷成本高且效率低,因此需要一种自动化、高效且低成本的方法来解决这一问题。

Contribution: 提出了ADPT框架,结合LVLMs和语义模式匹配模块,实现了无需人工干预的高质量缺陷标注。

Method: ADPT框架采用优化的领域特定提示、递归验证和迭代自问机制,将原始视觉数据转化为标注数据。

Result: 实验显示,ADPT在区分缺陷与非缺陷图像的准确率高达98%,四类缺陷标注准确率为85%-98%,在类别不平衡数据集上也达到了80%-92%的准确率。

Insight: ADPT为结构性缺陷的高保真数据集构建提供了可扩展且经济高效的解决方案,支持下游任务如迁移学习和领域适应。

Abstract: Automated structural defect annotation is essential for ensuring infrastructure safety while minimizing the high costs and inefficiencies of manual labeling. A novel agentic annotation framework, Agent-based Defect Pattern Tagger (ADPT), is introduced that integrates Large Vision-Language Models (LVLMs) with a semantic pattern matching module and an iterative self-questioning refinement mechanism. By leveraging optimized domain-specific prompting and a recursive verification process, ADPT transforms raw visual data into high-quality, semantically labeled defect datasets without any manual supervision. Experimental results demonstrate that ADPT achieves up to 98% accuracy in distinguishing defective from non-defective images, and 85%-98% annotation accuracy across four defect categories under class-balanced settings, with 80%-92% accuracy on class-imbalanced datasets. The framework offers a scalable and cost-effective solution for high-fidelity dataset construction, providing strong support for downstream tasks such as transfer learning and domain adaptation in structural damage assessment.

[71] Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation

Yunbo Xu,Xuesong Zhang,Jia Li,Zhenzhen Hu,Richang Hong

Main category: cs.CV

TL;DR: 论文提出了一种名为COFA的在线特征增强策略,通过分离前景和背景特征来提升视觉语言导航(VLN)任务的性能,实验证明其有效性和先进性。

Details Motivation: 在视觉语言导航任务中,前景提供语义信息,背景包含空间连接信息,但当前方法未充分探索两者的分离利用。

Contribution: 提出了COFA策略,通过在线增强分离的前景和背景特征,提升导航任务的泛化能力。

Method: 利用语义增强的地标识别分离前景和背景特征,并通过共识驱动的在线增强策略动态调整特征偏好。

Result: 在REVERIE和R2R数据集上,COFA显著提升了基线模型的泛化能力并达到SOTA性能。

Insight: 分离和动态增强前景与背景特征是提升VLN任务性能的关键。

Abstract: Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.

[72] Robust Context-Aware Object Recognition

Klara Janouskova,Cristian Gavrus,Jiri Matas

Main category: cs.CV

TL;DR: 论文提出了一种联合实现鲁棒性和上下文感知的方法RCOR,通过将定位作为识别的一部分,解耦对象中心和上下文建模,并结合非参数化融合,提高了模型的性能。

Details Motivation: 标准监督学习容易导致模型过度依赖背景信息(称为捷径学习),限制了在实际部署中的鲁棒性。现有方法通常通过抑制背景来解决问题,但牺牲了上下文信息。

Contribution: 首次提出了一种既能保持鲁棒性又能利用上下文信息的方法RCOR,通过联合定位和识别任务,实现了对复杂场景的有效建模。

Method: RCOR将定位与识别任务结合,解耦对象中心和上下文建模,并通过非参数化融合提升模型性能。适用于监督学习和视觉语言模型(VLM)。

Result: 在不进行微调的情况下,RCOR在ImageNet-1k等数据集上显著提升了模型性能,尤其是在复杂场景中表现突出。

Insight: 定位任务可以作为识别任务的关键辅助,通过解耦建模和非参数化融合,能够同时利用对象中心和上下文信息,提升模型的鲁棒性和泛化能力。

Abstract: In visual recognition, both the object of interest (referred to as foreground, FG, for simplicity) and its surrounding context (background, BG) play an important role. However, standard supervised learning often leads to unintended over-reliance on the BG, known as shortcut learning of spurious correlations, limiting model robustness in real-world deployment settings. In the literature, the problem is mainly addressed by suppressing the BG, sacrificing context information for improved generalization. We propose RCOR – Robust Context-Aware Object Recognition – the first approach that jointly achieves robustness and context-awareness without compromising either. RCOR treats localization as an integral part of recognition to decouple object-centric and context-aware modelling, followed by a robust, non-parametric fusion. It improves the performance of both supervised models and VLM on datasets with both in-domain and out-of-domain BG, even without fine-tuning. The results confirm that localization before recognition is now possible even in complex scenes as in ImageNet-1k.

[73] UCD: Unconditional Discriminator Promotes Nash Equilibrium in GANs

Mengfei Xia,Nan Xue,Jiapeng Zhu,Yujun Shen

Main category: cs.CV

TL;DR: 论文提出了一种无条件判别器(UCD),通过移除判别器中的条件输入,使其提取更全面的特征,从而促进GAN训练中的纳什均衡,显著提升生成质量。

Details Motivation: GAN训练在实践中难以收敛且常常陷入模式崩溃,原因是判别器(D)中的条件输入引入了冗余捷径,阻碍了有效的知识提取。

Contribution: 主要贡献包括定量分析了GAN训练中纳什均衡的程度,并提出了一种无条件判别器(UCD)方法,该方法通过移除条件输入,使D提取更全面的特征,从而促进GAN训练的稳定性和性能。

Method: UCD的核心方法是仅使用无条件输入的判别器,避免条件注入带来的冗余,使其能够提取更鲁棒的特征,从而更好地指导生成器(G)。该方法与经典GAN理论兼容,可作为插件使用。

Result: 在ImageNet-64数据集上,UCD取得了1.47 FID的优异结果,超越了StyleGAN-XL和其他先进的一步扩散模型。

Insight: 移除判别器中的条件输入可以显著提升GAN的训练效果,避免模式崩溃并促进纳什均衡,为GAN研究提供了新的改进方向。

Abstract: Adversarial training turns out to be the key to one-step generation, especially for Generative Adversarial Network (GAN) and diffusion model distillation. Yet in practice, GAN training hardly converges properly and struggles in mode collapse. In this work, we quantitatively analyze the extent of Nash equilibrium in GAN training, and conclude that redundant shortcuts by inputting condition in $D$ disables meaningful knowledge extraction. We thereby propose to employ an unconditional discriminator (UCD), in which $D$ is enforced to extract more comprehensive and robust features with no condition injection. In this way, $D$ is able to leverage better knowledge to supervise $G$, which promotes Nash equilibrium in GAN literature. Theoretical guarantee on compatibility with vanilla GAN theory indicates that UCD can be implemented in a plug-in manner. Extensive experiments confirm the significant performance improvements with high efficiency. For instance, we achieved \textbf{1.47 FID} on the ImageNet-64 dataset, surpassing StyleGAN-XL and several state-of-the-art one-step diffusion models. The code will be made publicly available.

[74] Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset

Yannick Hauri,Luca A. Lanzendörfer,Till Aczel

Main category: cs.CV

TL;DR: 该论文提出了虚拟时尚摄影任务,旨在将标准化服装图像转化为情境丰富的编辑影像,并构建了一个大规模服装-画册配对数据集。

Details Motivation: 传统时尚图像生成任务(如虚拟试穿)局限于简单场景,无法捕捉时尚编辑影像的动态性和故事性。本文希望通过新任务和数据集填补这一空白。

Contribution: 1. 引入虚拟时尚摄影任务;2. 构建首个大规模服装-画册配对数据集;3. 提出自动化检索流水线,跨域对齐服装。

Method: 设计了结合视觉-语言推理和目标级定位的自动化检索流水线,从不同数据源中提取并匹配合格的服装-画册对。

Result: 构建了包含高、中、低三个质量等级的数据集(分别为10,000、50,000和300,000对),为模型训练提供了丰富素材。

Insight: 该数据集不仅支持传统任务,还能推动更具创造性和故事性的时尚图像生成。

Abstract: Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.

[75] Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Nanxiang Jiang,Zhaoxin Fan,Enhan Kang,Daiheng Gao,Yun Zhou,Yanxia Chang,Zheng Zhu,Yeying Jin,Wenjun Wu

Main category: cs.CV

TL;DR: 本文提出了一种针对最新Rectified Flow-based文本到图像(T2I)框架的概念攻击方法ReFlux,旨在评估概念擦除策略的鲁棒性。

Details Motivation: 当前的T2I扩散模型存在安全隐患,可能生成有害内容。现有的概念擦除方法在应用于新一代Rectified Flow Transformer(如Flux)时效果有限。本文旨在解决这一问题。

Contribution: 提出了首个专门针对Rectified Flow-based T2I框架的概念攻击方法ReFlux,揭示了现有擦除技术依赖的注意力局部化现象,并提出了高效的攻击策略。

Method: 提出了反向注意力优化策略以重新激活被抑制的信号,并结合速度引导的动态机制和一致性保持目标,增强了攻击的鲁棒性和生成内容的稳定性。

Result: 实验表明,ReFlux有效地评估了概念擦除策略的鲁棒性,为相关研究提供了可靠基准。

Insight: 现有概念擦除技术在Rectified Flow Transformer中的局限性源于注意力局部化现象,针对这一现象的攻击方法能显著提升攻击效果。

Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, most existing approaches and corresponding attack evaluations are tailored to Stable Diffusion (SD) and exhibit limited effectiveness when transferred to next-generation rectified flow transformers such as Flux. In this work, we present ReFlux, the first concept attack method specifically designed to assess the robustness of concept erasure in the latest rectified flow-based T2I framework. Our approach is motivated by the observation that existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization. Building on this insight, we propose a simple yet effective attack strategy that specifically targets this property. At its core, a reverse-attention optimization strategy is introduced to effectively reactivate suppressed signals while stabilizing attention. This is further reinforced by a velocity-guided dynamic that enhances the robustness of concept reactivation by steering the flow matching process, and a consistency-preserving objective that maintains the global layout and preserves unrelated content. Extensive experiments consistently demonstrate the effectiveness and efficiency of the proposed attack method, establishing a reliable benchmark for evaluating the robustness of concept erasure strategies in rectified flow transformers.

[76] OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding

Jieer Ouyang,Xiaoneng Xiang,Zheng Wang,Yangkai Ding

Main category: cs.CV

TL;DR: OTTER是一个开放的、多模态的多标签标记框架,结合了预定义类别和用户驱动的开放标签的优点,通过多模态注意力架构实现动态且语义一致的标记。

Details Motivation: 目前的多标签标记方法通常在预定义标签上表现良好,但缺乏对开放标签的灵活性。OTTER旨在结合封闭集的稳定性和开放词汇的灵活性,以满足多模态标记的需求。

Contribution: 1. 提出了OTTER框架,统一了预定义和开放标签的多模态标记;2. 构建了一个大规模、层次化的多模态数据集;3. 通过多模态注意力架构实现了语义一致的标签对齐。

Method: OTTER采用多模态注意力架构,联合对齐视觉和文本表示与固定及开放标签嵌入。数据集通过自动化视觉-语言标注和人工细化结合的方式构建。

Result: OTTER在两个基准数据集上表现优异,总体F1分数分别为0.81和0.75,开放标签F1接近完美(0.99和0.97),同时在预定义标签上保持竞争力。

Insight: OTTER展示了在多模态标记任务中如何有效平衡封闭集的稳定性和开放词汇的灵活性,为动态标签生成提供了新的思路。

Abstract: We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotated through a hybrid pipeline combining automated vision-language labeling with human refinement. By leveraging a multi-head attention architecture, OTTER jointly aligns visual and textual representations with both fixed and open-set label embeddings, enabling dynamic and semantically consistent tagging. OTTER consistently outperforms competitive baselines on two benchmark datasets: it achieves an overall F1 score of 0.81 on Otter and 0.75 on Favorite, surpassing the next-best results by margins of 0.10 and 0.02, respectively. OTTER attains near-perfect performance on open-set labels, with F1 of 0.99 on Otter and 0.97 on Favorite, while maintaining competitive accuracy on predefined labels. These results demonstrate OTTER’s effectiveness in bridging closed-set consistency with open-vocabulary flexibility for multi-modal tagging applications.

[77] Beyond one-hot encoding? Journey into compact encoding for large multi-class segmentation

Aaron Kujawa,Thomas Booth,Tom Vercauteren

Main category: cs.CV

TL;DR: 该论文提出了一种二进制编码方法替代独热编码,以减少大规模多类分割的计算和内存需求,但在医学图像分割任务中性能和SOTA仍有差距。

Details Motivation: 独热编码在类别数量大时计算和内存需求急剧增加,因此探索更紧凑的编码方法以减少资源消耗。

Contribution: 1. 提出了一系列二进制编码方法(包括ECOC、类权重等);2. 在研究大规模脑分区任务中提供了负面结果,启发了未来研究。

Method: 探索了二进制编码、ECOC、类权重、硬/软解码等方法,并应用于108类的3D MRI脑分区任务。

Result: 二进制编码的性能(DSC 39.3-73.8)低于独热编码(DSC 82.4),但提升了计算效率。

Insight: 二进制编码虽能减少资源需求,但在医学图像分割中性能仍需改进,提供了负面结果以推动未来研究。

Abstract: This work presents novel methods to reduce computational and memory requirements for medical image segmentation with a large number of classes. We curiously observe challenges in maintaining state-of-the-art segmentation performance with all of the explored options. Standard learning-based methods typically employ one-hot encoding of class labels. The computational complexity and memory requirements thus increase linearly with the number of classes. We propose a family of binary encoding approaches instead of one-hot encoding to reduce the computational complexity and memory requirements to logarithmic in the number of classes. In addition to vanilla binary encoding, we investigate the effects of error-correcting output codes (ECOCs), class weighting, hard/soft decoding, class-to-codeword assignment, and label embedding trees. We apply the methods to the use case of whole brain parcellation with 108 classes based on 3D MRI images. While binary encodings have proven efficient in so-called extreme classification problems in computer vision, we faced challenges in reaching state-of-the-art segmentation quality with binary encodings. Compared to one-hot encoding (Dice Similarity Coefficient (DSC) = 82.4 (2.8)), we report reduced segmentation performance with the binary segmentation approaches, achieving DSCs in the range from 39.3 to 73.8. Informative negative results all too often go unpublished. We hope that this work inspires future research of compact encoding strategies for large multi-class segmentation tasks.

[78] Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation

Jinchang Zhang,Zijun Li,Jiakai Lin,Guoyu Lu

Main category: cs.CV

TL;DR: 该论文提出了一种通过视觉-语言知识蒸馏实现开放词汇事件相机目标检测的方法,结合SNN和CNN框架自适应分割事件流并保留关键时间信息。

Details Motivation: 事件相机在高速度和低延迟方面具有优势,但缺乏纹理和颜色信息,使其开放词汇目标检测面临挑战。现有方法难以泛化到新物体,且CLIP等视觉-语言模型无法直接应用于事件数据。

Contribution: 1) 提出事件-图像知识蒸馏框架,利用CLIP的语义理解能力实现事件数据的开放词汇检测;2) 设计了一种混合SNN-CNN框架,自适应分割事件流以保留关键时间信息。

Method: 1) 使用图像帧作为教师模型输入,通过空间注意力蒸馏指导学生模型学习CLIP的视觉表示;2) SNN自适应确定事件分割时刻,提取关键时间特征,CNN处理特征以完成目标检测。

Result: 该方法有效解决了事件数据开放词汇检测问题,避免了固定分组分割导致的信息丢失,实现了对新物体的良好泛化能力。

Insight: 通过知识蒸馏和自适应事件分割,可以在缺乏颜色信息的事件流中实现高效的开放词汇目标检测,同时保留关键的时间动态信息。

Abstract: Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP’s semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP’s rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP’s broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.

[79] ProtoMask: Segmentation-Guided Prototype Learning

Steffen Meinert,Philipp Schlinge,Nils Strodthoff,Martin Atzmueller

Main category: cs.CV

TL;DR: ProtoMask提出了一种基于分割引导的原型学习方法,通过分割掩码限制显著性图的语义区域,提高了原型与输入空间映射的可信度。

Details Motivation: 现有的基于原型的方法通常依赖后处理的显著性技术来解释原型语义,但这些技术的可靠性和质量受到质疑。ProtoMask旨在通过分割基础模型降低可视化不确定性。

Contribution: 1.提出ProtoMask架构,利用分割掩码的边界框裁剪图像,生成独立输入。2.通过实验验证了模型在细粒度分类数据集上的竞争性能及独特的可解释性特征。

Method: 利用分割模型生成掩码及其边界框,裁剪图像作为独立输入,限制显著性图的语义区域,提高原型与输入的映射可信度。

Result: 在三个细粒度分类数据集上表现优异,实验结果证明其性能优于其他流行模型。

Insight: 分割技术的引入不仅能提高模型的解释性,还能增强原型学习的效果,为XAI领域提供了新的思路。

Abstract: XAI gained considerable importance in recent years. Methods based on prototypical case-based reasoning have shown a promising improvement in explainability. However, these methods typically rely on additional post-hoc saliency techniques to explain the semantics of learned prototypes. Multiple critiques have been raised about the reliability and quality of such techniques. For this reason, we study the use of prominent image segmentation foundation models to improve the truthfulness of the mapping between embedding and input space. We aim to restrict the computation area of the saliency map to a predefined semantic image patch to reduce the uncertainty of such visualizations. To perceive the information of an entire image, we use the bounding box from each generated segmentation mask to crop the image. Each mask results in an individual input in our novel model architecture named ProtoMask. We conduct experiments on three popular fine-grained classification datasets with a wide set of metrics, providing a detailed overview on explainability characteristics. The comparison with other popular models demonstrates competitive performance and unique explainability features of our model. https://github.com/uos-sis/quanproto

[80] Graph Integrated Multimodal Concept Bottleneck Model

Jiakai Lin,Jinchang Zhang,Guoyu Lu

Main category: cs.CV

TL;DR: MoE-SGT是一个结合了图结构和混合专家(MoE)模块的多模态概念瓶颈模型,通过显式建模概念间的关系和动态任务分配提升了模型的性能和可解释性。

Details Motivation: 现有的概念瓶颈模型(CBMs)通常是单模态的,且忽略了概念间的结构化关系,限制了其在复杂推理任务中的表现。

Contribution: 提出了MoE-SGT框架,通过图变换器和MoE模块显式建模概念间的结构化关系,并动态分配推理任务,显著提升了模型的准确性和适应性。

Method: 1. 构建答案-概念和答案-问题图显式建模概念关系;2. 集成图变换器捕获多层次依赖;3. 用MoE模块替换前馈层,动态分配推理任务。

Result: MoE-SGT在多个数据集上比其他概念瓶颈网络实现了更高的准确性。

Insight: 结合图结构和动态任务分配机制可以显著提升模型的复杂推理能力和适应性。

Abstract: With growing demand for interpretability in deep learning, especially in high stakes domains, Concept Bottleneck Models (CBMs) address this by inserting human understandable concepts into the prediction pipeline, but they are generally single modal and ignore structured concept relationships. To overcome these limitations, we present MoE-SGT, a reasoning driven framework that augments CBMs with a structure injecting Graph Transformer and a Mixture of Experts (MoE) module. We construct answer-concept and answer-question graphs for multimodal inputs to explicitly model the structured relationships among concepts. Subsequently, we integrate Graph Transformer to capture multi level dependencies, addressing the limitations of traditional Concept Bottleneck Models in modeling concept interactions. However, it still encounters bottlenecks in adapting to complex concept patterns. Therefore, we replace the feed forward layers with a Mixture of Experts (MoE) module, enabling the model to have greater capacity in learning diverse concept relationships while dynamically allocating reasoning tasks to different sub experts, thereby significantly enhancing the model’s adaptability to complex concept reasoning. MoE-SGT achieves higher accuracy than other concept bottleneck networks on multiple datasets by modeling structured relationships among concepts and utilizing a dynamic expert selection mechanism.

[81] Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs

Sanghwan Kim,Rui Xiao,Stephan Alaniz,Yongqin Xian,Zeynep Akata

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的框架,利用多模态大语言模型(MLLM)的内在不确定性作为指导信号,以提升复杂视觉任务的性能。通过响应不确定性评分候选视觉输入,模型能够自主关注最显著的数据。

Details Motivation: 现有的MLLM在细粒度感知任务(如高分辨率图像中的小物体识别或长视频中的关键时刻定位)中表现不佳,通常需要复杂的任务特定微调,限制了其泛化能力并增加了模型复杂度。

Contribution: 提出了一种无需训练的框架,利用MLLM的内在不确定性作为指导信号,应用于视觉搜索、长视频理解和时序定位任务。

Method: 提出了一种统一机制,通过响应不确定性对候选视觉输入进行评分,使模型能够自主聚焦于最相关的视觉信息。该方法无需额外训练,直接利用模型的固有特性。

Result: 实验表明,该方法在三个复杂视觉任务上的性能媲美专门微调的方法,验证了利用内在不确定性提升多模态任务性能的普适性。

Insight: 模型输出熵的变化可作为视觉信息相关性的有效指标,无需额外训练即可显著提升MLLM在细粒度任务中的表现。

Abstract: Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or finding key moments in long videos. Existing works typically rely on complicated, task-specific fine-tuning, which limits their generalizability and increases model complexity. In this work, we propose an effective, training-free framework that uses an MLLM’s intrinsic uncertainty as a proactive guidance signal. Our core insight is that a model’s output entropy decreases when presented with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data. We apply this simple principle to three complex visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned methods. Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.

[82] Deep learning motion correction of quantitative stress perfusion cardiovascular magnetic resonance

Noortje I. P. Schueler,Nathan C. K. Wong,Richard J. Crawley,Josien P. W. Pluim,Amedeo Chiribiri,Cian M. Scannell

Main category: cs.CV

TL;DR: 论文提出了一种基于无监督深度学习的运动校正方法,用于定量应力灌注心血管磁共振(CMR)成像,显著提升了运动校正的速度和鲁棒性。

Details Motivation: 传统基于配准的运动校正方法速度慢且对采集变异性敏感,限制了其在定量灌注成像中的稳健性和可扩展性。

Contribution: 提出了一种三步法的无监督深度学习架构,通过一次性估计替代迭代配准,并利用鲁棒主成分分析减少对比度相关的影响。

Method: 采用三步法校正运动,结合鲁棒主成分分析,同时对灌注序列和辅助图像(动脉输入函数和质子密度加权序列)进行对齐。模型在多厂商患者数据上训练和验证。

Result: 相比传统方法,深度学习方法显著提升了时间平滑性(p<0.001),心肌对齐效果相近或更优,心肌灌注图的运动伪影减少,处理时间缩短15倍。

Insight: 该方法在多厂商数据上训练,能够泛化到不同序列,有望推动定量灌注成像的临床广泛应用。

Abstract: Background: Quantitative stress perfusion cardiovascular magnetic resonance (CMR) is a powerful tool for assessing myocardial ischemia. Motion correction is essential for accurate pixel-wise mapping but traditional registration-based methods are slow and sensitive to acquisition variability, limiting robustness and scalability. Methods: We developed an unsupervised deep learning-based motion correction pipeline that replaces iterative registration with efficient one-shot estimation. The method corrects motion in three steps and uses robust principal component analysis to reduce contrast-related effects. It aligns the perfusion series and auxiliary images (arterial input function and proton density-weighted series). Models were trained and validated on multivendor data from 201 patients, with 38 held out for testing. Performance was assessed via temporal alignment and quantitative perfusion values, compared to a previously published registration-based method. Results: The deep learning approach significantly improved temporal smoothness of time-intensity curves (p<0.001). Myocardial alignment (Dice = 0.92 (0.04) and 0.91 (0.05)) was comparable to the baseline and superior to before registration (Dice = 0.80 (0.09), p<0.001). Perfusion maps showed reduced motion, with lower standard deviation in the myocardium (0.52 (0.39) ml/min/g) compared to baseline (0.55 (0.44) ml/min/g). Processing time was reduced 15-fold. Conclusion: This deep learning pipeline enables fast, robust motion correction for stress perfusion CMR, improving accuracy across dynamic and auxiliary images. Trained on multivendor data, it generalizes across sequences and may facilitate broader clinical adoption of quantitative perfusion imaging.

[83] DEAP DIVE: Dataset Investigation with Vision transformers for EEG evaluation

Annemarie Hoffsommer,Helen Schneider,Svetlana Pavlitska,J. Marius Zöllner

Main category: cs.CV

TL;DR: 该论文研究了如何利用DEAP数据集中EEG信号的子集进行情感预测,通过连续小波变换将EEG数据转换为尺度图,并使用视觉变换器(ViT)模型实现高准确率。

Details Motivation: 传统的情绪预测方法(如自我评估和面部表情分析)存在主观性或模糊性问题,而EEG信号提供了更直接和无偏的数据源。但由于完整EEG测量复杂且成本高,作者希望通过低成本的EEG设备实现类似效果。

Contribution: 论文主要贡献在于证明仅需12个EEG通道即可实现高准确率(91.57%)的情绪预测,而传统方法需要32个通道(96.9%)。

Method: 采用连续小波变换将EEG数据转换为尺度图,并训练视觉变换器(ViT)模型进行分类。

Result: 模型在预测4种情绪象限(唤醒度和效价的高低组合)时达到91.57%的准确率,与传统方法的96.9%接近。

Insight: 研究表明,减少EEG通道数(从32降至12)并未显著损失预测性能,为低成本EEG设备的应用提供了可能性。

Abstract: Accurately predicting emotions from brain signals has the potential to achieve goals such as improving mental health, human-computer interaction, and affective computing. Emotion prediction through neural signals offers a promising alternative to traditional methods, such as self-assessment and facial expression analysis, which can be subjective or ambiguous. Measurements of the brain activity via electroencephalogram (EEG) provides a more direct and unbiased data source. However, conducting a full EEG is a complex, resource-intensive process, leading to the rise of low-cost EEG devices with simplified measurement capabilities. This work examines how subsets of EEG channels from the DEAP dataset can be used for sufficiently accurate emotion prediction with low-cost EEG devices, rather than fully equipped EEG-measurements. Using Continuous Wavelet Transformation to convert EEG data into scaleograms, we trained a vision transformer (ViT) model for emotion classification. The model achieved over 91,57% accuracy in predicting 4 quadrants (high/low per arousal and valence) with only 12 measuring points (also referred to as channels). Our work shows clearly, that a significant reduction of input channels yields high results compared to state-of-the-art results of 96,9% with 32 channels. Training scripts to reproduce our code can be found here: https://gitlab.kit.edu/kit/aifb/ATKS/public/AutoSMiLeS/DEAP-DIVE.

[84] Extreme Blind Image Restoration via Prompt-Conditioned Information Bottleneck

Hongeun Kim,Bryan Sangwoo Kim,Jong Chul Ye

Main category: cs.CV

TL;DR: 本文提出了一种针对极端盲图像恢复(EBIR)的新框架,通过分解ELQ到HQ的图像恢复过程,利用信息瓶颈理论稳定训练,显著提升了图像恢复效果。

Details Motivation: 现有盲图像恢复(BIR)方法在极端退化(如严重复合退化)中表现不佳,原因是巨大的领域差距导致恢复后图像失真和细节丢失。

Contribution: 1. 提出了一种新的分层框架,将ELQ到HQ的恢复过程分解为ELQ到LQ和LQ到HQ两步;2. 从信息瓶颈理论出发,推导了训练目标函数,稳定训练过程;3. 支持推理时的提示优化(LFO)和无需微调的即插即用增强。

Method: 1. 通过学习一个投影器将ELQ图像映射到中间LQ流形;2. 使用冻结的现成BIR模型将LQ图像恢复为HQ;3. 结合信息瓶颈理论设计损失函数,平衡低质量重建和高质量先验匹配。

Result: 在严重退化场景下的广泛实验表明,该方法显著提升了图像恢复质量,减少了失真和细节损失。

Insight: 通过分解恢复过程和引入信息瓶颈理论,可以有效缓解极端退化带来的巨大领域差距问题,同时为现有模型的增强提供了无需微调的灵活性。

Abstract: Blind Image Restoration (BIR) methods have achieved remarkable success but falter when faced with Extreme Blind Image Restoration (EBIR), where inputs suffer from severe, compounded degradations beyond their training scope. Directly learning a mapping from extremely low-quality (ELQ) to high-quality (HQ) images is challenging due to the massive domain gap, often leading to unnatural artifacts and loss of detail. To address this, we propose a novel framework that decomposes the intractable ELQ-to-HQ restoration process. We first learn a projector that maps an ELQ image onto an intermediate, less-degraded LQ manifold. This intermediate image is then restored to HQ using a frozen, off-the-shelf BIR model. Our approach is grounded in information theory; we provide a novel perspective of image restoration as an Information Bottleneck problem and derive a theoretically-driven objective to train our projector. This loss function effectively stabilizes training by balancing a low-quality reconstruction term with a high-quality prior-matching term. Our framework enables Look Forward Once (LFO) for inference-time prompt refinement, and supports plug-and-play strengthening of existing image restoration models without need for finetuning. Extensive experiments under severe degradation regimes provide a thorough analysis of the effectiveness of our work.

[85] Defect Segmentation in OCT scans of ceramic parts for non-destructive inspection using deep learning

Andrés Laveda-Martínez,Natalia P. García-de-la-Puente,Fernando García-Torres,Niels Møller Israelsen,Ole Bang,Dominik Brouczek,Niels Benson,Adrián Colomer,Valery Naranjo

Main category: cs.CV

TL;DR: 本文提出了一种基于U-Net架构的深度学习系统,用于陶瓷零件OCT扫描中的缺陷分割,实现了高精度的缺陷检测(Dice分数0.979),并展示了其在非破坏性检测中的实用性。

Details Motivation: 陶瓷制造业需要通过非破坏性检测(NDT)确保零件质量,而光学相干断层扫描(OCT)提供了高分辨率内部成像。然而,手动分析OCT图像耗时且易出错,因此需要自动化的缺陷检测系统。

Contribution: 主要贡献包括开发了基于U-Net的深度学习模型,通过多实验配置优化性能,并结合后处理技术实现了定量和定性评估。该系统在Dice分数和推理时间上表现优异。

Method: 采用U-Net架构的神经网络,训练基于手动标注的OCT图像。通过实验配置优化网络性能,并结合后处理技术提升预测结果的准确性。

Result: 系统在缺陷检测中表现出色,Dice分数达0.979,优于同类研究。单个体积推理时间为18.98秒,支持高效的自动化质量控制。

Insight: 基于深度学习的OCT图像分析可实现高效、准确的缺陷检测,为非破坏性检测的自动化提供了可行方案。

Abstract: Non-destructive testing (NDT) is essential in ceramic manufacturing to ensure the quality of components without compromising their integrity. In this context, Optical Coherence Tomography (OCT) enables high-resolution internal imaging, revealing defects such as pores, delaminations, or inclusions. This paper presents an automatic defect detection system based on Deep Learning (DL), trained on OCT images with manually segmented annotations. A neural network based on the U-Net architecture is developed, evaluating multiple experimental configurations to enhance its performance. Post-processing techniques enable both quantitative and qualitative evaluation of the predictions. The system shows an accurate behavior of 0.979 Dice Score, outperforming comparable studies. The inference time of 18.98 seconds per volume supports its viability for detecting inclusions, enabling more efficient, reliable, and automated quality control.

[86] Multi-Objective Task-Aware Predictor for Image-Text Alignment

Eunki Kim,Na Min An,James Thorne,Hyunjung Shim

Main category: cs.CV

TL;DR: 论文提出了一个多目标任务感知预测器MULTI-TAP,用于评估图像文本对齐性,能够生成整体评分和多目标细粒度评分,同时解决了现有方法在人类判断对齐、长序列处理、推理效率和多目标评分等方面的不足。

Details Motivation: 现有图像-文本对齐评估方法缺乏对人类偏好的多维度综合考虑,尤其是在多目标和高效推理方面的不足。因此,需要一种能够同时满足多维度评分需求且高效的方法。

Contribution: 1. 提出MULTI-TAP框架,支持多目标和单目标评分;2. 展示了其在多种大型视觉语言模型上的适用性和高效性;3. 发布新数据集EYE4ALL,支持多维度人类偏好标注。

Method: MULTI-TAP通过在预训练的大型视觉语言模型(LVLM)的隐藏状态上训练轻量级岭回归层,实现多目标细粒度评分。

Result: MULTI-TAP在性能上优于现有基准方法(如VisionREWARD),且效率更高;在7-8B参数规模下,性能接近GPT-4o的G-VEval。

Insight: 通过轻量级方法在多目标任务中表现出色,表明预训练模型的隐藏状态可以有效支持高效的细粒度评分。此外,新数据集为研究多维度人类偏好提供了重要资源。

Abstract: Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.

[87] ZQBA: Zero Query Black-box Adversarial Attack

Joana C. Costa,Tiago Roxo,Hugo Proença,Pedro R. M. Inácio

Main category: cs.CV

TL;DR: ZQBA提出了一种零查询的黑盒对抗攻击方法,利用DNN的特征图生成对抗样本,无需多次查询或训练替代模型。

Details Motivation: 现有黑盒对抗攻击方法需要多次查询或训练扩散模型,限制了实际应用的便捷性。ZQBA通过直接利用DNN表征生成对抗样本,解决了这一问题。

Contribution: 提出了一种零查询的黑盒对抗攻击方法ZQBA,通过DNN特征图生成对抗样本,不仅能跨模型和数据集迁移,且仅需单次查询即可实现高效攻击。

Method: 利用DNN的特征图,将其添加到干净图像中,生成对抗样本以欺骗目标模型。

Result: 实验表明,ZQBA在CIFAR和Tiny ImageNet数据集上优于现有黑盒攻击方法,且在SSIM和人眼评估中保持了对抗样本的不可感知性。

Insight: ZQBA揭示了DNN表征可用于高效生成对抗样本,强调了DNN在现实场景中的脆弱性。

Abstract: Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at https://github.com/Joana-Cabral/ZQBA.

[88] Uncertainty-Aware Concept Bottleneck Models with Enhanced Interpretability

Haifei Zhang,Patrick Barry,Eduardo Brandao

Main category: cs.CV

TL;DR: 本文提出了一种不确定性感知的概念瓶颈模型(CBM),通过学习二值类别级概念原型增强解释性和鲁棒性。

Details Motivation: 概念瓶颈模型(CBMs)虽然提供了语义明确且可解释的分类流程,但其预测性能通常低于端到端的卷积神经网络,且概念预测到最终标签的不确定性传播尚未充分研究。

Contribution: 1. 提出了一种不确定性感知和可解释的分类器;2. 通过二值类别级概念原型作为分类规则,增强了解释性和鲁棒性;3. 支持对不确定或异常输入的置信预测。

Method: 方法包括学习二值类别级概念原型,并使用预测概念向量与原型之间的距离作为分类分数和不确定性度量,同时将这些原型作为可解释的分类规则。

Result: 该方法在增强解释性的同时,保持了预测性能,并通过置信预测提高了模型对不确定输入的鲁棒性。

Insight: 结合不确定性度量和可解释的分类规则可以显著提升CBMs的实际应用价值。

Abstract: In the context of image classification, Concept Bottleneck Models (CBMs) first embed images into a set of human-understandable concepts, followed by an intrinsically interpretable classifier that predicts labels based on these intermediate representations. While CBMs offer a semantically meaningful and interpretable classification pipeline, they often sacrifice predictive performance compared to end-to-end convolutional neural networks. Moreover, the propagation of uncertainty from concept predictions to final label decisions remains underexplored. In this paper, we propose a novel uncertainty-aware and interpretable classifier for the second stage of CBMs. Our method learns a set of binary class-level concept prototypes and uses the distances between predicted concept vectors and each class prototype as both a classification score and a measure of uncertainty. These prototypes also serve as interpretable classification rules, indicating which concepts should be present in an image to justify a specific class prediction. The proposed framework enhances both interpretability and robustness by enabling conformal prediction for uncertain or outlier inputs based on their deviation from the learned binary class-level concept prototypes.

[89] MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts

Yifan Shen,Yangyang Shu,Hye-young Paik,Yulei Sui

Main category: cs.CV

TL;DR: MetaLogic提出了一种新的评估框架,通过在逻辑上等效但语法不同的提示下生成图像对,来检测文本到图像(T2I)模型的语义不一致性。

Details Motivation: 当前T2I模型在输入提示发生微小语言变化时,生成的图像可能语义不一致,暴露了模型在推理和泛化上的不足。为了解决这一问题,研究者提出了MetaLogic。

Contribution: MetaLogic的主要贡献包括:1)提出了一种无需真实图像的评估框架;2)利用变体测试生成语义相同的图像对,直接比较其一致性;3)分类了语义对齐错误类型,并提供了调试模型的示例。

Method: MetaLogic基于变体测试,生成语法不同但语义相同的提示对,并通过比较生成的图像对来检测语义一致性。该方法避免了依赖真实图像,直接分析模型的逻辑理解能力。

Result: 实验表明,即使是最先进的T2I模型(如Flux.dev和DALLE-3),其语义不一致率也分别高达59%和71%。MetaLogic高效且可扩展,能够发现现有指标忽略的逻辑不一致问题。

Insight: MetaLogic揭示了T2I模型在逻辑理解上的局限性,强调了语义一致性评估的重要性。该方法为模型调试和改进提供了实用工具,同时也为未来的研究提供了新的评估方向。

Abstract: Recent advances in text-to-image (T2I) models, especially diffusion-based architectures, have significantly improved the visual quality of generated images. However, these models continue to struggle with a critical limitation: maintaining semantic consistency when input prompts undergo minor linguistic variations. Despite being logically equivalent, such prompt pairs often yield misaligned or semantically inconsistent images, exposing a lack of robustness in reasoning and generalisation. To address this, we propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images. MetaLogic leverages metamorphic testing, generating image pairs from prompts that differ grammatically but are semantically identical. By directly comparing these image pairs, the framework identifies inconsistencies that signal failures in preserving the intended meaning, effectively diagnosing robustness issues in the model’s logic understanding. Unlike existing evaluation methods that compare a generated image to a single prompt, MetaLogic evaluates semantic equivalence between paired images, offering a scalable, ground-truth-free approach to identifying alignment failures. It categorises these alignment errors (e.g., entity omission, duplication, positional misalignment) and surfaces counterexamples that can be used for model debugging and refinement. We evaluate MetaLogic across multiple state-of-the-art T2I models and reveal consistent robustness failures across a range of logical constructs. We find that even the SOTA text-to-image models like Flux.dev and DALLE-3 demonstrate a 59 percent and 71 percent misalignment rate, respectively. Our results show that MetaLogic is not only efficient and scalable, but also effective in uncovering fine-grained logical inconsistencies that are overlooked by existing evaluation metrics.

[90] Solar PV Installation Potential Assessment on Building Facades Based on Vision and Language Foundation Models

Ruyu Liu,Dongxu Zhuang,Jianhua Zhang,Arega Getaneh Abate,Per Sieverts Nielsen,Ben Wang,Xiufeng Liu

Main category: cs.CV

TL;DR: 该论文提出了一种自动化框架SF-SPA,用于评估建筑立面的太阳能光伏安装潜力,通过计算机视觉和人工智能技术解决了透视校正、语义分割和光伏布局优化等挑战。

Details Motivation: 城市建筑立面的太阳能光伏潜力未得到充分利用,传统评估方法因复杂几何和语义组件而效率低下。

Contribution: SF-SPA框架首次结合几何校正、零样本语义分割、LLM引导的空间推理和能源模拟,实现了高效的PV潜力评估。

Method: 四阶段流水线:几何校正、零样本语义分割、LLM空间推理和能源模拟,验证了80栋建筑的性能。

Result: 面积估计误差6.2%±2.8%,单栋评估时间100秒,模拟结果验证了方法的可靠性。

Insight: 通过AI和LLM的结合,SF-SPA为城市能源规划和BIPV部署提供了高效自动化工具。

Abstract: Building facades represent a significant untapped resource for solar energy generation in dense urban environments, yet assessing their photovoltaic (PV) potential remains challenging due to complex geometries and semantic com ponents. This study introduces SF-SPA (Semantic Facade Solar-PV Assessment), an automated framework that transforms street-view photographs into quantitative PV deployment assessments. The approach combines com puter vision and artificial intelligence techniques to address three key challenges: perspective distortion correction, semantic understanding of facade elements, and spatial reasoning for PV layout optimization. Our four-stage pipeline processes images through geometric rectification, zero-shot semantic segmentation, Large Language Model (LLM) guided spatial reasoning, and energy simulation. Validation across 80 buildings in four countries demonstrates ro bust performance with mean area estimation errors of 6.2% ± 2.8% compared to expert annotations. The auto mated assessment requires approximately 100 seconds per building, a substantial gain in efficiency over manual methods. Simulated energy yield predictions confirm the method’s reliability and applicability for regional poten tial studies, urban energy planning, and building-integrated photovoltaic (BIPV) deployment. Code is available at: https:github.com/CodeAXu/Solar-PV-Installation

[91] From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

Fan Yang,Zhiyang Chen,Yousong Zhu,Xin Li,Jinqiao Wang

Main category: cs.CV

TL;DR: TrajVLM-Gen是一个两阶段的视觉语言框架,结合轨迹预测和视频生成,生成符合物理规律的运动视频。

Details Motivation: 现有视频生成模型常产生不符合真实物理规律的运动,缺乏一致性。

Contribution: 提出TrajVLM-Gen框架,结合视觉语言模型预测轨迹并指导视频生成,提升物理一致性。

Method: 两阶段方法:1)用VLM预测粗粒度轨迹;2)基于注意力机制细化运动。

Result: 在UCF-101和MSR-VTT上取得FVD分数545和539,优于现有方法。

Insight: 利用轨迹预测结合视频生成可提升视频的物理一致性。

Abstract: Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

[92] What You See is What You Ask: Evaluating Audio Descriptions

Divy Kala,Eshika Khandelwal,Makarand Tapaswi

Main category: cs.CV

TL;DR: 这篇论文提出了ADQA基准,用于评估音频描述(AD)在帮助盲人和低视力(BLV)用户理解故事和视觉细节方面的效果,揭示了当前AD生成方法的主观性问题和不足。

Details Motivation: 现有的自动AD生成研究主要集中于短片段,且评估时仅与单一参考AD对比,忽略了AD创作的主观性。作者通过分析同一电影的两个独立AD轨,量化了AD的主观性,并指出短片段评估的局限性。

Contribution: 提出了ADQA基准,用于评估AD在较长时间连贯视频段中的效果,包含视觉欣赏(VA)和叙事理解(NU)两类问题,揭示了当前AD生成方法与人工AD的差距。

Method: 量化分析两个独立AD轨的主观性数据,设计了ADQA基准,测试AD在BLV用户理解故事和视觉细节方面的实际效果。

Result: ADQA显示,当前AD生成方法显著落后于人工AD,强调了长片段评估的重要性。

Insight: AD创作具有高度主观性,评估应基于更长的连贯片段,而非短片段;未来的AD生成研究需更好地满足BLV用户的实际需求。

Abstract: Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.

[93] PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset

Thomas Campagnolo,Ezio Malis,Philippe Martinet,Gaetan Bahl

Main category: cs.CV

TL;DR: PhraseStereo 是首个开放词汇的立体图像分割数据集,将短语-区域分割扩展到立体图像对中,利用了深度几何线索。

Details Motivation: 当前短语接地(phrase grounding)的研究主要局限于单视角图像,而忽视了立体视觉中丰富的几何线索。

Contribution: 提出了 PhraseStereo 数据集,首次将短语接地任务扩展到立体图像对,为多模态学习提供了新的挑战和机会。

Method: 基于 PhraseCut 数据集,利用 GenStereo 生成准确的右视角图像,从而扩展短语接地任务到立体领域。

Result: PhraseStereo 提供了立体图像对及其对齐的分割掩码和短语标注,为语言、视觉和 3D 感知的交叉研究奠定了基础。

Insight: 立体图像对的深度信息可以为多模态学习提供更精确和上下文感知的接地能力。

Abstract: Understanding how natural language phrases correspond to specific regions in images is a key challenge in multimodal semantic segmentation. Recent advances in phrase grounding are largely limited to single-view images, neglecting the rich geometric cues available in stereo vision. For this, we introduce PhraseStereo, the first novel dataset that brings phrase-region segmentation to stereo image pairs. PhraseStereo builds upon the PhraseCut dataset by leveraging GenStereo to generate accurate right-view images from existing single-view data, enabling the extension of phrase grounding into the stereo domain. This new setting introduces unique challenges and opportunities for multimodal learning, particularly in leveraging depth cues for more precise and context-aware grounding. By providing stereo image pairs with aligned segmentation masks and phrase annotations, PhraseStereo lays the foundation for future research at the intersection of language, vision, and 3D perception, encouraging the development of models that can reason jointly over semantics and geometry. The PhraseStereo dataset will be released online upon acceptance of this work.

[94] NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution

Xiangtao Kong,Rongyuan Wu,Shuaizheng Liu,Lingchen Sun,Lei Zhang

Main category: cs.CV

TL;DR: NSARM提出了一个基于自回归模型的稳健实时图像超分辨率框架,通过两阶段训练策略(变换网络和端到端微调),在保持高效推理的同时提升了图像质量和输入鲁棒性。

Details Motivation: 现有的Real-ISR方法要么依赖缓慢的扩散模型,要么质量较低且鲁棒性差。自回归模型(如Infinity)展示了高效且高质量的生成能力,但尚未应用于超分辨率任务。本文旨在利用自回归模型的优势解决这些问题。

Contribution: 1. 提出了NSARM框架,首次将自回归模型应用于Real-ISR任务;2. 设计了两阶段训练策略,提高了模型对输入退化的鲁棒性;3. 在质量和效率上超越了现有方法,同时保持了强泛化能力。

Method: 1. 第一阶段训练变换网络,将低质量图像映射到初步尺度;2. 第二阶段对整个模型进行端到端微调;3. 基于Infinity的自回归模型,采用比特级下一尺度预测策略。

Result: NSARM在定量和定性评估中均优于现有Real-ISR方法,生成更高质量的图像并保持高效推理。对输入质量的鲁棒性和泛化能力显著提升。

Insight: 自回归模型在Real-ISR任务中展示了潜力,其高效性和高鲁棒性优于扩散模型。两阶段训练策略是提高模型适应性和性能的关键。

Abstract: Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM

[95] Feature Identification for Hierarchical Contrastive Learning

Julius Ott,Nastassia Vysotskaya,Huawei Sun,Lorenzo Servadei,Robert Wille

Main category: cs.CV

TL;DR: 这篇论文提出了两种新颖的层次对比学习方法(HMLC),分别基于高斯混合模型(G-HMLC)和注意力机制(A-HMLC),旨在捕捉层次特有的特征并建模类间关系,从而提升层次分类任务的性能。

Details Motivation: 传统的分类方法往往忽略了不同层次类别间的固有关系,导致丢失重要的监督信号。为了解决这一问题,论文设计了两种层次对比学习方法,以更好地捕捉层次结构信息。

Contribution: 主要贡献包括:1)提出了基于高斯混合模型的G-HMLC方法;2)设计了基于注意力机制的A-HMLC方法;3)通过显式建模类间关系和不平衡类别分布,实现了在所有层次上的细粒度聚类。

Method: 方法包括:1)G-HMLC利用高斯混合模型建模特征分布;2)A-HMLC通过注意力机制学习层次特有的特征;3)两种方法均利用对比学习优化特征表示。

Result: 在CIFAR100和ModelNet40数据集上,HMLC方法在线性评估中达到了最先进的性能,准确率比现有方法高出2个百分点。

Insight: 论文的亮点在于通过层次对比学习显式建模类间关系和不平衡分布,这在复杂层次分类任务中具有广泛的应用潜力。

Abstract: Hierarchical classification is a crucial task in many applications, where objects are organized into multiple levels of categories. However, conventional classification approaches often neglect inherent inter-class relationships at different hierarchy levels, thus missing important supervisory signals. Thus, we propose two novel hierarchical contrastive learning (HMLC) methods. The first, leverages a Gaussian Mixture Model (G-HMLC) and the second uses an attention mechanism to capture hierarchy-specific features (A-HMLC), imitating human processing. Our approach explicitly models inter-class relationships and imbalanced class distribution at higher hierarchy levels, enabling fine-grained clustering across all hierarchy levels. On the competitive CIFAR100 and ModelNet40 datasets, our method achieves state-of-the-art performance in linear evaluation, outperforming existing hierarchical contrastive learning methods by 2 percentage points in terms of accuracy. The effectiveness of our approach is backed by both quantitative and qualitative results, highlighting its potential for applications in computer vision and beyond.

[96] Can World Models Benefit VLMs for World Dynamics?

Kevin Zhang,Kuangzhi Ge,Xiaowei Chi,Renrui Zhang,Shaojun Shi,Zhen Dong,Sirui Han,Shanghang Zhang

Main category: cs.CV

TL;DR: 本文探讨了生成世界模型(World Models)是否能替代传统视觉编码器范式,用于通用多模态理解任务,并提出了一种动态视觉对齐方法(DyVA),显著提升了空间推理能力。

Details Motivation: 随着生成世界模型在视频数据上的强大表现,研究它们是否能用于通用的多模态任务成为自然的问题。本文旨在探索世界模型先验在视觉语言模型中的应用潜力。

Contribution: 1)提出了一种将视频扩散模型重用作生成编码器的方法;2)实现了动态视觉对齐方法(DyVA),提升了空间推理能力;3)展示了世界模型在视觉语言任务中的优越性能。

Method: 采用视频扩散模型作为生成编码器,通过单步去噪生成视觉嵌入,并将其应用于视觉语言模型(WorldLMs),最终提出DyVA方法。

Result: DyVA在多任务视觉推理基准上超越了开源和专有基线,实现了最优或接近最优的性能。

Insight: 视频预训练带来的运动一致性内部化是世界模型在视觉语言任务中表现优越的关键因素。

Abstract: Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM’s inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.

[97] Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model

Hyun-kyu Ko,Youbin Kim,Jihyeon Park,Dongheok Park,Gyeongjin Kang,Wonjun Cho,Hyung Yi,Eunbyung Park

Main category: cs.CV

TL;DR: 论文提出了Gather-Scatter Mamba(GSM),一种结合选择性状态空间模型(Mamba)和空间上下文聚合的混合架构,用于高效视频超分辨率的时空建模。

Details Motivation: 传统RNN在视频超分辨率中面临梯度消失、并行性差和推理速度慢的问题,而Transformer的二次复杂度限制了其在长序列中的表现。Mamba提供了线性复杂度的解决方案,但缺乏空间依赖性建模能力。

Contribution: 1. 提出混合架构,结合Mamba和自注意力机制;2. 提出GSM机制,通过特征对齐减少遮挡伪影;3. 实现了高效的时空建模。

Method: 1. 使用Mamba进行选择性时间传播;2. 引入移位窗口自注意力聚合空间上下文;3. GSM机制对特征进行对齐和重分配。

Result: GSM在视频超分辨率任务中高效地减少了遮挡伪影,提升了时空建模能力。

Insight: 1. Mamba与自注意力的结合能平衡复杂度和建模能力;2. 特征对齐对时空信息传播至关重要。

Abstract: State Space Models (SSMs)-most notably RNNs-have historically played a central role in sequential modeling. Although attention mechanisms such as Transformers have since dominated due to their ability to model global context, their quadratic complexity and limited scalability make them less suited for long sequences. Video super-resolution (VSR) methods have traditionally relied on recurrent architectures to propagate features across frames. However, such approaches suffer from well-known issues including vanishing gradients, lack of parallelism, and slow inference speed. Recent advances in selective SSMs like Mamba offer a compelling alternative: by enabling input-dependent state transitions with linear-time complexity, Mamba mitigates these issues while maintaining strong long-range modeling capabilities. Despite this potential, Mamba alone struggles to capture fine-grained spatial dependencies due to its causal nature and lack of explicit context aggregation. To address this, we propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation. Furthermore, we introduce Gather-Scatter Mamba (GSM), an alignment-aware mechanism that warps features toward a center anchor frame within the temporal window before Mamba propagation and scatters them back afterward, effectively reducing occlusion artifacts and ensuring effective redistribution of aggregated information across all frames. The official implementation is provided at: https://github.com/Ko-Lani/GSMamba.

[98] AI-CNet3D: An Anatomically-Informed Cross-Attention Network with Multi-Task Consistency Fine-tuning for 3D Glaucoma Classification

Roshan Kenia,Anfei Li,Rishabh Srivastava,Kaveri A. Thakoor

Main category: cs.CV

TL;DR: 论文提出了一种名为AI-CNet3D的新型深度学习模型,通过结合跨注意力机制和3D CNN,从OCT体积中提取关键特征,用于青光眼分类,并展示了优越的性能和计算效率。

Details Motivation: 传统的2D报告方法在压缩3D OCT体积时会丢失关键结构细节,导致青光眼诊断的准确性受限。

Contribution: 1. 提出了一种结合跨注意力机制的3D CNN模型(AI-CNet3D);2. 引入了Channel Attention REpresentations(CAREs)以增强可解释性和解剖学一致性;3. 在多任务一致性微调中利用CAREs和Grad-CAMs对齐,进一步提升性能。

Method: 1. 在3D CNN中集成跨注意力机制,提取关键区域(如视网膜上下半球、视盘和黄斑)的特征;2. 通过分割体积并应用跨注意力捕捉不对称性;3. 使用CAREs和Grad-CAMs进行多任务一致性微调。

Result: 模型在两个大型数据集上表现优于现有注意力机制和卷积模型,同时计算效率显著提升(参数减少100倍)。

Insight: 结合解剖学知识的跨注意力机制能够有效提升青光眼分类的准确性和可解释性,同时保持计算效率。

Abstract: Glaucoma is a progressive eye disease that leads to optic nerve damage, causing irreversible vision loss if left untreated. Optical coherence tomography (OCT) has become a crucial tool for glaucoma diagnosis, offering high-resolution 3D scans of the retina and optic nerve. However, the conventional practice of condensing information from 3D OCT volumes into 2D reports often results in the loss of key structural details. To address this, we propose a novel hybrid deep learning model that integrates cross-attention mechanisms into a 3D convolutional neural network (CNN), enabling the extraction of critical features from the superior and inferior hemiretinas, as well as from the optic nerve head (ONH) and macula, within OCT volumes. We introduce Channel Attention REpresentations (CAREs) to visualize cross-attention outputs and leverage them for consistency-based multi-task fine-tuning, aligning them with Gradient-Weighted Class Activation Maps (Grad-CAMs) from the CNN’s final convolutional layer to enhance performance, interpretability, and anatomical coherence. We have named this model AI-CNet3D (AI-`See’-Net3D) to reflect its design as an Anatomically-Informed Cross-attention Network operating on 3D data. By dividing the volume along two axes and applying cross-attention, our model enhances glaucoma classification by capturing asymmetries between the hemiretinal regions while integrating information from the optic nerve head and macula. We validate our approach on two large datasets, showing that it outperforms state-of-the-art attention and convolutional models across all key metrics. Finally, our model is computationally efficient, reducing the parameter count by one-hundred–fold compared to other attention mechanisms while maintaining high diagnostic performance and comparable GFLOPS.

[99] Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification

Yucheng Lu,Hubert Dariusz Zając,Veronika Cheplygina,Amelia Jiménez-Sánchez

Main category: cs.CV

TL;DR: 该研究通过调查机器学习研究者在医学图像分类中的迁移学习决策,揭示其选择源数据集时依赖直觉而非系统原则,并指出了任务依赖性、社区实践和数据集特性等因素的影响。

Details Motivation: 迁移学习在医学图像分类中至关重要,但源数据集的选择通常依赖研究者的直觉,缺乏系统性原则,这可能影响算法的泛化能力和患者结果。

Contribution: 研究首次从人机交互(HCI)的角度调查了研究者选择源数据集的决策过程,揭示了任务依赖性、社区实践等因素的重要性,并挑战了传统的‘相似度越高越好’的观点。

Method: 采用任务驱动型问卷调查机器学习实践者,分析其选择源数据集时的决策依据,包括数据集属性、计算嵌入、感知相似性等因素。

Result: 研究发现源数据集的选择具有任务依赖性,相似性评分与预期性能并不总是一致,且参与者使用的术语模糊。

Insight: 研究指出了需要更清晰的定义和人机交互工具来支持源数据集的系统性选择,为迁移学习提供了实用启示。

Abstract: Transfer learning is crucial for medical imaging, yet the selection of source datasets - which can impact the generalizability of algorithms, and thus patient outcomes - often relies on researchers’ intuition rather than systematic principles. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-centered HCI perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional “more similar is better” view. Participants often used ambiguous terminology, which suggests a need for clearer definitions and HCI tools to make them explicit and usable. By clarifying these heuristics, this work provides practical insights for more systematic source selection in transfer learning.

[100] InfVSR: Breaking Length Limits of Generic Video Super-Resolution

Ziqing Zhang,Kai Liu,Zheng Chen,Xi Li,Yucong Chen,Bingnan Duan,Linghe Kong,Yulun Zhang

Main category: cs.CV

TL;DR: InfVSR提出了一种自回归一步扩散范式,解决了长视频超分辨率(VSR)的效率低和扩展性差问题,实现了高质量和高速度的超分辨率处理。

Details Motivation: 现实世界的视频通常包含数千帧,而现有的VSR方法在处理长序列时效率低下且扩展性差,需要突破这些限制。

Contribution: 1. 将VSR重新定义为自回归一步扩散范式;2. 提出了因果结构的DiT和高效的单步扩散蒸馏方法;3. 构建了新的长视频评估基准和语义级度量标准。

Method: 1. 将预训练的DiT调整为因果结构,通过滚动KV缓存和联合视觉引导保持局部和全局一致性;2. 通过补丁级像素监督和跨块分布匹配,将扩散过程高效蒸馏为单步。

Result: InfVSR在长视频VSR中实现了最先进的超分辨率质量,同时速度提升了58倍,显著优于MGLD-VSR等方法。

Insight: 长视频处理需要兼顾效率和质量,引入自回归和扩散模型结合的范式是突破现有限制的有效途径。

Abstract: Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at https://github.com/Kai-Liu001/InfVSR.

[101] JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Siheng Wan,Zhengtao Yao,Zhengdao Li,Junhao Dong,Yanshu Li,Yikai Li,Linshan Li,Haoyan Xu,Yijiang Li,Zhikang Dong,Huacan Wang,Jifeng Shen

Main category: cs.CV

TL;DR: JEPA-T 是一个统一的多模态框架,通过联合嵌入预测 Transformer 将图像和文本编码为离散的视觉和文本标记,并结合交叉注意力增强文本和视觉信息的融合。

Details Motivation: 现有的文本到图像(T2I)生成方法多基于自监督训练的令牌中心架构,但如何在生成过程中有效融合文本与视觉令牌仍是一个挑战。

Contribution: 提出了 JEPA-T 框架,通过联合嵌入预测 Transformer 和交叉注意力机制,实现了文本与视觉令牌的有效融合,并在训练中通过原始文本嵌入提升对齐效果。

Method: 采用联合嵌入预测 Transformer 编码图像和文本为离散标记,并结合交叉注意力进行条件去噪;在训练阶段注入原始文本嵌入以优化对齐。

Result: 在 ImageNet-1K 上的实验表明,JEPA-T 具有高效的数据利用能力、开放词汇泛化能力,并优于非融合和晚融合基线方法。

Insight: 结合晚期架构融合和目标级对齐,可以在基于令牌的 T2I 任务中实现调节强度和主干通用性的平衡。

Abstract: Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git

[102] A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

Axel Barroso-Laguna,Tommaso Cavallari,Victor Adrian Prisacariu,Eric Brachmann

Main category: cs.CV

TL;DR: FastForward是一种通过单次前馈过程快速构建场景表示并进行图像定位的方法,通过3D空间中的特征集合实现高效相机姿态估计,显著减少了映射时间。

Details Motivation: 现有视觉定位方法在构建场景表示时需要大量时间,FastForward旨在以更快速度实现竞争性精度,满足实时性和实用性需求。

Contribution: 提出FastForward方法,通过单次前馈过程实现快速映射和查询图像定位,结合图像检索技术达到最佳精度,并展示了强大的跨域泛化能力。

Method: FastForward将多张映射图像表示为3D空间中的特征集合,利用这些特征预测查询图像与场景的对应关系,从而估计相机姿态。

Result: FastForward在最小化映射准备时间的同时,达到了与最先进方法相当的精度,并能有效泛化到未见过的户外场景。

Insight: 将多张图像的特征集合表示为3D空间的锚点,是实现高效相机姿态估计的关键,该方法在速度和泛化性上具有显著优势。

Abstract: Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.

[103] Visual Self-Refinement for Autoregressive Models

Jiamian Wang,Ziqi Zhou,Chaithanya Kumar Mummadi,Sohail Dianat,Majid Rabbani,Raghuveer Rao,Chen Qiu,Zhiqiang Tao

Main category: cs.CV

TL;DR: 该论文提出了一种自回归模型的插拔式视觉自细化模块,用于增强生成视觉序列中的空间对应关系建模,从而提升生成质量。

Details Motivation: 自回归模型虽然在序列建模中表现优异,但在视觉信号的空间特性与逐令牌预测的序列依赖性之间存在冲突,导致生成结果不理想。

Contribution: 论文的主要贡献是设计了一种插拔式的细化模块,能在后预训练阶段联合细化所有生成的令牌,从而改善视觉-语言建模的质量。

Method: 该方法通过利用令牌之间的全局上下文和关系,缓解了序列生成中的错误累积问题,并在共享的序列预测框架下工作。

Result: 实验结果表明,该方法显著提升了生成质量,使模型能生成语义更一致的结果。

Insight: 论文揭示了在自回归模型中引入全局上下文和关系建模的重要性,为解决视觉信号与序列建模的冲突提供了新思路。

Abstract: Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language modeling under a shared sequential prediction framework. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model’s ability to produce semantically consistent results.

[104] SoftCFG: Uncertainty-guided Stable Guidance for Visual autoregressive Model

Dongli Xu,Aleksei Tiulpin,Matthew B. Blaschko

Main category: cs.CV

TL;DR: SoftCFG是一种不确定性引导的稳定指导方法,用于改善自回归模型的视觉生成质量,解决了传统Classifier-Free Guidance(CFG)中存在的指导消失和过度指导问题。

Details Motivation: 自回归模型在图像生成中表现出色,但传统CFG方法在应用中存在指导信号逐渐消失或过度干扰的问题,影响了生成图像的视觉连贯性。

Contribution: 提出了SoftCFG方法,通过不确定性引导的自适应扰动分配,确保每个生成token都能贡献加权指导,同时引入Step Normalization稳定长序列生成。

Method: SoftCFG在推理阶段动态分配不确定性加权的扰动,并结合Step Normalization限制累积扰动,无需额外训练且兼容现有自回归模型。

Result: 实验表明,SoftCFG显著提升了图像生成质量,在ImageNet 256上的FID指标达到了自回归模型的SOTA水平。

Insight: 不确定性引导的加权扰动分配可以有效地平衡文本指导和视觉上下文冲突,同时Step Normalization是稳定长序列生成的关键。

Abstract: Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256 among autoregressive models.

[105] TextCAM: Explaining Class Activation Map with Text

Qiming Zhao,Xingjian Li,Xiaoyu Cao,Xiaolong Wu,Min Xu

Main category: cs.CV

TL;DR: TextCAM是一种结合类激活图(CAM)与自然语言的新颖解释框架,旨在为深度视觉模型的预测提供更丰富的语义解释。

Details Motivation: 深度神经网络(DNNs)在许多领域取得了显著成功,但其黑盒性质限制了在高风险应用中的可信度。CAM及其变体仅能突出显示空间区域,缺乏语义解释。为解决这一问题,提出了TextCAM。

Contribution: TextCAM的主要贡献是将CAM的精确空间定位能力与视觉语言模型(VLMs)的语义对齐能力结合,生成了既显示模型关注区域又解释视觉属性的文本描述。

Method: TextCAM通过CLIP嵌入和线性判别分析生成通道级语义表示,并将其与CAM权重聚合,产生文本描述。进一步扩展了方法,将特征通道分组成语义一致的部分,以实现更细粒度的解释。

Result: 在ImageNet、CLEVR和CUB数据集上的实验表明,TextCAM提供的解释既忠实于模型预测,又提升了人类理解能力,同时能检测虚假相关性并保持模型保真度。

Insight: TextCAM的提出表明,结合视觉与语言模型的能力可以为深度神经网络提供更具可解释性的解释方法,有助于提升模型的透明度和可信度。

Abstract: Deep neural networks (DNNs) have achieved remarkable success across domains but remain difficult to interpret, limiting their trustworthiness in high-stakes applications. This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants working by highlighting spatial regions that drive predictions. We figure out that CAM provides little semantic insight into what attributes underlie these activations. To address this limitation, we propose TextCAM, a novel explanation framework that enriches CAM with natural languages. TextCAM combines the precise spatial localization of CAM with the semantic alignment of vision-language models (VLMs). Specifically, we derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights to produce textual descriptions of salient visual evidence. This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision. We further extend TextCAM to generate feature channels into semantically coherent groups, enabling more fine-grained visual-textual explanations. Experiments on ImageNet, CLEVR, and CUB demonstrate that TextCAM produces faithful and interpretable rationales that improve human understanding, detect spurious correlations, and preserve model fidelity.

[106] POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Ashim Dahal,Ankit Ghimire,Saydul Akbar Murad,Nick Rahimi

Main category: cs.CV

TL;DR: POVQA提出了一种数据高效的视频问答方法,通过时间池化和轻量级监督对齐大视觉语言模型,显著提升了问答表现和推理质量。

Details Motivation: 目前的视频问答方法通常需要1500+帧的上下文窗口,仅能覆盖50秒的视频内容,信息利用率低且计算成本高。

Contribution: 1) 提出POVQA框架,每秒钟视频压缩为单个时间池化图像;2) 结合轻量级监督优化LVLMs;3) 提供ReasonVQA数据集,包含239个带推理标注的问答对。

Method: 1) 使用运动模糊和加权平均等方法对视频进行时间池化;2) 在QWEN-2.5-VL 7B模型上结合SFT和DPO进行训练;3) 引入推理提示的二阶段目标。

Result: 在ReasonVQA数据集上,F1分数从0.212提升至0.543,BLEU-4和ROUGE-L也显著提升;跨池化方法和跨数据集的零样本测试表明方法鲁棒性强。

Insight: 时间池化结合轻量优化能高效压缩视频信息,提升问答性能;推理提示进一步改善了模型的解释能力。

Abstract: Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.

[107] ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Yuxiang Guo,Jiang Liu,Ze Wang,Hao Chen,Ximeng Sun,Yang Zhao,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum

Main category: cs.CV

TL;DR: ImageDoctor是一个统一的多方面文本生成图像(T2I)模型评估框架,通过四个互补维度(合理性、语义对齐、美观性和整体质量)评估图像质量,并提供像素级错误指示热图。

Details Motivation: 现有方法通常使用单一标量量化生成图像的质量,无法提供全面且可解释的图像质量反馈。

Contribution: 1. 提出ImageDoctor框架,通过四个维度评估图像质量;2. 提供像素级热图标记错误区域;3. 引入“看-思考-预测”范式提升细节敏感性和推理能力;4. 结合监督微调和强化学习训练模型。

Method: 基于视觉语言模型,采用“看-思考-预测”范式(定位潜在错误、生成推理、定量评分),结合监督微调和强化学习训练。

Result: 在多个数据集上与人类偏好强对齐,用作奖励模型时生成质量提升10%。

Insight: 多维度评估和像素级反馈能更全面地指导T2I模型优化。

Abstract: The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a “look-think-predict” paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality – achieving an improvement of 10% over scalar-based reward models.

[108] Code2Video: A Code-centric Paradigm for Educational Video Generation

Yanzhe Chen,Kevin Qinghong Lin,Mike Zheng Shou

Main category: cs.CV

TL;DR: 论文提出了Code2Video,一个通过可执行Python代码生成教育视频的框架,结合规划、编码和视觉语言模型优化,在教育场景中表现优于直接代码生成方法。

Details Motivation: 当前生成模型在像素空间视频合成方面虽有进展,但难以满足教育视频对学科知识、精确视觉结构和连贯过渡的需求,需要一个更可控的渲染环境来解决。

Contribution: 1. 提出Code2Video框架,包含Planner、Coder和Critic三个模块,通过代码生成教育视频;2. 构建MMMC基准数据集,支持多维度评估;3. 引入TeachQuiz指标,量化视频的知识传递效果。

Method: 1. Planner规划内容和视觉资产;2. Coder将指令转为Python代码,使用scope-guided auto-fix提升效率;3. Critic利用视觉语言模型优化布局和清晰度。

Result: Code2Video在教育视频生成中优于直接代码生成方法40%,效果接近人工制作的教程。

Insight: 代码为中心的范式在教育视频生成中更具可控性和解释性,结合多代理协作和视觉语言模型能显著提升质量。

Abstract: While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.

[109] Secure and reversible face anonymization with diffusion models

Pol Labarbarie,Vincent Itier,William Puech

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于扩散模型的安全且可逆的人脸匿名化方法,通过结合秘密密钥和面部掩码,实现了高质量的匿名化图像,并能通过正确的密钥恢复原始人脸。

Details Motivation: 人脸图像在计算机视觉算法处理中容易泄露敏感信息,现有匿名化方法难以同时满足高质量生成、安全性和可逆性的需求。

Contribution: 提出首个基于扩散模型的安全、高质量且可逆的人脸匿名化方法,通过结合秘密密钥和面部掩码,确保了图像的隐私保护和身份认证需求。

Method: 将秘密密钥与扩散模型的潜在人脸表示结合,并使用面部掩码约束生成过程以保持高质量图像;通过确定性的前向和后向扩散过程实现可逆性。

Result: 该方法生成的匿名化人脸图像质量高,与原始图像的视觉相似性更低,并且只有持有正确密钥的授权方才能恢复原始人脸。

Insight: 结合扩散模型的生成能力和秘密密钥的安全机制,可以在隐私保护和身份认证之间取得更好的平衡。

Abstract: Face images processed by computer vision algorithms contain sensitive personal information that malicious actors can capture without consent. These privacy and security risks highlight the need for effective face anonymization methods. Current methods struggle to propose a good trade-off between a secure scheme with high-quality image generation and reversibility for later person authentication. Diffusion-based approaches produce high-quality anonymized images but lack the secret key mechanism to ensure that only authorized parties can reverse the process. In this paper, we introduce, to our knowledge, the first secure, high-quality reversible anonymization method based on a diffusion model. We propose to combine the secret key with the latent faces representation of the diffusion model. To preserve identity-irrelevant features, generation is constrained by a facial mask, maintaining high-quality images. By using a deterministic forward and backward diffusion process, our approach enforces that the original face can be recovered with the correct secret key. We also show that the proposed method produces anonymized faces that are less visually similar to the original faces, compared to other previous work.

[110] KeySG: Hierarchical Keyframe-Based 3D Scene Graphs

Abdelrhman Werby,Dennis Rotondi,Fabio Scaparro,Kai O. Arras

Main category: cs.CV

TL;DR: KeySG提出了一种基于关键帧的分层3D场景图框架,通过多模态信息增强节点表示,并利用视觉语言模型(VLM)提取场景信息,解决了传统方法在图规模和语义限制上的问题。

Details Motivation: 传统3D场景图方法在语义关系和规模扩展性方面存在局限,无法支持复杂的人类中心环境中的机器推理和导航任务。

Contribution: 1) 提出了一种分层3D场景图表示(KeySG),结合几何和视觉覆盖优化的关键帧;2) 利用VLM提取场景信息,避免显式建模对象关系;3) 通过分层RAG管道解决大规模图的扩展性问题。

Method: 1) 将3D场景表示为分层的图结构(楼层、房间、对象等);2) 使用关键帧和多模态信息增强节点表示;3) 采用VLM提取信息,并通过RAG管道高效检索相关上下文。

Result: 在四个基准测试中(如3D物体分割和复杂查询检索),KeySG在大多数指标上优于现有方法,证明了其语义丰富性和效率。

Insight: 分层结构和关键帧的使用有效提升了3D场景图的表达能力,同时减轻了计算和存储负担。

Abstract: In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM’s context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across four distinct benchmarks – including 3D object segmentation and complex query retrieval – KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.

[111] Instant4D: 4D Gaussian Splatting in Minutes

Zhanpeng Luo,Haoxi Ran,Li Lu

Main category: cs.CV

TL;DR: Instant4D是一种基于4D高斯飞溅的单目重建系统,能在几分钟内处理未标定的日常视频,显著提升了动态场景重建的效率。

Details Motivation: 动态视图合成虽已取得进展,但从未标定的日常视频中重建场景仍因优化慢和参数估计复杂而具挑战性。

Contribution: 提出了一种高效的4D高斯表示方法,将训练时间缩短至两分钟内,模型大小缩减至原始的10%,同时在多个基准测试中保持竞争力。

Method: 通过深度视觉SLAM进行几何恢复,采用网格剪枝优化场景表示,并引入简化的4D高斯表示以高效处理时序动态。

Result: 在Dycheck数据集上,10分钟内完成单视频重建;典型200帧视频的训练时间显著缩短,性能保持竞争力。

Insight: 4D高斯表示在动态场景重建中具有高效性和实用性,适用于未标定视频的快速处理。

Abstract: Dynamic view synthesis has seen significant advances, yet reconstructing scenes from uncalibrated, casual video remains challenging due to slow optimization and complex parameter estimation. In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30x speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. Our method reconstruct a single video within 10 minutes on the Dycheck dataset or for a typical 200-frame video. We further apply our model to in-the-wild videos, showcasing its generalizability. Our project website is published at https://instant4d.github.io/.

[112] Strategic Fusion of Vision Language Models: Shapley-Credited Context-Aware Dawid-Skene for Multi-Label Tasks in Autonomous Driving

Yuxiang Feng,Keyang Zhang,Hassane Ouchouid,Ashwil Kaniamparambil,Ioannis Souflas,Panagiotis Angeloudis

Main category: cs.CV

TL;DR: 该论文提出了一种基于博弈论的融合方法,结合Shapley值的上下文感知Dawid-Skene模型,用于自动驾驶中多标签任务的视觉语言模型融合,显著提升了性能。

Details Motivation: 视觉语言模型(VLMs)在自动驾驶中的应用日益广泛,但其幻觉问题影响了可靠性。论文旨在通过一种融合方法解决多标签任务中的可靠性问题。

Contribution: 1.提出了Shapley-credited Context-Aware Dawid-Skene融合方法;2.结合上下文和Shapley值,提升模型融合的可靠性和适应性;3.通过真实数据集验证方法的有效性。

Method: 使用博弈论融合方法,结合上下文条件和Shapley值,为每个模型和标签动态调整可靠性,并通过日志似然比和先验知识生成校准后的后验概率。

Result: 实验结果在Hamming距离上减少了23%,Macro-F1和Micro-F1分别提升了55%和47%,证明了方法的有效性。

Insight: 该方法不仅提升了多模型融合的性能,还保护了单一模型的独特信号,适用于动态环境下的自动驾驶任务。

Abstract: Large vision-language models (VLMs) are increasingly used in autonomous-vehicle (AV) stacks, but hallucination limits their reliability in safety-critical pipelines. We present Shapley-credited Context-Aware Dawid-Skene with Agreement, a game-theoretic fusion method for multi-label understanding of ego-view dashcam video. It learns per-model, per-label, context-conditioned reliabilities from labelled history and, at inference, converts each model’s report into an agreement-guardrailed log-likelihood ratio that is combined with a contextual prior and a public reputation state updated via Shapley-based team credit. The result is calibrated, thresholdable posteriors that (i) amplify agreement among reliable models, (ii) preserve uniquely correct single-model signals, and (iii) adapt to drift. To specialise general VLMs, we curate 1,000 real-world dashcam clips with structured annotations (scene description, manoeuvre recommendation, rationale) via an automatic pipeline that fuses HDD ground truth, vehicle kinematics, and YOLOv11 + BoT-SORT tracking, guided by a three-step chain-of-thought prompt; three heterogeneous VLMs are then fine-tuned with LoRA. We evaluate with Hamming distance, Micro-Macro-F1, and average per-video latency. Empirically, the proposed method achieves a 23% reduction in Hamming distance, 55% improvement in Macro-F1, and 47% improvement in Micro-F1 when comparing with the best single model, supporting VLM fusion as a calibrated, interpretable, and robust decision-support component for AV pipelines.

[113] EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory

Jiahao Wang,Luoxin Ye,TaiMing Lu,Junfei Xiao,Jiahan Zhang,Yuxiang Guo,Xijun Liu,Rama Chellappa,Cheng Peng,Alan Yuille,Jieneng Chen

Main category: cs.CV

TL;DR: EvoWorld提出了一种结合全景视频生成和演进3D记忆的世界模型,通过显式3D几何指导增强视频生成的空间一致性和视觉真实性。

Details Motivation: 受人类心理探索和回放3D环境能力的启发,EvoWorld旨在模拟这种能力,实现长时程空间一致的世界建模。

Contribution: 主要贡献在于将演进3D记忆作为显式空间指导,用于全景视频生成,显著提升了视觉真实性和几何一致性。

Method: 方法包括:1)利用视频生成器生成未来帧;2)通过前馈transformer演进3D重建;3)基于几何重投影从显式3D记忆中合成未来帧。

Result: 实验表明,EvoWorld在视觉真实性和空间一致性上优于现有方法,尤其在长时程探索中表现突出。

Insight: 显式3D记忆的引入是关键创新点,它为视频生成提供了丰富空间线索,解决了长时程一致性问题。

Abstract: Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene’s 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.

[114] IMAGEdit: Let Any Subject Transform

Fei Shen,Weihao Xu,Rui Yan,Dong Zhang,Xiangbo Shu,Jinhui Tang

Main category: cs.CV

TL;DR: IMAGEdit是一个无需训练的框架,用于多目标视频编辑,通过多模态条件和精确掩码序列实现编辑,无需微调或重新训练。

Details Motivation: 现有视频编辑方法在提示侧多模态条件不足和掩码边界纠缠问题上存在局限性,限制了多目标视频编辑的适用性。

Contribution: IMAGEdit设计了提示引导的多模态对齐模块和基于先验的掩码重定向模块,解决了多目标视频编辑的核心挑战,显著提升了性能。

Method: 利用大模型的能力生成多模态信息和掩码运动序列,并将其输入预训练的掩码驱动视频生成模型,合成编辑后的视频。

Result: 在MSVBench基准测试中,IMAGEdit表现优于现有方法,验证了其泛化能力和编辑效果。

Insight: IMAGEdit的通用性和兼容性使其能够灵活应用于各种掩码驱动的视频生成模型,推动了视频编辑领域的发展。

Abstract: In this paper, we present IMAGEdit, a training-free framework for any number of video subject editing that manipulates the appearances of multiple designated subjects while preserving non-target regions, without finetuning or retraining. We achieve this by providing robust multimodal conditioning and precise mask sequences through a prompt-guided multimodal alignment module and a prior-based mask retargeting module. We first leverage large models’ understanding and generation capabilities to produce multimodal information and mask motion sequences for multiple subjects across various types. Then, the obtained prior mask sequences are fed into a pretrained mask-driven video generation model to synthesize the edited video. With strong generalization capability, IMAGEdit remedies insufficient prompt-side multimodal conditioning and overcomes mask boundary entanglement in videos with any number of subjects, thereby significantly expanding the applicability of video editing. More importantly, IMAGEdit is compatible with any mask-driven video generation model, significantly improving overall performance. Extensive experiments on our newly constructed multi-subject benchmark MSVBench verify that IMAGEdit consistently surpasses state-of-the-art methods. Code, models, and datasets are publicly available at https://github.com/XWH-A/IMAGEdit.

cs.SD [Back]

[115] Unpacking Musical Symbolism in Online Communities: Content-Based and Network-Centric Approaches

Kajwan Ziaoddini

Main category: cs.SD

TL;DR: 该论文结合内容分析和轻量级网络视角,研究了在线社区中音乐象征主义的传播方式,揭示了能量下降和舞蹈性上升的趋势,以及情绪与流派的系统性关联。

Details Motivation: 探索音乐象征主义在在线社区中的生产和传播机制,结合音乐内容和歌词网络分析,揭示音乐特征与社群文化的关系。

Contribution: 1) 提出了一个可复现的MIR加网络分析流程;2) 揭示了能量与响度、声学特征的关联;3) 通过歌词分析发现主流叙事的代词中心主义。

Method: 1) 使用音频描述符和歌词转录分析音乐特征;2) 量化时间趋势、词汇突出性和共现性;3) 按流派分析情绪分布。

Result: 发现能量下降,舞蹈性上升;情绪因流派而异(R&B最积极);代词在歌词中占比高。

Insight: 主流音乐偏好趋向放松但节奏感强的作品,反映社群文化的变化和商业化的影响。

Abstract: This paper examines how musical symbolism is produced and circulated in online communities by combining content-based music analysis with a lightweight network perspective on lyrics. Using a curated corpus of 275 chart-topping songs enriched with audio descriptors (energy, danceability, loudness, liveness, valence, acousticness, speechiness, popularity) and full lyric transcripts, we build a reproducible pipeline that (i) quantifies temporal trends in sonic attributes, (ii) models lexical salience and co-occurrence, and (iii) profiles mood by genre. We find a decade-long decline in energy (79 -> 58) alongside a rise in danceability (59 -> 73); valence peaks in 2013 (63) and dips in 2014-2016 (42) before partially recovering. Correlation analysis shows strong coupling of energy with loudness (r = 0.74) and negative associations for acousticness with both energy (r = -0.54) and loudness (r = -0.51); danceability is largely orthogonal to other features (|r| < 0.20). Lyric tokenization (>114k tokens) reveals a pronoun-centric lexicon “I/you/me/my” and a dense co-occurrence structure in which interpersonal address anchors mainstream narratives. Mood differs systematically by style: R&B exhibits the highest mean valence (96), followed by K-Pop/Pop (77) and Indie/Pop (70), whereas Latin/Reggaeton is lower (37) despite high danceability. Read through a subcultural identity lens, these patterns suggest the mainstreaming of previously peripheral codes and a commercial preference for relaxed yet rhythmically engaging productions that sustain collective participation without maximal intensity. Methodologically, we contribute an integrated MIR-plus-network workflow spanning summary statistics, correlation structure, lexical co-occurrence matrices, and genre-wise mood profiling that is robust to modality sparsity and suitable for socially aware recommendation or community-level diffusion studies.

[116] When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Chen-An Li,Tzu-Han Lin,Hung-yi Lee

Main category: cs.SD

TL;DR: 大型音频-语言模型(LALMs)在实际噪声环境中表现脆弱,研究发现无关音频(如静音或噪音)会显著干扰文本推理任务,即使音频无关紧要也会降低准确性和增加预测波动。

Details Motivation: 研究动机是探索LALMs在音频无关的任务中如何受到干扰音频的影响,尤其是在实际应用中的噪声环境下模型的鲁棒性问题。

Contribution: 主要贡献包括揭示了无关音频(包括静音)对LALMs文本推理任务的负面影响,提出干扰的严重程度与音频时长、音量和解码温度相关,并测试了缓解策略。

Method: 研究方法包括在三个文本基准上测试不同无关音频(如静音、合成噪声和环境声音)的影响,并分析模型规模、音频参数(如时长和音量)对干扰的影响。

Result: 结果显示无关音频明显降低准确性并增加预测波动,静音干扰与噪音相当;大规模模型表现更稳健,但问题仍普遍存在;缓解策略中,自我一致性有效但计算代价高。

Insight: 研究揭示了跨模态干扰是LALMs鲁棒性的关键挑战,强调了在无关输入下保护推理性能的高效融合策略的重要性。

Abstract: Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

[117] Hearing the Order: Investigating Selection Bias in Large Audio-Language Models

Yu-Xiang Lin,Chen-An Li,Sheng-Lun Wei,Po-Chun Chen,Hsin-Hsi Chen,Hung-yi Lee

Main category: cs.SD

TL;DR: 本文研究了大型音频语言模型(LALMs)在选择任务中是否存在顺序偏差问题,并通过实验证明所有模型均受此影响。顺序调整可显著改变性能(高达24%),提出基于排列的策略可缓解偏差。

Details Motivation: 当前LALMs广泛应用于有序选项任务,但其预测是否受选项顺序影响尚不清楚。若存在顺序偏差,将影响模型的可靠性,因此需要系统性研究。

Contribution: 1. 首次系统性分析LALMs的顺序偏差问题;2. 通过实验证明所有模型均存在此类偏差且影响显著;3. 提出排列策略可缓解偏差。

Method: 在六个LALMs上,对三个常用基准及语音版本进行实验,通过调整选项顺序观察性能变化,并测试基于排列的缓解策略。

Result: 实验显示,选项顺序调整可导致性能波动高达24%,甚至改变模型排名。排列策略在多数情况下能有效减轻偏差。

Insight: 当前LALMs的评估方法可能因顺序偏差不可靠,提醒研究者关注此类问题。排列策略为潜在解决方案,但仍需进一步研究。

Abstract: Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.

cs.AI [Back]

[118] ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Dongqi Zheng

Main category: cs.AI

TL;DR: ARS是一种无需训练的推理抑制方法,通过动态监控冗余推理步骤提升LRLM的效率,同时在准确性上保持或优于现有方法。

Details Motivation: 大型推理语言模型(LRLM)在复杂任务中表现出色,但存在计算效率低下的问题,即“过度思考”现象。现有方法难以平衡推理质量与效率。

Contribution: 提出自适应推理抑制(ARS),通过动态监控和渐进阈值抑制冗余推理步骤,显著提升了效率而不损失准确性。

Method: 采用多检查点确定性估计机制和渐进抑制阈值,动态调整推理步骤,避免不必要的计算。

Result: 在数学推理任务中,ARS显著提升了效率(token、延迟和能耗分别减少53%、46.1%和57.9%),同时保持或提高了准确性。

Insight: ARS方法表明,动态调整推理步骤可以有效提升效率,同时避免静态压缩方法的性能损失。

Abstract: Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.

[119] ACON: Optimizing Context Compression for Long-horizon LLM Agents

Minki Kang,Wei-Ning Chen,Dongge Han,Huseyin A. Inan,Lukas Wutschitz,Yanzhi Chen,Robert Sim,Saravan Rajmohan

Main category: cs.AI

TL;DR: ACON提出了一种针对长视野(long-horizon)LLM智能体的上下文压缩优化框架,显著降低了内存占用并保持了任务性能。

Details Motivation: 随着LLM智能体在动态环境中部署的需求增加,长上下文带来的成本和效率问题成为关键挑战。现有压缩方法主要针对单步任务或窄应用,缺乏对长视野任务的支持。

Contribution: 提出了Agent Context Optimization (ACON),一种统一的上下文压缩框架,通过自然语言空间中的压缩指南优化和蒸馏技术,实现了高效的长期任务处理。

Method: ACON通过分析完整上下文与压缩上下文失败的轨迹,优化压缩指南,并将其蒸馏至小模型以减少开销。

Result: 在AppWorld等任务中,ACON减少了26-54%的内存占用,任务性能基本保持;压缩指南蒸馏后仍保持95%以上准确率,并能提升小模型的性能达46%。

Insight: ACON展示了LLM在长视野任务中通过上下文压缩和蒸馏技术实现高效性与性能平衡的潜力,为智能体部署提供了新思路。

Abstract: Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.

[120] Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling

Federico Tiblias,Irina Bigoulaeva,Jingcheng Niu,Simone Balloccu,Iryna Gurevych

Main category: cs.AI

TL;DR: 论文提出了一种名为SMDS的模型无关方法,用于自动发现语言模型中的特征流形,揭示了这些流形在不同任务中的几何结构及其对推理的功能性作用。

Details Motivation: 现有方法专注于发现特定特征的几何结构,缺乏泛化性,而SMDS旨在自动发现多维特征流形,以理解语言模型如何编码和利用概念。

Contribution: 1. 提出SMDS方法,自动发现特征流形;2. 揭示了特征流形的几何结构与概念属性的一致性;3. 展示了这些结构的稳定性和动态调整能力。

Method: SMDS(监督多维缩放)是一种模型无关方法,通过分析语言模型的潜在空间,识别特征的多维几何结构(如圆形、直线、簇等)。

Result: SMDS成功识别了时间推理任务中特征流形的多样几何结构,并发现这些结构稳定支持模型推理,且能动态调整以适应上下文变化。

Insight: 特征流形在语言模型中不仅编码概念属性,还动态支持推理,表明语言模型通过结构化表示进行实体推理。

Abstract: The linear representation hypothesis states that language models (LMs) encode concepts as directions in their latent space, forming organized, multidimensional manifolds. Prior efforts focus on discovering specific geometries for specific features, and thus lack generalization. We introduce Supervised Multi-Dimensional Scaling (SMDS), a model-agnostic method to automatically discover feature manifolds. We apply SMDS to temporal reasoning as a case study, finding that different features form various geometric structures such as circles, lines, and clusters. SMDS reveals many insights on these structures: they consistently reflect the properties of the concepts they represent; are stable across model families and sizes; actively support reasoning in models; and dynamically reshape in response to context changes. Together, our findings shed light on the functional role of feature manifolds, supporting a model of entity-based reasoning in which LMs encode and transform structured representations.

[121] VIRTUE: Visual-Interactive Text-Image Universal Embedder

Wei-Yao Wang,Kazuya Tateishi,Qiyu Wu,Shusuke Takahashi,Yuki Mitsufuji

Main category: cs.AI

TL;DR: VIRTUE是一种新型的可视化交互文本-图像通用嵌入模型,通过结合分割模型和视觉语言模型的优势,实现了对用户指定区域的精确嵌入,并在大规模SCaR基准测试中取得了SOTA性能。

Details Motivation: 现有嵌入模型缺乏视觉交互能力,无法处理用户指定的兴趣区域(如点、边界框、掩模),限制了其在局部意图表示和多模态任务中的应用。

Contribution: 1. 提出了VIRTUE模型,结合分割模型和视觉语言模型,支持用户通过视觉交互指定区域;2. 引入了SCaR基准测试,验证模型在实体级检索任务中的表现。

Method: VIRTUE利用分割模型处理视觉提示(如点、框等),并将这些区域信息结合到嵌入表示中,从而支持更精确的多模态任务。

Result: 在36个通用MMEB任务中提升3.1%-8.5%,在5个视觉交互SCaR任务中提升15.2%-20.3%。

Insight: 通过视觉交互实现的区域级嵌入能够显著提升模型在复杂场景中的表现,为多模态任务提供了新思路。

Abstract: Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.

[122] Batch-CAM: Introduction to better reasoning in convolutional deep learning models

Giacomo Ignesti,Davide Moroni,Massimo Martinelli

Main category: cs.AI

TL;DR: 本文提出了Batch-CAM,一种结合Grad-CAM算法和原型重建损失的训练方法,旨在提升深度学习模型的性能和可解释性。

Details Motivation: 在高风险领域(如医疗)中,深度学习模型的透明性和可解释性至关重要。传统方法在准确性和解释性之间往往难以平衡,因此需要一种新的训练范式来解决这一问题。

Contribution: Batch-CAM通过融合Grad-CAM的批处理实现和原型重建损失,引导模型关注图像的显著特征,从而在分类任务中提升性能,同时减少训练和推理时间。

Method: 结合Grad-CAM算法和原型重建损失,Batch-CAM在训练过程中通过批处理优化模型对显著特征的关注。

Result: 实验表明,Batch-CAM在准确性和图像重建质量上均有提升,同时降低了训练和推理时间。

Insight: Batch-CAM为构建更透明、可解释和可信赖的AI系统提供了一种有效方法,尤其是在需要高精度和可解释性的领域中。

Abstract: Understanding the inner workings of deep learning models is crucial for advancing artificial intelligence, particularly in high-stakes fields such as healthcare, where accurate explanations are as vital as precision. This paper introduces Batch-CAM, a novel training paradigm that fuses a batch implementation of the Grad-CAM algorithm with a prototypical reconstruction loss. This combination guides the model to focus on salient image features, thereby enhancing its performance across classification tasks. Our results demonstrate that Batch-CAM achieves a simultaneous improvement in accuracy and image reconstruction quality while reducing training and inference times. By ensuring models learn from evidence-relevant information,this approach makes a relevant contribution to building more transparent, explainable, and trustworthy AI systems.

eess.SP [Back]

[123] WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities

Ziyi Zeng,Zhenyang Cai,Yixi Cai,Xidong Wang,Junying Chen,Rongsheng Wang,Yipeng Liu,Siqi Cai,Benyou Wang,Zhiguo Zhang,Haizhou Li

Main category: eess.SP

TL;DR: 该论文提出了WaveMind,一种基于EEG信号的多模态大型语言模型,通过将EEG信号与文本和视觉模态对齐,实现了对脑电信号的通用解释,并引入了一个新的数据集WaveMind-Instruct-338k用于指令调优。

Details Motivation: EEG信号的多模态分析存在挑战,因为它们同时编码了认知过程和内在神经状态,导致跨模态表示学习效率低下。作者希望通过对齐EEG与其他模态的语义空间,提升其通用解释能力。

Contribution: 1. 揭示EEG信号与其他模态的互补关系;2. 提出将EEG信号映射到统一语义空间的方法;3. 首次引入跨任务的EEG数据集WaveMind-Instruct-338k,支持指令调优。

Method: 通过分析EEG信号与其他模态的关系,将其映射到统一的语义空间,并使用WaveMind-Instruct-338k数据集进行指令调优,实现多模态对齐。

Result: 模型在四个下游任务中表现优异,支持灵活的开放式对话,为EEG通用模型和神经科学研究提供了有价值的方法。

Insight: EEG信号的多模态对齐为跨模态学习提供了新思路,通过统一语义空间的方法,可以更有效地理解复杂的脑电活动。

Abstract: Electroencephalography (EEG) interpretation using multimodal large language models (MLLMs) offers a novel approach for analyzing brain signals. However, the complex nature of brain activity introduces critical challenges: EEG signals simultaneously encode both cognitive processes and intrinsic neural states, creating a mismatch in EEG paired-data modality that hinders effective cross-modal representation learning. Through a pivot investigation, we uncover complementary relationships between these modalities. Leveraging this insight, we propose mapping EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation. To fully enable conversational capabilities, we further introduce WaveMind-Instruct-338k, the first cross-task EEG dataset for instruction tuning. The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations across four downstream tasks, thereby offering valuable insights for both neuroscience research and the development of general-purpose EEG models.

cs.IR [Back]

[124] Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs

Roksana Goworek,Olivia Macmillan-Scott,Eda B. Özyiğit

Main category: cs.IR

TL;DR: 该论文综述了跨语言信息检索(CLIR)的演进,从早期基于翻译的方法到当前基于嵌入和多语言大语言模型(LLMs)的技术,总结了核心组件、评估方法和资源,并指出了未来的发展方向。

Details Motivation: CLIR旨在解决跨语言检索文档的挑战,传统方法依赖翻译,但存在局限性。随着多语言LLMs的出现,嵌入和生成技术提供了新解决方案,需要系统梳理和展望。

Contribution: 1. 系统回顾了CLIR的发展历程和技术演变;2. 总结了核心组件和评估方法;3. 指出了数据不平衡和语言多样性等挑战;4. 提出了未来研究方向。

Method: 论文采用文献综述的方法,对比分析了翻译增强、嵌入驱动和生成式CLIR技术的优劣,并讨论了多语言LLMs在其中扮演的角色。

Result: 研究表明,基于嵌入和多语言LLMs的方法显著提升了CLIR的性能,尤其是在答案生成和语义对齐方面表现突出。

Insight: 1. 跨语言表征对齐是多语言LLMs的核心挑战;2. 未来的CLIR系统需更鲁棒、包容和适应性强;3. 数据不平衡问题需通过更均衡的资源分配解决。

Abstract: Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query. Research in this area has typically framed the task as monolingual retrieval augmented by translation, treating retrieval methods and cross-lingual capabilities in isolation. Both monolingual and cross-lingual retrieval usually follow a pipeline of query expansion, ranking, re-ranking and, increasingly, question answering. Recent advances, however, have shifted from translation-based methods toward embedding-based approaches and leverage multilingual large language models (LLMs), for which aligning representations across languages remains a central challenge. The emergence of cross-lingual embeddings and multilingual LLMs has introduced a new paradigm, offering improved retrieval performance and enabling answer generation. This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques. It presents a structured account of core CLIR components, evaluation practices, and available resources. Persistent challenges such as data imbalance and linguistic variation are identified, while promising directions are suggested for advancing equitable and effective cross-lingual information retrieval. By situating CLIR within the broader landscape of information retrieval and multilingual language processing, this work not only reviews current capabilities but also outlines future directions for building retrieval systems that are robust, inclusive, and adaptable.

cs.MM [Back]

[125] Object-AVEdit: An Object-level Audio-Visual Editing Model

Youquan Fu,Ruiyang Si,Hongfa Wang,Dongzhan Zhou,Jiacheng Sun,Ping Luo,Di Hu,Hongyuan Zhang,Xuelong Li

Main category: cs.MM

TL;DR: Object-AVEdit提出了一种基于反转-再生范式的音频-视觉对象级编辑模型,解决了现有模型在跨模态对象级操作上的不足。

Details Motivation: 当前音频和视频编辑模型难以实现对象级的跨模态编辑,尤其是在保留源实例结构信息的同时进行增删改操作。

Contribution: 1) 设计了反转-再生全局优化的编辑算法;2) 开发了词到发声对象对齐的音频生成模型,弥合了音频与视频生成模型在对象可控性上的差距。

Method: 采用反转-再生范式,结合词到发声对象的对齐模型和全局优化的编辑算法,实现对象级的音频-视觉编辑。

Result: 实验表明模型在对象级编辑任务中表现优异,且音频生成模型也达到先进水平。

Insight: 通过跨模态对齐和全局优化,可以实现更精细的对象级音频-视觉编辑。

Abstract: There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: https://gewu-lab.github.io/Object_AVEdit-website/.

cs.LG [Back]

[126] Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space

Houjun Liu,Shikhar Murty,Christopher D. Manning,Róbert Csordás

Main category: cs.LG

TL;DR: 论文提出了Thoughtbubbles,一种无需监督的方法,通过在潜在空间中并行思考,改进transformer模型的推理计算效率。

Details Motivation: 现有方法需要显式生成链式思维标记来扩展推理计算能力,但这限制了其在预训练中的应用且仅限于串行生成。Thoughtbubbles旨在通过学习在潜在空间中并行计算来解决这些问题。

Contribution: Thoughtbubbles是一种改进的transformer变体,能够通过学习和操作残差流在潜在空间中进行并行自适应计算,无需显式监督信号。

Method: 模型在学习过程中通过克隆或删除残差流,在网络的中间层形成“思维泡泡”,从而为需要大量计算的标记提供额外思考空间。这种方法仅通过语言建模损失在预训练中学习。

Result: 在OpenWebText和peS2o数据集上,Thoughtbubbles在困惑度和零样本评估(如HellaSwag和LAMBADA)中优于标准解码器和非自适应并行计算方法。

Insight: Thoughtbubbles的隐式特性使得自适应计算可以从预训练阶段开始学习,为统一训练和推理模型的行为提供了新思路。

Abstract: Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a “bubble” of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.

[127] The data-quality illusion: Rethinking Classifier-based quality filtering for LLM Pretraining

Thiziri Nait Saada,Louis Bethune,Michal Klein,David Grangier,Marco Cuturi,Pierre Ablin

Main category: cs.LG

TL;DR: 该论文深入分析了基于分类器的质量过滤(CQF)方法,指出尽管该方法能提升下游任务表现,但并未改善高质量数据集上的语言建模能力。作者通过实验揭示了CQF的局限性,并质疑其对数据质量的有效度量。

Details Motivation: 大规模预训练模型通常使用混合质量的数据集,数据过滤是关键环节之一。CQF是一种流行的过滤方法,但其有效性存在争议,作者旨在揭示其潜在问题。

Contribution: 1. 揭示了CQF虽然提升下游任务性能,但对高质量数据集的语言建模能力没有帮助;2. 展示了CQF与其声称的质量度量之间存在矛盾;3. 通过对比实验发现CQF与其他数据质量提升方法的差异。

Method: 1. 分析CQF的行为,特别是其对高质量数据集的影响;2. 比较CQF与基于随机标记置换的合成数据的模型表现;3. 通过实验验证CQF是否真正捕捉到数据质量的本质。

Result: 实验结果表明,CQF未能有效提升高质量数据集的语言建模能力,且其过滤行为掩盖了高质量数据的某些特性。

Insight: CQF可能并非数据质量的理想度量方式,需开发更有效的方法。同时,高质量数据的多样性可能比简单的分数过滤更重要。

Abstract: Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier’s score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.

[128] It Takes Two: Your GRPO Is Secretly DPO

Yihong Wu,Liheng Ma,Lei Ding,Muzhi Li,Xinyu Wang,Kejia Chen,Zhan Su,Zhanguang Zhang,Chenyang Huang,Yingxue Zhang,Mark Coates,Jian-Yun Nie

Main category: cs.LG

TL;DR: 该论文揭示了GRPO与DPO之间的理论联系,并提出了一种仅需两组rollout的GRPO方法(2-GRPO),显著降低了计算开销,同时保持了与16-GRPO相当的性能。

Details Motivation: 传统GRPO算法需要较大的组规模以确保训练稳定性,但这带来了高昂的计算成本。论文通过将GRPO重新定义为对比学习,发现其与DPO的联系,从而探索在最小组规模下的可行性。

Contribution: 1. 揭示了GRPO与DPO之间的理论联系;2. 提出了仅需两组rollout的2-GRPO方法;3. 通过理论和实验验证了2-GRPO的性能和效率。

Method: 将GRPO重新定义为一种对比学习问题,并基于DPO的理论框架分析最小组规模的GRPO(2-GRPO)。通过理论证明和实验验证其有效性。

Result: 2-GRPO在性能上与16-GRPO相当,同时减少了70%以上的训练时间和仅使用1/8的rollout数量。

Insight: 通过对比学习和理论重构,可以显著优化RL算法的计算开销,而不会牺牲性能。

Abstract: Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO’s empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.

[129] GEM: A Gym for Agentic LLMs

Zichen Liu,Anya Sims,Keyu Duan,Changyu Chen,Simon Yu,Xiangxin Zhou,Haotian Xu,Shaopan Xiong,Bo Liu,Chenmien Tan,Chuen Yang Beh,Weixun Wang,Hao Zhu,Weiyan Shi,Diyi Yang,Michael Shieh,Yee Whye Teh,Wee Sun Lee,Min Lin

Main category: cs.LG

TL;DR: 论文介绍了GEM,一个开源的环境模拟器,旨在支持基于经验的大语言模型训练,提供标准化环境-代理接口、多样化环境和工具,并通过基准测试比较不同RL算法的表现。

Details Motivation: 随着大语言模型的训练范式从静态数据集转向基于经验的学习,需要一个标准化且高效的框架来支持环境与代理的交互。

Contribution: 提出了GEM,一个类似OpenAI-Gym的标准化环境模拟器,支持高吞吐量和灵活性,并提供多样化的环境和工具,加速基于经验的LLM研究。

Method: GEM设计了异步向量化执行和灵活的接口封装,支持多种RL训练框架,并引入了ReBN算法优化信用分配。

Result: 通过GEM,在不同环境中比较了PPO、GRPO和REINFORCE的表现,展示了ReBN的优势。

Insight: GEM不仅是一个训练环境,还是一个便捷的评估工具,为未来基于代理的LLM研究提供了重要支持。

Abstract: The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which – unlike GRPO – is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

[130] A Practitioner’s Guide to Multi-turn Agentic Reinforcement Learning

Ruiyi Wang,Prithviraj Ammanabrolu

Main category: cs.LG

TL;DR: 该论文研究了在多轮强化学习中训练大型语言模型代理的实际有效方法和不足之处,提出了三个关键设计支柱(环境、奖励和策略),并总结了一套训练配方。

Details Motivation: 现有关于多轮强化学习中代理训练的框架和定义较为分散,缺乏系统性分析和设计选择的总结。

Contribution: 论文的主要贡献是拆分了设计空间为三个支柱(环境、奖励和策略),并通过实验总结了一套训练配方。

Method: 方法包括实验分析了任务复杂度对环境的信号作用,奖励稀疏性对训练的影响,以及不同策略梯度方法的比较。

Result: 实验结果表明,简单环境可以反映代理在不同任务中的泛化能力,奖励稀疏性依赖于RL算法的选择,同时找到了最优的SFT与RL训练比例。

Insight: 研究发现环境设计、奖励设计和策略选择之间的协同设计对代理性能至关重要,并提出了一套实用的训练配方。

Abstract: We study what actually works and what doesn’t for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars – environment, reward, and policy – and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent’s policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro

[131] Prompt Curriculum Learning for Efficient LLM Post-Training

Zhaolin Gao,Joongwon Kim,Wen Sun,Thorsten Joachims,Sid Wang,Richard Yuanzhe Pang,Liang Tan

Main category: cs.LG

TL;DR: 论文提出Prompt Curriculum Learning (PCL),一种轻量级强化学习算法,通过选择中等难度提示语来优化语言模型的训练效率,避免昂贵的前向计算,显著提升训练速度。

Details Motivation: 传统RL训练语言模型对批处理和提示语选择策略敏感,效率低。PCL旨在通过动态选择中等难度提示语,在性能与效率之间取得更好平衡。

Contribution: 1. 提出PCL算法,通过动态选择中等难度提示语优化训练效率;2. 通过实验确定了最佳训练批大小和提示语难度的重要性;3. 展示了PCL在速度和性能上的显著优势。

Method: PCL结合价值模型动态选择中等难度提示语,避免昂贵的前向计算,同时逐步增加提示语难度,优化RL训练过程。

Result: PCL在MATH和DeepScaleR数据集上分别实现12.1倍和16.9倍的提示语选择速度提升,并在性能或效率上优于基线方法。

Insight: 中等难度提示语对RL训练至关重要,动态选择策略能显著提升训练效率,同时保持良好的模型性能。

Abstract: We introduce Prompt Curriculum Learning (PCL), a lightweight reinforcement learning (RL) algorithm that selects intermediate-difficulty prompts using a learned value model to post-train language models. Since post-training LLMs via RL remains sensitive to batching and prompt selection strategies, we first conduct a series of systematic experiments where we (1) determine the optimal training batch size that balances generation efficiency and gradient quality and (2) establish the importance of focusing on prompts of intermediate difficulty for the policy. We build upon these results to design PCL, which identifies prompts of intermediate difficulty for the current policy in an on-policy manner by using a value model that is concurrently updated based on the current policy. By focusing on informative prompts that yield high effective ratios, PCL achieves either the highest performance or requires significantly less time to reach comparable performance to its counterparts. Compared to rollout-based filtering methods, PCL avoids costly rollouts and achieves $12.1\times$ and $16.9\times$ faster speed on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR, respectively. We further demonstrate that our value model accurately predicts prompt difficulty and allows PCL to focus on progressively more challenging prompts during RL. Our results present a new methodology that delivers improved tradeoff between upper-bound performance and efficiency for reasoning-focused RL.

[132] Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

Yiran Shen,Yu Xia,Jonathan Chang,Prithviraj Ammanabrolu

Main category: cs.LG

TL;DR: 该论文提出了一种统一框架,通过标准化奖励模型训练和多目标对齐方法,实现了在可验证和不可验证奖励领域的同时对齐,并提供了细粒度的推理时用户控制。

Details Motivation: 大型语言模型对齐人类偏好通常是多维的,但现有方法往往将多种信号简化为单一目标,导致训练低效且缺乏用户控制。

Contribution: 提出了一种统一框架,包括标准化的过程奖励模型训练、多目标对齐方法(MAH-DPO和向量化奖励),以及推理时的用户控制能力。

Method: 1.标准化PRM训练;2.使用MAH-DPO和向量化奖励进行多目标对齐;3.实现推理时细粒度用户控制。

Result: 在数学推理、价值观对齐和多轮对话任务中,框架显著提升了多目标性能,减少了目标间的冲突,并增强了用户控制灵活性。

Insight: 通过向量化奖励和多目标对齐方法,能够更好地平衡不同目标间的冲突,为模型对齐提供了新的解决思路。

Abstract: Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single optimizeable objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards (mathematical accuracy), non-verifiable subjective preferences (human values), and complex interactive scenarios (multi-turn AI tutoring dialogues). Such multi-objective reinforcement learning setups are often plagued by the individual objectives being at odds with each other, resulting in inefficient training and little user control during inference. We propose a unified framework that: (i) standardizes {process reward model} (PRM) training across both verifiable and non-verifiable settings to better supervise models’ chain-of-thought reasoning; (ii) performs {multi-objective alignment} by training the LLM with our $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{DPO}$ (MAH-DPO) and a vectorized reward where the dimensions of the vector correspond to the various objectives instead of a single scalar; and (iii) demonstrates how such a system provides fine-grained inference-time user control. Experiments across math reasoning, value alignment, and multi-turn dialogue show that our framework improves performance across multiple objectives simultaneously, while minimizing cross-objective trade-offs and enabling flexible inference time user control. The code can be found at https://github.com/pearls-lab/multiobj-align.

[133] BroRL: Scaling Reinforcement Learning via Broadened Exploration

Jian Hu,Mingjie Liu,Ximing Lu,Fang Wu,Zaid Harchaoui,Shizhe Diao,Yejin Choi,Pavlo Molchanov,Jun Yang,Jan Kautz,Yi Dong

Main category: cs.LG

TL;DR: BroRL提出了一种通过增加每个示例的rollout数量来扩展强化学习的互补方法,解决了ProRL在训练步骤增加后性能饱和的问题,实现了持续的性能提升。

Details Motivation: ProRL通过增加训练步骤扩展强化学习,但性能会在几千步后饱和。BroRL提出通过增加rollout数量来扩展探索空间,打破性能瓶颈。

Contribution: 1. 通过质量平衡方程分析了正确和错误token概率质量的变化速率;2. 提出BroRL方法,增加rollout数量实现持续性能提升;3. 在1.5B模型上取得SOTA结果。

Method: BroRL基于质量平衡方程理论分析,增加每个示例的rollout数量(数百次),确保正确token概率质量增长。

Result: BroRL在3K步ProRL饱和后仍能持续提升性能,1.5B模型在多个基准测试中取得了最优结果。

Insight: 在强化学习中,增加探索(rollout数量)是扩展模型能力的有效方式,远超过单纯增加训练步骤的效果。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.

[134] Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment

Suhyeon Lee,Jong Chul Ye

Main category: cs.LG

TL;DR: PromptLoop是一个通过潜在反馈逐步优化提示的模块化RL框架,显著提升了扩散模型的对齐效果,具备更好的泛化性和鲁棒性。

Details Motivation: 现有的基于RL的扩散模型微调方法在泛化性、组合性和抗奖励黑客攻击方面存在不足,而现有的提示优化方法多为前馈式,无法充分利用RL的序列性质。

Contribution: 提出了PromptLoop,一种基于潜在反馈的逐步提示优化RL框架,无需修改扩散模型权重,而是通过MLLM实现提示的动态更新。

Method: 使用多模态大语言模型(MLLM)训练RL策略,基于扩散模型的中间潜在状态逐步更新提示,实现了与Diffusion RL结构相似但更灵活的对齐方法。

Result: 实验表明PromptLoop能够有效优化奖励,泛化到未见过的模型,与现有对齐方法正交组合,并缓解过优化和奖励黑客问题。

Insight: 通过潜在反馈动态优化提示是一种高效的对齐策略,展现了模块化设计的优势和对扩散模型对齐任务的适应性。

Abstract: Despite the recent progress, reinforcement learning (RL)-based fine-tuning of diffusion models often struggles with generalization, composability, and robustness against reward hacking. Recent studies have explored prompt refinement as a modular alternative, but most adopt a feed-forward approach that applies a single refined prompt throughout the entire sampling trajectory, thereby failing to fully leverage the sequential nature of reinforcement learning. To address this, here we introduce PromptLoop, a plug-and-play RL framework that incorporates latent feedback into step-wise prompt refinement. Rather than modifying diffusion model weights, a multimodal large language model (MLLM) is trained with RL to iteratively update prompts based on intermediate latent states of diffusion models. This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment. Extensive experiments across diverse reward functions and diffusion backbones demonstrate that PromptLoop (i) achieves effective reward optimization, (ii) generalizes seamlessly to unseen models, (iii) composes orthogonally with existing alignment methods, and (iv) mitigates over-optimization and reward hacking.

[135] Rehearsal-free and Task-free Online Continual Learning With Contrastive Prompt

Aopeng Wang,Ke Deng,Yongli Ren,Jun Luo

Main category: cs.LG

TL;DR: 论文提出了一种无需重放缓冲区(rehearsal-free)且无需任务标识(task-free)的在线持续学习方法(F2OCL),通过结合提示学习(prompt learning)和NCM分类器,有效解决了持续学习中的灾难性遗忘问题。

Details Motivation: 现有方法在在线持续学习(OCL)中常使用重放缓冲区存储样本或依赖任务边界,但这可能引发数据安全或隐私问题,且任务边界在实际中难以确定。因此,研究旨在探索无需存储样本且无需任务标识的OCL解决方案。

Contribution: 主要贡献是提出了一种结合提示学习和NCM分类器的新方法(F2OCL),避免了样本存储和任务边界依赖,解决了灾难性遗忘问题。

Method: 方法的核心是将提示学习(prompt learning)与NCM(Nearest Class Mean)分类器结合,通过动态调整提示来适应新数据,同时保持对旧知识的记忆。

Result: 在两个基准数据集上的广泛实验证明了该方法的有效性,表明其在避免灾难性遗忘方面的优越性能。

Insight: 论文揭示了提示学习在持续学习中的潜力,为无需存储样本和任务标识的OCL提供了一种可行的技术路径。

Abstract: The main challenge of continual learning is \textit{catastrophic forgetting}. Because of processing data in one pass, online continual learning (OCL) is one of the most difficult continual learning scenarios. To address catastrophic forgetting in OCL, some existing studies use a rehearsal buffer to store samples and replay them in the later learning process, other studies do not store samples but assume a sequence of learning tasks so that the task identities can be explored. However, storing samples may raise data security or privacy concerns and it is not always possible to identify the boundaries between learning tasks in one pass of data processing. It motivates us to investigate rehearsal-free and task-free OCL (F2OCL). By integrating prompt learning with an NCM classifier, this study has effectively tackled catastrophic forgetting without storing samples and without usage of task boundaries or identities. The extensive experimental results on two benchmarks have demonstrated the effectiveness of the proposed method.

cs.MA [Back]

[136] Stochastic Self-Organization in Multi-Agent Systems

Nurbek Tastan,Samuel Horvath,Karthik Nandakumar

Main category: cs.MA

TL;DR: 论文提出了SelfOrg框架,通过动态调整多智能体系统的通信结构,实现高效协作。该方法利用Shapley值评估智能体贡献,构建有向无环图(DAG)优化信息传播,无需额外监督或训练。实验表明,在弱LLM背景下性能显著优于现有方法。

Details Motivation: 现有的多智能体系统(MAS)协作机制通常依赖固定拓扑或外部LLM评估,增加了复杂性。本文的目标是通过动态通信优化,提升智能体协作效率,尤其是在弱LLM环境下。

Contribution: 提出SelfOrg框架,动态调整智能体通信结构;利用Shapley值评估贡献并构建DAG;无需额外监督或训练;在弱LLM背景下表现优异。

Method: 1. 智能体独立生成对用户查询的响应;2. 用Shapley值近似评估同伴贡献;3. 构建DAG优化信息传播;4. 动态更新通信结构。

Result: 实验表明SelfOrg在强、弱LLM环境下均表现稳健,尤其在弱LLM背景下显著优于现有方法。理论分析表明多智能体提高了正确性概率。

Insight: 动态通信结构能有效提升多智能体系统的协作效率;Shapley值评估和DAG构建是实现高效信息传播的关键。

Abstract: Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM. However, this potential can only be realized when the collaboration mechanism between agents is optimized. Specifically, optimizing the communication structure between agents is critical for fruitful collaboration. Most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or employ external LLM judges, thereby adding to the complexity. In this work, we introduce a response-conditioned framework that adapts communication on-the-fly. Agents independently generate responses to the user query and assess peer contributions using an approximation of the Shapley value. A directed acyclic graph (DAG) is then constructed to regulate the propagation of the responses among agents, which ensures stable and efficient message transmission from high-contributing agents to others. This graph is dynamically updated based on the agent responses from the previous collaboration round. Since the proposed framework enables the self-organization of agents without additional supervision or training, we refer to it as SelfOrg. The SelfOrg framework goes beyond task- and query-level optimization and takes into account the stochastic nature of agent responses. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse. We also theoretically show that multiple agents increase the chance of correctness and that the correct responses naturally dominate the information flow.

cs.RO [Back]

[137] VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Hengtao Li,Pengxiang Ding,Runze Suo,Yihao Wang,Zirui Ge,Dongyuan Zang,Kexian Yu,Mingyang Sun,Hongyin Zhang,Donglin Wang,Weihua Su

Main category: cs.RO

TL;DR: VLA-RFT 提出了一个基于世界模型的强化微调框架,通过数据驱动的可控模拟器减少样本需求,提升视觉-语言-动作模型的泛化性和鲁棒性,仅需不到400步微调即可超越模仿学习基线。

Details Motivation: 视觉-语言-动作(VLA)模型依赖模仿学习,容易导致累积误差和分布偏移下的性能下降。强化学习(RL)虽能缓解问题,但面临真实交互成本高或模拟与现实差距大的挑战。

Contribution: 提出了VLA-RFT框架,利用数据驱动的世界模型作为可控模拟器,通过轨迹级奖励信号实现高效的强化微调,显著降低样本需求并提升模型鲁棒性。

Method: 训练世界模型从真实交互数据中预测未来视觉观察,基于动作生成密集奖励信号,强化微调策略以对齐目标任务。

Result: 仅需不到400步微调,VLA-RFT优于监督学习基线,且在扰动条件下保持稳定性能。

Insight: 世界模型驱动的强化微调是一种高效的后训练范式,能够显著提升VLA模型的泛化能力和鲁棒性。

Abstract: Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.

[138] Hybrid Training for Vision-Language-Action Models

Pietro Mazzaglia,Cansu Sancaktar,Markus Peschl,Daniel Dijkman

Main category: cs.RO

TL;DR: 本文提出了Hybrid Training(HyT)框架,通过条件预测多样化输出,使得Vision-Language-Action模型在推理时可以选择性地生成Chain-of-thought(CoT)或直接预测动作,从而提高性能而不增加推理时间。

Details Motivation: 在机器人任务中,Chain-of-thought(CoT)虽然能提升性能,但会增加推理时间,影响实时性。HyT旨在解决这一矛盾,允许模型在训练时学习CoT,但在推理时可以跳过。

Contribution: HyT框架的核心贡献是实现了模型在推理时的灵活性,既能生成CoT也能直接预测动作,同时保持性能提升而不牺牲推理速度。

Method: HyT通过条件预测多样化输出(如动作、CoT或指令),支持模型在训练时学习CoT生成,而在推理时可以选择性地跳过CoT。

Result: 在仿真和真实实验中,HyT展示了性能的提升,同时减少了推理时间,验证了其有效性。

Insight: 研究表明,CoT并非性能提升的绝对前提,HyT通过灵活性设计成功平衡了性能与效率的需求。

Abstract: Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model’s generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent’s actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.

[139] HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo,Daewon Choi,Taeyoung Kim,Kyungmin Lee,Changyeon Kim,Youngyo Seo,Jinwoo Shin

Main category: cs.RO

TL;DR: HAMLET是一个将视觉-语言-动作(VLA)模型转化为历史感知策略的框架,通过紧凑编码历史时刻和轻量级记忆模块提升长期任务的性能。

Details Motivation: 现有的VLA模型忽视历史上下文,而机器人任务往往依赖历史信息。HAMLET旨在解决这一问题,提升模型在历史相关任务中的表现。

Contribution: 提出了HAMLET框架,通过moment tokens和记忆模块整合历史信息,显著提升了VLA模型的性能,尤其在长期任务中。

Method: 1. 引入moment tokens紧凑编码每时刻的感知信息;2. 使用时序对比学习初始化;3. 轻量级记忆模块整合历史特征。

Result: 在真实世界任务中成功率提升47.2%,RoboCasa Kitchen和LIBERO任务中性能也有显著提升。

Insight: 历史上下文对机器人任务至关重要,HAMLET展示了如何高效地将其融入VLA模型。

Abstract: Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.

q-bio.QM [Back]

[140] Behavioural Classification in C. elegans: a Spatio-Temporal Analysis of Locomotion

Nemanja Antonic,Monika Scholz,Aymeric Vellinger,Euphrasie Ramahefarivo,Elio Tuci

Main category: q-bio.QM

TL;DR: 该论文提出了一种无需清晰观察秀丽隐杆线虫(C. elegans)全身的方法,可从运动中提取行为单元,并通过无监督自动流程定义这些单元,避免了预定义的偏见。

Details Motivation: 目前的方法需要清晰观察线虫全身,但在高密度条件下难以实现。因此,需要一种无需完整身体信息的方法来提取行为单元,以更好地研究社会背景对个体行为的影响。

Contribution: 论文的主要贡献包括:1)提出了一种无需完整身体视角的行为单元提取方法;2)采用无监督自动流程定义行为单元,避免人为偏见;3)展示了基于单点追踪的时空运动模式可用于行为分类。

Method: 方法包括:1)从线虫运动中提取行为单元的无监督自动流程;2)模拟线虫运动的代理模型验证方法的有效性;3)将自动提取的行为单元与人工设计的单元进行对比解释。

Result: 结果表明,即使通过单点追踪,也能提取出时空运动模式,这些模式是行为分类的基本要素。模拟线虫的运动与自然线虫的运动匹配度验证了方法的有效性。

Insight: 研究发现,时空运动模式在高密度条件下仍然可识别,且无监督自动方法能够避免人工设计行为单元的偏见,为行为分析提供了更客观的工具。

Abstract: The 1mm roundworm C. elegans is a model organism used in many sub-areas of biology to investigate different types of biological processes. In order to complement the n-vivo analysis with computer-based investigations, several methods have been proposed to simulate the worm behaviour. These methods extract discrete behavioural units from the flow of the worm movements using different types of tracking techniques. Nevertheless, these techniques require a clear view of the entire worm body, which is not always achievable. For example, this happens in high density worm conditions, which are particularly informative to understand the influence of the social context on the single worm behaviour. In this paper, we illustrate and evaluate a method to extract behavioural units from recordings of C. elegans movements which do not necessarily require a clear view of the entire worm body. Moreover, the behavioural units are defined by an unsupervised automatic pipeline which frees the process from predefined assumptions that inevitably bias the behavioural analysis. The behavioural units resulting from the automatic method are interpreted by comparing them with hand-designed behavioural units. The effectiveness of the automatic method is evaluated by measuring the extent to which the movement of a simulated worm, with an agent-based model, matches the movement of a natural worm. Our results indicate that spatio-temporal locomotory patterns emerge even from single point worm tracking. Moreover, we show that such patterns represent a fundamental aspect of the behavioural classification process.

cs.GR [Back]

[141] Motion In-Betweening for Densely Interacting Characters

Xiaotang Zhang,Ziyi Chang,Qianhui Men,Hubert P. H. Shum

Main category: cs.GR

TL;DR: 本文提出了一种针对密集交互角色的运动中插(in-betweening)方法,通过跨空间建模和对抗学习解决交互角色的长时程运动合成问题,以保持运动质量和交互稳定性。

Details Motivation: 传统的运动中插方法主要针对单一角色,但扩展到密集交互角色时面临时空对应和自然过渡的挑战。本文旨在解决这一问题,实现两角色自然交互的长时程运动合成。

Contribution: 1)提出跨空间中插(Cross-Space In-Betweening)方法,建模角色在不同表示空间中的交互;2)通过对抗学习识别周期性交互模式以保持交互质量;3)学习修正漂移的潜空间以防止姿态误差累积,提升运动质量。

Method: 1)跨空间中插方法建模角色交互;2)对抗学习识别周期性交互模式;3)潜空间修正技术防止长时程合成中的误差积累。

Result: 实验表明,该方法能够生成真实、可控且长时程的交互运动(如拳击和舞蹈动作),并通过定量评估和用户研究验证了其有效性。

Insight: 交互角色的运动中插需要同时考虑时空对应和运动质量,而对抗学习和潜空间修正是解决这些问题的有效手段。

Abstract: Motion in-betweening is the problem to synthesize movement between keyposes. Traditional research focused primarily on single characters. Extending them to densely interacting characters is highly challenging, as it demands precise spatial-temporal correspondence between the characters to maintain the interaction, while creating natural transitions towards predefined keyposes. In this research, we present a method for long-horizon interaction in-betweening that enables two characters to engage and respond to one another naturally. To effectively represent and synthesize interactions, we propose a novel solution called Cross-Space In-Betweening, which models the interactions of each character across different conditioning representation spaces. We further observe that the significantly increased constraints in interacting characters heavily limit the solution space, leading to degraded motion quality and diminished interaction over time. To enable long-horizon synthesis, we present two solutions to maintain long-term interaction and motion quality, thereby keeping synthesis in the stable region of the solution space.We first sustain interaction quality by identifying periodic interaction patterns through adversarial learning. We further maintain the motion quality by learning to refine the drifted latent space and prevent pose error accumulation. We demonstrate that our approach produces realistic, controllable, and long-horizon in-between motions of two characters with dynamic boxing and dancing actions across multiple keyposes, supported by extensive quantitative evaluations and user studies.

[142] ReSWD: ReSTIR’d, not shaken. Combining Reservoir Sampling and Sliced Wasserstein Distance for Variance Reduction

Mark Boss,Andreas Engelhardt,Simon Donné,Varun Jampani

Main category: cs.GR

TL;DR: ReSWD结合Weighted Reservoir Sampling和Sliced Wasserstein Distance,通过自适应保留信息性的投影方向,减少方差,实现稳定梯度和快速收敛。

Details Motivation: 高维分布中Wasserstein距离计算成本过高,而Sliced Wasserstein Distance(SWD)虽可扩展,但其蒙特卡罗估计器方差高,导致梯度噪声大、收敛慢。

Contribution: 提出ReSWD,通过Weighted Reservoir Sampling自适应保留优化过程中的信息性投影方向,降低了SWD的方差,同时保持无偏性。

Method: 将Weighted Reservoir Sampling集成到SWD中,动态选择优化过程中重要的投影方向。

Result: 在合成基准和实际任务(如色彩校正和扩散引导)中,ReSWD表现优于标准SWD及其他方差减少基线方法。

Insight: Reservoir Sampling的结合机制可以有效稳定SWD优化过程,适用于需要高效分布匹配的任务。

Abstract: Distribution matching is central to many vision and graphics tasks, where the widely used Wasserstein distance is too costly to compute for high dimensional distributions. The Sliced Wasserstein Distance (SWD) offers a scalable alternative, yet its Monte Carlo estimator suffers from high variance, resulting in noisy gradients and slow convergence. We introduce Reservoir SWD (ReSWD), which integrates Weighted Reservoir Sampling into SWD to adaptively retain informative projection directions in optimization steps, resulting in stable gradients while remaining unbiased. Experiments on synthetic benchmarks and real-world tasks such as color correction and diffusion guidance show that ReSWD consistently outperforms standard SWD and other variance reduction baselines. Project page: https://reservoirswd.github.io/

[143] Audio Driven Real-Time Facial Animation for Social Telepresence

Jiye Lee,Chenghui Li,Linh Tran,Shih-En Wei,Jason Saragih,Alexander Richard,Hanbyul Joo,Shaojie Bai

Main category: cs.GR

TL;DR: 该论文提出了一种基于音频驱动的实时面部动画系统,通过扩散模型和新型架构实现低延迟(<15ms)的高质量3D面部表情动画,适用于虚拟现实中的社交交互。

Details Motivation: 现有的面部动画技术通常在实时性和质量之间存在权衡,难以满足虚拟现实中社交交互的低延迟和高真实感需求。

Contribution: 1. 提出了一种基于音频信号的实时面部动画系统;2. 利用扩散模型和新型架构(在线Transformer和蒸馏管道)显著降低延迟;3. 支持多模态输入(如情感条件和VR头显的眼部摄像头)。

Method: 1. 使用编码器将音频信号转换为潜在面部表情序列;2. 通过扩散模型生成高质量面部动画;3. 采用在线Transformer消除对未来输入的依赖,并通过蒸馏管道将迭代去噪加速为单步操作。

Result: 实验表明,该系统在面部动画准确性上优于现有离线方法,推理速度提高了100至1000倍,并在多语言演讲等场景中得到验证。

Insight: 该系统展示了扩散模型在实时面部动画中的潜力,同时通过架构创新解决了实时处理的挑战,为虚拟现实中的社交交互提供了新工具。

Abstract: We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.

eess.IV [Back]

[144] Enhancing Safety in Diabetic Retinopathy Detection: Uncertainty-Aware Deep Learning Models with Rejection Capabilities

Madhushan Ramalingam,Yaish Riaz,Priyanthi Rajamanoharan,Piyumi Dasanayaka

Main category: eess.IV

TL;DR: 这篇论文研究了一种具备不确定性感知能力的深度学习模型,用于糖尿病视网膜病变检测,引入拒绝机制以拒绝低置信度预测,从而提高临床诊断的安全性。

Details Motivation: 糖尿病视网膜病变的早期诊断至关重要,但现有深度学习模型缺乏对预测置信度的明确指示,可能导致临床决策的不确定性。

Contribution: 论文的主要贡献是提出了一种不确定性感知模型和拒绝机制,能够在低置信度情况下拒绝预测,从而提高模型的可靠性和安全性。

Method: 采用了变分贝叶斯模型,结合拒绝机制,通过关键性能指标(如接受预测的准确率、覆盖率、拒绝率和预期校准误差)评估模型性能。

Result: 结果表明,模型在准确性和谨慎性之间存在权衡,不确定性估计和选择性拒绝显著提升了模型在安全关键诊断场景中的可靠性。

Insight: 在医疗诊断等安全关键领域,引入不确定性感知和拒绝机制是必要的,可以有效平衡模型的覆盖率和可靠性。

Abstract: Diabetic retinopathy (DR) is a major cause of visual impairment, and effective treatment options depend heavily on timely and accurate diagnosis. Deep learning models have demonstrated great success identifying DR from retinal images. However, relying only on predictions made by models, without any indication of model confidence, creates uncertainty and poses significant risk in clinical settings. This paper investigates an alternative in uncertainty-aware deep learning models, including a rejection mechanism to reject low-confidence predictions, contextualized by deferred decision-making in clinical practice. The results show there is a trade-off between prediction coverage and coverage reliability. The Variational Bayesian model adopted a more conservative strategy when predicting DR, subsequently rejecting the uncertain predictions. The model is evaluated by means of important performance metrics such as Accuracy on accepted predictions, the proportion of accepted cases (coverage), the rejection-ratio, and Expected Calibration Error (ECE). The findings also demonstrate a clear trade-off between accuracy and caution, establishing that the use of uncertainty estimation and selective rejection improves the model’s reliability in safety-critical diagnostic use cases.

[145] Deep Learning Approaches with Explainable AI for Differentiating Alzheimer Disease and Mild Cognitive Impairment

Fahad Mostafa,Kannon Hossain,Hafiz Khan

Main category: eess.IV

TL;DR: 该论文提出了一种混合深度学习集成框架,结合可解释AI技术,用于区分阿尔茨海默病和轻度认知障碍,并在ADNI数据集上取得了优越的分类性能。

Details Motivation: 阿尔茨海默病的早期准确诊断对其临床干预至关重要,而轻度认知障碍作为其前驱阶段,结构变化细微,难以区分。因此,需要一种高性能且可解释的方法来辅助诊断。

Contribution: 主要贡献包括:1)提出了一种混合深度学习集成框架,结合ResNet50、NASNet和MobileNet三种预训练CNN模型;2)通过集成学习和加权平均策略进一步提升性能;3)引入可解释AI技术(Grad-CAM),生成热图和归因图以解释模型的决策依据。

Method: 方法包括:1)使用灰质和白质切片作为输入;2)对三种预训练CNN模型进行微调;3)采用堆叠集成学习和加权平均策略优化模型组合;4)应用Grad-CAM生成解释性热图。

Result: 在ADNI数据集上,提出的方法在阿尔茨海默病 vs. 轻度认知障碍的分类中达到99.21%的准确率,在轻度认知障碍 vs. 正常对照的分类中达到91.0%的准确率,优于传统迁移学习和基线集成方法。

Insight: 研究揭示了深度学习模型在神经退行性疾病诊断中的潜力,同时通过可解释AI技术增强了模型的透明性,有助于识别结构生物标志物,为临床决策提供支持。

Abstract: Early and accurate diagnosis of Alzheimer Disease is critical for effective clinical intervention, particularly in distinguishing it from Mild Cognitive Impairment, a prodromal stage marked by subtle structural changes. In this study, we propose a hybrid deep learning ensemble framework for Alzheimer Disease classification using structural magnetic resonance imaging. Gray and white matter slices are used as inputs to three pretrained convolutional neural networks such as ResNet50, NASNet, and MobileNet, each fine tuned through an end to end process. To further enhance performance, we incorporate a stacked ensemble learning strategy with a meta learner and weighted averaging to optimally combine the base models. Evaluated on the Alzheimer Disease Neuroimaging Initiative dataset, the proposed method achieves state of the art accuracy of 99.21% for Alzheimer Disease vs. Mild Cognitive Impairment and 91.0% for Mild Cognitive Impairment vs. Normal Controls, outperforming conventional transfer learning and baseline ensemble methods. To improve interpretability in image based diagnostics, we integrate Explainable AI techniques by Gradient weighted Class Activation, which generates heatmaps and attribution maps that highlight critical regions in gray and white matter slices, revealing structural biomarkers that influence model decisions. These results highlight the frameworks potential for robust and scalable clinical decision support in neurodegenerative disease diagnostics.

[146] DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction

Yucheng Xing,Ling Huang,Jingying Ma,Ruping Hong,Jiangdong Qiu,Pei Liu,Kai He,Huazhu Fu,Mengling Feng

Main category: eess.IV

TL;DR: DPsurv提出了一种基于双原型证据融合的网络,用于全切片图像的生存预测,具有不确定性感知能力和可解释性,通过五项公开数据集验证了其效果和可靠性。

Details Motivation: 现有全切片图像生存分析方法普遍缺乏可解释性,且忽视预测不确定性,DPsurv旨在解决这些问题。

Contribution: 提出了DPsurv网络,实现不确定性感知的生存区间预测,并通过原型分配图和风险聚合增强可解释性。

Method: 采用双原型证据融合网络,输出不确定性预测结果,并通过原型图和风险聚合提供多层次解释。

Result: 在五项公开数据集上取得最高的C-index和最低的积分Brier得分,验证了方法的有效性和可靠性。

Insight: DPsurv的透明性设计增强了模型的可信度和临床适用性,为肿瘤预后分析提供了新思路。

Abstract: Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.

[147] Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods Study

Kiran Nijjer,Ryan Bui,Derek Jiu,Adnan Ahmed,Peter Wang,Benjamin Liu,Kevin Zhu,Lilly Zhu

Main category: eess.IV

TL;DR: 论文分析了大型语言模型SkinGPT-4在皮肤病诊断中的肤色偏见问题,并提出微调策略以减少偏见,最终通过临床评估验证了方法的有效性。

Details Motivation: 现有的皮肤病诊断模型(如SkinGPT-4)在训练数据中主要以浅肤色为主,导致对深肤色的诊断准确性较低。这种偏见可能对医疗公平性产生负面影响。

Contribution: 1. 评估了SkinGPT-4在不同肤色下的性能偏见;2. 提出了微调策略以减少肤色偏见;3. 通过临床验证证明了方法的有效性。

Method: 1. 使用SCIN数据集评估性能偏见;2. 对SkinGPT-4进行微调,以优化自定义皮肤病分类任务;3. 采用公平性指标(如人口统计平等性和均衡几率)进行评估。

Result: 微调后的模型在公平性指标上表现更优,人口统计平等性从0.10提升至0.75,Fitzpatrick I-VI的公平性评分为0.83-0.90。

Insight: 大型语言模型在医疗应用中存在偏见问题,通过针对性微调和公平性评估可以有效提升模型的包容性和准确性。

Abstract: SkinGPT-4, a large vision-language model, leverages annotated skin disease images to augment clinical workflows in underserved communities. However, its training dataset predominantly represents lighter skin tones, limiting diagnostic accuracy for darker tones. Here, we evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases, including eczema, allergic-contact dermatitis, and psoriasis using the open-sourced SCIN dataset. We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks and explored bias mitigation strategies. Clinical evaluation by board-certified dermatologists on six relevant skin diseases from 300 SCIN cases assessed images for diagnostic accuracy, informativity, physician utility, and patient utility. Model fairness metrics, including demographic parity and equalized odds, were calculated across skin tones. SkinGPT-4 achieved an average demographic parity of 0.10 across Fitzpatrick types, with notable differences of 0.10-0.15 between lightest and darkest tones across evaluation metrics. Model hallucinations in artifacts and anatomy occurred at a rate of 17.8. Our customized models achieved average F1, precision, and AUROC of 0.75, 0.78, and 0.78 across visually similar disease pairs. Fairness analysis showed an average demographic parity of 0.75, with a maximum disparity of 0.21 across skin tones. The best model achieved parity scores of 0.83, 0.83, 0.76, 0.89, 0.90, and 0.90 for Fitzpatrick I-VI, indicating robust fairness. Large language models such as SkinGPT-4 showed weaker performance on darker tones. Model biases exist across evaluation criteria, and hallucinations may affect diagnostic efficacy. These findings demonstrate the efficacy of training accurate, fair models using existing backbones for custom skin disease classification.

[148] Variable Rate Image Compression via N-Gram Context based Swin-transformer

Priyanka Mudgal,Feng Liu

Main category: eess.IV

TL;DR: 该论文提出了一种基于N-gram上下文的Swin Transformer方法,用于学习图像压缩,实现了单模型可变速率压缩,并通过扩大感受野提高了高分辨率图像的重建质量。

Details Motivation: 现有的Swin Transformer在高分辨率图像重建时因受限的感受野而忽视较大区域,导致重建质量不佳。为此,论文提出了一种改进方法,以提升上下文感知能力和可变速率压缩性能。

Contribution: 1. 提出了一种基于N-gram上下文的Swin Transformer,用于可变速率图像压缩;2. 通过扩展感受野提升了高分辨率图像重建质量;3. 在BD-Rate指标上比现有方法提升了5.86%。

Method: 将N-gram上下文机制引入Swin Transformer中,扩展了模型的感受野,使其能够捕捉更大的区域信息,从而改进高分辨率图像的重建质量。

Result: 实验表明,该方法在可变速率图像压缩任务中优于现有技术,BD-Rate指标提升了5.86%,并显著提高了图像中感兴趣区域(ROI)的质量。

Insight: 通过结合N-gram上下文机制和Swin Transformer,可以有效扩展模型的感受野,提升高分辨率图像的重建质量,特别适用于工业视觉等对象聚焦的应用场景。

Abstract: This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.

[149] A Fast and Precise Method for Searching Rectangular Tumor Regions in Brain MR Images

Hidenori Takeshima,Shuki Maruyama

Main category: eess.IV

TL;DR: 本文提出了一种快速精准的方法,用于在脑部MRI图像中搜索矩形肿瘤区域。该方法结合了分割网络和基于用户可控搜索指标的快速搜索方法,显著提升了速度和准确性。

Details Motivation: 脑部MRI图像中肿瘤区域的快速精准定位对诊断至关重要。传统方法耗时且准确性不足,亟需一种高效的技术改进现有流程。

Contribution: 1. 提出了一种结合EfficientNet-U-Net分割网络和基于求和面积表的快速搜索方法。2. 设计了用户可控的搜索指标,优先选择立方体区域并优化肿瘤分数计算。

Method: 1. 使用EfficientNet-U-Net作为分割网络。2. 通过求和面积表加速矩形区域的体素求和,实现3D全搜索。3. 设计了新型搜索指标,优先立方体并优化肿瘤分数。

Result: 3D全搜索耗时仅8秒,比传统方法快100-500倍。提出的搜索指标在肿瘤分数和形状偏好(立方体优于长条形)上均优于传统方法。

Insight: 结合高效分割网络和快速搜索方法是解决医学图像分析中区域定位问题的有效途径。用户可控指标的设计提供了灵活性,适合实际临床应用。

Abstract: Purpose: To develop a fast and precise method for searching rectangular regions in brain tumor images. Methods: The authors propose a new method for searching rectangular tumor regions in brain MR images. The proposed method consisted of a segmentation network and a fast search method with a user-controllable search metric. As the segmentation network, the U-Net whose encoder was replaced by the EfficientNet was used. In the fast search method, summed-area tables were used for accelerating sums of voxels in rectangular regions. Use of the summed-area tables enabled exhaustive search of the 3D offset (3D full search). The search metric was designed for giving priority to cubes over oblongs, and assigning better values for higher tumor fractions even if they exceeded target tumor fractions. The proposed computation and metric were compared with those used in a conventional method using the Brain Tumor Image Segmentation dataset. Results: When the 3D full search was used, the proposed computation (8 seconds) was 100-500 times faster than the conventional computation (11-40 minutes). When the user-controllable parts of the search metrics were changed variously, the tumor fractions of the proposed metric were higher than those of the conventional metric. In addition, the conventional metric preferred oblongs whereas the proposed metric preferred cubes. Conclusion: The proposed method is promising for implementing fast and precise search of rectangular tumor regions, which is useful for brain tumor diagnosis using MRI systems. The proposed computation reduced processing times of the 3D full search, and the proposed metric improved the quality of the assigned rectangular tumor regions.

[150] U-DFA: A Unified DINOv2-Unet with Dual Fusion Attention for Multi-Dataset Medical Segmentation

Zulkaif Sajjad,Furqan Shaukat,Junaid Mir

Main category: eess.IV

TL;DR: U-DFA是一个统一的DINOv2-Unet架构,结合了局部-全局融合适配器(LGFA)以提高医学图像分割性能,在多数据集上实现了最先进的性能。

Details Motivation: 现有的CNN和transformer结合方法在局部和全局特征融合上效果不佳,而视觉语言模型(VLM)和基础模型在医学图像任务中存在领域差距和高计算成本问题。

Contribution: 提出U-DFA,一种统一的DINOv2-Unet架构,集成了新颖的LGFA模块,有效融合了高层语义和空间特征,同时减少了可训练参数的数量。

Method: 使用DINOv2-Unet编码器-解码器结构,通过LGFA模块将CNN空间特征注入冻结的DINOv2块中,实现多阶段的特征融合。

Result: 在Synapse和ACDC数据集上达到了最先进的性能,仅使用了33%的可训练参数。

Insight: LGFA模块的设计有效解决了局部和全局特征的融合问题,同时减少了计算成本,为多模态医学图像分割提供了可扩展的解决方案。

Abstract: Accurate medical image segmentation plays a crucial role in overall diagnosis and is one of the most essential tasks in the diagnostic pipeline. CNN-based models, despite their extensive use, suffer from a local receptive field and fail to capture the global context. A common approach that combines CNNs with transformers attempts to bridge this gap but fails to effectively fuse the local and global features. With the recent emergence of VLMs and foundation models, they have been adapted for downstream medical imaging tasks; however, they suffer from an inherent domain gap and high computational cost. To this end, we propose U-DFA, a unified DINOv2-Unet encoder-decoder architecture that integrates a novel Local-Global Fusion Adapter (LGFA) to enhance segmentation performance. LGFA modules inject spatial features from a CNN-based Spatial Pattern Adapter (SPA) module into frozen DINOv2 blocks at multiple stages, enabling effective fusion of high-level semantic and spatial features. Our method achieves state-of-the-art performance on the Synapse and ACDC datasets with only 33% of the trainable model parameters. These results demonstrate that U-DFA is a robust and scalable framework for medical image segmentation across multiple modalities.