Table of Contents

cs.CL [Back]

[1] Filtered Reasoning Score: Evaluating Reasoning Quality on a Model’s Most-Confident Traces cs.CL | cs.AIPDF

Manas Pathak, Xingyao Chen, Shuozhe Li, Amy Zhang, Liu Leqi

TL;DR: 本文提出了一种名为过滤推理分数(FRS)的新评估指标,用于评估大型语言模型(LLM)在推理任务中的推理质量,而不仅仅是答案的正确性。FRS通过分析模型最自信的推理轨迹,在忠实性、连贯性、实用性和事实性等多个维度上进行评估,从而区分出具有相似准确率但推理能力不同的模型。

Details

Motivation: 现有基于结果的评估方法(如准确率)存在局限性,因为模型可能通过有缺陷的推理得出正确答案,或者因记忆或过度优化而掩盖了推理能力的差异。因此,需要一种超越结果评估的方法来直接评估推理本身的质量。

Result: 实验表明,使用FRS评估时,在标准准确率下无法区分的模型在推理质量上表现出显著差异。此外,在一个基准测试上FRS较高的模型,在其他推理基准测试的准确率和推理质量上也倾向于表现更好。

Insight: 创新点在于提出了一个多维度推理质量评分框架,并引入了基于模型置信度的过滤机制(FRS),仅使用最自信的轨迹进行评分,这有助于减少低置信度正确轨迹的偶然性影响,从而更可靠地捕捉模型可迁移的推理能力。

Abstract: Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model’s transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.


[2] Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision cs.CLPDF

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu

TL;DR: 本文提出了一种名为Self-Distillation Zero(SD-Zero)的后训练方法,旨在解决可验证场景下强化学习稀疏奖励和蒸馏方法需要外部监督的问题。该方法通过让单一模型同时扮演生成器和修订者两个角色,利用二元奖励生成密集的令牌级自监督信号,从而在无需外部教师或高质量演示的情况下,实现更高效的训练。在数学和代码推理基准测试中,SD-Zero显著提升了基础模型的性能,并超越了多种基线方法。

Details

Motivation: 当前可验证场景下的后训练方法存在两类问题:强化学习(RLVR)依赖稀疏的二元奖励,而蒸馏方法需要昂贵或难以获取的外部教师或高质量演示来提供密集监督。SD-Zero旨在克服这些限制,提出一种无需外部监督且比强化学习更样本高效的方法。

Result: 在Qwen3-4B-Instruct和Olmo-3-7B-Instruct模型上,于数学和代码推理基准测试中,SD-Zero相比基础模型性能提升至少10%,并在相同问题集和训练样本预算下,超越了Rejection Fine-Tuning (RFT)、GRPO和Self-Distillation Fine-Tuning (SDFT)等强基线方法。

Insight: SD-Zero的核心创新在于提出了一种将稀疏二元奖励转化为密集令牌级自监督的机制,其算法展现出两个新颖特性:令牌级自定位(修订者能基于奖励识别生成器响应中需要修订的关键令牌)和迭代自进化(修订能力的提升可通过定期的教师同步蒸馏回生成性能)。这为无需外部监督的高效模型自我改进提供了新思路。

Abstract: Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser’s token distributions conditioned on the generator’s response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator’s response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.


[3] Benchmarking Deflection and Hallucination in Large Vision-Language Models cs.CL | cs.AI | cs.CVPDF

Nicholas Moratelli, Christopher Davis, Leonardo F. R. Ribeiro, Bill Byrne, Gonzalo Iglesias

TL;DR: 该论文提出了一个动态数据筛选流程来构建VLM-DeflectionBench基准,包含2775个样本,用于评估大型视觉语言模型在面临视觉与文本证据冲突或检索知识不完整时的行为,特别是模型是否能够正确生成“无法回答”的回应。通过对20个先进模型的实验,发现模型在存在噪声或误导性证据时通常无法有效回避回答。

Details

Motivation: 现有基准测试忽略了视觉与文本证据间的冲突,且未评估模型在检索知识不完整时生成回避回答(如“抱歉,我无法回答”)的能力,同时由于LVLM训练集扩大导致基准快速过时,无法有效测试模型对检索的依赖。

Result: 在VLM-DeflectionBench上对20个SOTA LVLM进行实验,结果表明模型在存在噪声或误导性证据时通常无法正确生成回避回答。该基准旨在成为可靠知识库视觉问答评估的可复用和可扩展基准。

Insight: 创新点包括:1)动态数据筛选流程以维持基准难度,过滤出真正依赖检索的样本;2)细粒度评估协议,区分参数记忆与检索鲁棒性;3)强调评估模型在知识不足时的行为而非仅关注其已知内容,这对提高LVLM的可靠性至关重要。

Abstract: Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer…) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.


[4] Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration cs.CLPDF

Xin Liu, Lu Wang

TL;DR: 本文提出CURE框架,通过教导大语言模型在声明级别进行不确定性推理来改善长文本生成的事实性。该框架引入声明感知推理协议,将输出结构化为原子声明与显式置信度估计的配对,并通过多阶段训练流程对齐模型置信度与声明正确性,最终在四个长文本事实性基准测试中显著提升了事实准确性。

Details

Motivation: 现有方法主要通过事后修正或基于正确性的强化学习来改善大语言模型在长文本生成中的幻觉问题,但未能教导模型评估其生成内容的可靠性,导致模型仍可能自信地陈述错误声明。

Result: 在四个长文本事实性基准测试上,CURE框架相比有监督和强化学习基线持续提升了事实准确性,同时保持了事实召回率;特别是在传记生成任务中,声明级别准确性提升了高达39.9%,且在FactBench基准上的AUROC提升了16.0%,表明校准效果显著改善。

Insight: 创新点在于将不确定性推理从整个响应的单一标量置信度扩展到声明级别,通过结构化输出和置信度对齐训练实现细粒度的事实性校准,这为改善长文本生成的可信度提供了可借鉴的细粒度不确定性建模思路。

Abstract: Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims’ correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.


[5] Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models cs.CL | cs.AI | cs.CYPDF

Syed Rifat Raiyan

TL;DR: 该论文首次系统性地研究了大型语言模型中的可识别受害者效应,发现该效应普遍存在且受对齐训练和推理提示的显著影响。指令微调模型表现出极端IVE,而推理专用模型则反转该效应;标准思维链提示会加剧IVE,仅功利主义思维链能消除它。

Details

Motivation: 随着LLM在人道主义分类、自动资助评估和内容审核中扮演重要角色,研究这些系统是否继承了人类道德推理中的情感非理性,特别是可识别受害者效应。

Result: 在16个前沿模型上进行了51,955次API试验,发现IVE效应显著(d=0.223),约为人类单受害者基线(d≈0.10)的两倍;标准思维链提示使效应大小增加近三倍(从d=0.15到d=0.41)。

Insight: 揭示了LLM对齐训练和推理提示对道德偏差的调制作用,标准思维链可能加剧而非纠正非理性,为AI在人道主义决策中的部署提供了重要警示。

Abstract: The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen’s d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.


[6] AlphaEval: Evaluating Agents in Production cs.CLPDF

Pengrui Lu, Bingyu Xu, Wenjun Zhang, Shengjia Hua, Xuanjian Gao

TL;DR: AlphaEval是一个面向生产环境的AI智能体评估基准,包含来自7家公司的94个真实任务,覆盖6个职业领域。它评估的是完整的商业智能体产品(如Claude Code、Codex),而非单一模型,并采用了多范式评估框架(如LLM-as-a-Judge、形式化验证等)。此外,论文还贡献了一个从生产需求到可执行评估任务的系统化构建框架。

Details

Motivation: 现有基准测试通常使用回顾性整理、需求明确、指标确定的任务来评估智能体能力,这与生产环境中需求隐含、输入多模态且信息分散、任务需要未声明的领域知识、输出为长期专业成果、且成功标准由领域专家动态评判的现实情况存在根本差异。因此,需要开发能反映生产现实的评估方法。

Result: 论文提出了AlphaEval基准及其构建框架,但摘要中未提及具体的定量评估结果(如与SOTA的比较)或基准测试分数。

Insight: 主要创新点在于:1) 提出了首个基于真实生产任务、评估完整商业智能体产品的基准,能捕捉模型级评估无法发现的性能变化;2) 设计了一个多范式、可组合的评估框架;3) 贡献了一个系统化的“需求到基准”构建方法论,使任何组织都能快速为其领域构建生产级基准,实现了评估流程的标准化和可复现。

Abstract: The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics – conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products – Claude Code, Codex, etc. – as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework – a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.


[7] AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs cs.CL | cs.IRPDF

Manoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Quan Z. Sheng

TL;DR: 本文提出了AgenticAI-DialogGen框架,这是一个基于智能体、无需人工监督的模块化系统,用于生成基于人物设定和主题引导的对话。该框架利用LLM智能体从非结构化对话中提取知识图谱、识别主题、构建说话者角色,并模拟主题引导的对话,同时通过QA模块生成基于短期和长期对话历史的问答对。作者还创建了TopicGuidedChat数据集,用于微调和评估LLM的短长期记忆能力。

Details

Motivation: 当前缺乏同时编码短长期对话历史的数据集,使得微调和评估大语言模型的短长期记忆能力变得困难。现有的对话数据集要么缺乏记忆基础,要么忽视主题连续性,或依赖昂贵的人工标注。

Result: 评估表明,AgenticAI-DialogGen生成的对话质量更高,并且在TopicGuidedChat数据集上微调后的LLM在基于记忆的QA任务上取得了更好的性能。

Insight: 创新点在于提出了一个完全自动化的、基于智能体的对话生成框架,能够同时建模短期(当前对话)和长期(知识图谱形式的人物知识)记忆,并生成高质量、主题连贯的对话数据用于模型微调和评估。

Abstract: Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine-tuning and evaluating short- and long-term memories remain difficult due to the absence of datasets that encode both short- and long-term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI-DialogGen, a modular agent-based framework that generates persona-grounded and topic-guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. A QA module generates memory-grounded Question Answer (QA) pairs drawn from short- and long-term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long-term memory is encoded as speaker-specific knowledge graphs and short-term memory as newly generated topic-guided conversations. Evaluations depict that AgenticAI-DialogGen yields higher conversational quality and LLMs fine-tuned on TGC dataset achieve improved performance on memory-grounded QA tasks.


[8] Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models cs.CLPDF

Keshu Wu, Chenchen Kuai, Zihao Li, Jiwan Jiang, Shiyu Shen

TL;DR: 本文提出了一种名为Order-Aware Knowledge Hypergraph RAG(OKH-RAG)的新方法,用于增强大型语言模型。该方法将知识表示为具有优先顺序结构的超图中的高阶交互,并将检索重新定义为超边的序列推断,以恢复反映底层推理过程的连贯交互轨迹。

Details

Motivation: 现有基于图或超图的RAG方法将检索到的证据视为无序集合,隐含了排列不变性假设,这与许多现实世界推理任务不符,因为结果不仅取决于发生哪些交互,还取决于交互展开的顺序。

Result: 在热带气旋和港口运营等顺序敏感问答和解释任务上的评估表明,OKH-RAG始终优于排列不变的基线方法,消融实验证实性能提升源于对交互顺序的建模。

Insight: 创新点在于将顺序视为知识的一阶结构属性,通过学习的转移模型直接从数据推断优先关系而无需显式时间监督,揭示了基于集合检索的关键局限:有效推理不仅需要检索相关证据,还需将其组织成结构化序列。

Abstract: Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.


[9] Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score cs.CLPDF

Manh Nguyen, Sunil Gupta, Hung Le

TL;DR: 本文提出了一种名为径向共识评分(RCS)的新方法,用于从大型语言模型生成的多个候选回答中选择最可靠的答案。RCS通过计算答案嵌入的加权弗雷歇均值(语义中心),并根据候选答案到该中心的径向距离进行排序,从而建模语义共识。该方法无需训练,支持多种加权方案,并在多个基准测试中优于现有基线方法。

Details

Motivation: 现有方法(如自一致性)依赖离散投票,而基于概率的方法往往无法捕捉候选答案之间的关系,或低估高质量但低频的响应,且未能充分利用答案表示的几何结构。为了解决这些限制,本文旨在开发一种更高效、更鲁棒的答案选择方法。

Result: 在涵盖短问答和长推理任务的七个基准测试以及五个开源模型上的广泛实验表明,RCS变体始终优于强基线方法,且随着采样预算增加,性能提升更为显著。RCS还可作为多智能体辩论中多数投票的有效替代方案,并在黑盒场景中表现出强大的鲁棒性。

Insight: 创新点在于将几何共识作为可扩展且广泛适用的原则,通过加权弗雷歇均值和径向距离来建模语义一致性,超越了简单的多数投票,实现了更富表达力和鲁棒性的聚合。从客观角度看,该方法框架灵活,支持多种加权方案,且无需训练,适用于黑盒设置,具有很好的实用性和泛化能力。

Abstract: Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.


[10] Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning cs.CLPDF

Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang

TL;DR: 本文提出了一种名为SpreadsheetAgent的两阶段多智能体框架,用于解决现实世界中大规模电子表格的理解问题。该方法采用分步阅读和推理范式,通过多模态(代码执行结果、图像、LaTeX表格)逐步解释局部区域,并构建结构草图和行列摘要,再进行任务驱动的推理,最后通过验证模块确保可靠性。

Details

Motivation: 现有基于大语言模型的方法通常将表格视为纯文本,忽略了关键的布局线索和视觉语义,且现实世界中的电子表格往往规模巨大,超出了LLM能高效处理的输入长度。

Result: 在两个电子表格数据集上的广泛实验证明了该方法的有效性。使用GPT-OSS-120B时,SpreadsheetAgent在Spreadsheet Bench上达到了38.16%的准确率,比ChatGPT Agent基线(35.27%)高出2.89个绝对百分点。

Insight: 创新点在于提出了一个两阶段多智能体框架,采用增量式、多模态的局部区域解释策略,并设计了验证模块以减少错误传播。从客观角度看,该方法将复杂的表格理解任务分解为结构提取和任务推理两个阶段,并整合了代码执行等外部工具,提升了处理大规模、复杂布局电子表格的鲁棒性和可扩展性。

Abstract: Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at https://github.com/renhouxing/SpreadsheetAgent.git.


Haoran Li, Yulin Chen, Huihao Jing, Wenbin Hu, Tsz Ho Li

TL;DR: 本文提出ContextLens框架,通过半规则化方法利用大语言模型(LLM)在法律领域中对输入上下文进行建模,以评估数据隐私和AI安全的合规性。该框架不直接评估安全结果,而是指导LLM回答一系列精心设计的问题,涵盖适用性、一般原则和详细规定,从而在上下文不完整或模糊的现实场景中识别已知和未知因素,以符合预定义的优先级和规则。

Details

Motivation: 现有研究通常假设上下文完整清晰,但现实世界中的隐私和安全问题上下文往往是模糊和不完整的,这限制了LLM在合规性评估中的有效性。本文旨在解决这一局限性,通过建模不完美的隐私和安全上下文来提升法律合规评估的准确性。

Result: 在涵盖《通用数据保护条例》(GDPR)和《欧盟人工智能法案》的现有合规基准测试上进行广泛实验,结果表明ContextLens能显著提升LLM的合规评估性能,无需任何训练即可超越现有基线,并能进一步识别模糊和缺失的因素。

Insight: 创新点在于提出半规则化框架,将LLM与法律领域知识结合,通过结构化问题引导而非直接评估,以处理不完美上下文;客观分析认为,该方法通过显式建模已知和未知因素,增强了在现实模糊场景下的鲁棒性和可解释性,为法律合规评估提供了新思路。

Abstract: Individuals’ concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs’ compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.


[12] ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection cs.CLPDF

Boyang Li, Hongzhe Shou, Yuanyuan Liang, Jingbin Zhang, Fang Zhou

TL;DR: 本文提出了一种名为ToxiTrace的可解释中文毒性检测方法,该方法通过三个核心组件——利用轻量级LLM指导的细粒度毒性片段生成器(CuSA)、梯度约束损失函数(GCLoss)和样本特异性对比学习(ARCL)——来增强BERT类编码器,旨在同时提升分类准确性和生成可读、连贯的毒性证据片段。

Details

Motivation: 现有中文毒性内容检测方法主要关注句子级分类,但往往无法提供可读且连续的毒性证据片段,因此需要一种既能准确分类又能提供可解释性证据的方法。

Result: 实验表明,ToxiTrace在保持基于编码器的高效推理的同时,提高了分类准确性和毒性片段提取的质量,并产生了更连贯、更易读的解释。

Insight: 创新点在于将可解释性直接融入训练目标,通过梯度对齐和对比学习机制,使模型在分类时能自动聚焦于关键毒性证据,这为构建兼具高性能与高可解释性的毒性检测模型提供了新思路。

Abstract: Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.


[13] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness cs.CLPDF

Tomer Ashuach, Liat Ein-Dor, Shai Gretz, Yoav Katz, Yonatan Belinkov

TL;DR: 该论文研究了大型语言模型是否拥有关于答案正确性的特权知识(类似于人类的内省能力),即模型内部状态是否包含外部观察无法获取的信息。通过训练基于模型自身隐藏状态和外部模型表示的分类器,发现在标准评估中自身表示并无优势,但在模型预测存在分歧的子集上,发现事实知识任务中存在特权知识,而数学推理中则没有。

Details

Motivation: 探究大型语言模型是否具备类似人类内省的特权知识,即模型内部状态是否包含关于答案正确性的外部不可观测信息,以解决模型自我评估能力的问题。

Result: 在标准评估中,自身表示与外部模型表示性能相当;但在预测分歧子集上,事实知识任务中自身表示显著优于外部表示(达到SOTA水平),而数学推理中无优势。

Insight: 创新点在于通过模型间预测分歧来隔离特权知识,揭示了领域特异性(事实vs.数学)和层级渐进性(事实知识优势从中早期层开始显现),为理解模型内部表示机制提供了新视角。

Abstract: Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model’s own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.


[14] ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance cs.CLPDF

Daniil Gurgurov, Tom Röhr, Sebastian von Rohrscheidt, Josef van Genabith, Alexander Löser

TL;DR: 本文提出了ReasonXL,一个包含五种欧洲语言(英语、德语、法语、意大利语、西班牙语)的大规模并行跨领域推理轨迹语料库,用于解决大语言模型(LLM)在非英语场景下仍主要使用英语进行推理的问题。通过一个简单的两阶段流程(监督微调SFT和带可验证奖励的强化学习RLVR),成功使LLM能够完全使用目标语言进行推理,同时保持或超越基线性能,并最小化通用知识损失。

Details

Motivation: 尽管多语言能力有所进步,但大多数大语言模型在训练和关键推理过程中仍以英语为中心,即使在处理非英语问题时也主要用英语推理,这为非英语使用场景带来了根本性的不匹配。

Result: 使用ReasonXL进行适配后,模型在目标语言上的推理性能与基线相当或更优,通用知识损失最小,并广泛保留了跨语言迁移能力。

Insight: 创新点在于构建了首个大规模多语言推理轨迹平行语料库ReasonXL,并提出SFT+RLVR的两阶段适配流程,能高效地将LLM的推理语言切换至目标语言。分析发现,模型早期层存在决定语言身份激活瓶颈,而上层则集中了适配驱动的权重和激活变化;RLVR相比SFT能以更小的参数更新实现更大的行为差异,表明其表征重路由更高效。

Abstract: Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates.


[15] Agentic Insight Generation in VSM Simulations cs.CLPDF

Micha Selak, Dirk Krechel, Adrian Ulges, Sven Spieckermann, Niklas Stoehr

TL;DR: 本文提出了一种解耦的两步智能体架构,用于从复杂的价值流图模拟中提取可操作的见解。该架构通过将编排与数据分析分离,结合领域专家知识进行渐进式数据发现,从而智能选择数据源并在数据结构间执行多跳推理。实验表明,该框架在多个先进大语言模型上具有可行性,顶级模型准确率高达86%且鲁棒性高。

Details

Motivation: 从复杂的价值流图模拟中提取可操作的见解具有挑战性、耗时且易出错,现有方法虽擅长处理原始数据,但结构上难以捕捉区分相似数据源所需的细微情境差异。

Result: 在多个最先进的大语言模型上验证了框架的可行性,顶级模型在评估中达到高达86%的准确率,并表现出高鲁棒性。

Insight: 创新点在于提出解耦的两步智能体架构,将编排与数据分析分离,结合领域知识进行渐进式数据发现和多跳推理,以解决现有方法在捕捉细微情境差异上的不足。

Abstract: Extracting actionable insights from complex value stream map simulations can be challenging, time-consuming, and error-prone. Recent advances in large language models offer new avenues to support users with this task. While existing approaches excel at processing raw data to gain information, they are structurally unfit to pick up on subtle situational differences needed to distinguish similar data sources in this domain. To address this issue, we propose a decoupled, two-step agentic architecture. By separating orchestration from data analysis, the system leverages progressive data discovery infused with domain expert knowledge. This architecture allows the orchestration to intelligently select data sources and perform multi-hop reasoning across data structures while maintaining a slim internal context. Results from multiple state-of-the-art large language models demonstrate the framework’s viability: with top-tier models achieving accuracies of up to 86% and demonstrating high robustness across evaluation runs.


[16] Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation cs.CL | cs.AI | cs.CVPDF

Sihang Jia, Shuliang Liu, Songbo Yang, Yibo Yan, Xin Zou

TL;DR: 本文提出了一种名为’解码扰动’的训练免费框架,旨在缓解多模态大语言模型中的推理幻觉问题。该方法通过动态应用多级文本扰动来探测潜在的语言先验,利用注意力方差增强稳定证据区域并抑制特征空间中的可疑噪声,同时构建可解释的先验漂移方向来抵消文本共现带来的概率偏差。

Details

Motivation: 多模态大语言模型经常遭受推理幻觉,部分原因是语言先验主导了视觉证据。现有免训练缓解方法要么扰动视觉表示并偏离自然图像分布,要么进行侵入性操作损害模型的固有生成流畅性。

Result: 大量实验证实,DeP有效减少了幻觉,并在多个基准测试中实现了优越性能。

Insight: 论文的创新点在于将多模态幻觉视为解码阶段视觉基础对文本措辞的过度敏感,并基于此提出通过受控文本干预来缓解先验诱导的幻觉。从客观角度看,该方法避免了直接扰动视觉输入,而是通过动态文本扰动和注意力方差分析来调整特征空间,是一种新颖且可解释的缓解策略。

Abstract: Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model’s inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.


[17] KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning cs.CL | cs.AIPDF

Shuai Wang, Yinan Yu

TL;DR: 本文提出了KG-Reasoner,一个基于强化学习的端到端知识图谱推理框架。它将多跳推理过程整合到大型语言模型的统一“思考”阶段,使模型能够内化图谱遍历过程,动态探索推理路径并在必要时进行回溯。

Details

Motivation: 大型语言模型在知识密集型推理任务上存在困难,而现有基于知识图谱的推理方法通常将过程分解为孤立的步骤,限制了推理的灵活性并可能导致信息丢失。本文旨在解决复杂查询在知识图谱上进行精确多跳推理的挑战。

Result: 在八个多跳和知识密集型推理基准测试上的实验表明,KG-Reasoner达到了与最先进方法相当或更优的性能。

Insight: 主要创新点在于通过强化学习将知识图谱的遍历过程内化到LLM的推理中,实现了端到端的统一决策,避免了传统流水线方法导致的步骤割裂和信息丢失,从而提升了推理的连贯性和灵活性。

Abstract: Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified “thinking” phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG-Reasoner.


[18] Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting cs.CL | cs.AIPDF

Shuai Wang, Xixi Wang, Yinan Yu

TL;DR: 本文提出了一种基于图的软提示框架,用于在不完整知识图谱上进行多跳知识库问答,通过将推理范式从节点级路径遍历转向子图级推理,利用图神经网络编码结构子图生成软提示,使大语言模型能够基于更丰富的结构上下文进行推理,减少对缺失边的敏感性,并采用两阶段范式降低计算成本。

Details

Motivation: 解决大语言模型在知识密集型场景中容易产生幻觉的问题,以及传统多跳KBQA方法依赖显式边遍历、对知识图谱不完整性脆弱的局限性。

Result: 在四个多跳KBQA基准测试上的实验表明,该方法在其中三个上达到了最先进的性能。

Insight: 创新点在于将推理从节点路径遍历提升到子图级,利用GNN编码结构信息生成软提示来增强LLM的结构感知能力;采用两阶段范式平衡性能与计算效率;该方法减少了对图谱完整性的依赖,提升了在不完整KG上的鲁棒性。

Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.


[19] Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs cs.CL | cs.SDPDF

Linhao Zhang, Yuhan Song, Aiwei Liu, Chuhan Wu, Sijun Zhang

TL;DR: 本文提出了一种统一音频模式(UAS),用于解决当前音频大语言模型在细粒度声学感知任务上表现不佳的问题。UAS通过将音频信息组织为转录、副语言信息和非语言事件三个明确组件,并以统一JSON格式进行监督,从而在保持音频-文本对齐的同时提升模型的感知能力。

Details

Motivation: 当前音频大语言模型在复杂推理任务上表现出色,但在细粒度声学感知方面表现不佳,这归因于以ASR为中心的训练方式抑制了副语言线索和声学事件。本文旨在通过结构化监督框架来弥补这一差距。

Result: 在MMSU、MMAR和MMAU等基准测试上的实验表明,UAS-Audio相比同规模的最先进模型在细粒度感知任务上提升了10.9%,同时保持了稳健的推理能力。

Insight: 创新点在于提出了一种将音频信息分解为三个明确组件的统一监督框架,实现了全面的声学覆盖而不牺牲音频-文本对齐,从而同时提升了感知和推理性能。

Abstract: Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components – Transcription, Paralinguistics, and Non-linguistic Events – within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS-Audio yields consistent improvements, boosting fine-grained perception by 10.9% on MMSU over the same-size state-of-the-art models while preserving robust reasoning capabilities. Our code and model are publicly available at https://github.com/Tencent/Unified_Audio_Schema.


[20] Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis cs.CLPDF

Kang He, Yuzhe Ding, Xinrong Wang, Fei Li, Chong Teng

TL;DR: 本文提出了一种名为EBMC(Enhance-then-Balance Modality Collaboration)的新型框架,用于解决多模态情感分析中模态不平衡和噪声干扰的问题。该框架通过语义解耦和跨模态增强来提升弱模态的表征质量,并引入能量引导的模态协调机制和实例感知的模态信任蒸馏,以实现模态间的平衡协作和鲁棒融合。

Details

Motivation: 现有方法在利用跨模态互补性时,往往难以充分利用较弱的模态,导致主导模态压制非语言模态,引发模态竞争,从而在模态噪声或缺失情况下降低融合性能和鲁棒性。

Result: 大量实验表明,EBMC在标准多模态情感分析基准上取得了最先进或具有竞争力的结果,并在模态缺失的设置下保持了强大的性能。

Insight: 创新点在于提出了“先增强后平衡”的协作范式,通过语义解耦增强弱模态,并利用能量引导的均衡目标和实例级信任蒸馏,实现了对模态贡献的动态、自适应平衡,提升了模型在非理想条件下的鲁棒性。

Abstract: Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.


[21] Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs cs.CLPDF

Xudong Wang, Chaoning Zhang, Qigan Sun, Zhenzhen Huang, Chang Lu

TL;DR: 本文提出Tri-RAG,一种基于结构化三元组的检索框架,通过将外部自然语言知识自动转换为由条件、证明和结论组成的标准化三元组,以条件作为语义锚点进行检索匹配,从而提升检索增强生成(RAG)中检索的准确性与上下文效率,并在多个基准数据集上验证了其有效性。

Details

Motivation: 现有RAG方法通常检索并拼接非结构化文本片段作为上下文,容易引入冗余或弱相关信息,导致上下文积累过多、语义对齐度降低、推理链碎片化,从而影响生成质量并增加token消耗。

Result: 在多个基准数据集上的实验结果表明,Tri-RAG显著提升了检索质量和推理效率,在复杂推理场景中产生了更稳定的生成行为和更高效的资源利用。

Insight: 创新点在于将外部知识结构化表示为逻辑三元组(条件-证明-结论),并以条件作为显式语义锚点进行检索,避免了直接拼接冗长原始文本,从而在检索准确性和上下文token效率之间取得了良好平衡;方法采用轻量级提示调优,无需更新模型参数。

Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in large language models (LLMs) by incorporating external knowledge during generation. However, the effectiveness of RAG depends not only on the design of the retriever and the capacity of the underlying model, but also on how retrieved evidence is structured and aligned with the query. Existing RAG approaches typically retrieve and concatenate unstructured text fragments as context, which often introduces redundant or weakly relevant information. This practice leads to excessive context accumulation, reduced semantic alignment, and fragmented reasoning chains, thereby degrading generation quality while increasing token consumption. To address these challenges, we propose Tri-RAG, a structured triplet-based retrieval framework that improves retrieval efficiency through reasoning-aligned context construction. Tri-RAG automatically transforms external knowledge from natural language into standardized structured triplets consisting of Condition, Proof, and Conclusion, explicitly capturing logical relations among knowledge fragments using lightweight prompt-based adaptation with frozen model parameters. Building on this representation, the triplet head Condition is treated as an explicit semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge units without directly concatenating lengthy raw texts. As a result, Tri-RAG achieves a favorable balance between retrieval accuracy and context token efficiency. Experimental results across multiple benchmark datasets demonstrate that Tri-RAG significantly improves retrieval quality and reasoning efficiency, while producing more stable generation behavior and more efficient resource utilization in complex reasoning scenarios.


[22] Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs cs.CL | cs.AIPDF

Alkid Baci, Luke Friedrichs, Caglar Demir, N’Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo

TL;DR: 本文提出了一种名为RALP的新方法,将知识图谱链接预测重新定义为提示学习问题。该方法通过学习基于字符串的思维链(CoT)提示作为三元组的评分函数,利用贝叶斯优化(MIPRO算法)从少于30个训练样本中识别有效提示,无需梯度访问。RALP能够预测缺失的实体、关系甚至字面量,并为预测结果分配置信度分数。

Details

Motivation: 动机在于解决传统知识图谱嵌入(KGE)模型在动态、异构图谱中难以处理未见实体、关系,尤其是字面量(literals)的局限性,而预训练大语言模型(LLMs)通过提示具有更好的泛化能力。

Result: 在转导、数值和OWL实例检索基准测试中,RALP将最先进(SOTA)的KGE模型的平均倒数排名(MRR)提升了超过5%。在涉及复杂类表达式(如∃hasChild.Female, ≥5 hasChild.Female)的OWL推理任务上,取得了超过88%的Jaccard相似度。

Insight: 创新点在于将链接预测重构为提示学习问题,并利用贝叶斯优化高效学习思维链提示。这为基于嵌入的方法提供了一种灵活替代方案,展示了基于提示的LLM推理在处理知识图谱复杂任务(包括字面量预测)上的潜力。

Abstract: Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\exists hasChild.Female$, $\geq 5 ; hasChild.Female$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice-group/RALP .


[23] InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models cs.CLPDF

Shreya Gupta, Prottay Kumar Adhikary, Bhavyaa Dave, Salam Michael Singh, Aniket Deroy

TL;DR: InsightFlow是一种基于大语言模型的方法,能够自动从患者-治疗师对话中生成符合5P框架的因果图,用于心理健康临床案例构建。通过46份专家标注的心理治疗记录进行评估,结果显示生成的图在结构上与人类标注者间一致性相当,语义对齐度高,专家评价认为其具有中等完整性、一致性和临床实用性。

Details

Motivation: 临床案例构建通常使用5P框架将患者症状和心理社会因素组织成因果模型,但从治疗记录手动构建这些图耗时且存在临床医生间的差异,因此需要自动化方法来提高效率和一致性。

Result: 在46份心理治疗记录上评估,InsightFlow生成的图在结构相似性(NetSimile)上与人类标注者间协议相当,语义对齐度(嵌入相似性)高,专家评价为中等完整、一致和临床有用;与人类图相比,LLM生成的图倾向于形成更互联的结构而非链式模式,但整体复杂性和内容覆盖相似。

Insight: 论文的创新点在于利用LLM自动合成患者叙述为因果模型,展示了LLM在临床工作流中自动化因果建模的潜力;客观分析认为,该方法能生成在专家实践自然变异范围内具有临床意义的图,但未来需改进时间推理和减少冗余。

Abstract: Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.


[24] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood cs.CLPDF

Xingyu Lin, Yilin Wen, Du Su, Jinchang Hou, En Wang

TL;DR: 本文提出了一种名为TEPO的新型令牌级策略优化框架,旨在解决大型语言模型在思维链推理中面临的令牌级稀疏奖励问题。该框架通过序列级似然将组级奖励与个体令牌关联,并引入令牌级KL散度掩码约束来稳定训练。

Details

Motivation: 现有方法如GRPO及其相关熵正则化方法在处理思维链推理中的令牌级稀疏奖励时存在困难,容易导致熵崩溃或模型退化,因此需要一种更精细的令牌级优化策略。

Result: 实验表明,TEPO在数学推理基准测试中达到了最先进的性能,同时显著提升了训练稳定性,与GRPO/DAPO相比收敛时间减少了50%。

Insight: 创新点在于将组级奖励通过序列级似然链接到令牌级聚合,并针对具有正优势且熵减的令牌引入KL散度掩码约束,从而更有效地利用稀疏奖励并避免策略突变。

Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.


[25] Generating Effective CoT Traces for Mitigating Causal Hallucination cs.CLPDF

Yiheng Zhao, Jun Yan

TL;DR: 本文针对小型语言模型在事件因果关系识别任务中存在的因果幻觉问题,提出了一种生成有效思维链轨迹的方法。首先研究了有效思维链轨迹应满足的标准,并设计了一个生成符合这些标准轨迹的流程。同时,引入了一个新的度量指标——因果幻觉率来量化因果幻觉,指导标准制定并验证流程有效性。实验表明,使用该方法生成的思维链轨迹进行微调,能显著减少小型模型的因果幻觉并提升平均准确率,且模型展现出良好的跨数据集、跨难度泛化能力以及对误导性干预提示的鲁棒性。

Details

Motivation: 解决小型语言模型在事件因果关系识别任务中表现出的严重因果幻觉问题,并填补该领域缺乏有效思维链轨迹数据集的空白。

Result: 实验表明,使用所提流程生成的思维链轨迹进行微调,能显著减少小型LLMs的因果幻觉并提高平均准确率,模型在跨数据集、跨难度泛化以及对抗误导性干预提示方面表现出色。

Insight: 创新点在于首次为事件因果关系识别任务系统性地定义了有效思维链轨迹的标准并构建了生成流程,同时提出了量化因果幻觉的新指标——因果幻觉率,为缓解该问题提供了可复现的框架和评估基准。

Abstract: Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.


[26] Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning cs.CLPDF

Timon Ziegenbein, Maja Stahl, Henning Wachsmuth

TL;DR: 本文提出了一种基于强化学习的方法,用于教导大型语言模型(LLMs)进行类人化的文本编辑,以改进论证的适当性。该方法生成自包含的句子级编辑建议,可独立接受或拒绝,并通过多组件奖励函数联合优化编辑层面的语义相似性、流畅性和模式一致性以及论证层面的适当性。

Details

Motivation: 动机在于观察到人类与LLM生成的编辑策略存在不匹配:LLM通常进行多个分散的编辑并显著改变含义,而人类则将依赖变化封装在自包含、保持含义的编辑中。本文旨在解决LLM编辑策略不自然的问题,使其更接近人类编辑方式。

Result: 在自动和人工评估中,该方法优于竞争基线和当前最先进的类人化编辑方法,通过多轮编辑实现的适当性接近完全重写的水平。

Insight: 创新点在于使用强化学习(特别是群体相对策略优化)结合多组件奖励函数来教导LLM进行类人化编辑,生成自包含的句子级编辑建议,这提高了编辑的自然性和适当性,同时保持了语义相似性和流畅性。

Abstract: Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one’s arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.


[27] Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss cs.CL | cs.AIPDF

Ronald Skorobogat, Ameya Prabhu, Matthias Bethge

TL;DR: 本文指出当前多语言基准测试主要评估数学推理和事实回忆能力,而非真正的多语言熟练度,并提出通过往返翻译来评估多语言生成能力。作者引入了一个名为Lost in Translation(LiT)的基准测试,该测试无需人工参考翻译,且与用户评分高度相关。

Details

Motivation: 现有前沿模型的多语言评估基准侧重于数学推理和事实回忆,无法准确衡量真实世界的多语言能力,导致模型在这些基准上表现优异但在实际任务中表现不佳。

Result: 往返翻译评估与LMArena上的用户评分几乎完美相关(ρ = 0.94),且无需人工参考翻译或更强大的多语言评判模型。

Insight: 创新点在于提出往返翻译作为多语言能力评估的替代方法,通过源语言到目标语言再回译的语义差异来暴露生成缺陷,并构建了LiT基准以实现更真实的多语言模型评估。

Abstract: Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.


[28] MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models cs.CL | eess.ASPDF

Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu

TL;DR: 本文提出MoshiRAG,一种用于全双工语音语言模型的异步知识检索方法,旨在通过结合紧凑的语音接口与选择性检索机制,在保持实时交互性的同时提升模型的事实准确性。

Details

Motivation: 全双工语音语言模型在实时交互性方面表现出色,但其事实准确性仍有待提升,而单纯增大模型规模会导致推理成本过高,因此需要一种高效的知识增强方案。

Result: MoshiRAG在事实准确性上达到了与最佳公开非双工语音语言模型相当的水平,同时保持了全双工系统的交互性,并在领域外数学推理任务上展现了强劲性能。

Insight: 创新点在于利用响应起始与核心信息传递之间的自然时间间隙进行异步检索,实现了模块化、即插即用的设计,无需重新训练即可接入不同检索方法,有效平衡了实时性与知识需求。

Abstract: Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.


[29] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts cs.CL | cs.CVPDF

Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze

TL;DR: 本文介绍了GlotOCR Bench,一个评估OCR模型在100多种Unicode文字上泛化能力的综合基准。该基准包含从真实多语言文本渲染的干净和退化图像变体,使用Google Fonts字体库、HarfBuzz进行字形处理、FreeType进行光栅化,支持从左到右和从右到左的文字。评估了多种开源和专有视觉语言模型,发现大多数模型仅在不到十种文字上表现良好,即使最强的前沿模型也无法泛化到超过三十种文字。性能与文字级别的预训练覆盖范围密切相关,表明当前OCR系统既依赖视觉识别也依赖语言模型预训练。面对不熟悉的文字,模型要么产生随机噪声,要么从已知的相似文字中产生幻觉字符。

Details

Motivation: 当前OCR评估主要集中于少数高资源和中资源文字,缺乏对多种Unicode文字泛化能力的全面评估,因此需要建立一个更广泛的基准来揭示OCR模型的局限性。

Result: 在GlotOCR Bench上评估多个模型,大多数模型在不到十种文字上表现良好,最强的模型也无法泛化到超过三十种文字,性能与文字预训练覆盖范围相关。

Insight: 创新点在于构建了一个覆盖100多种Unicode文字的OCR基准,揭示了当前OCR模型严重依赖预训练数据覆盖,泛化能力有限,且在不熟悉文字上易产生幻觉或噪声,强调了多语言OCR评估的重要性。

Abstract: Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.


[30] PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models cs.CL | cs.CYPDF

Han Bao, Penghao Zhang, Yue Huang, Zhengqing Yuan, Yanchi Ru

TL;DR: 本文提出了PolicyBench,一个大规模、跨系统(中美)的公共政策理解基准测试,包含21K个案例,用于评估LLMs在政策领域的记忆、理解和应用能力。基于此基准,作者进一步提出了PolicyMoE,一个面向政策领域的专家混合模型,其在应用导向的任务上表现优于记忆和概念理解任务。

Details

Motivation: LLMs越来越多地应用于现实世界决策(包括公共政策领域),但其理解和推理政策相关内容的能力尚未得到充分探索,因此需要专门的基准和模型来填补这一空白。

Result: 在PolicyBench基准上,提出的PolicyMoE模型在应用导向的政策任务上表现出比记忆或概念理解任务更强的性能,并且在结构化推理任务上取得了最高的准确率。

Insight: 主要创新点在于构建了首个大规模、跨系统的公共政策理解基准(PolicyBench),并设计了一个与认知层级对齐的领域专业化MoE模型(PolicyMoE),这为评估和提升LLMs在复杂、结构化领域(如政策)的理解能力提供了新路径。

Abstract: Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom’s taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.


[31] Toward Autonomous Long-Horizon Engineering for ML Research cs.CLPDF

Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng

TL;DR: 本文介绍了AiScientist系统,这是一个用于自主进行长周期机器学习研究工程的系统。其核心设计原则是结合结构化编排与持久化状态连续性,通过分层编排和基于文件的‘File-as-Bus’工作空间协议,使智能体能够基于代码、分析和实验证据等持久化工件进行持续、连贯的任务执行,从而解决在任务理解、环境设置、实现、实验和调试等环节中维持长期一致性的难题。

Details

Motivation: 解决自主AI在长周期机器学习研究工程中面临的挑战,即智能体需要在数小时或数天的时间跨度内,在多个复杂阶段(如任务理解、编码、实验)中维持连贯且持续的进展,避免因依赖对话式交接而导致状态丢失和控制薄弱。

Result: 在两个互补的基准测试中,AiScientist在PaperBench上的平均得分比最佳匹配基线提高了10.54分,在MLE-Bench Lite上实现了81.82%的‘Any Medal’率。消融实验表明,‘File-as-Bus’协议是关键性能驱动因素,移除后会导致PaperBench得分下降6.41分,MLE-Bench Lite得分下降31.82分。

Insight: 论文宣称的创新点在于提出了‘结构化编排与持久化状态连续性’相结合的系统设计原则,并具体实现了分层编排架构和‘File-as-Bus’工作空间协议。从客观角度看,其核心洞察是将长周期ML研究工程重新定义为‘在持久化项目状态上协调专业化工作的系统问题’,而非纯粹的局部推理问题,并通过基于文件系统的持久化状态管理来确保智能体行为的长期一致性和可追溯性,这是一个重要的系统设计范式转变。

Abstract: Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.


cs.CV [Back]

[32] MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs cs.CVPDF

Md Rakibul Haque, KM Arefeen Sultan, Tushar Kataria, Shireen Elhabian

TL;DR: 本文提出了MedConcept框架,旨在以无监督方式发现预训练医学视觉语言模型(VLM)中的潜在医学概念,并将其与可临床验证的文本语义对齐,从而提升模型的可解释性。该方法通过识别稀疏的神经元级概念激活,并将其转化为伪报告式摘要,支持医生级别的模型内部推理检查。

Details

Motivation: 现有医学VLM虽然性能强大,但其不透明的潜在表示限制了临床信任和预测解释能力;当前基于梯度或注意力的可解释性方法通常局限于分类等特定任务,且无法提供可从预训练表示中复用、概念级的解释。

Result: 论文引入了一个定量的语义验证协议,利用独立的预训练医学LLM作为冻结的外部评估器,通过定义对齐、未对齐和不确定三个概念分数来量化概念与放射学报告之间的语义支持、矛盾或模糊性,为医学VLM的可解释性评估提供了定量基线。

Insight: 创新点在于无监督地发现并语义化医学概念,将神经元激活转化为临床可理解的文本摘要;同时,提出了一种不依赖人工标注、利用外部LLM进行定量评估的新协议,为可解释性研究提供了客观的衡量标准。

Abstract: While medical Vision-Language models (VLMs) achieve strong performance on tasks such as tumor or organ segmentation and diagnosis prediction, their opaque latent representations limit clinical trust and the ability to explain predictions. Interpretability of these multimodal representations are therefore essential for the trustworthy clinical deployment of pretrained medical VLMs. However, current interpretability methods, such as gradient- or attention-based visualizations, are often limited to specific tasks such as classification. Moreover, they do not provide concept-level explanations derived from shared pretrained representations that can be reused across downstream tasks. We introduce MedConcept, a framework that uncovers latent medical concepts in a fully unsupervised manner and grounds them in clinically verifiable textual semantics. MedConcept identifies sparse neuron-level concept activations from pretrained VLM representations and translates them into pseudo-report-style summaries, enabling physician-level inspection of internal model reasoning. To address the lack of quantitative evaluation in concept-based interpretability, we introduce a quantitative semantic verification protocol that leverages an independent pretrained medical LLM as a frozen external evaluator to assess concept alignment with radiology reports. We define three concept scores, Aligned, Unaligned, and Uncertain, to quantify semantic support, contradiction, or ambiguity relative to radiology reports and use them exclusively for post hoc evaluation. These scores provide a quantitative baseline for assessing interpretability in medical VLMs. All codes, prompt and data to be released on acceptance. Ke


[33] V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos cs.CVPDF

Chengkun Yue, Chuanzhi Xu, Jiangpeng He

TL;DR: 本文提出了V-Nutri框架,利用第一人称烹饪视频中的过程信息来改进菜肴级别的营养估计。该方法结合了预训练的视觉骨干网络和一个轻量级融合模块,从最终菜肴帧和烹饪过程关键帧中聚合特征,并包含一个基于VideoMamba的事件检测模型来选择食材添加时刻的关键帧。研究在HD-EPIC数据集上建立了首个视频营养估计基准,并验证了过程线索的补充价值。

Details

Motivation: 现有基于视觉的膳食营养估计方法主要依赖最终菜肴的单一图像,但许多营养相关成分(如油、酱汁)在烹饪后视觉上变得模糊,导致准确估计困难。本文旨在探索第一人称烹饪视频中的过程信息是否能提升菜肴级别的营养估计。

Result: 在HD-EPIC数据集上的实验表明,在受控条件下,过程线索能提供补充性的营养证据,改善营养估计。结果进一步指出,过程关键帧的益处高度依赖于骨干网络表示能力和事件检测质量。

Insight: 创新点在于首次将烹饪过程视频信息系统性地引入营养估计任务,提出了一个分阶段框架,并集成了基于VideoMamba的事件检测模型来识别食材添加时刻。从客观角度看,该方法强调了时序过程信息对解决视觉模糊问题的价值,并为视频营养分析建立了新的基准数据集。

Abstract: Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.


[34] Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery cs.CVPDF

Chitra Banarjee, Patrick Kwon, Ania Lipat, Rui Xie, Chen Chen

TL;DR: 本文提出了一种基于3D人体网格恢复(HMR)模型的流程,用于从老年人完成计时起立行走测试(TUG)的视频中提取步态参数,以评估跌倒风险和整体健康状况。研究发现,视频提取的步态参数与基于IMU的测量结果显著相关,并且步长变异性、坐站时长等参数与自我报告的跌倒风险和跌倒恐惧感相关。

Details

Motivation: 当前临床实践中对老年人步态的评估主要局限于秒表测量的步速,缺乏更全面、可及的评估方法。本文旨在利用计算机视觉技术,从社区环境中录制的视频中自动提取更丰富的步态参数,以实现更便捷、生态效度更高的跌倒风险分析。

Result: 视频提取的步时与基于IMU鞋垫的测量结果显著相关。通过线性混合效应模型分析,确认了更短、变异性更大的步长以及更长的坐站时长与更高的自我报告跌倒风险和跌倒恐惧感相关。

Insight: 创新点在于将3D人体网格恢复模型应用于社区环境下的老年人步态分析,实现了从普通视频中提取临床相关步态参数,为在非受控、真实世界场景中进行可及且有效的健康评估提供了技术流程。该方法避免了穿戴式传感器的使用,提高了评估的便捷性和生态效度。

Abstract: Gait assessment is a key clinical indicator of fall risk and overall health in older adults. However, standard clinical practice is largely limited to stopwatch-measured gait speed. We present a pipeline that leverages a 3D Human Mesh Recovery (HMR) model to extract gait parameters from recordings of older adults completing the Timed Up and Go (TUG) test. From videos recorded across different community centers, we extract and analyze spatiotemporal gait parameters, including step time, sit-to-stand duration, and step length. We found that video-derived step time was significantly correlated with IMU-based insole measurements. Using linear mixed effects models, we confirmed that shorter, more variable step lengths and longer sit-to-stand durations were predicted by higher self-rated fall risk and fear of falling. These findings demonstrate that our pipeline can enable accessible and ecologically valid gait analysis in community settings.


[35] INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents cs.CV | cs.AI | cs.CL | cs.LGPDF

Somraj Gautam, Anathapindika Dravichi, Gaurav Harit

TL;DR: 本文提出了INDOTABVQA,一个用于评估印度尼西亚语(Bahasa Indonesia)真实文档图像上跨语言表格视觉问答(Table VQA)的基准数据集。该数据集包含1,593张具有不同视觉风格(带边框、无边框、彩色)且包含一个或多个表格的文档图像,以及对应四种语言(印度尼西亚语、英语、印地语、阿拉伯语)的1,593个问答对。作者评估了多个领先的开源和闭源视觉语言模型(VLMs),发现其在结构复杂表格和低资源语言上存在显著性能差距。通过在数据集上微调模型和引入空间先验(表格区域坐标),模型性能得到了显著提升。

Details

Motivation: 解决在低资源语言(如印度尼西亚语)和跨语言场景下,现有视觉语言模型对真实世界文档图像中表格理解能力不足的问题,特别是缺乏针对特定语言和领域的数据集来评估和提升此类能力。

Result: 在INDOTABVQA基准上评估了Qwen2.5-VL、Gemma-3、LLaMA-3.2和GPT-4o等模型,揭示了显著的性能差距。通过在数据集上微调一个3B模型和使用LoRA微调一个7B模型,准确率分别提升了11.6%和17.8%。额外提供表格区域坐标作为空间先验输入,性能进一步提升了4-7%。

Insight: 主要创新点在于创建了一个多语言、多视觉风格的印度尼西亚语表格VQA基准,支持单语和跨语言评估。客观分析认为,其核心贡献是强调了语言多样性和领域特定数据集的重要性,并证明了针对性的微调和引入空间结构先验(如表格坐标)能有效提升模型在复杂文档理解任务上的性能,这对全球代表性不足地区的文档AI研究具有价值。

Abstract: We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}


[36] Ultra-low-light computer vision using trained photon correlations cs.CV | physics.opticsPDF

Mandar M. Sohoni, Jérémie Laydevant, Mathieu Ouellet, Shi-Yuan Ma, Ryotatsu Yanagimoto

TL;DR: 本文提出了一种名为相关感知训练(CAT)的方法,用于在超低光和噪声条件下提升物体识别任务的性能。该方法通过端到端优化可训练的相关光子光源和Transformer后端,使Transformer能够利用光子相关性,仅需少量拍摄次数(≤100次)即可实现比传统非相关光源方法高达15个百分点的分类准确率提升。

Details

Motivation: 传统基于相关光子光源的方法主要关注图像重建,但在计算机视觉任务(如物体识别)中,最终目标是对场景进行推理而非重建图像。本文旨在探索如何将相关光子光源的优势整合到混合光电计算机视觉流程中,以在光子预算受限的场景下提升物体识别的准确性。

Result: 在超低光和噪声成像条件下,该方法在物体识别任务上比基于非相关光源的传统计算机视觉方法提高了高达15个百分点的分类准确率,并且优于使用未经训练的相关光子光源的方法。

Insight: 论文的创新点在于将计算机视觉任务(物体识别)与光子相关性模式训练相结合,通过端到端优化光源和数字后端,使系统能够专门针对特定任务利用相关性,从而在光子预算严格受限的场景下超越现有专注于图像重建的方法。

Abstract: Illumination using correlated photon sources has been established as an approach to allowing high-fidelity images to be reconstructed from noisy camera frames by taking advantage of the knowledge that signal photons are spatially correlated whereas detector clicks due to noise are uncorrelated. However, in computer-vision tasks, the goal is often not ultimately to reconstruct an image, but to make inferences about a scene – such as what object is present. Here we show how correlated-photon illumination can be used to gain an advantage in a hybrid optical-electronic computer-vision pipeline for object recognition. We demonstrate correlation-aware training (CAT): end-to-end optimization of a trainable correlated-photon illumination source and a Transformer backend in a way that the Transformer can learn to benefit from the correlations, using a small number (<= 100) of shots. We show a classification accuracy enhancement of up to 15 percentage points over conventional, uncorrelated-illumination-based computer vision in ultra-low-light and noisy imaging conditions, as well as an improvement over using untrained correlated-photon illumination. Our work illustrates how specializing to a computer-vision task – object recognition – and training the pattern of photon correlations in conjunction with a digital backend allows us to push the limits of accuracy in highly photon-budget-constrained scenarios beyond existing methods focused on image reconstruction.


[37] TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment cs.CVPDF

Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur

TL;DR: 本文提出了TIPSv2,一个通过增强图像块与文本对齐能力来改进视觉-语言预训练的新模型系列。核心创新包括:引入图像块级蒸馏以显著提升密集块-文本对齐,改进iBOT掩码图像目标为iBOT++让未掩码标记也直接贡献损失,优化指数移动平均设置,并引入多粒度合成描述采样策略。

Details

Motivation: 现有视觉-语言预训练模型在将密集的图像块表示与对应概念的文本嵌入对齐这一基础能力上仍存在不足,本文旨在解决这一关键问题,提升基础模型的块-文本对齐能力。

Result: 在9个任务和20个数据集上的综合实验表明,TIPSv2模型性能强劲,通常与近期视觉编码器模型相当或更优。

Insight: 主要创新点在于揭示了图像块级蒸馏能显著超越教师模型的块-文本对齐能力,并据此设计了iBOT++等预训练配方改进。客观来看,其将自监督学习中的掩码图像建模思想与视觉-语言对齐目标更紧密地结合,并通过多粒度文本利用提升了训练效率和效果,是一种有借鉴意义的预训练框架增强方法。

Abstract: Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment – surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .


[38] Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLMs cs.CVPDF

Kaizhen Tan

TL;DR: 本文实证研究了视觉令牌剪枝对多模态大语言模型(MLLMs)校准性的影响,发现剪枝并不总是以牺牲可靠性为代价来换取效率。在POPE数据集上,基于纯覆盖率的SCOPE剪枝策略能显著降低预期校准误差(ECE)同时保持准确率;而在ScienceQA-IMG数据集上,剪枝也能降低ECE且准确率稳定或略有提升。

Details

Motivation: 现有视觉令牌剪枝研究主要评估任务准确率,但忽略了模型校准性(即预测置信度与实际正确性的匹配程度)。本文旨在探究剪枝如何影响MLLMs的校准性,以评估其是否在提升效率的同时损害了可靠性。

Result: 在POPE数据集上,使用纯覆盖率策略的SCOPE剪枝相比未剪枝模型显著降低了ECE(预期校准误差),同时保持了相近的准确率;基于显著性的剪枝则导致校准性变差,而FastV剪枝在实验设置下导致性能严重下降。在ScienceQA-IMG数据集上,剪枝普遍降低了ECE,且准确率保持稳定或略有提升。

Insight: 视觉令牌剪枝应同时评估准确率和置信度质量,尤其对于需要可靠决策的多模态系统;基于覆盖率的剪枝策略(如降低显著性权重)可能比基于显著性的策略更有利于改善模型校准性,且覆盖率选择中的间隙幂指数默认设置并非总是最优。

Abstract: Visual token pruning is a widely used strategy for efficient inference in multimodal large language models (MLLMs), but existing work mainly evaluates it with task accuracy. In this paper, we study how visual token pruning affects model calibration, that is, whether predicted confidence matches actual correctness. Using LLaVA-1.5-7B on POPE and ScienceQA-IMG, we evaluate Expected Calibration Error (ECE), Brier score, and AURC under several pruning strategies, including SCOPE with different saliency weights, saliency-only pruning, FastV, and random pruning, across multiple token budgets. Our results show that pruning does not simply trade reliability for efficiency. On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep further shows a consistent trend: reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration, and real FastV causes severe performance degradation in our setting. On ScienceQA-IMG, pruning also reduces ECE, with accuracy remaining stable or slightly improving. We additionally study the gap power exponent in coverage-based selection and find that its default setting is not always optimal. Overall, our results suggest that visual token pruning should be evaluated not only by accuracy, but also by confidence quality, especially for multimodal systems that need reliable decisions.


[39] PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning cs.CVPDF

Syed Fahim Ahmed, Gnanesh Rasineni, Florian Koehler, Abu Zahid Bin Aziz, Mei Wang

TL;DR: 本文提出了一种名为渐进上下文多示例学习(PC-MIL)的新框架,用于计算病理学中的全切片图像分类。该方法将监督的空间范围作为一个独立的设计维度,通过解耦特征分辨率与监督尺度,在固定20倍放大的特征下,以毫米为单位变化MIL包的范围,并以临床相关的2毫米尺度锚定监督。PC-MIL以可控比例渐进混合切片级和区域级监督,从而在保持全局性能的同时,改善模型在跨上下文评估中的泛化能力。

Details

Motivation: 标准切片级多示例学习仅优化全局标签,导致模型在聚合特征时未能学习具有解剖学意义的定位,监督尺度与临床推理尺度不匹配,模型的归纳偏差抹除了解剖结构信息。

Result: 在来自五个公共数据集的1,476张前列腺全切片图像上进行二元癌症检测实验,结果表明,适度的区域监督改善了跨上下文性能,平衡的多上下文训练在不牺牲全局性能的情况下,稳定了切片和区域评估的准确性。

Insight: 创新点在于将监督的空间范围作为首要设计维度,解耦了特征分辨率与监督尺度,并通过渐进混合不同尺度的监督来塑造模型的归纳偏差,支持基于解剖学的泛化,这为多示例学习提供了监督尺度这一独立于特征分辨率的泛化轴。

Abstract: Whole-slide image (WSI) classification in computational pathology is commonly formulated as slide-level Multiple Instance Learning (MIL) with a single global bag representation. However, slide-level MIL is fundamentally underconstrained: optimizing only global labels encourages models to aggregate features without learning anatomically meaningful localization. This creates a mismatch between the scale of supervision and the scale of clinical reasoning. Clinicians assess tumor burden, focal lesions, and architectural patterns within millimeter-scale regions, whereas standard MIL is trained only to predict whether “somewhere in the slide there is cancer.” As a result, the model’s inductive bias effectively erases anatomical structure. We propose Progressive-Context MIL (PC-MIL), a framework that treats the spatial extent of supervision as a first-class design dimension. Rather than altering magnification, patch size, or introducing pixel-level segmentation, we decouple feature resolution from supervision scale. Using fixed 20x features, we vary MIL bag extent in millimeter units and anchor supervision at a clinically motivated 2mm scale to preserve comparable tumor burden and avoid confounding scale with lesion density. PC-MIL progressively mixes slide- and region-level supervision in controlled proportions, enabling explicit train-context x test-context analysis. On 1,476 prostate WSIs from five public datasets for binary cancer detection, we show that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance. These results demonstrate that supervision extent shapes MIL inductive bias and support anatomically grounded WSI generalization.


[40] HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models cs.CVPDF

Xinyun Liu

TL;DR: 本文提出了一种名为HTDC(犹豫触发差分校准)的无训练解码框架,旨在缓解大型视觉语言模型(LVLMs)中的幻觉问题。该方法通过检测中间层的标记偏好波动(称为层间犹豫)来识别视觉基础不稳定的步骤,并仅在易犹豫的步骤触发校准,从而在保持标准全分支推理的同时减少不必要的计算开销。

Details

Motivation: 大型视觉语言模型在多模态任务中表现出色,但仍存在由视觉基础不稳定和过度依赖语言先验引起的幻觉问题。现有无训练解码方法通常在每一步都进行校准,导致不必要的计算并可能干扰稳定预测,因此需要一种更高效、有针对性的校准机制。

Result: 在代表性幻觉基准测试上的实验表明,HTDC能持续减少幻觉,同时保持较强的任务准确性,在效果和计算开销之间实现了有利的权衡。

Insight: 创新点在于引入层间犹豫作为视觉基础不稳定的简单信号,并基于此设计了一种触发式校准机制,仅在不稳定步骤激活轻量级视觉消除和语义消除探针来抑制幻觉候选,避免了不必要的干预,提高了校准效率。

Abstract: Large vision-language models (LVLMs) achieve strong multimodal performance, but still suffer from hallucinations caused by unstable visual grounding and over-reliance on language priors. Existing training-free decoding methods typically apply calibration at every decoding step, introducing unnecessary computation and potentially disrupting stable predictions. We address this problem by identifying layer-wise hesitation, a simple signal of grounding instability reflected by fluctuations in token preference across intermediate layers. Based on this observation, we propose Hesitation-Triggered Differential Calibration (HTDC), a training-free decoding framework that preserves standard full-branch inference and activates calibration only at hesitation-prone steps. When triggered, HTDC contrasts the full branch with two lightweight probes, a visual-nullification probe and a semantic-nullification probe, to suppress hallucination-prone candidates while avoiding unnecessary intervention on stable steps. Experiments on representative hallucination benchmarks show that HTDC consistently reduces hallucinations while maintaining strong task accuracy, achieving a favorable trade-off between effectiveness and computational overhead.


[41] Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models cs.CV | cs.LGPDF

Md Tanvirul Alam

TL;DR: 该论文研究了大型视觉语言模型(VLM)中的语义固化现象,即模型倾向于依赖默认的语义先验,即使提示指定了同样有效的替代解释。作者提出了VLM-Fix基准,通过四款抽象策略游戏对比标准规则与逆规则下的表现,揭示了模型在逆规则下性能显著下降的固化差距。实验表明,提示干预和后期训练能影响这种固化,且后期表征的激活引导可以部分修正相关错误。

Details

Motivation: 现有评估方法未能清晰区分VLM的感知失败与规则映射失败,论文旨在研究并量化模型对默认语义先验的依赖(即语义固化)问题。

Result: 在14个开源和闭源VLM上,VLM-Fix基准显示模型在标准规则下的准确率始终高于逆规则,存在显著的语义固化差距;提示干预(如中性别名)能缩小该差距,而语义负载别名会重新扩大差距;后期训练表明规则对齐性强,联合规则训练能提升泛化能力;在VLMBias上的外部验证也观察到类似模式。

Insight: 创新点在于提出了VLM-Fix基准来孤立评估语义固化,揭示了VLM对语义先验的顽固依赖机制;客观分析认为,通过提示工程和后期表征编辑可部分缓解此问题,为改善VLM的规则遵循和泛化能力提供了新视角。

Abstract: Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm-fix/.


[42] ViLL-E: Video LLM Embeddings for Retrieval cs.CVPDF

Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, Sheng Liu, Son Tran

TL;DR: 本文提出了ViLL-E(Video-LLM-Embed),一种统一的视频大语言模型架构,它通过一种新颖的嵌入生成机制,使模型能够根据视频复杂度自适应地“思考”,从而在保持视频问答能力的同时,显著提升了视频检索和时序定位等基于嵌入的任务性能。

Details

Motivation: 现有视频大语言模型在文本输出任务(如视频问答)上表现出色,但在基于嵌入的检索任务(如文本到视频检索)上性能不如专门的嵌入模型。本文旨在解决这一性能差距,构建一个既能理解视频内容又能生成高质量嵌入的统一模型。

Result: 模型在时序定位任务上平均提升7%(相比其他VideoLLMs),在视频检索任务上提升高达4%(相比双编码器模型),性能与最先进的专用嵌入模型相当,同时在视频问答任务上保持竞争力。在零样本组合视频检索和长文本检索任务上,分别超越当前最佳方法5%和2%。

Insight: 核心创新点在于为视频大语言模型引入了可自适应停止的嵌入生成机制,以及结合生成式与对比学习的三阶段训练方法。这为构建统一的多任务视频理解模型提供了新思路,实现了生成与检索能力的有效协同,并解锁了强大的零样本泛化能力。

Abstract: Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to “think longer” for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).


[43] VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale cs.CVPDF

Parth Parag Kulkarni, Rohit Gupta, Prakash Chandra Chhipa, Mubarak Shah

TL;DR: 本文提出VidTAG框架,用于视频地理定位任务,旨在通过视频帧预测其精确的GPS坐标和轨迹。该方法采用双编码器架构,结合自监督和语言对齐特征进行帧到GPS检索,并引入TempGeo模块对齐帧嵌入和GeoRefiner模块细化GPS特征,以解决视频预测中的时间不一致问题。

Details

Motivation: 现有基于分类的方法粒度较粗(仅城市级),而基于图像检索的方法在全球尺度上因需要庞大的图像库而不实用。本文旨在实现细粒度的视频地理定位,并利用易于构建的GPS坐标库来解决该问题。

Result: 在Mapillary (MSLS)和GAMa数据集上的评估表明,该方法能生成时间一致的轨迹,在1公里阈值上比GeoCLIP提升20%,并在全球粗粒度视频地理定位任务(CityGuessr68k)上超越当前SOTA 25%。

Insight: 创新点在于提出了一种结合自监督和语言对齐特征的双编码器检索框架,并设计了专门的模块(TempGeo和GeoRefiner)来处理视频序列的时间对齐和特征细化,为细粒度视频地理定位提供了新思路。

Abstract: The task of video geolocalization aims to determine the precise GPS coordinates of a video’s origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model’s ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/


[44] Nucleus-Image: Sparse MoE for Image Generation cs.CVPDF

Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, Haozhe Liu

TL;DR: Nucleus-Image是一个基于稀疏专家混合(MoE)架构的文本到图像生成扩散变换器模型。它通过创新的专家选择路由、解耦路由设计以及高效的架构优化,在仅激活约20亿参数的情况下,达到了与更大模型相当或更优的图像生成质量,同时在推理效率上建立了新的帕累托前沿。

Details

Motivation: 解决现有高质量文本到图像生成模型推理成本高、效率低的问题,旨在通过稀疏MoE架构实现模型总容量(170亿参数)与每次前向传播激活参数量(约20亿)之间的高效权衡,从而在不牺牲生成质量的前提下大幅降低推理成本。

Result: 在GenEval、DPG-Bench和OneIG-Bench等多个基准测试上,其性能匹配或超越了领先模型,达到了最先进的(SOTA)或相当的水平,同时显著减少了每次推理的计算量。

Insight: 主要创新点包括:1)采用专家选择路由(Expert-Choice Routing)的稀疏MoE扩散变换器架构;2)解耦路由设计,将时间步感知的专家分配与时间步条件化的专家计算分离,以提升路由稳定性;3)完全从变换器主干中排除文本令牌并使用跨时间步共享文本KV的联合注意力,以优化推理效率;4)渐进式稀疏化训练策略与多纵横比分桶训练方法。这为通过稀疏激活扩展模型容量以实现高效高质量生成提供了有效路径。

Abstract: We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.


[45] Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment cs.CVPDF

Xinjie Zhang, Qiang Li, Xiaowen Ma, Axi Niu, Li Yan

TL;DR: 本文提出了一种名为DS-IEQA的统一框架,用于图像编辑质量评估,通过联合学习评估标准和分数表示来解决现有方法的局限性。

Details

Motivation: 现有基于MLLM的图像编辑质量评估方法依赖人工启发式提示,存在度量提示僵化和分数建模忽略距离关系的问题,难以与人类隐式标准对齐并捕捉分数空间的连续结构。

Result: 在2026年NTIRE X-AIGC质量评估赛道2中,该方法在没有使用额外训练数据的情况下取得了第4名的成绩,证明了其优越性能。

Insight: 创新点在于提出了反馈驱动的度量提示优化来自动精炼度量定义,以及通过解耦数值标记与语言建模的距离回归损失来显式建模分数连续性,从而实现了评估标准与分数表示的联合优化。

Abstract: Recent advances in image editing have heightened the need for reliable Image Editing Quality Assessment (IEQA). Unlike traditional methods, IEQA requires complex reasoning over multimodal inputs and multi-dimensional assessments. Existing MLLM-based approaches often rely on human heuristic prompting, leading to two key limitations: rigid metric prompting and distance-agnostic score modeling. These issues hinder alignment with implicit human criteria and fail to capture the continuous structure of score spaces. To address this, we propose Define-and-Score Image Editing Quality Assessment (DS-IEQA), a unified framework that jointly learns evaluation criteria and score representations. Specifically, we introduce Feedback-Driven Metric Prompt Optimization (FDMPO) to automatically refine metric definitions via probabilistic feedback. Furthermore, we propose Token-Decoupled Distance Regression Loss (TDRL), which decouples numerical tokens from language modeling to explicitly model score continuity through expected distance minimization. Extensive experiments show our method’s superior performance; it ranks 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without any additional training data.


[46] Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation cs.CV | cs.AIPDF

Wentai Zhang, Ronghui Xi, Shiyao Peng, Jiayu Huang, Haoran Luo

TL;DR: 本文提出了一种名为PASA的训练无关框架,旨在解决视频扩散变换器中自注意力计算负担过重的问题。通过引入曲率感知动态预算机制、硬件对齐分组近似和随机选择偏置,PASA能够在保持视频生成高保真度的同时,显著加速推理过程并消除时间闪烁现象。

Details

Motivation: 视频扩散变换器虽能生成高保真视频,但自注意力计算开销巨大;现有稀疏注意力加速方法常因静态稀疏模式和确定性块路由导致严重的视觉闪烁,因此需要一种高效且时间平滑的解决方案。

Result: 在主流视频扩散模型上的广泛评估表明,PASA实现了显著的推理加速,同时持续生成非常流畅且结构稳定的视频序列。

Insight: 创新点包括:1) 曲率感知动态预算机制,根据时间步的生成轨迹加速度弹性分配精确计算预算,确保关键语义转换期间的高精度处理;2) 硬件对齐分组近似替代全局同质化估计,在保持峰值计算吞吐量的同时捕捉细粒度局部变化;3) 在注意力路由机制中引入随机选择偏置,软化刚性选择边界并消除选择振荡,从而根除导致时间闪烁的局部计算饥饿问题。

Abstract: Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.


[47] ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models cs.CVPDF

Xinliang Wang, Yifeng Shi, Zhenyu Wu

TL;DR: 本文提出了ArtifactWorld框架,通过系统性的数据扩展和同质双模型范式来解决3D高斯溅射(3DGS)在稀疏视图下产生的几何和光度退化问题。该框架建立了一个细粒度的3DGS伪影分类法,并构建了一个包含10.75万个多样化配对视频片段的大规模训练集。在架构上,它利用视频扩散主干网络统一修复过程,通过一个同构预测器生成伪影热图来定位结构缺陷,并采用伪影感知三元融合机制进行精确的时空修复。

Details

Motivation: 当前基于生成的修复方法在修复3DGS伪影时,存在时间连贯性不足、缺乏显式空间约束以及缺乏大规模训练数据的问题,导致多视图不一致、错误的几何幻觉以及对多样化真实世界伪影分布的泛化能力有限。

Result: 大量实验表明,ArtifactWorld在稀疏新视图合成和鲁棒的3D重建任务上达到了最先进的性能。

Insight: 创新点包括:1) 建立了首个针对3DGS伪影的细粒度现象学分类法和大规模配对视频数据集,解决了数据瓶颈;2) 提出了同质双模型范式,将伪影定位(通过热图)与修复(通过三元融合)统一在视频扩散框架内,实现了精确的、强度引导的时空修复,增强了多视图一致性。

Abstract: 3D Gaussian Splatting (3DGS) delivers high-fidelity real-time rendering but suffers from geometric and photometric degradations under sparse-view constraints. Current generative restoration approaches are often limited by insufficient temporal coherence, a lack of explicit spatial constraints, and a lack of large-scale training data, resulting in multi-view inconsistencies, erroneous geometric hallucinations, and limited generalization to diverse real-world artifact distributions. In this paper, we present ArtifactWorld, a framework that resolves 3DGS artifact repair through systematic data expansion and a homogeneous dual-model paradigm. To address the data bottleneck, we establish a fine-grained phenomenological taxonomy of 3DGS artifacts and construct a comprehensive training set of 107.5K diverse paired video clips to enhance model robustness. Architecturally, we unify the restoration process within a video diffusion backbone, utilizing an isomorphic predictor to localize structural defects via an artifact heatmap. This heatmap then guides the restoration through an Artifact-Aware Triplet Fusion mechanism, enabling precise, intensity-guided spatio-temporal repair within native self-attention. Extensive experiments demonstrate that ArtifactWorld achieves state-of-the-art performance in sparse novel view synthesis and robust 3D reconstruction. Code and dataset will be made public.


[48] ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception cs.CV | cs.AIPDF

Huanzhen Wang, Ziheng Zhou, Jiaqi Song, Li He, Yunshi Lan

TL;DR: 本文提出ARGen框架,通过情感语义注入和自适应强化扩散两阶段方法,生成动态面部表情数据以解决野外动态表情识别中的数据稀缺和长尾分布问题,提升情感感知的鲁棒性。

Details

Motivation: 解决野外动态面部表情识别中因数据稀缺和长尾分布导致模型难以有效学习稀缺情绪时序动态的问题。

Result: 在生成和识别任务上的大量实验验证了ARGen显著提升了合成保真度和识别性能,为基于视觉的情感计算建立了可解释且可泛化的生成增强范式。

Insight: 创新点包括通过面部动作单元建立情感知识对齐、检索增强提示生成策略注入可解释情感先验,以及结合文本条件图像到视频扩散与强化学习,通过帧间条件引导和多目标奖励函数联合优化生成质量与效率。

Abstract: Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.


[49] Style-Decoupled Adaptive Routing Network for Underwater Image Enhancement cs.CVPDF

Hang Xu, Chen Long, Bing Wang, Hao Chen, Zhen Dong

TL;DR: 本文提出了一种名为SDAR-Net的新型自适应水下图像增强框架,旨在解决现有方法因采用统一映射而导致的处理不足或过度处理问题。该方法首先将输入图像的特定退化风格与场景结构解耦,然后通过自适应路由机制,根据退化风格特征动态调整增强过程,实现针对每张图像的精准恢复。

Details

Motivation: 现有水下图像增强方法主要依赖于针对平均数据集分布的统一映射,导致对轻度退化图像处理过度,而对严重退化图像恢复不足。本文旨在解决这种非自适应增强的局限性。

Result: 在真实世界基准测试中,SDAR-Net取得了新的最先进性能,峰值信噪比达到25.72 dB,并在下游视觉任务中证明了其实用性。

Insight: 核心创新点在于将图像特征解耦为动态退化风格嵌入和静态场景结构表示,并引入自适应路由机制,根据风格特征预测软权重来指导增强过程,实现了从“统一映射”到“自适应调制”的范式转变。

Abstract: Underwater Image Enhancement (UIE) is essential for robust visual perception in marine applications. However, existing methods predominantly rely on uniform mapping tailored to average dataset distributions, leading to over-processing mildly degraded images or insufficient recovery for severe ones. To address this challenge, we propose a novel adaptive enhancement framework, SDAR-Net. Unlike existing uniform paradigms, it first decouples specific degradation styles from the input and subsequently modulates the enhancement process adaptively. Specifically, since underwater degradation primarily shifts the appearance while keeping the scene structure, SDAR-Net formulates image features into dynamic degradation style embeddings and static scene structural representations through a carefully designed training framework. Subsequently, we introduce an adaptive routing mechanism. By evaluating style features and adaptively predicting soft weights at different enhancement states, it guides the weighted fusion of the corresponding image representations, accurately satisfying the adaptive restoration demands of each image. Extensive experiments show that SDAR-Net achieves a new state-of-the-art (SOTA) performance with a PSNR of 25.72 dB on real-world benchmark, and demonstrates its utility in downstream vision tasks. Our code is available at https://github.com/WHU-USI3DV/SDAR-Net.


[50] DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos cs.CVPDF

Yuan Huang, Sijie Zhao, Jing Cheng, Hao Xu, Shaohui Jiao

TL;DR: 本文提出DreamStereo,一种用于高清视频的实时立体修复方法,旨在填补扭曲视频中被遮挡区域,同时保持视觉连贯性和时间一致性。该方法通过梯度感知视差扭曲(GAPW)获得连续边缘和平滑遮挡区域,利用基于视差的双投影(PBDP)策略生成几何一致的立体修复对和遮挡掩码,并引入稀疏感知立体修复(SASI)减少冗余计算,实现实时处理。

Details

Motivation: 立体视频修复任务面临高质量数据集稀缺导致修复先验学习能力有限,以及现有方法对全帧进行均等处理造成大量冗余计算的问题。

Result: SASI方法在扩散推理中减少超过70%的冗余token,实现10.7倍加速,在单张A100 GPU上以25 FPS实时处理高清(768 x 1280)视频,结果与全计算版本相当。

Insight: 创新点包括梯度感知视差扭曲(GAPW)提升边缘连续性,基于视差的双投影(PBDP)无需立体视频即可生成几何一致数据,以及稀疏感知立体修复(SASI)通过减少冗余token实现高效实时处理,为立体修复任务提供了数据生成和计算优化的新思路。

Abstract: Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.


[51] LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion cs.CVPDF

Clara Xue, Zizheng Yan, Zhenning Shi, Yuhang Yu, Jingyu Zhuang

TL;DR: LiveMoments是一个专为Live Photos中重新选择的关键照片设计的参考引导图像恢复框架。它利用原始高质量关键照片作为参考,通过一个包含参考分支和主分支的双分支神经网络,并结合统一的运动对齐模块,来恢复因视频管线导致质量下降的重新选定帧。

Details

Motivation: Live Photos允许用户选择视频片段中的替代帧作为关键照片以捕捉更好的瞬间,但这些帧通常因视频图像信号处理管线质量较低而存在明显退化,因此需要专门的恢复技术来提升其质量。

Result: 在真实和合成的Live Photos数据集上的实验表明,LiveMoments在感知质量和保真度上显著优于现有解决方案,尤其是在快速运动或复杂结构的场景中。

Insight: 论文的创新点在于为Live Photos这一特定应用场景定制了参考引导的恢复框架,并引入了统一的运动对齐模块,在潜在空间和图像层面进行空间对齐,有效利用了原始高质量关键照片的结构和纹理信息来指导低质量帧的恢复。

Abstract: Live Photo captures both a high-quality key photo and a short video clip to preserve the precious dynamics around the captured moment. While users may choose alternative frames as the key photo to capture better expressions or timing, these frames often exhibit noticeable quality degradation, as the photo capture ISP pipeline delivers significantly higher image quality than the video pipeline. This quality gap highlights the need for dedicated restoration techniques to enhance the reselected key photo. To this end, we propose LiveMoments, a reference-guided image restoration framework tailored for the reselected key photo in Live Photos. Our method employs a two-branch neural network: a reference branch that extracts structural and textural information from the original high-quality key photo, and a main branch that restores the reselected frame using the guidance provided by the reference branch. Furthermore, we introduce a unified Motion Alignment module that incorporates motion guidance for spatial alignment at both the latent and image levels. Experiments on real and synthetic Live Photos demonstrate that LiveMoments significantly improves perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures. Our code is available at https://github.com/OpenVeraTeam/LiveMoments.


[52] Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors cs.CVPDF

Rong Wang, Ruyi Zha, Ziang Cheng, Jiayu Yang, Pulak Purkait

TL;DR: 本文提出了一种利用3D基础模型先验从单张图像生成几何真实且一致的轨道视频的新方法,通过引入多尺度3D适配器将形状先验注入视频生成模型,以解决现有方法在长程外推(如后视图合成)中结构不一致的问题。

Details

Motivation: 现有视频生成方法主要依赖像素级注意力来保证帧间一致性,但在像素对应有限的远距离外推(如后视图合成)中约束不足,导致生成结果结构不真实或不连贯,因此需要利用从大规模3D资产库学习到的形状先验作为辅助约束。

Result: 在多个基准测试上的广泛实验表明,该方法在视觉质量、形状真实性和多视图一致性方面优于最先进方法,并能鲁棒地泛化到复杂相机轨迹和真实世界图像。

Insight: 创新点在于利用3D基础生成模型的丰富形状先验(包括去噪全局潜在向量和从体积特征投影的潜在图像)作为多尺度条件,通过多尺度3D适配器以跨注意力方式注入特征令牌,从而在保留基础视频模型预训练能力的同时实现高效且模型无关的微调,避免了显式网格提取。

Abstract: We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.


[53] GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality cs.CV | cs.MMPDF

Zhiwei Zhang, Xingyuan Zeng, Xinkai Kong, Kunquan Zhang, Haoyuan Liang

TL;DR: 本文提出了GTPBD-MM,首个面向全球梯田地块提取的多模态基准数据集,整合了高分辨率光学影像、结构化文本描述和数字高程模型(DEM)数据,并支持在纯图像、图像+文本以及图像+文本+DEM三种设置下的系统评估。作者还提出了一个多模态基线模型ETTerra,用于梯田地块的边界描绘。

Details

Motivation: 现有公共基准主要关注规则且相对平坦的农田场景,而山区梯田地块具有阶梯状地形、显著高程变化、不规则边界和强烈的跨区域异质性,使得地块提取成为一个需要联合视觉识别、语义判别和地形感知几何理解的更具挑战性的问题。目前缺乏在图像-文本-DEM对齐设置下针对复杂梯田地块提取的统一基准。

Result: 广泛的实验表明,文本语义和地形几何信息提供了超越单独视觉外观的互补线索,在复杂的梯田场景中产生了更准确、连贯和结构一致的描绘结果。

Insight: 论文的创新点在于构建了首个整合多模态数据(图像、文本、DEM)的全球梯田地块提取基准,并提出了一个利用高程和文本引导的多模态基线网络ETTerra,证明了多模态信息融合对于复杂地形下农业地块提取任务的有效性和互补性。

Abstract: Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.


[54] Cell Instance Segmentation via Multi-Task Image-to-Image Schrödinger Bridge cs.CVPDF

Hayato Inoue, Shota Harada, Shumpei Takezaki, Ryoma Bise

TL;DR: 本文提出了一种基于多任务图像到图像薛定谔桥的细胞实例分割框架,将实例分割视为基于分布的图像生成问题,通过反向距离图整合边界感知监督,并使用确定性推理生成稳定预测。在PanNuke数据集上实现了竞争性或更优的性能,无需依赖SAM预训练或额外后处理;在MoNuSeg数据集上展示了有限训练数据下的鲁棒性。

Details

Motivation: 现有细胞实例分割方法通常结合确定性预测与后处理,对实例掩码的全局结构约束有限,本文旨在通过分布建模提供更有效的框架。

Result: 在PanNuke数据集上达到竞争性或更优性能(未依赖SAM预训练或额外后处理);在MoNuSeg数据集上显示有限训练数据下的鲁棒性。

Insight: 创新点包括将实例分割形式化为基于薛定谔桥的图像到图像生成问题,整合边界感知监督(反向距离图)和确定性推理;客观分析表明该方法为细胞分割提供了分布建模的新视角,可能提升全局结构一致性。

Abstract: Existing cell instance segmentation pipelines typically combine deterministic predictions with post-processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi-task image-to-image Schrödinger Bridge framework that formulates instance segmentation as a distribution-based image-to-image generation problem. Boundary-aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre-training or additional post-processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schrödinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation.


[55] RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation cs.CVPDF

Guoan Xu, Yang Xiao, Guangwei Gao, Dongchen Zhu, Wenjing Jia

TL;DR: 本文提出了一种名为RSGMamba的新型多模态语义分割框架,该框架从模态可靠性角度重新审视跨模态融合问题。其核心是可靠性感知的自门控Mamba模块(RSGMB),它显式地建模模态可靠性并通过自门控机制动态调节跨模态交互,同时结合轻量级的局部交叉门控调制(LCGM)来细化空间细节。该方法在RGB-D和RGB-T分割基准上取得了最先进的性能。

Details

Motivation: 现有跨模态融合方法通常隐含地假设所有模态同等可靠,当辅助模态存在噪声、未对齐或不完整时,会导致特征退化。本文旨在解决多模态语义分割中模态可靠性不均衡的问题。

Result: 在NYUDepth V2和SUN-RGBD数据集上分别达到58.8%和54.0% mIoU,比先前最佳方法分别提升0.4%和0.7%;在MFNet和PST900数据集上分别达到61.1%和88.9% mIoU,最高提升1.6%。模型参数量仅为48.6M,在多个基准上实现了最先进的性能。

Insight: 创新点在于从模态可靠性视角重构融合策略,提出了显式建模可靠性的自门控状态空间模型(RSGMB),实现了可靠性感知的特征选择与聚合;同时,轻量级的局部交叉门控调制(LCGM)补充了全局建模能力,兼顾了细粒度空间细节的优化。这种显式可靠性建模与动态门控机制的结合,为处理噪声或不完整多模态数据提供了新思路。

Abstract: Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.


[56] EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports cs.CV | cs.AI | cs.MMPDF

Jianzhe Ma, Zhonghao Cao, Shangkui Chen, Yichen Xu, Wenxuan Wang

TL;DR: 本文提出了EgoEsportsQA,一个用于评估视频大语言模型在高速、信息密集的虚拟环境中感知与推理能力的首个视频问答基准。该基准包含从3款第一人称射击游戏专业比赛中收集的1,745个高质量问答对,并采用了一个包含认知能力(感知与推理)和电竞知识两个维度的分类法。对现有SOTA模型的评估表明,它们在复杂虚拟场景中的表现仍有显著不足。

Details

Motivation: 现有视频大语言模型擅长理解慢节奏的真实世界第一人称视频,但其在高速、规则约束的虚拟环境(如电竞)中的能力尚未得到充分探索,缺乏专门的评测基准。

Result: 在EgoEsportsQA基准上,当前最先进的视频大语言模型表现不佳,最佳模型准确率仅为71.58%。评估揭示了模型在深度战术推理和精细微操作方面存在明显短板。

Insight: 论文的创新点在于构建了首个针对虚拟电竞场景的感知与推理视频问答基准,并提出了一个解耦的双维度分类法。这为诊断当前视频大语言模型在高速虚拟环境中的固有弱点提供了工具,并可能指导模型架构优化及下游应用开发。

Abstract: While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.


[57] All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding cs.CV | cs.LGPDF

Tanzila Rahman, Renjie Liao, Leonid Sigal

TL;DR: 本文提出了一种统一的合成数据生成管道,用于多模态视频理解任务,能够自动生成无限量的多模态视频数据,并支持对象计数、问答和分割等多种任务格式。通过引入基于VQA的微调策略,模型被训练回答关于视觉内容的结构化问题,以增强推理能力。实验表明,主要使用合成数据训练的模型在真实世界数据集上表现出色,优于传统训练方法。

Details

Motivation: 解决多模态大语言模型在视频理解任务中需要大规模多样化标注数据的问题,因为真实世界的数据收集和标注成本高、速度慢且多样性有限。

Result: 在视频对象计数、基于视频的视觉问答和视频对象分割三个挑战性任务上,模型在真实世界数据集上有效泛化,通常优于传统训练方法,展示了合成数据管道的潜力。

Insight: 创新点在于提出一个统一的多任务合成数据生成管道,支持多种任务格式,并通过基于VQA的微调策略增强模型的视觉基础和推理能力,为多模态视频理解提供可扩展的替代方案。

Abstract: Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.


[58] Detecting Precise Hand Touch Moments in Egocentric Video cs.CVPDF

Huy Anh Nguyen, Feras Dayoub, Minh Hoai

TL;DR: 本文提出了一种名为HiCE(Hand-informed Context Enhanced)的模块,用于从第一人称视角视频中精确检测手部与物体接触的瞬间。该方法通过交叉注意力机制融合手部区域及其周围环境的时空特征,并引入一个强调手部姿态和运动动态的抓握感知损失与软标签,以区分接近接触与实际接触的帧。作者还构建了一个名为TouchMoment的新数据集,包含4021个视频和8456个标注的接触时刻。实验表明,在严格的两帧容差评估标准下,该方法在TouchMoment数据集上平均精度显著优于现有的事件检测基线方法16.91%。

Details

Motivation: 解决在第一人称视频中精确检测手部与物体接触瞬间的挑战性任务,该任务对于增强现实、人机交互、辅助技术和机器人学习至关重要,因为接触起始信号标志着动作的开始或完成。

Result: 在提出的TouchMoment数据集上,使用严格的两帧容差评估标准,所提方法(HiCE)的平均精度显著优于最先进的事件检测基线方法16.91%。

Insight: 创新点包括:1) 提出HiCE模块,利用交叉注意力机制融合手部区域及其周围环境的时空上下文特征;2) 引入抓握感知损失和软标签,强调接触事件特有的手部姿态模式和运动动态,以更好地区分接近接触与实际接触;3) 构建了一个大规模、精细标注的第一人称手部接触数据集TouchMoment,为该领域提供了新的基准。

Abstract: We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see’) that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.


[59] Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization cs.CVPDF

Zanyi Wang, Fan Li, Dengyang Jiang, Liuzhuozheng Li, Yunhua Zhong

TL;DR: 本文提出ST-GD框架,旨在解决时空视频定位任务中数据稀缺的挑战。该框架通过冻结预训练的2D视觉语言模型(如Grounding DINO),并注入轻量级适配器(约1000万可训练参数)来引入时空感知能力,同时结合新颖的时间解码器进行边界预测,从而在有限数据下实现高效适应。

Details

Motivation: 现有全训练方法在时空视频定位任务中数据需求大,而大规模标注数据(密集帧级边界框和复杂时序语言对齐)获取成本高昂,导致模型在有限数据集上容易过拟合;零样本基础模型则缺乏任务特定的时序感知能力,无法实现精确定位。

Result: ST-GG在数据稀缺场景下表现优异,在有限规模的HC-STVG v1/v2基准测试中取得了极具竞争力的性能,同时在VidSTG数据集上保持了鲁棒的泛化能力。

Insight: 创新点在于采用参数高效的适配器策略,在冻结预训练模型的基础上注入轻量级模块以引入时空感知,避免了小数据破坏预训练先验;同时设计专门的时间解码器处理边界预测,为严格小数据约束下的复杂视频理解提供了有效范式。

Abstract: Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.


[60] Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection cs.CVPDF

Haifeng Zhang, Qinghui He, Xiuli Bi, Bo Liu, Chi-Man Pun

TL;DR: 本文提出了一种名为多维度对抗特征学习(MAFL)的框架,旨在解决现有AI生成图像检测方法因数据偏差导致的泛化能力不足问题。该框架通过对抗训练机制抑制生成模式和内容偏差,引导模型关注不同生成模型共享的生成特征,从而有效区分真实与生成图像,并减少对大规模训练数据的依赖。

Details

Motivation: 现有生成图像检测方法易受数据偏差影响,导致模型过度拟合特定生成模式和内容,而非学习不同生成模型共有的特征,从而限制了跨模型泛化能力。

Result: 在广泛实验验证中,该方法在准确率上比现有SOTA方法提升10.89%,平均精度(AP)提升8.57%;仅使用320张图像训练时,在公共数据集上仍能达到超过80%的检测准确率。

Insight: 创新点在于引入多维度对抗损失和对抗训练机制,通过对抗性分支抑制偏差特征学习,迫使模型学习更通用的生成特征;客观来看,该方法通过对抗学习减少数据偏差依赖,为小样本场景下的生成图像检测提供了有效解决方案。

Abstract: In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.


[61] OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion cs.CVPDF

Dongjian Yu, Weiqing Min, Qian Jiang, Xing Lin, Xin Jin

TL;DR: 本文提出了OmniFood8K数据集和NutritionSynth-115K合成数据集,以解决现有食物营养估计方法对西方菜肴覆盖不足且依赖深度传感器的问题。同时,论文设计了一个端到端的单RGB图像营养预测框架,该框架通过预测深度图、在频域层次对齐融合RGB与深度特征,并利用基于掩码的预测头动态选择关键区域,从而提升预测准确性。

Details

Motivation: 现有食物营养估计数据集主要覆盖西方菜肴,缺乏对中国菜品的充分覆盖,且许多先进方法依赖深度传感器,限制了日常场景的适用性。

Result: 在多个数据集上的广泛实验表明,该方法优于现有方法,实现了更准确的营养预测。

Insight: 创新点包括构建大规模中餐多模态数据集、合成数据增强、通过频域对齐融合多模态特征,以及利用动态通道选择强调关键成分区域,这些策略可提升单图像营养估计的鲁棒性和精度。

Abstract: Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models’ capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/


[62] Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CVPDF

Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park

TL;DR: 本文研究了多模态大语言模型中视觉令牌剪枝在复杂视觉推理任务上失效的原因,发现解码阶段相关视觉信息偏移是主要问题,并提出了一种无需训练的附加框架DSTP来动态对齐视觉令牌与推理需求。

Details

Motivation: 现有视觉令牌剪枝方法在简单视觉理解任务上表现可靠,但在复杂视觉推理任务上泛化能力不足,这一关键差距尚未被充分探索。

Result: 实验表明,DSTP显著减轻了剪枝方法在复杂推理任务上的性能下降,并在多个视觉理解基准上带来一致性能提升,且在不同SOTA架构上均有效,计算开销极小。

Insight: 创新点在于识别了解码阶段相关视觉信息偏移这一关键失效机制,并提出了一个轻量、无需训练、可即插即用的通用框架来动态调整剪枝策略,增强了现有方法的泛化能力。

Abstract: Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.


[63] Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models cs.CVPDF

Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg

TL;DR: 本文研究了针对视觉语言模型(VLMs)的排版提示注入攻击,通过将对抗性文本渲染为图像来绕过安全机制。研究评估了四种VLMs在不同字体大小和视觉变换下的攻击成功率,发现字体大小显著影响攻击效果,文本攻击在某些模型上比图像攻击更有效,且文本-图像嵌入距离与攻击成功率呈强负相关。

Details

Motivation: 随着VLMs作为自主代理的感知骨干,排版攻击通过图像形式注入对抗文本,威胁日益增长,但攻击面异质且模型脆弱性差异大,防御困难。

Result: 在SALAD-Bench的1000个提示上评估GPT-4o、Claude Sonnet 4.5、Mistral-Large-3和Qwen3-VL-4B-Instruct,发现小字体(6px)攻击成功率近零,中字体效果最佳;GPT-4o和Claude的文本攻击比图像攻击更有效(36% vs 8%,47% vs 22%);嵌入距离与攻击成功率强负相关(r=-0.71至-0.93);重度退化使嵌入距离增加10-12%,攻击成功率降低34-96%。

Insight: 创新点在于系统量化了排版攻击中字体大小和视觉条件对攻击成功率的影响,并揭示了文本-图像嵌入对齐与攻击成功的强相关性;客观分析表明,模型特定的鲁棒性模式意味着无法一刀切防御,为对抗环境下选择VLM骨干提供了实证指导。

Abstract: We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6–28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10–12% and reduce ASR by 34–96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.


[64] Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection cs.CVPDF

Hao Wang, Jiqing Zhang, Xin Yang, Baocai Yin, Lu Jiang

TL;DR: 本文提出了一种模态无关的提示学习框架,用于多模态伪装目标检测(COD)。该框架通过为Segment Anything Model(SAM)生成模态无关的多模态提示,实现了对任意辅助模态的参数高效适应,并显著提升了COD任务的性能。

Details

Motivation: 现有方法通常依赖模态特定的架构或定制融合策略,限制了可扩展性和跨模态泛化能力。本文旨在解决这一问题,提出一种能够灵活适应不同辅助模态的统一框架。

Result: 在RGB-Depth、RGB-Thermal和RGB-Polarization等多个基准测试上的广泛实验验证了该框架的有效性和泛化能力,表明其能显著提升整体性能。

Insight: 创新点在于将多模态学习建模为数据驱动的内容域与知识驱动的提示域之间的交互,将任务相关线索提炼为统一的SAM解码提示,并引入轻量级掩码细化模块以结合细粒度提示线索来校准粗粒度预测,从而获得更准确的伪装目标边界。

Abstract: Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.


[65] Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models cs.CV | cs.AIPDF

Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao

TL;DR: 本文提出了一种名为Chain-of-Models Pre-Training (CoM-PT)的新方法,旨在无损性能地加速视觉基础模型(VFMs)的训练。该方法的核心思想是在模型家族层面进行加速,通过构建一个按模型大小升序排列的模型链,仅对最小的模型进行标准预训练,然后通过参数空间和特征空间的顺序逆向知识迁移来高效训练更大的模型。该方法在45个数据集上验证了其有效性,能显著降低训练成本并保持或提升模型性能。

Details

Motivation: 现有方法通常针对单个模型进行训练优化,而随着模型家族规模扩大,训练成本急剧增加。CoM-PT旨在从模型家族层面重新思考训练加速,解决大规模视觉基础模型预训练的计算效率问题。

Result: 在CC3M数据集上的实验表明,以ViT-L为最大模型时,CoM-PT可将计算复杂度降低高达72%。在固定模型大小范围内,随着模型家族规模从3个扩展到4个和7个模型,CoM-PT的加速比从4.13倍跃升至5.68倍和7.09倍,且在45个零样本和微调任务数据集上验证了其性能大多优于标准单独训练。

Insight: 创新点在于将训练加速从单个模型优化提升到模型家族层面,通过构建模型链和顺序逆向知识迁移(结合参数和特征空间重用)实现性能无损的加速。其高效的扩展性导致了一个反直觉现象:训练更多模型反而能带来更高的效率。该方法与具体预训练范式无关,具有向大语言模型等计算密集型场景扩展的潜力。

Abstract: In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.


[66] Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning cs.CVPDF

Jungwon Choi, Eunwoo Kim

TL;DR: 本文提出了一种双模态锚点引导的过滤框架,用于改进测试时提示调优(TPT)方法。该方法通过引入文本锚点和自适应图像锚点,基于语义对齐和置信度过滤增强视图,并利用锚点作为辅助预测头与原始输出进行置信度加权集成,从而为提示更新提供稳定的监督信号。在15个基准数据集上的实验表明,该方法实现了新的最先进性能。

Details

Motivation: 测试时提示调优(TPT)通过增强视图适应视觉语言模型,但其有效性受限于难以确定哪些视图有益。标准的基于熵的过滤依赖于模型的内部置信度分数,这些分数在分布偏移下常常校准错误,导致对无关裁剪或背景区域赋予高置信度而忽略语义内容。

Result: 在15个基准数据集上的广泛实验证明了新的最先进(SOTA)性能,突显了锚点引导监督作为鲁棒提示更新基础的有效性。

Insight: 创新点在于提出双模态锚点(文本锚点和自适应图像锚点)来基于语义证据进行视图选择,并将锚点作为辅助预测头与原始输出进行置信度加权集成,从而提供更稳定的监督信号,克服了传统熵过滤在分布偏移下的校准问题。

Abstract: Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.


[67] A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs cs.CVPDF

Mohammed Asad, Mohit Bajpai, Sudhir Singh, Rahul Katarya

TL;DR: 本文提出了一种用于乳腺X光摄影ROI良恶性分类的混合架构,结合EfficientNetV2-M提取局部特征与Vision Mamba(一种状态空间模型)进行高效全局上下文建模,以提升对乳腺可疑病变的准确分类。

Details

Motivation: 解决现有方法在乳腺X光摄影病灶分类中的局限性:CNN擅长提取局部视觉模式但不擅长建模长程依赖,而ViT虽能通过自注意力解决此问题但存在二次计算成本过高的问题。

Result: 在CBIS-DDSM数据集上对以异常为中心的ROI进行二分类(良性与恶性),该方法在基于ROI的设置中实现了强大的病灶级分类性能。

Insight: 创新点在于将强大的CNN骨干网络与线性复杂度的序列模型(状态空间模型)相结合,以兼顾局部特征提取与高效全局上下文建模,在计算效率和建模能力之间取得平衡。

Abstract: Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.


[68] IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation cs.CV | cs.AIPDF

Haoyu Zheng, Tianwei Lin, Wei Wang, Zhuonan Wang, Wenqiao Zhang

TL;DR: 本文提出了IAD-Unify,一个基于双编码器的统一框架,通过冻结的DINOv2区域专家向共享的Qwen3.5-4B视觉语言骨干网络注入轻量级token,实现了工业异常分割、区域基础的理解和掩码引导的生成。同时,构建了包含59,916张图像的综合评估平台Anomaly-56K,用于统一评估。

Details

Motivation: 解决现有方法无法在一个统一框架和评估协议下同时支持工业检测中的缺陷定位、自然语言解释和可控缺陷编辑这三项能力的问题。

Result: 在构建的Anomaly-56K平台上进行控制性消融实验,验证了区域基础机制对理解任务的决定性作用(移除后定位精度下降>76个百分点),预测区域性能接近真实区域,区域基础生成在图像保真度和感知质量上表现最佳,联合预训练以可忽略的生成代价(-0.16 dB)提升了理解能力。在MMAD基准测试中,包括未见类别,也表现出强大的跨类别泛化性能。

Insight: 创新点在于提出了一个统一的多任务框架,通过冻结的区域专家与大型视觉语言模型结合,并利用轻量级token注入实现信息传递;同时,构建了大规模、统一的多任务评估平台。从客观角度看,其将区域感知信息与通用视觉语言模型能力解耦并高效融合的设计思路,以及系统性的多任务评估方法,具有借鉴意义。

Abstract: Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by >76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.


[69] SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker cs.CV | cs.AIPDF

Junbin Su, Ziteng Xue, Shihui Zhang, Kun Chen, Weiming Hu

TL;DR: 本文提出SEATrack,一种简单、高效、自适应的双流多模态跟踪器,旨在解决当前参数高效微调(PEFT)方法中性能提升往往以牺牲参数效率为代价的问题。该方法通过两个互补视角:一是提出AMG-LoRA来对齐跨模态匹配响应,二是引入分层混合专家(HMoE)进行高效的全局关系建模,从而在RGB-T、RGB-D和RGB-E跟踪任务上实现了性能与效率的更好平衡。

Details

Motivation: 当前多模态跟踪中的参数高效微调(PEFT)方法存在一个趋势,即性能提升往往伴随着参数预算的膨胀,这侵蚀了PEFT的效率承诺。本文旨在解决这一性能与效率的权衡困境。

Result: SEATrack在RGB-T、RGB-D和RGB-E跟踪任务上,相比现有最先进(SOTA)方法,在平衡性能与效率方面取得了显著进展。

Insight: 创新点包括:1)强调并解决跨模态匹配响应对齐这一被忽视但关键的因素,提出AMG-LoRA(结合LoRA和自适应互引导)来动态细化和对齐跨模态注意力图;2)引入分层混合专家(HMoE)进行高效的全局关系建模,以替代传统的局部融合方法,在表达能力和计算效率间取得平衡。

Abstract: Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT’s efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{Code is available}}.


[70] From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception cs.CVPDF

Jilong Zhu, Yang Feng

TL;DR: 本文针对多模态大语言模型(MLLMs)在细粒度视觉感知任务(如识别微小物体或细微视觉关系)上的不足,提出了变分信息流(VIF)框架。该框架从概率视角出发,利用条件变分自编码器(CVAE)将与问答对相关的视觉显著性建模为潜在分布,作为一个即插即用模块来缓解视觉衰减问题,从而提升模型在深层决策过程中的注意力聚焦能力。

Details

Motivation: 解决MLLMs在细粒度感知任务中因视觉衰减现象(即稀疏的细粒度视觉信号在传播过程中被主导的文本标记过早抑制或稀释)而导致的性能下降问题。

Result: 在涵盖通用VQA、细粒度感知和视觉定位的多个基准测试上进行广泛评估,结果表明VIF相比先前方法带来了有竞争力的性能提升,验证了其增强MLLMs细粒度感知的有效性。

Insight: 创新点在于从信息流传播的内在机制出发,采用概率建模(CVAE)来显式地捕获和保持与任务相关的视觉显著性,而非仅依赖输入层面的增强;这种即插即用的设计使其能灵活集成到现有架构中,为解决视觉与语言模态交互中的信息不平衡问题提供了新思路。

Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a “loss of focus” during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.


[71] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1) cs.CV | cs.AIPDF

Guanyi Qin, Jie Liang, Bingbing Zhang, Lishen Qu, Ya-nan Guan

TL;DR: 本文概述了NTIRE 2026挑战赛中第三轮’任意图像修复模型’竞赛的Track 1:专业图像质量评估。该挑战旨在探索多模态大语言模型在模仿人类专家认知、评估高质量图像对方面的能力,核心任务包括比较质量选择和生成解释性推理。

Details

Motivation: 传统图像质量评估方法通常依赖单一标量分数,难以区分高质量图像间的细微差异,且缺乏解释能力,无法为视觉任务提供指导。该挑战旨在利用MLLMs的潜力,弥补这一差距。

Result: 挑战吸引了近200名注册者和超过2500份提交,顶尖方法显著推进了专业IQA领域的技术水平。

Insight: 创新点在于建立了一个新颖的基准,将MLLMs应用于专业图像质量评估,要求模型不仅进行质量比较,还需生成专家级的解释性推理,这超越了传统的标量评分范式。

Abstract: In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.


[72] MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models cs.CV | cs.AIPDF

Ruoxiang Huang, Zhen Yuan

TL;DR: MODIX是一种无需训练的多模态信息驱动位置索引缩放框架,旨在优化视觉语言模型中的位置编码机制。它通过基于协方差的熵建模模态内信息密度,并结合跨模态对齐来评估模态间交互,从而动态调整位置步长,为信息丰富的模态分配更细的粒度,同时压缩冗余模态,无需修改模型参数或架构。

Details

Motivation: 现有视觉语言模型的位置编码机制对所有令牌统一分配位置索引,忽略了模态内和跨模态信息密度的变化,导致注意力分配效率低下,冗余视觉区域主导而信息内容不足。

Result: 在多种架构和基准测试上的实验表明,MODIX能持续提升多模态推理能力,并根据任务依赖的信息分布自适应地重新分配注意力。

Insight: 创新点在于将位置粒度视为隐式资源,通过动态调整位置索引来优化注意力分配;客观分析认为,这提出了位置编码应作为Transformer中自适应资源的新视角,适用于多模态序列建模。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.


[73] Cross-Attentive Multiview Fusion of Vision-Language Embeddings cs.CVPDF

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

TL;DR: 本文提出了一种名为CAMFusion的新型多视图Transformer架构,用于将视觉语言模型从2D图像提升到3D场景理解。该方法通过跨注意力机制融合来自多个视角的视觉语言描述符,形成统一的每个3D实例的嵌入表示,并利用多视图一致性作为自监督信号来提升性能。

Details

Motivation: 现有方法通常通过反向投影和平均2D描述符或启发式选择单个代表性视图来将视觉语言模型应用于3D场景,这往往导致次优的3D表示。本文旨在解决这一挑战,提升开放词汇3D语义分割的性能。

Result: CAMFusion在3D语义和实例分类基准测试中,不仅持续优于简单的平均或单视图描述符选择方法,而且实现了最先进的(SOTA)结果,包括在域外数据集上的零样本评估。

Insight: 主要创新点在于引入了跨注意力的多视图融合机制来生成统一的3D实例嵌入,并利用多视图一致性作为额外的自监督信号来增强训练。这为如何有效聚合多视角2D信息以构建更优的3D表示提供了新思路。

Abstract: Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.


[74] Evolution-Inspired Sample Competition for Deep Neural Network Optimization cs.CVPDF

Ying Zheng, Yiyi Zhang, Yi Wang, Lap-Pui Chau

TL;DR: 本文提出了一种名为’自然选择’(NS)的新型进化启发的深度神经网络优化方法,该方法通过将多个样本组合成复合图像并评估其竞争状态,动态调整样本损失权重,以解决传统训练中均匀处理样本导致的类别不平衡、难样本学习不足和噪声样本错误强化等问题。

Details

Motivation: 传统深度网络训练通常以统一的范式优化所有样本,未显式建模样本间的异质竞争,这可能导致类别不平衡下的偏差、难样本学习不足以及噪声样本的错误强化。

Result: 在四个图像分类任务的12个公共数据集上的广泛实验证明了该方法的有效性,且NS兼容多种网络架构,不依赖任务特定假设,显示出强大的通用性和实用潜力。

Insight: 创新点在于引入进化启发的竞争机制,通过组内样本竞争状态动态重加权损失,超越了基于预定义启发式或静态标准的传统样本重加权策略,实现了更自适应和平衡的模型优化。

Abstract: Conventional deep network training generally optimizes all samples under a largely uniform learning paradigm, without explicitly modeling the heterogeneous competition among them. Such an oversimplified treatment can lead to several well-known issues, including bias under class imbalance, insufficient learning of hard samples, and the erroneous reinforcement of noisy samples. In this work, we present \textit{Natural Selection} (NS), a novel evolution-inspired optimization method that explicitly incorporates competitive interactions into deep network training. Unlike conventional sample reweighting strategies that rely mainly on predefined heuristics or static criteria, NS estimates the competitive status of each sample in a group-wise context and uses it to adaptively regulate its training contribution. Specifically, NS first assembles multiple samples into a composite image and rescales it to the original input size for model inference. Based on the resulting predictions, a natural selection score is computed for each sample to characterize its relative competitive variation within the constructed group. These scores are then used to dynamically reweight the sample-wise loss, thereby introducing an explicit competition-driven mechanism into the optimization process. In this way, NS provides a simple yet effective means of moving beyond uniform sample treatment and enables more adaptive and balanced model optimization. Extensive experiments on 12 public datasets across four image classification tasks demonstrate the effectiveness of the proposed method. Moreover, NS is compatible with diverse network architectures and does not depend on task-specific assumptions, indicating its strong generality and practical potential. The code will be made publicly available.


[75] Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models cs.CVPDF

Zijian Liu, Sihan Cao, Pengcheng Zheng, Kuien Liu, Caiyan Qin

TL;DR: 本文提出了一种名为解码器端时间再平衡(DTR)的训练无关推理方法,旨在缓解视频大语言模型(Video-LLMs)中的幻觉问题。该方法通过自适应校准解码器层的视觉注意力,减轻模型对锚定帧的过度依赖,从而鼓励模型更均衡地利用视频中的时间证据来生成响应。

Details

Motivation: 现有Video-LLMs在生成时倾向于过度依赖视频中有限的时间证据(即锚定帧),导致时间证据聚合不平衡,这与幻觉生成密切相关。现有缓解方法多依赖于训练或额外模块,而忽略了这一更根本的解码器端结构偏差问题。

Result: 在多个幻觉和视频理解基准测试上的广泛实验表明,DTR能持续提升不同Video-LLM家族的幻觉鲁棒性,同时保持有竞争力的视频理解性能和高推理效率。

Insight: 创新点在于首次从解码器端揭示了Video-LLMs存在与输入无关的、模型特定的时间注意力结构偏差(锚定帧主导现象),并提出了一种无需训练、仅选择性干预中后期解码器层注意力的轻量级推理时校正方法(DTR)来重新平衡时间证据分配,这为缓解幻觉提供了一种新的、高效的视角。

Abstract: Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.


[76] Efficient Semantic Image Communication for Traffic Monitoring at the Edge cs.CV | cs.AI | cs.NIPDF

Damir Assylbek, Nurmukhammed Aitymbetov, Marko Ristin, Dimitrios Zorbas

TL;DR: 本文提出了两种面向边缘交通监控的语义图像通信方法MMSD和SAMR,旨在降低传输成本的同时保留关键视觉信息。MMSD通过将图像转换为分割图、边缘图和文本描述等紧凑语义表示,结合扩散模型重建场景,实现高压缩与数据保密;SAMR则基于语义重要性选择性屏蔽非关键区域后进行JPEG编码,并通过生成式修复恢复内容,在保持高压缩的同时提供更好的视觉质量。

Details

Motivation: 解决视觉监控系统在严格通信约束下传输高分辨率图像不切实际且不必要的问题,因为此类系统通常更关注物体存在、空间关系和场景上下文等语义信息,而非精确像素保真度。

Result: 在Raspberry Pi 5上,边缘处理时间约为MMSD 15秒、SAMR 9秒;平均传输数据量分别减少99%和99.1%。MMSD在保持强语义一致性的同时,比近期SPIC基线具有更低的负载大小;SAMR在可比操作条件下,比标准JPEG和SQ-GAN提供了更好的质量-压缩权衡。

Insight: 创新点在于采用非对称的发送端-接收端架构,将轻量处理置于边缘,计算密集的重建任务卸载到服务器;MMSD通过多模态语义分解实现高压缩与保密性,SAMR通过语义感知的掩码重建优化视觉质量与压缩的平衡,为边缘视觉通信提供了可借鉴的语义驱动压缩范式。

Abstract: Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.


[77] GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning cs.CV | cs.CLPDF

Zhaochen Liu, Limeng Qiao, Guanglu Wan, Tingting Jiang

TL;DR: 本文提出GeoAlign框架,通过动态聚合多层级几何特征来解决多模态大语言模型在空间推理任务中的局限性。该方法构建层次化几何特征库,并利用MLLM的原始视觉标记作为内容感知查询进行层级稀疏路由,从而自适应地为每个图像块提取合适的几何特征。

Details

Motivation: 现有方法通过从3D基础模型注入几何特征来缓解MLLM的空间推理困难,但依赖于静态单层特征提取,导致任务错配偏差:几何特征自然朝向3D预训练目标演化,可能与MLLM的异构空间需求相矛盾,使得任何单层特征都从根本上不足。

Result: 在VSI-Bench、ScanQA和SQA3D基准测试上的广泛实验表明,本文提出的紧凑4B模型有效实现了最先进的性能,甚至超越了现有更大的MLLM。

Insight: 创新点在于识别了任务错配偏差问题,并提出了动态多层级几何特征对齐框架,通过内容感知的稀疏路由机制自适应地整合几何信息,从而更精准地满足MLLM在空间推理中的实际需求。

Abstract: Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM’s original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.


[78] Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis cs.CV | cs.MMPDF

Miao Liu, Fangda Wei, Jing Wang, Xinyuan Qian

TL;DR: 本文提出了’倾听深度伪造检测’(LDD)这一新任务,旨在检测交互场景中伪造的倾听状态,而传统深度伪造检测主要关注伪造的说话状态。作者构建了首个专门用于LDD的数据集ListenForge,并提出了一种名为MANet的运动感知和音频引导网络,以捕捉倾听视频中的细微运动不一致性并利用说话者音频语义进行跨模态融合。

Details

Motivation: 现有深度伪造检测研究主要集中于伪造说话状态的场景,但在真实的交互场景中,攻击者会交替伪造说话和倾听状态以增强欺骗性。检测’倾听深度伪造’目前尚未被充分探索且缺乏数据集和方法,而合成倾听反应的质量相对较低,这为检测提供了突破机会。

Result: 大量实验表明,现有的说话深度伪造检测(SDD)模型在倾听场景下表现不佳。相比之下,MANet在ListenForge数据集上取得了显著优越的性能。

Insight: 主要创新点在于将深度伪造检测的视角从传统的以说话为中心扩展到交互中的倾听状态,并为此构建了首个数据集和专门设计的网络MANet。该网络通过运动感知和音频引导的跨模态融合来应对倾听伪造的独特特征,为交互通信环境中的多模态伪造分析开辟了新方向。

Abstract: Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker’s appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of ‘listening deepfakes’ remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker’s audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.


[79] PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning cs.CV | cs.AIPDF

Jinlong Liu, Wanggui He, Peng Zhang, Mushui Liu, Hao Jiang

TL;DR: 本文提出PromptEcho,一种无需人工标注和奖励模型训练的奖励构建方法,用于提升文本到图像模型的提示跟随能力。该方法利用冻结的视觉语言模型,通过计算原始提示作为标签的token级交叉熵损失,直接提取预训练中编码的图像-文本对齐知识作为奖励信号。

Details

Motivation: 解决现有强化学习方法中高质量奖励信号获取困难的问题:CLIP分数过于粗糙,而基于VLM的奖励模型需要昂贵的人工标注偏好数据和额外微调。

Result: 在两个最先进的T2I模型上,PromptEcho在提出的DenseAlignBench基准上实现了显著的改进,净胜率分别提升26.8和16.2个百分点,并在GenEval、DPG-Bench和TIIFBench上获得一致增益,且无需任务特定训练。消融研究证实其全面优于使用相同VLM的基于推理的评分,且奖励质量随VLM规模扩展。

Insight: 创新点在于直接从冻结的预训练VLM中提取token级对齐知识作为确定性奖励,无需额外训练或标注,计算高效且可随开源VLM增强自动改进;同时提出了用于严格测试提示跟随能力的概念密集型密集描述基准DenseAlignBench。

Abstract: Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.


[80] Hypergraph-State Collaborative Reasoning for Multi-Object Tracking cs.CVPDF

Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, Xinchao Wang

TL;DR: 本文提出了一种基于超图状态协同推理的多目标跟踪框架,通过联合推断多个相关目标的运动状态来增强运动估计的鲁棒性。该框架利用超图模块捕获空间运动相关性,并结合状态空间模型保证时间平滑性,从而在遮挡和噪声情况下实现稳定的轨迹关联。

Details

Motivation: 现有运动估计方法存在两个主要局限:一是噪声或概率预测导致的不稳定性,二是在遮挡情况下轨迹容易断裂。本文旨在通过多目标协同推理来克服这些问题,提升跟踪的连续性和鲁棒性。

Result: 在MOT17、MOT20、DanceTrack和SportsMOT四个主流且多样化的基准测试上进行了广泛实验,覆盖了多种运动模式和场景复杂度,结果表明该方法在广泛的跟踪场景中达到了最先进的性能水平。

Insight: 创新点在于提出了协同推理框架,通过超图动态建模空间运动关联,并结合状态空间模型统一时空推理,实现了空间一致性与时间连贯性的同步优化,从而在遮挡和噪声环境下仍能保持轨迹的连续性和稳定性。

Abstract: Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.


[81] AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition cs.CVPDF

Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang

TL;DR: 本文提出了AffectAgent,一个面向情感的多智能体检索增强生成框架,用于细粒度的多模态情感识别。该框架包含三个联合优化的专门智能体,通过协作决策来检索跨模态样本、评估证据并生成预测,以解决单轮检索增强生成易受模态模糊性影响、难以捕捉跨模态复杂情感依赖的问题。

Details

Motivation: 基于LLM的多模态情感识别依赖静态参数化记忆,在解释细微情感状态时容易产生幻觉。单轮检索增强生成极易受到模态模糊性的影响,因此难以捕捉跨模态的复杂情感依赖关系。

Result: 在MER-UniBench基准上进行的大量实验表明,AffectAgent在复杂场景下实现了卓越的性能。

Insight: 创新点在于提出了一个由三个专门智能体(查询规划器、证据过滤器、情感生成器)协作推理的框架,并使用带共享情感奖励的MAPPO进行端到端优化。此外,引入了MB-MoE来动态调节不同模态的贡献以缓解跨模态异质性导致的表征失配,以及RAAF来在模态缺失条件下通过融入检索到的视听嵌入来增强语义补全。

Abstract: LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.


[82] Scaling In-Context Segmentation with Hierarchical Supervision cs.CVPDF

T. Camaret Ndir, Marco Reisert, Robin T. Schirrmeister

TL;DR: 本文提出了一种名为PatchICL的分层监督框架,用于解决医学图像分割中上下文学习(ICL)的计算效率问题。该方法通过选择性图像分块和多级监督,主动识别并仅关注最具信息量的解剖区域,从而在保持分割精度的同时显著降低计算成本。

Details

Motivation: 标准ICL方法依赖密集的全局交叉注意力,其计算复杂度随图像分辨率增加而急剧上升,而现有局部注意力机制缺乏对选择过程的显式监督,导致在非信息区域进行冗余计算。本文旨在开发一种更高效、可扩展的ICL框架,以减轻临床标注负担并提升模型适应性。

Result: 在512×512分辨率下,与强基线模型UniverSeg相比,PatchICL在保持竞争性的域内CT分割精度的同时,计算量减少了44%。在涵盖多种成像模态的35个域外数据集上,PatchICL在13个模态类别中的6个上超越了基线,尤其在以局部病理为主的OCT和皮肤镜检查模态上表现突出。

Insight: 创新点在于将选择性图像分块与分层监督相结合,通过显式监督引导模型聚焦于信息区域,从而优化计算资源分配。这为高效上下文学习提供了新思路,其主动区域选择机制可推广至其他需要处理高分辨率图像的视觉任务中。

Abstract: In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44% at $512\times512$ resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at https://github.com/tidiane-camaret/ic_segmentation


[83] ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search cs.CV | cs.AI | cs.MAPDF

Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon

TL;DR: ARGOS 是首个将多摄像头行人搜索重新定义为交互式推理问题的基准和框架,要求智能体在信息不对称的情况下,通过规划、提问和排除候选人来完成任务。智能体接收模糊的目击陈述,必须在有限的回合预算内决定询问内容、何时调用时空工具以及如何解释模糊响应。推理基于编码摄像头连接性和经验验证转移时间的时空拓扑图(STTG)。基准包含三个渐进式赛道(语义感知、空间推理、时间推理)中的 2,691 个任务,覆盖 14 个真实场景。实验表明,使用四个 LLM 骨干网络时,该基准远未解决(最佳 TWS:赛道 2 为 0.383,赛道 3 为 0.590),消融实验证实移除领域特定工具会使准确率下降高达 49.6 个百分点。

Details

Motivation: 解决传统多摄像头行人搜索中信息不对称和模糊性带来的挑战,将其重新定义为需要智能体进行交互式推理的问题,以更贴近实际监控场景中基于不完整目击信息进行搜索的需求。

Result: 在包含 2,691 个任务的基准上,使用四个 LLM 骨干网络进行实验,最佳任务加权分数(TWS)在赛道 2(空间推理)为 0.383,在赛道 3(时间推理)为 0.590,表明基准远未解决。消融实验显示,移除领域特定工具(如时空工具)会使准确率大幅下降高达 49.6 个百分点。

Insight: 创新点在于将多摄像头行人搜索重构为基于智能体的交互式推理任务,并引入了时空拓扑图(STTG)来 grounding 推理过程。从客观角度看,其提出的基准框架和渐进式赛道设计(Who, Where, When)为评估智能体在复杂、信息不对称环境下的规划与推理能力提供了新的范式和测试平台。

Abstract: We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.


[84] CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models cs.CV | cs.AIPDF

Yunkai Dang, Yizhu Jiang, Yifan Jiang, Qi Fan, Yinghuan Shi

TL;DR: 本文提出CLASP框架,通过类别自适应的层融合和双阶段剪枝策略,动态减少多模态大语言模型中的视觉令牌冗余,以降低计算开销。

Details

Motivation: 现有方法通常采用单层ViT特征和静态剪枝策略,难以适应多样化指令,导致模型性能不稳定。

Result: 在多个基准测试、剪枝比例和MLLM架构上,CLASP均优于现有方法,实现了更高效的视觉令牌压缩。

Insight: 创新点在于类别自适应的多层级视觉特征融合和双阶段剪枝(关注相关性的注意力显著枢纽令牌与覆盖冗余的完成令牌),支持提示条件化的特征融合和预算分配,实现鲁棒的视觉令牌压缩。

Abstract: Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.


[85] Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling cs.CV | cs.AIPDF

Huanzhen Wang, Ziheng Zhou, Zeng Tao, Aoxing Li, Yingkai Zhao

TL;DR: 本文提出了一种受认知启发的双流语义增强模型(DuSE),用于动态面部表情识别。该模型通过模拟人脑处理情绪感知的神经认知机制,将语言线索与面部动态的细粒度时间特征对齐,并整合感官输入与概念知识,从而提升动态情绪建模的准确性和可解释性。

Details

Motivation: 现有基于视觉的动态情绪建模方法往往忽视情绪感知和认知理论,与人脑处理情绪的方式存在差距。本文旨在通过模拟人脑的神经认知机制,构建更符合神经科学原理的动态情绪建模框架。

Result: 在具有挑战性的真实场景基准测试中进行了广泛实验,验证了该认知中心方法的有效性,实现了最先进的性能,并增强了模型的可解释性。

Insight: 创新点在于将认知理论(如认知启动效应和概念行为理论)计算化地融入动态情绪建模,通过双流架构(HTPC和LSEA)分别模拟语言线索的调制作用和知识整合过程,为动态面部表情识别提供了神经学上更可信的框架。

Abstract: The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain’s strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.


[86] Efficient Adversarial Training via Criticality-Aware Fine-Tuning cs.CV | cs.AIPDF

Wenyun Li, Zheng Zhang, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan

TL;DR: 本文提出了一种名为关键性感知对抗训练(CAAT)的高效对抗训练方法,旨在通过仅微调一小部分关键参数来提升Vision Transformer(ViT)模型的对抗鲁棒性,从而显著降低计算成本。

Details

Motivation: 随着ViT模型参数量的增加,其对抗鲁棒性并未成比例提升,而传统的对抗训练需要微调整个模型,计算成本过高。本文旨在通过仅微调关键参数子集,以较低成本达到与标准对抗训练相当的鲁棒性。

Result: 在三个广泛使用的对抗学习数据集上的实验表明,CAAT在仅微调约6%参数的情况下,对抗鲁棒性仅下降4.3%,且优于当前最先进的轻量级对抗训练方法。

Insight: 创新点在于引入关键性感知机制,自适应地识别并分配资源给对对抗鲁棒性贡献最大的参数,结合参数高效微调(PEFT)技术,为大规模模型的对抗训练提供了高效可行的路径。

Abstract: Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.


[87] Generative Anonymization in Event Streams cs.CV | cs.LGPDF

Adam T. Müller, Mihai Kocsis, Nicolaj C. Stache

TL;DR: 本文提出了一种针对事件流的生成式匿名化框架,旨在解决神经形态视觉传感器在公共空间部署时面临的数据隐私与数据效用之间的权衡问题。该方法通过将异步事件流投影到中间强度表示,利用预训练模型合成逼真但不存在的身份,再重新编码回神经形态域,从而在保护身份隐私的同时保持下游视觉任务所需的数据结构完整性。

Details

Motivation: 神经形态视觉传感器具有低延迟和高动态范围的优点,但其在公共空间的部署引发了严重的数据隐私问题。现有的事件到视频(E2V)模型可以从稀疏事件流中重建高保真强度图像,无意中暴露了人类身份,而当前的混淆方法(如掩码或打乱)会破坏时空结构,严重降低下游感知任务的数据效用。

Result: 实验表明,该方法能有效防止从E2V重建中恢复身份,同时保持下游视觉任务所需的结构数据完整性。此外,论文还引入了一个新颖的、通过精确机器人轨迹捕获的同步真实世界事件和RGB数据集,为隐私保护神经形态视觉的未来研究提供了稳健的基准。

Insight: 论文的主要创新点在于首次提出了针对事件流的生成式匿名化框架,通过弥合异步事件与标准空间生成模型之间的模态差距,实现了隐私保护与数据效用的平衡。客观来看,其将事件流映射到中间表示并利用预训练生成模型进行身份替换的流程设计,以及引入同步多模态数据集构建新基准的做法,具有借鉴意义。

Abstract: Neuromorphic vision sensors offer low latency and high dynamic range, but their deployment in public spaces raises severe data protection concerns. Recent Event-to-Video (E2V) models can reconstruct high-fidelity intensity images from sparse event streams, inadvertently exposing human identities. Current obfuscation methods, such as masking or scrambling, corrupt the spatio-temporal structure, severely degrading data utility for downstream perception tasks. In this paper, to the best of our knowledge, we present the first generative anonymization framework for event streams to resolve this utility-privacy trade-off. By bridging the modality gap between asynchronous events and standard spatial generative models, our pipeline projects events into an intermediate intensity representation, leverages pretrained models to synthesize realistic, non-existent identities, and re-encodes the features back into the neuromorphic domain. Experiments demonstrate that our method reliably prevents identity recovery from E2V reconstructions while preserving the structural data integrity required for downstream vision tasks. Finally, to facilitate rigorous evaluation, we introduce a novel, synchronized real-world event and RGB dataset captured via precise robotic trajectories, providing a robust benchmark for future research in privacy-preserving neuromorphic vision.


[88] Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors cs.CVPDF

Feiyu Tan, Heran Yang, Qihong Duan, Kai Ye, Qi Xie

TL;DR: 本文提出了一种嵌入旋转对称先验的图像到图像转换框架,通过引入旋转群等变卷积来构建旋转等变的I2I网络,确保网络在处理过程中保持旋转对称性。此外,还提出了一种可学习的变换等变卷积(TL-Conv),能够自适应学习变换群,增强不同数据集上的对称性保持。

Details

Motivation: 解决图像到图像转换任务中缺乏配对数据和无监督学习框架的挑战,通过引入变换对称先验来提升网络性能,特别是针对自然和科学图像中固有的旋转对称性。

Result: 在多个I2I任务上进行了广泛实验,验证了方法的有效性和优越性能,表明等变网络在提升生成质量和广泛适用性方面的潜力。

Insight: 创新点包括首次将旋转群等变卷积应用于I2I框架,以及提出可学习的变换等变卷积(TL-Conv)来自适应学习变换群,同时提供了TL-Conv等变误差的理论分析,确保在连续域中的精确等变性和离散情况下的误差界限。

Abstract: Image-to-image translation (I2I) is a fundamental task in computer vision, focused on mapping an input image from a source domain to a corresponding image in a target domain while preserving domain-invariant features and adapting domain-specific attributes. Despite the remarkable success of deep learning-based I2I approaches, the lack of paired data and unsupervised learning framework still hinder their effectiveness. In this work, we address the challenge by incorporating transformation symmetry priors into image-to-image translation networks. Specifically, we introduce rotation group equivariant convolutions to achieve rotation equivariant I2I framework, a novel contribution, to the best of our knowledge, along this research direction. This design ensures the preservation of rotation symmetry, one of the most intrinsic and domain-invariant properties of natural and scientific images, throughout the network. Furthermore, we conduct a systematic study on image symmetry priors on real dataset and propose a novel transformation learnable equivariant convolutions (TL-Conv) that adaptively learns transformation groups, enhancing symmetry preservation across diverse datasets. We also provide a theoretical analysis of the equivariance error of TL-Conv, proving that it maintains exact equivariance in continuous domains and provide a bound for the error in discrete cases. Through extensive experiments across a range of I2I tasks, we validate the effectiveness and superior performance of our approach, highlighting the potential of equivariant networks in enhancing generation quality and its broad applicability. Our code is available at https://github.com/tanfy929/Equivariant-I2I


[89] DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment cs.CV | cs.MMPDF

Xinyue Li, Shubo Xu, Zhichao Zhang, Zhaolin Cai, Yitong Chen

TL;DR: 本文提出DPC-VQA,一种用于视频质量评估的解耦感知与校准框架。该方法利用冻结的多模态大语言模型提供基础质量估计和感知先验,并通过轻量级校准分支预测残差校正以适应目标场景,从而避免昂贵的端到端重新训练。

Details

Motivation: 现有基于MLLM的VQA方法在新场景下适应成本高昂,需要大规模重新训练和昂贵的平均意见分数标注。论文认为预训练的MLLM已具备有用的感知先验,关键挑战在于高效地将其校准到目标MOS空间。

Result: 在用户生成内容和AI生成内容基准测试上的广泛实验表明,DPC-VQA取得了与代表性基线方法相当的性能,同时仅使用传统基于MLLM的VQA方法可训练参数的不到2%,并且仅需20%的MOS标签即可保持有效。

Insight: 创新点在于将质量感知和残差校准解耦,利用冻结MLLM作为感知先验,通过轻量级适配器进行高效校准。这种方法降低了数据与计算成本,为MLLM在特定任务上的高效微调提供了新思路。

Abstract: Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of MOS labels. The code will be released upon publication.


[90] Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks cs.CVPDF

Yingying Zhao, Chengyin Hu, Qike Zhang, Xin Li, Xin Wang

TL;DR: 本文提出了一种名为多模态语义光照攻击(MSLA)的物理可部署对抗攻击框架,首次系统研究了针对视觉语言模型(VLM)的物理世界攻击。MSLA通过可控的对抗性光照干扰真实场景中的多模态语义对齐,导致CLIP等模型的零样本分类性能下降,并在LLaVA、BLIP等先进VLM的图像描述和视觉问答任务中引发严重的语义幻觉。实验表明该攻击有效、可迁移且物理可实现,揭示了VLM在物理世界中的严重脆弱性。

Details

Motivation: 现有对抗攻击研究几乎完全集中于数字环境,而VLM在真实世界部署日益增多,其物理安全性尚未得到系统评估。物理可实现的对抗扰动可能引发识别失败和多模态推理混乱,导致下游任务中的严重语义误解,因此研究针对VLM的物理攻击对评估其真实世界安全风险至关重要。

Result: 在数字和物理领域的广泛实验证明,MSLA能有效降低主流CLIP变体的零样本分类性能,并在LLaVA、BLIP等先进VLM的图像描述和视觉问答任务中引发严重语义幻觉,攻击具有可迁移性和物理可实现性。

Insight: 创新点在于首次提出针对VLM的物理可部署对抗攻击框架,通过可控光照攻击多模态语义对齐而非仅任务特定输出,揭示了VLM对物理语义攻击的高度脆弱性,为VLM的物理世界鲁棒性评估提供了新视角和紧迫需求。

Abstract: Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.


[91] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis cs.CV | cs.CLPDF

Kathakoli Sengupta, Kai Ao, Paola Cascante-Bonilla

TL;DR: 本文提出了SceneCritic,一种用于评估3D室内场景合成的符号化评估器。它基于一个名为SceneOnto的结构化空间本体,该本体聚合了多个数据集的先验知识,用于从语义、方向和几何角度验证场景布局的连贯性。此外,研究还建立了一个迭代精化测试平台,比较了基于规则、LLM和VLM三种不同批评模式对模型修正空间结构的效果。

Details

Motivation: 当前LLM和VLM通过布局和场景图等中间结构生成室内场景,但其评估主要依赖LLM或VLM对渲染视图进行打分,这种判断容易受到视角、提示词表述和幻觉的影响,导致评估不稳定,难以判断场景的空间合理性。

Result: 实验表明:(a) SceneCritic在评估布局质量时,与人类判断的一致性显著优于基于VLM的评估器;(b) 纯文本LLM在语义布局质量评估上可以超越VLM;(c) 基于图像的VLM精化模式在语义和方向校正方面是最有效的批评模态。

Insight: 创新点在于构建了一个符号化的、基于结构化空间本体(SceneOnto)的评估框架,能够提供细粒度的对象级和关系级评估,并识别具体的违规行为。同时,通过对比不同批评模态,揭示了纯文本LLM在语义评估上的潜力以及VLM在迭代修正中的优势。

Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic’s constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.


[92] VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization cs.CV | cs.LGPDF

Andrei Atanov, Jesse Allardice, Roman Bachmann, Oğuzhan Fatih Kar, R Devon Hjelm

TL;DR: 本文提出了VideoFlexTok,一种用于视频的灵活长度、由粗到细的视觉分词器。它将视频表示为可变长度的token序列,其中前面的token捕获抽象信息(如语义和运动),后面的token添加细节。这种表示允许根据下游需求调整token数量,并在相同预算下编码更长的视频。

Details

Motivation: 解决现有视频分词器(将视频表示为固定3D网格token)的局限性,即下游模型必须学习预测所有低层次细节,导致学习复杂度高,且无法根据视频固有复杂性或下游任务需求灵活调整表示。

Result: 在类别和文本到视频生成任务上评估,相比3D网格token,VideoFlexTok实现了更高效的训练,例如使用小5倍的模型(1.1B vs 5.2B)达到了可比的生成质量(gFVD和ViCLIP Score)。此外,它能够以仅672个token(比可比的3D网格分词器少8倍)训练一个处理10秒81帧长视频的文本到视频模型。

Insight: 核心创新在于提出了由粗到细、可变长度的视频token序列表示,这通过生成流解码器实现。这允许根据视频内容和任务需求动态分配表示资源,提高了表示效率和模型训练效率,为长视频生成提供了可行的解决方案。

Abstract: Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details “pixel-by-pixel” irrespective of the video’s inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner – where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.


[93] Towards Long-horizon Agentic Multimodal Search cs.CV | cs.AIPDF

Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li

TL;DR: 本文提出了一种名为LMM-Searcher的新型长视野多模态深度搜索框架,旨在解决多模态智能体在长序列任务中面临的信息异构和上下文爆炸问题。其核心创新是采用基于文件的视觉表示机制,将视觉资产卸载到外部文件系统并用轻量级文本标识符(UID)映射,从而降低上下文开销。通过定制化的图像获取工具和渐进式按需视觉加载策略,以及一个用于生成复杂跨模态多跳推理查询的数据合成流程,作者微调了Qwen3-VL-Thinking-30A3B模型,创建了一个专门的多模态深度搜索智能体。

Details

Motivation: 解决现有多模态深度搜索智能体在长视野任务中,因处理异构信息(文本和视觉)导致的高令牌成本和上下文爆炸或关键视觉信号丢失的关键挑战。

Result: 在四个基准测试上的广泛实验表明,该方法成功扩展到100轮搜索视野,在MM-BrowseComp和MMSearch-Plus等具有挑战性的长视野基准测试中,达到了开源模型中的最先进(SOTA)性能,并且在不同基础模型上表现出很强的泛化能力。

Insight: 主要创新点包括:1)基于文件的视觉表示机制,通过UID映射实现视觉信息的轻量化管理和长期访问;2)渐进式按需视觉加载策略,支持主动感知;3)用于生成复杂跨模态多跳推理查询的数据合成流程,以进行高质量的指令微调。从客观角度看,这种将重型视觉数据与轻量级推理上下文解耦的架构设计,为构建高效的长序列多模态智能体提供了可借鉴的思路。

Abstract: Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.


[94] Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs cs.CV | cs.LGPDF

Muhammad Kamran Janjua, Hugo Silva, Di Niu, Bahador Rashidi

TL;DR: 本文提出了一种名为Perception Programs (P^2)的训练无关、模型无关方法,用于改进多模态语言模型(MLLMs)的视觉推理能力。该方法的核心思想是将视觉工具(如深度、光流、对应关系)生成的密集像素级输出,重写为紧凑、结构化、语言原生的摘要,使MLLMs能够直接解析和推理,从而显著提升在感知中心任务上的性能。

Details

Motivation: 现有方法通常将原始工具输出直接输入模型,但这些密集的像素级表示与LLMs基于语言的原生推理优势不匹配,导致感知能力弱且过度依赖语言先验。作者认为,在视觉工具能提供必要线索的问题中,瓶颈不在于调用更多工具或使用更大的MLLMs,而在于如何表示工具输出。

Result: 在BLINK基准的六个感知中心任务上,P^2方法相比基础模型和原始工具增强基线均带来大幅提升。例如,以GPT-5 Mini为基础模型时,P^2将多视角推理准确率从41.35%提升至86.47%,相对深度任务从52.42%提升至81.45%,平均增益达22%,达到了新的SOTA水平。即使在较小的MLLMs(如InternVL3.5-4B和Qwen3VL-4B)上,P^2也带来了15-40%的绝对增益,超越了先前的基于代理、监督和强化学习的工具使用方法,且无需任何训练或模型修改。

Insight: 论文的创新点在于提出了一种将视觉工具输出转化为语言原生摘要的表示方法,这更契合LLMs的推理模式,从而有效解锁了MLLMs的视觉工具推理潜力。从客观角度看,其核心洞察是:提升MLLMs视觉推理能力的关键可能不在于增强模型本身或增加工具调用,而在于优化工具输出与语言模型之间的接口表示,这是一种高效且通用的系统级解决方案。

Abstract: Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42% to 81.45% on relative depth, and achieves a 22% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.


[95] Pi-HOC: Pairwise 3D Human-Object Contact Estimation cs.CVPDF

Sravan Chittupalli, Ayush Jain, Dong Huang

TL;DR: 本文提出了Pi-HOC框架,用于从单张图像中估计所有人体-物体对之间的密集3D语义接触。该方法通过检测实例、为每对交互生成专用token,并利用InteractionFormer进行精炼,最终基于SAM解码器在SMPL人体网格上预测接触。在MMHOI和DAMON数据集上,Pi-HOC在精度和定位上显著优于现有方法,且推理吞吐量提升20倍。此外,预测的接触信息可通过测试时优化提升SAM-3D的图像到网格重建,并支持无需额外训练的语言查询接触预测。

Details

Motivation: 解决图像中真实世界人体-物体交互的多对多挑战,特别是细粒度并发物理接触的解析问题。现有方法局限于单人场景或需要额外物体几何信息,而当前SOTA方法虽利用VLM但难以处理多人场景且推理效率低。

Result: 在MMHOI和DAMON数据集上,Pi-HOC在准确性和定位精度上显著超越现有SOTA方法,同时实现了20倍更高的吞吐量。

Insight: 创新点包括:1) 单次推理、实例感知的框架,为每对交互生成专用token进行并行处理;2) 结合InteractionFormer和SAM解码器,实现密集3D接触预测;3) 展示了接触预测在提升3D重建和零样本语言查询任务中的泛化应用价值。

Abstract: Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.


[96] Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions cs.CVPDF

Ayce Idil Aytekin, Xu Chen, Zhengyang Shen, Thabo Beeler, Helge Rhodin

TL;DR: 本文提出了Grasp in Gaussians (GraG)方法,一种从单目视频中快速、鲁棒地重建动态3D手-物体交互的方法。该方法的核心是使用紧凑的高斯和表示来恢复准确且时序稳定的运动,并通过视频适配的SAM3D流程初始化物体,结合现成的手部姿态初始化,实现了高效跟踪。

Details

Motivation: 解决从单目视频中重建动态手-物体交互时,现有方法依赖优化沉重的神经表示导致速度慢的问题,旨在实现快速且鲁棒的重建。

Result: 在公开基准测试上,GraG在长序列上重建时序一致的手-物体交互速度比先前工作快6.4倍,同时物体重建质量提升13.4%,手部关节位置误差降低超过65%。

Insight: 创新点在于将经典跟踪文献中的高斯和表示与现代生成式高斯初始化结合,用于紧凑高效的运动跟踪;同时采用互补策略分别处理物体和手部,物体使用轻量化高斯和表示,手部使用简单的2D关节和深度对齐损失进行细化,避免了每帧优化详细3D外观模型,从而在保持几何保真度和关节稳定性的同时大幅提升速度。

Abstract: We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand’s per-joint position error by over 65%.


[97] Task Alignment: A simple and effective proxy for model merging in computer vision cs.CVPDF

Pau de Jorge, César Roberto de Souza, Björn Michele, Mert Bülent Sarıyıldız, Philippe Weinzaepfel

TL;DR: 本文提出了一种名为任务对齐代理的简单有效方法,用于加速计算机视觉中模型合并的超参数选择,并扩展了模型合并的适用范围,使其超越基于CLIP的分类任务,涵盖更复杂的多任务视觉场景。

Details

Motivation: 解决在异构解码器的多任务视觉模型中,由于解码器训练成本高,传统基于下游性能的超参数选择方法不切实际的问题。

Result: 任务对齐代理能够将超参数选择速度提升数个数量级,同时保持性能,使模型合并适用于CLIP分类之外的多任务视觉模型。

Insight: 创新点在于引入任务对齐作为代理指标,以低成本替代下游性能评估,从而高效优化模型合并过程;客观分析认为该方法通过代理优化解决了实际部署中的计算瓶颈,提升了模型合并的实用性。

Abstract: Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.


[98] Distorted or Fabricated? A Survey on Hallucination in Video LLMs cs.CV | cs.AIPDF

Yiyang Huang, Yitian Zhang, Yizhou Wang, Mingyuan Zhang, Liang Shi

TL;DR: 本文是关于视频大语言模型(Vid-LLMs)中幻觉问题的综述。论文系统性地分析了幻觉现象,将其分为动态扭曲和内容捏造两大核心类型及其子类,并回顾了评估和缓解幻觉的最新进展,包括基准、指标和干预策略。最后,论文探讨了幻觉的根本原因并提出了未来研究方向。

Details

Motivation: 尽管视频-语言建模取得了显著进展,但幻觉(即输出看似合理但与输入视频内容相矛盾)仍然是Vid-LLMs中一个持续存在的挑战。本文旨在对Vid-LLMs中的幻觉问题进行系统性梳理和分析,以促进对问题的理解并为构建鲁棒可靠的视频-语言系统奠定基础。

Result: 本文是一篇综述性论文,未提出具体模型或报告定量实验结果。它系统性地回顾了该领域的进展,包括关键的基准测试(如Video-Hallucination Benchmark)、评估指标(如CHAIR、HaluEval)以及干预策略(如视觉基础增强、指令调优)。

Insight: 论文的主要创新点在于提出了一个系统性的幻觉分类法(动态扭曲和内容捏造),并深入分析了其根本原因(如时间表征能力有限、视觉基础不足)。从客观角度看,该综述整合了分散的研究进展,为理解Vid-LLMs的幻觉问题提供了清晰的框架,并指出了有前景的未来方向,如开发运动感知的视觉编码器和集成反事实学习技术。

Abstract: Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at https://github.com/hukcc/Awesome-Video-Hallucination .


[99] Boosting Visual Instruction Tuning with Self-Supervised Guidance cs.CVPDF

Sophia Sirko-Galouchenko, Monika Wysoczanska, Andrei Bursuc, Nicolas Thome, Spyros Gidaris

TL;DR: 本文提出了一种简单轻量的方法,通过将自监督学习任务(如旋转预测、颜色匹配和跨视图对应)重新表述为图像-指令-响应三元组,并将其作为少量视觉基础任务注入到视觉指令微调中,以增强多模态大语言模型在视觉中心任务上的细粒度推理能力。该方法无需人工标注、架构修改或额外训练阶段,仅需在训练数据中引入少量(3-10%)此类视觉基础指令,即可在多个模型、训练机制和基准测试中持续提升性能。

Details

Motivation: 多模态大语言模型在视觉语言任务上表现良好,但在需要细粒度视觉推理的视觉中心问题上往往表现不佳。研究表明,这种限制并非源于视觉表示能力弱,而是由于指令微调过程中视觉信息利用不足,许多任务仅靠语言先验即可部分解决。

Result: 在多个模型、训练机制和基准测试中,仅注入少量(3-10%)视觉基础自监督任务指令,即可持续提升视觉中心评估的性能。

Insight: 创新点在于将经典自监督前置任务重新表述为自然语言指令形式,作为视觉基础监督信号注入指令微调数据中,从而强制模型依赖视觉证据。这种方法简单、轻量且无需额外成本,通过调整训练数据分布有效提升了模型的视觉推理能力。

Abstract: Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT


[100] Agentic Discovery with Active Hypothesis Exploration for Visual Recognition cs.CVPDF

Jaywon Koo, Jefferson Hernandez, Ruozhen He, Hanjie Chen, Chen Wei

TL;DR: 本文提出了HypoExplore框架,将视觉识别中的神经网络架构发现过程形式化为一个假设驱动的科学探究。该框架利用大语言模型,通过进化分支过程,在给定高层研究方向后,构思、实现、评估并改进神经网络架构。它维护了一个记录架构谱系的轨迹树和一个跟踪实验证据置信度的假设记忆库,并通过多个反馈代理分析实验结果来更新假设置信度。

Details

Motivation: 动机在于将神经网络架构搜索(NAS)过程自动化并赋予其类似科学发现的推理能力,旨在更系统、更高效地探索设计空间,并理解架构设计原则,而不仅仅是找到高性能模型。

Result: 在CIFAR-10上,最佳发现的轻量级架构准确率达到94.11%(从18.91%的基线进化而来),并能泛化到CIFAR-100和Tiny-ImageNet。在MedMNIST上的独立运行取得了最先进(SOTA)的性能。

Insight: 创新点在于将架构发现构建为基于假设的、由LLM驱动的主动探索过程,并引入了轨迹树和假设记忆库来记录和利用知识。客观来看,其通过多代理反馈和置信度累积来模拟科学推理的机制,以及展示所学设计原则的可迁移性,是推动自动化机器学习(AutoML)向更具解释性和知识积累方向发展的有益尝试。

Abstract: We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.


[101] See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback cs.CVPDF

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu

TL;DR: 这篇论文提出了一种名为’See, Point, Refine’的多轮迭代方法,用于解决图形用户界面(GUI)中像素级精确定位的问题,特别是在密集的编码环境(如IDE)中。该方法通过视觉反馈进行闭环自我纠正,显著提升了点击精度和任务成功率。

Details

Motivation: 现有计算机使用代理(CUAs)在密集编码界面中进行编辑级GUI定位时,通常依赖单次坐标预测,缺乏纠错机制,难以实现亚像素级精度。论文旨在解决这一高密度界面下的精确定位挑战。

Result: 在GPT-5.4、Claude和Qwen等模型上,通过一系列复杂编码基准测试评估,多轮细化方法在点击精度和整体任务成功率上均显著优于最先进的单次预测模型,达到了SOTA水平。

Insight: 论文的创新点在于将GUI定位从单次预测转变为基于视觉反馈的迭代细化闭环过程,使代理能够自我纠正位移误差并适应动态UI变化。这揭示了迭代视觉推理是构建下一代可靠软件工程代理的关键组件。

Abstract: Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.


[102] Representation geometry shapes task performance in vision-language modeling for CT enterography cs.CV | cs.AIPDF

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

TL;DR: 本文首次研究了腹部CT小肠造影的视觉-语言迁移学习,发现切片嵌入的平均池化在疾病分类任务上表现更佳(59.2%三类准确率),而注意力池化在跨模态检索任务上更优(0.235文本到图像MRR);同时,多窗RGB编码(将不同Hounsfield Unit窗口映射到RGB通道)比通过多平面采样增加空间覆盖的策略更有效,添加冠状面和矢状面视图反而会降低分类性能。

Details

Motivation: CT小肠造影是评估炎症性肠病的主要成像方式,但支持其自动化分析的最佳表征选择尚不明确,本文旨在探索视觉-语言模型在该模态上的表现。

Result: 在疾病分类任务上,平均池化达到59.2%的三类准确率;在跨模态检索任务上,注意力池化达到0.235的文本到图像MRR;对于报告生成,检索增强生成(RAG)将准确率提升至比随机基线高7-14个百分点,并将序数MAE从0.98改善至0.80-0.89。

Insight: 研究发现表征聚合器(平均池化vs注意力池化)强调学习表征的不同属性,影响任务性能;在医学影像中,基于组织对比度的多窗RGB编码比增加空间覆盖更有效;检索增强生成能显著提升报告生成的性能;采用三教师伪标签框架可在无专家标注下进行比较评估。

Abstract: Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4% vs.\ 71% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7–14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80–0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.


[103] Visual Preference Optimization with Rubric Rewards cs.CV | cs.AIPDF

Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu

TL;DR: 本文提出了一种基于实例特定评分标准的视觉偏好优化框架rDPO,通过为每个图像-指令对创建包含基本和附加标准的检查表式评分标准来评估策略响应,从而改进多模态任务中的偏好数据质量。

Details

Motivation: 现有偏好优化方法依赖离策略扰动或基于结果的粗粒度信号,不适用于细粒度视觉推理,因此需要更精细的反馈机制。

Result: 在公共奖励建模基准上,基于评分标准的提示显著提升了30B-A3B评估器的性能,接近GPT-5.4;在下游基准中,评分标准过滤将宏观平均提升至82.69,而结果过滤降至75.82;在综合基准上,rDPO达到61.01,明显优于风格约束基线(52.36)并超过基础模型(59.48)。

Insight: 创新点在于将在线策略数据构建与实例特定的标准级反馈相结合,通过评分标准提供细粒度监督,提升视觉偏好优化的效果。

Abstract: The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.


[104] Generative Refinement Networks for Visual Synthesis cs.CVPDF

Jian Han, Jinlai Liu, Jiahuan Wang, Bingyue Peng, Zehuan Yuan

TL;DR: 本文提出了生成式精炼网络(GRN),一种新的视觉合成范式,旨在解决扩散模型计算效率低和自回归模型因离散化及误差累积导致的质量问题。GRN通过近乎无损的分层二值量化(HBQ)解决离散化瓶颈,并引入全局精炼机制和熵引导采样策略,实现了复杂度感知的自适应生成。

Details

Motivation: 动机是结合扩散模型和自回归模型的优势,解决扩散模型计算资源分配不均(无论复杂度如何都使用统一计算量)以及自回归模型因有损离散化和误差累积导致生成质量受限的问题。

Result: 在ImageNet基准测试中,GRN在图像重建(0.56 rFID)和类别条件图像生成(1.81 gFID)上创造了新记录。在文本到图像和文本到视频生成任务上,GRN在同等规模下也实现了卓越性能。

Insight: 创新点包括:1)理论上近乎无损的分层二值量化(HBQ)方法,缓解了自回归模型的离散化瓶颈;2)在自回归生成基础上引入全局精炼机制,能像人类艺术家作画一样逐步完善和修正作品;3)熵引导采样策略,实现了复杂度感知的自适应步长生成,无需牺牲视觉质量。

Abstract: While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ’s latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks – like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.


[105] Lyra 2.0: Explorable Generative 3D Worlds cs.CVPDF

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao

TL;DR: Lyra 2.0是一个用于生成大规模、持久且可探索的3D世界的框架。它通过解决长序列视频生成中的空间遗忘和时间漂移问题,生成了3D一致的长轨迹视频,并利用前馈重建技术将其提升为高质量、可用于实时渲染的3D场景。

Details

Motivation: 当前基于视频生成和重建的3D场景创建方法,在生成长相机轨迹、涉及大视角变化和位置重访的大规模复杂环境时,会因视频模型性能下降而受限,主要面临空间遗忘和时序漂移两种退化问题。

Result: 论文提出的方法能够生成长度显著增加且3D一致的视频轨迹,并利用这些视频微调前馈重建模型,从而可靠地恢复出高质量的3D场景。

Insight: 创新点在于:1)利用每帧的3D几何信息进行信息路由(检索相关历史帧并建立密集对应),以解决空间遗忘,同时依赖生成先验进行外观合成;2)通过使用自增强的历史数据(让模型接触自身退化的输出)进行训练,以纠正而非传播时间漂移。这为生成大规模、一致的3D世界提供了一种新范式。

Abstract: Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model’s temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing – retrieving relevant past frames and establishing dense correspondences with the target viewpoints – while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.


cs.LG [Back]

[106] UCS: Estimating Unseen Coverage for Improved In-Context Learning cs.LG | cs.CLPDF

Jiayi Xin, Xiang Li, Evan Qiang, Weiqing He, Tianqi Shang

TL;DR: 本文提出了一种名为UCS(Unseen Coverage Selection)的训练无关、基于子集覆盖度的演示选择方法,旨在提升上下文学习(ICL)性能。该方法通过模型一致性嵌入诱导离散潜在簇,并利用平滑Good-Turing估计器从候选子集的经验频率谱中估计未揭示簇的数量,从而选择能更好覆盖潜在簇的演示集。

Details

Motivation: 现有演示选择方法多依赖启发式的相关性或多样性准则,对演示集的覆盖度缺乏深入洞察,而ICL性能高度依赖于演示的选择。本文动机是提出一种基于覆盖度的先验,即好的演示集应能向模型揭示当前选定子集未暴露的潜在簇。

Result: 在多个意图分类和推理基准测试中,使用前沿大语言模型进行实验,结果表明,在相同选择预算下,将UCS与现有强基线(包括查询相关和查询无关的选择器)结合,能通过简单的正则化目标一致地将ICL准确率提升2-6%。

Insight: 创新点在于提出了首个训练无关、基于覆盖度的演示选择先验UCS,它通过估计未揭示潜在簇的数量来优化演示选择,并能与现有选择方法无缝集成。从客观角度看,该方法提供了对任务和模型层面潜在簇分布的新见解,是一种通用且高效的ICL增强策略。

Abstract: In-context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training-free, subset-level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model-consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good–Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage-based and training-free, and can be seamlessly combined with both query-dependent and query-independent selection baselines via a simple regularized objective. Experiments on multiple intent-classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2-6% under the same selection budget, while also yielding insights into task- and model-level latent cluster distributions. Code is available at https://github.com/Raina-Xin/UCS.


[107] MolMem: Memory-Augmented Agentic Reinforcement Learning for Sample-Efficient Molecular Optimization cs.LG | cs.AI | cs.CLPDF

Ziqing Wang, Yibo Wen, Abhishek Pandy, Han Liu, Kaize Ding

TL;DR: MolMem提出了一种用于分子优化的记忆增强型智能体强化学习框架,通过静态范例记忆和演化技能记忆的双记忆系统,在有限的评估预算下实现高效样本利用,显著提升了单属性和多属性优化任务的性能。

Details

Motivation: 解决药物发现中分子优化任务因昂贵评估导致的样本效率低下问题,现有方法要么需要大量试错调用,要么依赖外部知识而难以应对复杂目标,缺乏能够支撑决策并提供可复用见解的长期记忆机制。

Result: 在仅使用500次评估调用的限制下,MolMem在单属性任务上达到90%成功率(比最佳基线提升1.5倍),在多属性任务上达到52%成功率,实验验证了其高效性。

Insight: 创新点在于引入双记忆系统(静态范例记忆用于冷启动,演化技能记忆用于提炼成功轨迹为可重用策略),并通过密集步进奖励训练策略,将昂贵试错转化为长期知识,为样本受限的优化问题提供了可借鉴的智能体架构设计思路。

Abstract: In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial-and-error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long-term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (\textbf{Mol}ecular optimization with \textbf{Mem}ory), a multi-turn agentic reinforcement learning (RL) framework with a dual-memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold-start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory-augmented formulation, we train the policy with dense step-wise rewards, turning costly rollouts into long-term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90% success on single-property tasks (1.5$\times$ over the best baseline) and 52% on multi-property tasks using only 500 oracle calls. Our code is available at https://github.com/REAL-Lab-NU/MolMem.


[108] Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning cs.LG | cs.AI | cs.CLPDF

NVIDIA, :, Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye

TL;DR: 本文介绍了Nemotron 3 Super,一个1200亿参数(激活120亿)的混合Mamba-注意力专家混合模型。该模型采用NVFP4格式预训练,引入了优化FLOP和参数效率的LatentMoE新架构,并包含用于推理加速的MTP层。在25万亿token上预训练后,通过监督微调和强化学习进行后训练,支持100万上下文长度,在常见基准测试上达到可比精度,同时推理吞吐量相比GPT-OSS-120B和Qwen3.5-122B分别提升最高2.2倍和7.5倍。

Details

Motivation: 旨在构建一个开放、高效且支持智能体推理的大语言模型,通过结合Mamba和Transformer的优势,并优化专家混合架构,以在保持高精度的同时显著提升推理效率。

Result: 在常见基准测试上达到与现有模型相当的精度,同时推理吞吐量相比GPT-OSS-120B和Qwen3.5-122B分别提升最高2.2倍和7.5倍。

Insight: 主要创新点包括:1) 采用NVFP4格式进行预训练;2) 提出LatentMoE,一种同时优化每FLOP精度和每参数精度的新专家混合架构;3) 集成MTP层,通过原生推测解码实现推理加速。这些设计在模型效率(吞吐量)和架构创新(混合Mamba-注意力与高效MoE)方面具有借鉴意义。

Abstract: We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.


[109] Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task cs.LG | cs.CLPDF

Alicia Curth, Rachel Lawrence, Sushrut Karmalkar, Niranjani Prasad

TL;DR: 本文研究了Transformer模型是否根据任务难度自适应地利用其深度。通过基于家族故事的多跳关系推理任务,作者监控了预训练模型和微调模型在不同难度任务中的层间预测演化(logit lens)和跨token信息整合(causal patching)。研究发现预训练模型仅表现出有限的自适应深度使用迹象,而微调模型则显示出更清晰、更一致的自适应深度使用,尤其是在不保留通用语言建模能力的微调设置下效果更强。

Details

Motivation: 探究Transformer模型是否能够根据任务难度自适应地调整其深度使用,以理解模型内部处理机制。

Result: 在基于家族故事的多跳关系推理任务中,预训练模型仅部分表现出自适应深度使用(如较大模型在简单任务中需要更少层数),而微调模型显示出更明确的自适应深度使用,特别是在不保留通用语言建模能力的微调设置下效果显著。

Insight: 论文通过logit lens和causal patching方法揭示了Transformer模型在微调后可能具备根据任务复杂度自适应分配计算深度的能力,这为理解模型内部机制和优化模型设计提供了新视角。

Abstract: We investigate whether transformers use their depth adaptively across tasks of increasing difficulty. Using a controlled multi-hop relational reasoning task based on family stories, where difficulty is determined by the number of relationship hops that must be composed, we monitor (i) how predictions evolve across layers via early readouts (the logit lens) and (ii) how task-relevant information is integrated across tokens via causal patching. For pretrained models, we find some limited evidence for adaptive depth use: some larger models need fewer layers to arrive at plausible answers for easier tasks, and models generally use more layers to integrate information across tokens as chain length increases. For models finetuned on the task, we find clearer and more consistent evidence of adaptive depth use, with the effect being stronger for less constrained finetuning regimes that do not preserve general language modeling abilities.


[110] Do VLMs Truly “Read” Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting cs.LG | cs.CLPDF

Kaiqi Hu, Linda Xiao, Shiyue Xu, Ziyi Tang, Mingwen Liu

TL;DR: 本文针对视觉语言模型在视觉股票价格预测中的应用,指出现有基准无法有效评估其对K线图的理解能力,并构建了一个多尺度K线图数据集和标准化评估框架。实验表明,大多数VLM仅在持续上涨或下跌趋势中表现良好,在常见市场场景中预测能力较弱,且存在显著的预测偏差和对预测时间范围的敏感性不足。

Details

Motivation: 现有研究未能有效评估VLM是否真正理解K线图模式,且数据集和评估设置多围绕单周期或表格输入设计,无法系统评估VLM整合短期和长期视觉市场动态的能力。

Result: 在构建的多尺度数据集上,大多数VLM仅在持续趋势条件下表现良好,在常见市场场景中预测能力弱,且表现出预测偏差和对预测时间范围敏感性有限。评估结合了混淆矩阵诊断和信息系数时间序列指标,并引入XGBoost作为基于特征的时间基线。

Insight: 创新点在于构建了首个多尺度K线图数据集和标准化评估框架,以系统评估VLM利用多尺度视觉市场信号的能力,揭示了VLM在精确时间推理方面的固有局限性,为未来研究提供了更严谨的基准。

Abstract: Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs’ comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs’ ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs’ ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.


cs.PL [Back]

[111] M$^\star$: Every Task Deserves Its Own Memory Harness cs.PL | cs.AI | cs.CL | cs.LGPDF

Wenbo Pan, Shujie Liu, Xiangyang Zhou, Shiwei Zhang, Wanlu Shi

TL;DR: 本文提出M$^\star$方法,通过可执行程序演化自动发现任务优化的记忆系统,将智能体记忆系统建模为Python程序,包含数据模式、存储逻辑和工作流指令,并采用基于种群的反射代码演化方法联合优化这些组件。

Details

Motivation: 现有大型语言模型智能体通常采用针对特定领域(如对话语义检索或代码技能重用)的固定记忆设计,但为某一目的优化的记忆系统难以迁移到其他任务,因此需要一种能自动适应不同任务的记忆系统设计方法。

Result: 在对话、具身规划和专家推理四个不同基准测试中,M$^\star$均稳健地超越了现有固定记忆基线方法,表明其性能优于通用记忆范式。

Insight: 创新点在于将记忆系统形式化为可演化的Python程序,并通过反射代码演化实现跨任务的自适应优化;客观来看,该方法为记忆系统设计提供了灵活的、任务特定的解决方案,突破了固定架构的局限性。

Abstract: Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.


cs.AI [Back]

[112] Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching cs.AI | cs.CLPDF

Rongzhe Wei, Ge Shi, Min Cheng, Na Zhang, Pan Li

TL;DR: 本文针对大型语言模型(LLM)在庞大工具库中执行多步任务时面临的挑战,提出了两项主要贡献:一是引入了SLATE(面向电子商务的合成大规模API工具包)基准,用于评估工具集成智能体;二是提出了熵引导分支(EGB)算法,一种基于不确定性的搜索方法,以优化探索与利用的权衡,提升任务成功率和计算效率。

Details

Motivation: 解决LLM智能体在大型工具库中执行长视野规划任务时面临的两个关键瓶颈:缺乏严格的、规划层面的评估框架,以及因工具集庞大和规划视野长而导致的巨大决策空间探索的计算需求。

Result: 在提出的SLATE基准上进行的大量实验表明,所提出的EGB算法显著提高了任务成功率和计算效率,为在工具丰富的环境中开发可靠且可扩展的LLM智能体提供了坚实基础。

Insight: 创新点包括:1) 提出了一个大规模、上下文感知的合成基准SLATE,它适应多样但功能有效的执行轨迹,能更全面地评估智能体;2) 提出了熵引导分支(EGB)搜索算法,通过在高预测熵处动态扩展决策分支来智能地引导搜索,优化了探索与利用的平衡。

Abstract: Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.


[113] Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams cs.AI | cs.CLPDF

Xiuxiu Tang, G. Alex Ambrose, Ying Cheng

TL;DR: 本研究探讨了使用GPT-4o对大学物理手写建构式答案进行AI辅助评分的可靠性,通过系统改变评分量规的细粒度、提示格式和温度设置,发现可靠的AI辅助评分主要依赖于清晰、结构化的技能型量规,而提示格式和温度设置的影响相对有限。

Details

Motivation: STEM评估中的学生手写答案格式多样且解释复杂,人工评分耗时且易出现评分者间不一致,特别是涉及部分得分时;尽管大语言模型在AI辅助评分中受到关注,但关于量规设计和LLM配置如何影响不同表现水平可靠性的证据仍然有限。

Result: 人类与AI在总分上的一致性接近人类评分者间信度,在高分和低分答案上最高,但在涉及部分或模糊推理的中等水平答案上下降;标准层面分析显示,对明确定义的概念技能比对扩展的程序性判断有更强的一致性;相对于整体评分,更细粒度的清单式量规提高了评分一致性。

Insight: 研究强调了在STEM领域实施可靠LLM辅助评分时,设计基于技能的清晰结构化量规是关键创新点,而提示工程和模型参数调优的作用相对次要;这为教育评估中的AI应用提供了可迁移的设计建议。

Abstract: Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students’ reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.


[114] HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models cs.AI | cs.CLPDF

Jawad Hossain, Xiangyu Guo, Jiawei Zhou, Chong Liu

TL;DR: 论文提出HintMR框架,通过提示辅助推理增强小语言模型的数学推理能力。该方法将问题分解为序列推理步骤,并利用一个经过大模型蒸馏训练的小模型生成上下文感知的提示,与推理模型协作,形成双模型系统,逐步引导推理过程,减少错误传播。

Details

Motivation: 解决小语言模型因容量有限而难以维持长链中间步骤、且早期错误难以恢复,导致复杂数学推理能力不足的问题。

Result: 在多个数学基准测试和模型上的实验表明,提示辅助能持续提升小语言模型的推理准确率,相比标准提示方法有显著增益,同时保持了模型效率。

Insight: 创新点在于提出一个轻量级的双模型协作框架,其中提示生成模型通过蒸馏获得引导能力,与推理模型形成结构化合作,实现逐步局部引导而不泄露完整解,从而有效增强数学推理。

Abstract: Small language models (SLMs) often struggle with complex mathematical reasoning due to limited capacity to maintain long chains of intermediate steps and to recover from early errors. We address this challenge by introducing a hint-assisted reasoning framework that incrementally guides SLMs through multi-step mathematical problem solving. Our approach decomposes solutions into sequential reasoning steps and provides context-aware hints, where hints are generated by a separate SLM trained via distillation from a strong large language model. While the hint-generating SLM alone is not capable of solving the problems, its collaboration with a reasoning SLM enables effective guidance, forming a cooperative two-model system for reasoning. Each hint is generated conditionally on the problem statement and the accumulated reasoning history, providing stepwise, localized guidance without revealing full solutions. This reduces error propagation and allows the reasoning model to focus on manageable subproblems. Experiments across diverse mathematical benchmarks and models demonstrate that hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency. These results highlight that structured collaboration between SLMs-via hint generation and reasoning-offers an effective and lightweight mechanism for enhancing mathematical reasoning.


[115] How memory can affect collective and cooperative behaviors in an LLM-Based Social Particle Swarm cs.AI | cs.CL | cs.GT | cs.MAPDF

Taisei Hishiki, Takaya Arita, Reiji Suzuki

TL;DR: 本研究探讨了大型语言模型(LLM)智能体的模型特定特性(包括内部对齐)如何影响其在多智能体系统中记忆对集体与合作动态的作用。通过将社会粒子群(SPS)模型中的规则智能体替换为具有大五人格分数和不同记忆长度的LLM智能体,研究发现记忆长度是控制集体行为的关键参数:使用Gemini-2.0-Flash时,即使最小记忆也显著抑制合作,随着记忆增长,系统从稳定合作集群经周期性形成与崩溃转变为分散背叛状态;而使用Gemma~3:4b则呈现相反趋势,更长记忆促进合作并形成密集合作集群。情感分析表明,不同模型对记忆的解释存在差异,这为生成基于智能体建模中涌现社会行为提供了微观认知解释。

Details

Motivation: 解决LLM智能体的模型特定特性(如内部对齐)如何影响多智能体系统中记忆对合作与集体行为的作用,以解释先前研究中关于记忆与合作关系的矛盾发现。

Result: 在基于囚徒困境的社会粒子群模型中,Gemini-2.0-Flash智能体的记忆增长导致合作被抑制(从稳定合作到分散背叛),而Gemma~3:4b智能体的记忆增长则促进合作并形成密集集群;大五人格特质与智能体行为的相关性部分符合人类实验发现,支持模型有效性。

Insight: 创新点在于将LLM智能体(具人格与可变记忆)引入经典多智能体模型,揭示了记忆长度对合作行为的模型依赖性影响,并通过情感分析提供微观认知机制解释,表明LLM的模型特定特性(可能包括对齐)是决定涌现社会行为的基础因素。

Abstract: This study examines how model-specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi-agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two-dimensional space and play the Prisoner’s Dilemma with neighboring agents, by replacing its rule-based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini-2.0-Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents’ reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro-level dynamics converge. These results suggest that model-specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent-Based Modeling, and provide a micro-level cognitive account of the contradictions found in prior work on memory and cooperation.


[116] MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents cs.AI | cs.CLPDF

Joongmin Shin, Chanjun Park, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

TL;DR: 本文提出MultiDocFusion,一种用于长工业文档增强RAG的多模态分块流水线。它通过视觉文档解析检测区域、OCR提取文本、基于LLM的文档结构层次解析重建层次树,并基于DFS分组构建层次块,以解决传统文本分块方法忽略复杂文档结构导致信息丢失的问题。

Details

Motivation: 传统RAG中的文本分块方法在处理结构复杂的长工业文档时,常忽略其层次结构,导致信息丢失和答案质量下降。本文旨在通过显式利用文档的层次和多模态信息来改进这一过程。

Result: 在多个工业基准测试上的广泛实验表明,与基线方法相比,MultiDocFusion将检索精度提高了8-15%,并将ANLS QA分数提升了2-3%,强调了利用文档层次结构对提升多模态文档QA性能的关键作用。

Insight: 创新点在于将视觉文档解析、OCR、基于LLM的文档结构层次解析(DSHP-LLM)和基于DFS的分组整合为一个流水线,显式地建模和利用工业文档的层次结构进行分块,从而提升RAG系统的保真度和性能。这是一种结构感知的多模态分块方法。

Abstract: RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.


[117] RPRA: Predicting an LLM-Judge for Efficient but Performant Inference cs.AI | cs.CL | cs.LG | cs.MAPDF

Dylan R. Ashley, Gaël Le Lan, Changsheng Zhao, Naina Dhingra, Zhipeng Cai

TL;DR: 本文研究了RPRA(Reason-Predict-Reason-Answer/Act)和PA(Predict-Answer/Act)范式,通过让模型在生成响应前预测一个LLM评判者对其输出的评分,来平衡计算效率与输出质量。评估了零样本预测、上下文报告卡和监督微调三种方法,发现大模型在零样本预测上表现良好,而小模型通过微调或使用报告卡也能可靠预测,从而显著提升预测准确性。

Details

Motivation: 解决大语言模型在计算受限设备(如手机、笔记本电脑)上部署时,计算效率(如参数量)与输出质量之间的根本权衡问题,通过让模型在自认为无法独立解决问题时寻求帮助(即向更大模型求助),以实现更高效的推理。

Result: 在多个数据集上,使用上下文报告卡和监督微调方法分别使小模型的预测准确率平均提升高达55%和52%,表明这些方法能显著改善小模型对LLM评判者评分的预测能力。

Insight: 创新点在于引入预测LLM评判者评分的机制,使模型能够自我评估性能限制,从而实现更高效、自适应的推理;客观来看,该方法为构建计算效率与质量平衡的自感知AI系统提供了新思路,特别是在资源受限环境下的模型部署。

Abstract: Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict – prior to responding – how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.


[118] RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair cs.AI | cs.CLPDF

Jagadeesh Rachapudi, Pranav Singh, Ritali Vatsi, Praful Hambarde, Amit Shukla

TL;DR: 这篇论文提出了一个名为RePAIR的交互式机器遗忘框架,旨在让终端用户能够通过自然语言指令,在推理时指导大语言模型(LLM)遗忘特定的有害知识、错误信息或个人数据。该框架包含一个用于检测遗忘意图的看守模型、一个生成修复程序的外科医生模型,以及一个自主更新参数的病人模型。其核心是一种名为STAMP的无训练、单样本遗忘方法,通过伪逆更新将MLP激活重定向到拒绝子空间,其低秩变体进一步提升了计算效率。

Details

Motivation: 大语言模型在预训练时会吸收有害知识、错误信息和个人数据,但缺乏选择性遗忘的机制。现有的机器遗忘方法以服务提供商为中心,需要重新训练流程和精心策划的保留数据集,将终端用户排除在控制自己数据的过程之外。因此,论文旨在提出一种用户驱动的、交互式的遗忘新范式。

Result: 在有害知识抑制、错误信息修正和个人数据擦除的广泛实验中,RePAIR实现了接近零的遗忘分数(Acc_f = 0.00, F-RL = 0.00),同时保持了模型效用(Acc_r最高达84.47,R-RL最高达0.88),性能优于六种最先进的基线方法。其低秩STAMP变体在设备上实现了高达约3倍于基于训练基线的加速。

Insight: 论文的主要创新点在于提出了交互式机器遗忘(IMU)这一新范式,以及实现该范式的RePAIR框架。其核心方法STAMP是一种无需训练、仅需单样本的遗忘技术,通过封闭形式的伪逆更新操作模型内部激活,实现了高效、用户驱动的知识编辑。这为在设备端实现透明、可控的模型知识管理提供了新思路,并具有扩展到多模态基础模型的潜力。

Abstract: Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.


[119] Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks cs.AI | cs.CV | cs.LGPDF

Arun Sharma

TL;DR: 本文提出了计算基础推理(CGR)设计范式,用于构建空间感知研究智能体,通过确定性计算解决所有可回答的子问题,再交由语言模型生成。Spatial Atlas系统实现了CGR,作为一个Agent-to-Agent服务器,处理两个挑战性基准:FieldWorkArena(涵盖工厂、仓库和零售环境的多模态空间问答基准)和MLE-Bench(包含75个Kaggle机器学习竞赛的端到端ML工程套件)。系统利用结构化空间场景图引擎从视觉描述中提取实体和关系,确定性计算距离和安全违规,并将计算事实输入大语言模型以避免空间推理幻觉。通过熵引导的动作选择最大化每步信息增益,并在三层前沿模型栈(OpenAI + Anthropic)间路由查询。系统还包括具有策略感知代码生成、分数驱动的迭代优化循环和基于提示的泄漏审计注册表的自修复ML管道。评估显示,CGR在保持通过结构化中间表示和确定性空间计算的可解释性的同时,实现了有竞争力的准确性。

Details

Motivation: 解决空间感知研究智能体在复杂环境(如工厂、仓库)中回答空间问题时可能产生幻觉推理的问题,以及端到端机器学习工程任务中代码生成和优化的挑战,旨在通过确定性计算确保空间推理的准确性和可解释性。

Result: 在FieldWorkArena和MLE-Bench两个基准上评估,CGR方法实现了有竞争力的准确性,同时通过结构化中间表示和确定性空间计算保持了可解释性。

Insight: 创新点包括计算基础推理(CGR)范式,通过确定性计算预处理可回答子问题以避免语言模型的空间推理幻觉;结构化空间场景图引擎用于实体关系提取和确定性计算;熵引导动作选择优化信息增益;以及自修复ML管道结合策略感知代码生成和迭代优化。从客观角度看,该方法将确定性计算与语言模型生成相结合,提升了空间推理的可靠性和系统可解释性,适用于多模态环境和复杂ML工程任务。

Abstract: We introduce compute-grounded reasoning (CGR), a design paradigm for spatial-aware research agents in which every answerable sub-problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent-to-Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question-answering benchmark spanning factory, warehouse, and retail environments, and MLE-Bench, a suite of 75 Kaggle machine learning competitions requiring end-to-end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy-guided action selection maximizes information gain per step and routes queries across a three-tier frontier model stack (OpenAI + Anthropic). A self-healing ML pipeline with strategy-aware code generation, a score-driven iterative refinement loop, and a prompt-based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.


[120] ReflectCAP: Detailed Image Captioning with Reflective Memory cs.AI | cs.CVPDF

Kyungmin Min, Minbeom Kim, Kang-il Lee, Seunghyun Yoon, Kyomin Jung

TL;DR: ReflectCAP是一种通过反思性记忆指导的详细图像描述方法,它采用多智能体流程分析大型视觉语言模型(LVLM)的幻觉和遗漏模式,生成结构化反思笔记,在推理时引导模型避免错误并关注细节,从而在事实性和覆盖范围之间达到帕累托最优,并在CapArena-Auto基准测试中显著提升性能,同时计算成本低于模型扩展或其他多智能体方法。

Details

Motivation: 现有详细图像描述方法难以同时实现事实准确性和细粒度覆盖,ReflectCAP旨在通过反思模型常见错误和遗漏来缓解这一矛盾。

Result: 在涵盖GPT-4.1、Qwen系列和InternVL变体的8个LVLM上,ReflectCAP在事实性与覆盖范围的权衡中达到帕累托前沿,在CapArena-Auto基准测试中生成描述优于强参考模型,且计算开销比模型扩展或现有多智能体流程低21-36%。

Insight: 创新点在于引入结构化反思笔记作为可重用指导,通过多智能体分析模型系统性偏差来动态优化描述生成,实现了质量与计算成本间的更优权衡,为实际应用中的高质量详细描述提供了可行方案。

Abstract: Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes – what to avoid and what to attend to – yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21–36% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.


eess.IV [Back]

[121] Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation eess.IV | cs.CVPDF

Nafis Fuad Shahid, Maroof Ahmed, Md Akib Haider, Saidur Rahman Sagor, Aashnan Rahman

TL;DR: 本文提出了一种用于多模态联邦学习的概率特征填补网络(P-FIN)和不确定性感知聚合策略(Fed-UQ-Avg),以解决医疗场景中因模态缺失导致的异质性问题。该方法不仅生成填补特征,还输出校准的不确定性估计,并在本地分类和全局聚合中利用该不确定性来提升模型在胸部X光分类任务上的鲁棒性和性能。

Details

Motivation: 解决多模态联邦学习中因临床站点资源限制或工作流程差异导致的模态异质性问题,现有确定性特征填补方法缺乏可靠性度量,在安全关键的医疗应用中存在风险。

Result: 在CheXpert、NIH Open-I和PadChest数据集上的联邦胸部X光分类实验中,该方法相比确定性基线取得了一致的性能提升,在最具挑战性的配置下AUC增益达到+5.36%。

Insight: 创新点在于将概率建模引入特征填补,生成带有不确定性估计的填补特征,并设计了两级不确定性利用机制(本地sigmoid门控和全局Fed-UQ-Avg聚合),为联邦学习中处理缺失模态提供了一种更安全、可解释的解决方案。

Abstract: Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downstream classifiers to treat all imputed features as equally trustworthy. In safety-critical medical applications, this limitation poses significant risks. We propose the Probabilistic Feature Imputation Network (P-FIN), which outputs calibrated uncertainty estimates alongside imputed features. This uncertainty is leveraged at two levels: (1) locally, through sigmoid gating that attenuates unreliable feature dimensions before classification, and (2) globally, through Fed-UQ-Avg, an aggregation strategy that prioritizes updates from clients with reliable imputation. Experiments on federated chest X-ray classification using CheXpert, NIH Open-I, and PadChest demonstrate consistent improvements over deterministic baselines, with +5.36% AUC gain in the most challenging configuration.


cs.CR [Back]

[122] TimeMark: A Trustworthy Time Watermarking Framework for Exact Generation-Time Recovery from AIGC cs.CR | cs.CLPDF

Shangkun Che, Silin Du, Ge Gao

TL;DR: 本文提出了一种名为TimeMark的可信时间水印框架,旨在为AI生成内容(AIGC)提供精确的生成时间恢复能力,以解决知识产权纠纷中的司法证据问题。该框架结合密码学技术,通过两阶段编码机制和纠错码,实现了100%的识别准确率,并能抵抗用户侧的统计攻击和提供商侧的伪造攻击。

Details

Motivation: 现有基于令牌分布统计信号的水印方法存在概率性检测、可靠性低、易受伪造攻击等问题,无法满足作为司法证据的可靠性要求。本文旨在设计一种可信的水印框架,确保时间信息的精确、可靠且不可伪造的恢复。

Result: 理论分析和实验表明,该框架满足司法证据的可靠性要求,能够以理论上完美的准确率恢复生成时间,为未来的AIGC相关知识产权纠纷提供了实用解决方案。

Insight: 创新点在于提出了“可信水印”的概念,将密码学技术与水印编码结合,通过时间依赖的密钥和随机非存储的载荷设计,消除了可检测的统计模式,并利用两阶段编码与纠错码确保了完美恢复。这为构建抗攻击、可验证的AIGC溯源机制提供了新思路。

Abstract: The widespread use of Large Language Models (LLMs) in text generation has raised increasing concerns about intellectual property disputes. Watermarking techniques, which embed meta information into AI-generated content (AIGC), have the potential to serve as judicial evidence. However, existing methods rely on statistical signals in token distributions, leading to inherently probabilistic detection and reduced reliability, especially in multi-bit encoding (e.g., timestamps). Moreover, such methods introduce detectable statistical patterns, making them vulnerable to forgery attacks and enabling model providers to fabricate arbitrary watermarks. To address these issues, we propose the concept of trustworthy watermark, which achieves reliable recovery with 100% identification accuracy while resisting both user-side statistical attacks and provider-side forgery. We focus on trustworthy time watermarking for use as judicial evidence. Our framework integrates cryptographic techniques and encodes time information into time-dependent secret keys under regulatory supervision, preventing arbitrary timestamp fabrication. The watermark payload is decoupled from time and generated as a random, non-stored bit sequence for each instance, eliminating statistical patterns. To ensure verifiability, we design a two-stage encoding mechanism, which, combined with error-correcting codes, enables reliable recovery of generation time with theoretically perfect accuracy. Both theoretical analysis and experiments demonstrate that our framework satisfies the reliability requirements for judicial evidence and offers a practical solution for future AIGC-related intellectual property disputes.


[123] CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training cs.CR | cs.CVPDF

Qi Li, Cheng-Long Wang, Yinzhi Cao, Di Wang

TL;DR: 本文提出CoLA(Choice Leakage Attack)框架,挑战了机器学习中‘在数据子集上训练可降低隐私风险’的直觉假设,证明子集训练本身会通过选择过程引入新的隐私泄露面,泄露训练成员及参与选择过程的所有样本的隐私信息。

Details

Motivation: 解决子集训练(如核心集选择、大规模过滤)中普遍存在的隐私风险被低估的问题,揭示数据选择过程本身会引入新的隐私泄露面,而非仅减少风险。

Result: 在视觉和语言模型上的实验表明,现有威胁模型低估了子集训练的隐私风险:扩展的隐私面(TM-MIA和SP-MIA)会泄露训练成员和选择参与成员信息,将风险从单个模型扩展到更广泛的ML生态系统。

Insight: 创新点在于将隐私泄露分析从模型训练扩展到整个数据-模型供应链(通过SP-MIA),并系统化定义了基于侧信道元数据(子集感知侧信道攻击)和模型输出(黑盒攻击)的两种攻击场景,揭示了子集选择过程本身是关键的隐私泄露源。

Abstract: Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introduce new privacy surface and leak more sensitive information. Such information can be captured by adversaries either through side-channel metadata from the subset selection process or via the outputs of the target model. To systematically study this phenomenon, we propose CoLA (Choice Leakage Attack), a unified framework for analyzing privacy leakage in subset selection. In CoLA, depending on the adversary’s knowledge of the side-channel information, we define two practical attack scenarios: Subset-aware Side-channel Attacks and Black-box Attacks. Under both scenarios, we investigate two privacy surfaces unique to subset training: (1) Training-membership MIA (TM-MIA), which concerns only the privacy of training data membership, and (2) Selection-participation MIA (SP-MIA), which concerns the privacy of all samples that participated in the subset selection process. Notably, SP-MIA enlarges the notion of membership from model training to the entire data-model supply chain. Experiments on vision and language models show that existing threat models underestimate subset-training privacy risks: the expanded privacy surface leaks both training and selection membership, extending risks from individual models to the broader ML ecosystem.


cs.SE [Back]

[124] From Plan to Action: How Well Do Agents Follow the Plan? cs.SE | cs.AI | cs.CLPDF

Shuyang Liu, Saman Dehghan, Jatin Ganhotra, Martin Hirzel, Reyhaneh Jabbarvand

TL;DR: 本文首次对编程智能体(如SWE-agent)在解决软件工程任务时遵循指定计划的程度进行了系统性分析,通过评估四个大语言模型在SWE-bench Verified和SWE-bench Pro数据集上的16,991条轨迹,发现智能体在没有明确计划时会依赖训练中内化的不完整或过拟合的工作流,而提供标准计划能提升任务成功率,但计划质量至关重要。

Details

Motivation: 当前智能体通常被指示遵循特定任务计划(如解决软件问题的导航、复现、修补和验证阶段)来指导行动,但尚不清楚智能体实际遵循这些计划的程度,这导致无法评估解决方案是通过正确战略推理还是数据污染或基准过拟合实现的,因此需要分析计划合规性以理解智能体的决策过程。

Result: 在SWE-bench Verified和SWE-bench Pro基准测试中,提供标准计划提高了问题解决率,定期计划提醒能减少计划违规并提升任务成功率;而劣质计划比无计划更损害性能,早期添加额外任务相关阶段可能降低性能,特别是当这些阶段与模型内部问题解决策略不一致时。

Insight: 论文揭示了智能体计划合规性的研究空白,强调需要微调范式以教导模型遵循指示计划,而非将任务特定计划编码到模型中,这要求模型学会自适应推理和行动,而不是记忆工作流;同时,计划设计和与模型内部策略的匹配对性能有显著影响。

Abstract: Agents aspire to eliminate the need for task-specific prompt crafting through autonomous reason-act-observe loops. Still, they are commonly instructed to follow a task-specific plan for guidance, e.g., to resolve software issues following phases for navigation, reproduction, patch, and validation. Unfortunately, it is unknown to what extent agents actually follow such instructed plans. Without such an analysis, determining the extent agents comply with a given plan, it is impossible to assess whether a solution was reached through correct strategic reasoning or through other means, e.g., data contamination or overfitting to a benchmark. This paper presents the first extensive, systematic analysis of plan compliance in programming agents, examining 16,991 trajectories from SWE-agent across four LLMs on SWE-bench Verified and SWE-bench Pro under eight plan variations. Without an explicit plan, agents fall back on workflows internalized during training, which are often incomplete, overfit, or inconsistently applied. Providing the standard plan improves issue resolution, and we observe that periodic plan reminders can mitigate plan violations and improve task success. A subpar plan hurts performance even more than no plan at all. Surprisingly, augmenting a plan with additional task-relevant phases in the early stage can degrade performance, particularly when these phases do not align with the model’s internal problem-solving strategy. These findings highlight a research gap: fine-tuning paradigms that teach models to follow instructed plans, rather than encoding task-specific plans in them. This requires teaching models to reason and act adaptively, rather than memorizing workflows.


cs.SD [Back]

[125] Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification cs.SD | cs.CLPDF

Tsai-Ning Wang, Herman Teun den Dekker, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed

TL;DR: 本文提出了一种名为TRIAGE的自适应测试时计算缩放框架,用于零样本呼吸音频分类。该框架通过基于置信度的路由机制,将音频样本分配到不同计算复杂度的推理层级(从简单的标签余弦评分到检索增强的大语言模型推理),从而在无需任务特定训练的情况下,实现对不同难度样本的差异化处理,在降低整体计算成本的同时提升分类性能。

Details

Motivation: 动机在于解决呼吸音频分析中标注数据稀缺、专家标注成本高昂的问题,同时克服现有零样本方法对所有输入采用统一计算、未考虑样本难度差异的局限性。

Result: 在九个无需任务特定训练的呼吸分类任务上,TRIAGE取得了平均AUROC为0.744的成绩,超越了先前的零样本方法,并在多个任务上达到或超过了有监督基线的水平。分析表明,该方法将性能提升集中在不确定样本上(相对提升高达19%),而对高置信度预测的影响微乎其微。

Insight: 创新点在于提出了一个分层的、基于置信度路由的自适应测试时计算缩放框架,将计算资源动态分配给不同难度的样本,实现了计算效率与分类精度的平衡。从客观角度看,这种将简单样本快速退出、复杂样本深入推理的机制,为资源受限下的零样本学习提供了可借鉴的范式。

Abstract: Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis show that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.


[126] CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing cs.SD | cs.CV | cs.MMPDF

Gaoxiang Cong, Liang Li, Jiaxin Ye, Zhedong Zhang, Hongming Shan

TL;DR: 本文提出了一种名为CoSyncDiT的新型电影配音框架,该框架基于流匹配,并受专业演员认知过程启发,通过认知同步扩散变换器逐步引导噪声到语音的生成轨迹,实现声学风格适应、细粒度视觉校准和时间感知上下文对齐,以合成与目标视频口型同步且保持参考音频音色身份的语音。

Details

Motivation: 现有电影配音方法在时长级别进行显式对齐,难以实现精确的口型同步且缺乏自然度;而隐式对齐方法又容易在真实场景中受到参考音频干扰,导致音色和发音退化。

Result: 在标准基准和具有挑战性的真实场景配音基准上进行的大量实验表明,该方法在多个指标上达到了最先进的性能。

Insight: 创新点在于受认知过程启发的CoSync-DiT架构,以及联合语义与对齐正则化机制,该机制同时约束上下文输出的帧级时间一致性和流隐藏状态的语义一致性,从而确保鲁棒的对齐。

Abstract: Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. While implicit alignment solutions have emerged, they remain susceptible to interference from the reference audio, triggering timbre and pronunciation degradation in in-the-wild scenarios. In this paper, we propose a novel flow matching-based movie dubbing framework driven by the Cognitive Synchronous Diffusion Transformer (CoSync-DiT), inspired by the cognitive process of professional actors. This architecture progressively guides the noise-to-speech generative trajectory by executing acoustic style adapting, fine-grained visual calibrating, and time-aware context aligning. Furthermore, we design the Joint Semantic and Alignment Regularization (JSAR) mechanism to simultaneously constrain frame-level temporal consistency on the contextual outputs and semantic consistency on the flow hidden states, ensuring robust alignment. Extensive experiments on both standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate that our method achieves the state-of-the-art performance across multiple metrics.


cs.RO [Back]

[127] ReefMapGS: Enabling Large-Scale Underwater Reconstruction by Closing the Loop Between Multimodal SLAM and Gaussian Splatting cs.RO | cs.CVPDF

Daniel Yang, Jungseok Hong, John J. Leonard, Yogesh Girdhar

TL;DR: 本文提出ReefMapGS,一种基于3D高斯泼溅的增量式水下三维重建框架。该框架利用多模态SLAM(声学、惯性、压力、视觉传感器)获取相机位姿,通过构建初始高置信度区域模型并逐步扩展,交替进行局部图像跟踪与3DGS场景优化,并将优化后的位姿反馈至位姿图进行全局轨迹优化,实现了无需COLMAP的大规模水下场景重建与更精确的自主水下航行器(AUV)位姿估计。

Details

Motivation: 解决3D高斯泼溅(3DGS)严重依赖计算密集型运动恢复结构(SfM)获取精确相机位姿,难以适用于现场机器人(如AUV)应用的问题,旨在利用机器人领域可用的多模态传感器数据,通过基于位姿图优化的SLAM方法实时估计位姿并进行不确定性建模,实现适用于野外环境的大规模水下三维重建。

Result: 在两个几何复杂的水下珊瑚礁站点实现了无需COLMAP的三维重建,并在长达700米的勘测轨迹上,相比基线方法获得了更精确的AUV全局位姿估计结果。

Insight: 创新点在于将多模态SLAM与3D高斯泼溅闭环结合,提出增量式重建框架,通过局部跟踪与全局优化交替迭代,利用SLAM的位姿不确定性引导重建过程,实现了在资源受限的机器人平台上进行大规模、高质量的三维场景重建,为野外机器人视觉SLAM与神经渲染的结合提供了新思路。

Abstract: 3D Gaussian Splatting is a powerful visual representation, providing high-quality and efficient 3D scene reconstruction, but it is crucially dependent on accurate camera poses typically obtained from computationally intensive processes like structure-from-motion that are unsuitable for field robot applications. However, in these domains, multimodal sensor data from acoustic, inertial, pressure, and visual sensors are available and suitable for pose-graph optimization-based SLAM methods that can estimate the vehicle’s trajectory and thus our needed camera poses while providing uncertainty. We propose a 3DGS-based incremental reconstruction framework, ReefMapGS, that builds an initial model from a high certainty region and progressively expands to incorporate the whole scene. We reconstruct the scene incrementally by interleaving local tracking of new image observations with optimization of the underlying 3DGS scene. These refined poses are integrated back into the pose-graph to globally optimize the whole trajectory. We show COLMAP-free 3D reconstruction of two underwater reef sites with complex geometry as well as more accurate global pose estimation of our AUV over survey trajectories spanning up to 700 m.


[128] Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers cs.RO | cs.CVPDF

Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki

TL;DR: 本文提出WHOLE-MoMa,一种用于全身移动操作的两阶段离线强化学习框架。该方法首先通过随机化轻量级全身控制器生成多样化演示数据,然后利用离线强化学习改进行为。在模拟和真实TIAGo++机器人上的实验表明,该方法在开门、开抽屉等任务上显著优于传统控制器和基线方法,且无需遥操作数据即可实现零样本迁移。

Details

Motivation: 解决全身移动操作任务中传统控制器需要大量手工调优且泛化性差,而学习方法又依赖昂贵遥操作数据或复杂奖励设计的问题。

Result: 在模拟环境中三个难度递增的任务上,WHOLE-MoMa显著优于全身控制器、行为克隆和多个离线强化学习基线。策略无需微调即可迁移到真实机器人,在双手开抽屉任务上达到80%成功率,在同时开柜门和放置物体任务上达到68%成功率。

Insight: 创新点在于将次优全身控制器作为结构化先验来收集任务相关数据,并扩展离线隐式Q学习(Q-chunking)以支持动作分块的扩散策略,从而处理复杂的协调任务。这为利用现有控制器引导强化学习提供了有效途径。

Abstract: Mobile Manipulation (MoMa) of articulated objects, such as opening doors, drawers, and cupboards, demands simultaneous, whole-body coordination between a robot’s base and arms. Classical whole-body controllers (WBCs) can solve such problems via hierarchical optimization, but require extensive hand-tuned optimization and remain brittle. Learning-based methods, on the other hand, show strong generalization capabilities but typically rely on expensive whole-body teleoperation data or heavy reward engineering. We observe that even a sub-optimal WBC is a powerful structural prior: it can be used to collect data in a constrained, task-relevant region of the state-action space, and its behavior can still be improved upon using offline reinforcement learning. Building on this, we propose WHOLE-MoMa, a two-stage pipeline that first generates diverse demonstrations by randomizing a lightweight WBC, and then applies offline RL to identify and stitch together improved behaviors via a reward signal. To support the expressive action-chunked diffusion policies needed for complex coordination tasks, we extend offline implicit Q-learning with Q-chunking for chunk-level critic evaluation and advantage-weighted policy extraction. On three tasks of increasing difficulty using a TIAGo++ mobile manipulator in simulation, WHOLE-MoMa significantly outperforms WBC, behavior cloning, and several offline RL baselines. Policies transfer directly to the real robot without finetuning, achieving 80% success in bimanual drawer manipulation and 68% in simultaneous cupboard opening and object placement, all without any teleoperated or real-world training data.


[129] DINO-Explorer: Active Underwater Discovery via Ego-Motion Compensated Semantic Predictive Coding cs.RO | cs.CVPDF

Yuhan Jin, Nayari Marie Lessa, Mariela De Lucas Alvarez, Melvin Laux, Lucas Amparo Barbosa

TL;DR: 论文提出了DINO-Explorer,一种用于自主水下航行器(AUV)的主动感知框架。该框架利用冻结的DINOv3基础模型的潜在空间,通过一个轻量级的、动作条件化的循环预测器来预测短期语义演变,并生成一个连续的新颖性感知信号。它通过一个受传出副本启发的模块,利用全局池化的光流来抵消自身运动引起的视觉变化,从而专注于真实的环境新颖性。该系统被评估用于在可变遥测约束下的异步事件分类任务,结果表明它能有效筛选出具有科学价值的事件,并显著节省带宽。

Details

Motivation: 海洋生态系统退化需要持续、科学选择性的水下监测。然而,大多数自主水下航行器(AUV)作为被动数据记录器运行,捕获大量视频供离线审查,经常错过具有高科学价值的瞬态事件。因此,需要转向主动感知,需要一个因果的、在线的信号来突出重要现象,同时抑制机动引起的视觉变化。

Result: 在可变遥测约束下的异步事件分类任务中,DINO-Explorer在固定工作点下,保留了78.8%的事后人类审查共识事件,触发确认率为56.8%。至关重要的是,与未补偿的新颖性信号基线相比,自身运动条件化抑制了45.5%的误报。在回放侧帕累托消融研究中,DINO-Explorer在验证的峰值F1分数与遥测带宽边界上稳健占优,在选定工作点下将遥测带宽减少了48.2%,同时保持了62.2%的峰值F1分数。

Insight: 论文的创新点在于将基础模型(DINOv3)的语义理解能力与主动感知相结合,通过动作条件化的循环预测器在线生成‘语义惊喜’信号。其核心洞察是引入了一个受传出副本启发的模块,利用光流来区分自身运动和环境变化,从而有效过滤掉由AUV机动产生的视觉干扰,专注于检测真实的环境新颖性事件。这为资源受限的在线、主动水下监测提供了一个高效且鲁棒的注意力机制框架。

Abstract: Marine ecosystem degradation necessitates continuous, scientifically selective underwater monitoring. However, most autonomous underwater vehicles (AUVs) operate as passive data loggers, capturing exhaustive video for offline review and frequently missing transient events of high scientific value. Transitioning to active perception requires a causal, online signal that highlights significant phenomena while suppressing maneuver-induced visual changes. We propose DINO-Explorer, a novelty-aware perception framework driven by a continuous semantic surprise signal. Operating within the latent space of a frozen DINOv3 foundation model, it leverages a lightweight, action-conditioned recurrent predictor to anticipate short-horizon semantic evolution. An efference-copy-inspired module utilizes globally pooled optical flow to discount self-induced visual changes without suppressing genuine environmental novelty. We evaluate this signal on the downstream task of asynchronous event triage under variant telemetry constraints. Results demonstrate that DINO-Explorer provides a robust, bandwidth-efficient attention mechanism. At a fixed operating point, the system retains 78.8% of post-discovery human-reviewer consensus events with a 56.8% trigger confirmation rate, effectively surfacing mission-relevant phenomena. Crucially, ego-motion conditioning suppresses 45.5% of false positives relative to an uncompensated surprise signal baseline. In a replay-side Pareto ablation study, DINO-Explorer robustly dominates the validated peak F1 versus telemetry bandwidth frontier, reducing telemetry bandwidth by 48.2% at the selected operating point while maintaining a 62.2% peak F1 score, successfully concentrating data transmission around human-verified novelty events.