Table of Contents

cs.CL [Back]

[1] TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models cs.CL | cs.AIPDF

Jiaquan Zhang, Qigan Sun, Chaoning Zhang, Xudong Wang, Zhenzhen Huang

TL;DR: 本文提出了一种基于拓扑优化的任务驱动对齐方法TDA-RC,旨在提升大型语言模型的推理能力。该方法通过将CoT、ToT和GoT等推理范式映射到统一的拓扑空间,利用持久同调量化其结构特征,并设计了一个拓扑优化代理来诊断和修复CoT推理链中的结构缺陷,从而在保持单轮生成效率的同时,实现接近多轮推理方法的性能。

Details

Motivation: 当前CoT范式虽然高效,但其推理链常存在逻辑断层;而ToT、GoT等多轮推理方法虽性能强、能揭示有效推理结构,但成本过高,限制了实际应用。本文旨在解决如何在单轮生成中融入多轮推理的智能,平衡推理准确性与效率。

Result: 在多个数据集上的实验表明,与ToT、GoT等多轮推理方法相比,该方法在推理准确性和效率之间取得了更优的平衡,实现了“单轮生成,多轮智能”的实用解决方案。

Insight: 创新点在于将拓扑学工具(持久同调)引入推理链分析,构建了统一的拓扑空间来量化不同推理范式的结构特征,并设计了诊断与修复一体的拓扑优化代理。这为轻量级CoT范式注入了多轮推理的结构化知识,是一种新颖的跨学科方法融合。

Abstract: Enhancing the reasoning capability of large language models (LLMs) remains a core challenge in natural language processing. The Chain-of-Thought (CoT) paradigm dominates practical applications for its single-round efficiency, yet its reasoning chains often exhibit logical gaps. While multi-round paradigms like Graph-of-Thoughts (GoT), Tree-of-Thoughts (ToT), and Atom of Thought (AoT) achieve strong performance and reveal effective reasoning structures, their high cost limits practical use. To address this problem, this paper proposes a topology-based method for optimizing reasoning chains. The framework embeds essential topological patterns of effective reasoning into the lightweight CoT paradigm. Using persistent homology, we map CoT, ToT, and GoT into a unified topological space to quantify their structural features. On this basis, we design a unified optimization system: a Topological Optimization Agent diagnoses deviations in CoT chains from desirable topological characteristics and simultaneously generates targeted strategies to repair these structural deficiencies. Compared with multi-round reasoning methods like ToT and GoT, experiments on multiple datasets show that our approach offers a superior balance between reasoning accuracy and efficiency, showcasing a practical solution to ``single-round generation with multi-round intelligence’’.


[2] The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse cs.CL | cs.AIPDF

Julian Coda-Forno, Jane X. Wang, Arslan Chaudhry

TL;DR: 本文研究了自回归语言模型中的‘反转诅咒’现象,即模型无法反向检索事实(例如训练‘A > B’但无法回答‘B < A’)。论文通过比较掩码语言建模(MLM)和仅解码器的掩码训练在四个反转基准上的表现,并进行了机制分析,发现反转准确性的提升需要源实体明确作为预测目标,且成功并不对应于单一方向无关的事实表示,而是将正反向作为不同条目存储。

Details

Motivation: 动机是探究自回归语言模型在事实检索中存在的‘反转诅咒’问题,并评估双向监督目标(如双向注意力或基于掩码的重建)是否能缓解此问题,同时深入理解这些目标如何实现反转准确性的机制。

Result: 在四个反转基准上,掩码语言建模(MLM)和仅解码器的掩码训练均能提高反转准确性,但机制分析表明,这种提升并非源于单一方向无关的表示,而是通过不同索引几何存储正反向条目实现的。

Insight: 创新点在于揭示了反转诅咒的缓解并不必然导致潜在的泛化(如统一概念的形成),而是依赖于训练信号中源实体作为预测目标的明确性,这警示了目标级‘修复’可能仅改善行为而未诱导预期的潜在泛化。机制分析表明,模型可能将正反向事实作为独立条目存储,这为理解语言模型的表示学习提供了新视角。

Abstract: The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on $A > B$'' but failing on $B < A$’’). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions as distinct entries, with different indexing geometry for MLM versus decoder-only masking-based training. Our results caution that objective-level ``fixes’’ can improve reversal behavior without necessarily inducing the kind of latent generalization one might expect from a unified concept.


[3] Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space cs.CL | cs.AIPDF

Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang, Jey Han Lau

TL;DR: 本文提出了一种名为Inclusion-of-Thoughts(IoT)的渐进式自过滤策略,旨在解决大型语言模型(LLMs)在回答多项选择题(MCQs)时,因存在看似合理的干扰项而导致偏好不稳定、在正确答案和错误答案之间摇摆的问题。该方法通过重构问题,仅保留看似合理的选项,从而减轻认知负荷,使模型能更专注于正确答案。

Details

Motivation: 动机在于LLMs在评估时容易受到多项选择题中看似合理的干扰项影响,导致注意力分散和答案偏好不稳定,从而影响推理的可靠性。

Result: 广泛的实证评估表明,IoT在算术、常识推理和教育基准测试中,以最小的计算开销显著提升了思维链(chain-of-thought)的性能。

Insight: 创新点在于提出了一种渐进式自过滤策略来净化决策空间,通过显式记录过滤过程增强了模型决策的透明度和可解释性,同时有效稳定了模型在干扰项存在下的内部推理。

Abstract: Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model’s internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model’s decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.


[4] Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation cs.CL | cs.AI | cs.LGPDF

Firoj Alam, Gagan Bhatia, Sahinur Rahman Laskar, Shammur Absar Chowdhury

TL;DR: 本文提出了OmniScore系列确定性学习指标,用于多语言生成文本评估,以替代传统基于大型语言模型(LLM)的自动化评判方法。OmniScore通过小规模参数模型(<1B)在大规模合成监督数据(约564k实例,覆盖107种语言)上训练,实现了低延迟和一致性,支持基于参考、基于源文本和混合评估等多种设置,在问答、翻译和摘要任务上进行了多语言(6种语言)验证。

Details

Motivation: 解决LLM作为自动化评判工具时存在的成本高、对提示设计、语言和聚合策略敏感、可复现性差的问题,旨在提供一种更实用、可扩展的替代方案。

Result: 在8,617个手动标注实例上评估,OmniScore在问答、翻译和摘要等任务中表现出与LLM评判行为相近的性能,同时保持了传统模型评分的低延迟和一致性,为轻量级确定性指标提供了实证支持。

Insight: 创新点在于开发了基于小规模参数模型的确定性学习指标家族,通过大规模多语言合成数据训练,实现了对LLM评判行为的近似,同时提升了评估的可靠性和可扩展性;从客观角度看,该方法降低了评估成本并提高了可复现性,为多语言生成文本评估提供了新的技术路径。

Abstract: While Large Language Models (LLMs) are increasingly adopted as automated judges for evaluating generated text, their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility. To address these challenges, we propose \textbf{\textit{OmniScore}}, a family of complementary, deterministic learned metrics developed using small size ($<$1B) parameter models. OmniScore approximates LLM-judge behavior while preserving the low latency and consistency of traditional model-based scoring. We trained the models large-scale synthetic supervision ($\sim$564k instances, in \textbf{107 languages}) and evaluated using 8,617 manually annotated instances. The OmniScore family supports reliable, multi-dimensional scores across a variety of settings, including reference-based, source-grounded, and hybrid evaluations. We evaluate these models across question answering (QA), translation, and summarization in \textbf{6 languages}. Our results demonstrate that lightweight, deterministic learned metrics provide a highly practical and scalable alternative to frontier LLMs. Our models and datasets can be found at https://huggingface.co/collections/QCRI/omniscore


[5] Document Optimization for Black-Box Retrieval via Reinforcement Learning cs.CL | cs.IRPDF

Omri Uzan, Ron Polonsky, Douwe Kiela, Christopher Potts

TL;DR: 本文提出了一种基于强化学习的文档优化方法,用于提升黑盒检索器的性能。该方法通过GRPO(Group Relative Policy Optimization)微调语言模型或视觉语言模型,将文档转换为与目标检索器预期查询分布更对齐的表示,仅需访问检索器的排名结果作为奖励信号。实验在代码检索和视觉文档检索任务上验证了其有效性,能够使更小、更高效的检索器超越更大模型,并且与微调检索器权重结合时效果最佳。

Details

Motivation: 传统的文档扩展技术在现代检索器中会引入噪声并降低性能,因此需要一种新的方法,在不增加查询时计算开销的前提下,优化文档表示以提升检索质量。

Result: 在代码检索和视觉文档检索任务上,使用OpenAI text-embedding-3-small模型进行文档优化后,nDCG5分别从58.7提升至66.8和从53.3提升至57.6,甚至略微超越了成本高6.5倍的text-embedding-3-large模型。当结合检索器微调时,Jina-ColBERT-V2在视觉文档检索和代码检索上的nDCG5分别从55.8提升至63.3和从48.6提升至61.8。

Insight: 创新点在于将文档扩展重新定义为文档优化问题,利用强化学习(GRPO)以检索排名改进作为奖励,仅需黑盒访问检索器,适用于多种检索架构;客观来看,该方法实现了计算离线化与性能提升的平衡,为高效检索系统设计提供了新思路。

Abstract: Document expansion is a classical technique for improving retrieval quality, and is attractive since it shifts computation offline, avoiding additional query-time processing. However, when applied to modern retrievers, it has been shown to degrade performance, often introducing noise that obfuscates the discriminative signal. We recast document expansion as a document optimization problem: a language model or a vision language model is fine-tuned to transform documents into representations that better align with the expected query distribution under a target retriever, using GRPO with the retriever’s ranking improvements as rewards. This approach requires only black-box access to retrieval ranks, and is applicable across single-vector, multi-vector and lexical retrievers. We evaluate our approach on code retrieval and visual document retrieval (VDR) tasks. We find that learned document transformations yield retrieval gains and in many settings enable smaller, more efficient retrievers to outperform larger ones. For example, applying document optimization to OpenAI text-embedding-3-small model improves nDCG5 on code (58.7 to 66.8) and VDR (53.3 to 57.6), even slightly surpassing the 6.5X more expensive OpenAI text-embedding-3-large model (66.3 on code; 57.0 on VDR). When retriever weights are accessible, document optimization is often competitive with fine-tuning, and in most settings their combination performs best, improving Jina-ColBERT-V2 from 55.8 to 63.3 on VDR and from 48.6 to 61.8 on code retrieval.


[6] RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World cs.CLPDF

Hanbing Liu, Lang Cao, Yang Li

TL;DR: 本文针对大语言模型在持续知识漂移下的适应性问题,提出了一个基于真实世界动态事件的时间戳证据基准,用于系统评估模型适应能力。研究发现现有方法(如RAG和基于学习的方法)在此设置下表现不佳,存在灾难性遗忘和时间不一致等局限。为此,作者提出了一个时间感知检索基线Chronos,通过将检索证据组织成事件演化图来提升时间一致性理解,无需额外训练。

Details

Motivation: 大语言模型的知识主要来自预训练,这使其知识固定在某个时间点,难以适应持续演化的真实世界知识(如事实、实体和事件的变化),导致预测过时和时间推理不一致。现有方法(如持续微调、知识编辑和RAG)缺乏在反映时间顺序和真实知识演化的设置下的系统评估。

Result: 在构建的真实世界动态事件基准上,大多数现有方法(包括普通RAG和几种基于学习的方法)表现不佳,暴露出灾难性遗忘和时间不一致等关键局限。提出的时间感知检索基线Chronos通过组织事件演化图,在无需训练的情况下改善了时间一致性理解。

Insight: 创新点包括引入一个基于时间戳证据的基准来系统评估持续知识漂移下的模型适应,以及提出Chronos方法,通过事件演化图增强检索的时间一致性,为真实场景中的LLM知识更新提供了分析和改进基础。

Abstract: Large language models (LLMs) acquire most of their knowledge during pretraining, which ties them to a fixed snapshot of the world and makes adaptation to continuously evolving knowledge challenging. As facts, entities, and events change over time, models may experience continuous knowledge drift, resulting not only in outdated predictions but also in temporally inconsistent reasoning. Although existing approaches, such as continual finetuning, knowledge editing, and retrieval-augmented generation (RAG), aim to update or supplement model knowledge, they are rarely evaluated in settings that reflect chronological, evolving, and real-world knowledge evolution. In this work, we introduce a new benchmark of real-world dynamic events, constructed from time-stamped evidence that captures how knowledge evolves over time, which enables systematic evaluation of model adaptation under continuous knowledge drift. The benchmark reveals that most existing methods, including vanilla RAG and several learning-based approaches, struggle under this setting, exposing critical limitations such as catastrophic forgetting and temporal inconsistency. To mitigate these limitations, we propose a time-aware retrieval baseline, Chronos, which progressively organizes retrieved evidence into an Event Evolution Graph to enable more temporally consistent understanding in LLMs without additional training. Overall, this work provides a foundation for analyzing and advancing LLM adaptation to continuous knowledge drift in realistic settings.


[7] $π^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models cs.CL | cs.AI | cs.LGPDF

Quyet V. Do, Thinh Pham, Nguyen Nguyen, Sha Li, Pratibha Zunjare

TL;DR: 这篇论文提出了一个名为 $π^2$ 的流水线,用于从初始结构化数据中构建高质量的推理数据,以提升大语言模型的长上下文推理能力。该方法通过从维基百科提取和扩展表格,生成多跳分析推理问题,并通过双路径代码执行自动验证答案,最后通过回译生成结构化推理轨迹作为解决方案。在多个长上下文推理基准测试中,使用该数据微调的模型取得了显著的性能提升。

Details

Motivation: 解决大语言模型在处理长上下文时推理能力不足的问题,特别是缺乏高质量、结构化的长程推理数据。

Result: 在四个长上下文推理基准测试以及作者提出的 $π^2$-Bench 上,使用 $π^2$ 数据微调的 GPT-OSS-20B 和 Qwen3-4B-Instruct-2507 模型分别实现了平均绝对准确率 +4.3% 和 +2.7% 的提升。其中,GPT-OSS-20B 通过自我蒸馏(使用自身推理轨迹)进一步将平均性能提升了 +4.4%。

Insight: 创新点在于提出了一种从结构化数据(维基百科表格)自动构建高质量、多跳、长上下文推理数据集的系统化流水线,并通过双路径代码执行确保答案正确性。该方法证明了结构化数据源和自动化验证对于生成有效训练数据的重要性,并且展示了自我蒸馏在提升模型自身推理能力上的潜力。

Abstract: We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, $π^2$, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc{\small{gpt-oss-20b}} and \textsc{\small{Qwen3-4B-Instruct-2507}} on $π^2$ yields consistent improvements across four long-context reasoning benchmarks and our alike $π^2$-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-distillation, where \textsc{\small{gpt-oss-20b}} even improves its average performance by +4.4% with its own reasoning traces, demonstrating $π^2$’s usefulness. Our code, data, and models are open-source at https://github.com/vt-pi-squared/pi-squared.


[8] SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning cs.CL | cs.CEPDF

Berny Kabalisa

TL;DR: SenseAI是一个经过人机协同验证的金融情感推理数据集,它不仅记录模型输出,还捕获完整的推理过程,包括推理链、置信度分数、人工修正信号和真实市场结果,旨在支持基于人类反馈的强化学习范式。

Details

Motivation: 现有金融情感数据集缺乏对模型推理过程的全面记录,无法有效支持RLHF对齐和模型改进,因此需要构建一个结构化的人机协同数据集来捕捉并纠正系统性的模型错误。

Result: 数据集包含1,439个标注数据点,覆盖40只美股和13个金融数据类别,分析揭示了模型行为中的系统性模式,如潜在推理漂移、置信度校准错误和前向投射倾向。

Insight: 创新点在于首次构建了整合完整推理过程和人机反馈的金融情感数据集,并识别出可预测的模型错误模式,为针对性模型改进提供了结构化数据基础。

Abstract: We introduce SenseAI, a human-in-the-loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction signals, and real-world market outcomes, providing a structure aligned with Reinforcement Learning from Human Feedback (RLHF) paradigms. The dataset consists of 1,439 labelled data points across 40 US-listed equities and 13 financial data categories, enabling direct integration into modern LLM fine-tuning pipelines. Through analysis, we identify several systematic patterns in model behavior, including a novel failure mode we term Latent Reasoning Drift, where models introduce information not grounded in the input, as well as consistent confidence miscalibration and forward projection tendencies. These findings suggest that LLM errors in financial reasoning are not random but occur within a predictable and correctable regime, supporting the use of structured HITL data for targeted model improvement. We discuss implications for financial AI systems and highlight opportunities for applying SenseAI in model evaluation and alignment.


[9] EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering cs.CLPDF

Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, Chuxu Zhang

TL;DR: EvolveRouter是一个可训练的多智能体问答框架,通过联合优化智能体质量和协作结构来解决现有路由方法的局限性。它结合了基于图的路由与指令精炼的闭环协同进化过程,并引入了自适应推理策略来动态确定每个查询的有效协作规模,从而实现更强大且更高效的多智能体推理。

Details

Motivation: 现有路由方法通常只优化固定智能体池而不改进智能体本身,且依赖无法根据查询自适应调整参与智能体数量的刚性协作方案。

Result: 在五个问答基准测试上的实验表明,EvolveRouter在F1和精确匹配指标上均持续优于最先进的路由基线方法。

Insight: 创新点在于将基于图的路由与目标指令精炼在闭环协同进化过程中耦合,使路由诊断能指导智能体改进,同时精炼后的智能体为路由提供更清晰的监督;并通过路由器加权的答案一致性动态确定每个查询的有效协作规模,实现了自适应的协作结构。

Abstract: Large language model agents often exhibit complementary strengths, making routing a promising approach for multi-agent question answering. However, existing routing methods remain limited in two important ways: they typically optimize over a fixed pool of agents without improving the agents themselves, and they often rely on rigid collaboration schemes that cannot adapt the number of participating agents to the query. We propose EvolveRouter, a trainable framework that addresses both limitations by jointly improving agent quality and collaboration structure. First, EvolveRouter couples graph-based query routing with targeted instruction refinement in a closed-loop co-evolution process, allowing router diagnostics to guide agent improvement while refined agents provide cleaner supervision for routing. Second, it introduces an adaptive inference strategy that dynamically determines the effective collaboration size for each query through router-weighted answer agreement. Together, these designs enable more capable and more efficient multi-agent reasoning. Experiments on five question answering benchmarks show that EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match, while further analysis confirms the benefits of closed-loop refinement and adaptive collaboration.


[10] Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models cs.CL | cs.AI | cs.IRPDF

Ziyi Chen, Mengxian Lyu, Cheng Peng, Yonghui Wu

TL;DR: 本研究系统探索了基于编码器和解码器的生成式大语言模型,用于筛选临床叙事以促进临床试验招募。通过比较通用和医学适配的LLM,并采用三种策略(原始长上下文、基于NER的抽取式摘要、RAG)缓解处理长文档时的’中间迷失’问题,在2018 N2C2 Track 1基准数据集上评估。实验表明,采用RAG策略的MedGemma模型取得了最佳性能(微F1分数89.05%),生成式LLM在需要跨文档长期推理的试验标准上表现显著提升。

Details

Motivation: 临床试验的患者筛选是一个劳动密集型的瓶颈,导致招募不足和试验失败。LLM的突破为利用人工智能改进筛选提供了机会。

Result: 在2018 N2C2 Track 1基准数据集上,MedGemma模型结合RAG策略取得了89.05%的微F1分数,优于其他模型。生成式LLM在需要长期推理的试验标准上改进显著,而在短上下文(如实验室测试)标准上仅有增量改进。

Insight: 创新点包括系统比较编码器与解码器LLM、针对长文档’中间迷失’问题提出三种缓解策略(特别是RAG的动态证据检索),并强调在实际应用中需根据具体标准在基于规则的查询、编码器LLM和生成式LLM之间选择,以在合理计算成本下最大化效率。

Abstract: Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the “Lost in the Middle” issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.


[11] Beneath the Surface: Investigating LLMs’ Capabilities for Communicating with Subtext cs.CLPDF

Kabir Ahuja, Yuxuan Li, Andrew Kyle Lampinen

TL;DR: 本文系统研究了大型语言模型在沟通中理解和运用‘潜台词’(subtext)的能力,通过四个新的评估套件(包括寓言写作与解读、多智能体多模态游戏等)进行测试。研究发现前沿模型普遍存在过度字面化沟通的偏见,难以处理微妙约束;但在某些情况下,模型能利用共同背景减少字面化线索,而在寓言理解中,副文本和角色设定会显著影响潜台词的解读。

Details

Motivation: 人类沟通具有创造性,常使用超越字面含义的潜台词,本文旨在探究语言模型是否能在沟通场景中运用潜台词,以评估其社交情境下的创造性推理能力。

Result: 在Visual Allusions环境中,即使最佳模型也有60%的时间生成字面化线索;但通过利用共同背景,某些模型能将过度字面化线索减少30%-50%。模型在未明确说明时难以推断共同背景的存在,且寓言理解受副文本和角色条件显著影响。

Insight: 创新点包括设计四个针对潜台词沟通的评估套件,量化了这一主观复杂现象;揭示了当前LLMs在社交情境中过度字面化的弱点,以及利用共同背景的有限能力,为未来社交基础创造性沟通研究提供方向。

Abstract: Human communication is fundamentally creative, and often makes use of subtext – implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing & interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints – even the best performing models generate literal clues 60% of times in one of our environments – Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.


[12] Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification cs.CLPDF

Jinhong Jeong, Junghun Park, Youngjae Yu

TL;DR: 本文提出Re-RIGHT,一个基于强化学习的统一多语言文本简化框架,旨在根据目标语言水平(如CEFR、JLPT等)自适应地简化文本,无需平行语料监督。该方法通过整合词汇覆盖、语义保持和连贯性三个奖励模块,训练了一个紧凑的4B参数策略模型,在英语、日语、韩语和中文上实现了优于大型语言模型基线的性能。

Details

Motivation: 现有基于大语言模型的文本可读性控制方法依赖预标注的句子语料且主要针对英语,构建个性化平行语料成本高昂,而提示方法在较低语言水平和非英语语言上表现不佳,因此需要一种无需平行语料监督的自适应多语言文本简化框架。

Result: 在英语、日语、韩语和中文四种语言上,Re-RIGHT在目标语言水平上实现了更高的词汇覆盖率,同时保持了原始语义和流畅性,优于GPT-5.2和Gemini 2.5等先进大语言模型基线。

Insight: 创新点包括:提出无需平行语料监督的强化学习框架,整合多奖励模块(词汇覆盖、语义保持、连贯性)进行优化,并针对多语言(英、日、韩、中)和多种语言水平标准(CEFR、JLPT、TOPIK、HSK)进行统一处理,有效解决了提示方法在低水平和非英语语言上的局限性。

Abstract: Text simplification supports second language (L2) learning by providing comprehensible input, consistent with the Input Hypothesis. However, constructing personalized parallel corpora is costly, while existing large language model (LLM)-based readability control methods rely on pre-labeled sentence corpora and primarily target English. We propose Re-RIGHT, a unified reinforcement learning framework for adaptive multilingual text simplification without parallel corpus supervision. We first show that prompting-based lexical simplification at target proficiency levels (CEFR, JLPT, TOPIK, and HSK) performs poorly at easier levels and for non-English languages, even with state-of-the-art LLMs such as GPT-5.2 and Gemini 2.5. To address this, we collect 43K vocabulary-level data across four languages (English, Japanese, Korean, and Chinese) and train a compact 4B policy model using Re-RIGHT, which integrates three reward modules: vocabulary coverage, semantic preservation, and coherence. Compared to the stronger LLM baselines, Re-RIGHT achieves higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.


[13] ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving cs.CL | cs.CVPDF

Kaiser Hamid, Can Cui, Nade Liang

TL;DR: 本文提出了ICR-Drive框架,用于诊断端到端语言驱动自动驾驶模型在指令反事实鲁棒性方面的表现。该框架通过生成四种类型的指令扰动(如改写、模糊、噪声和误导),在CARLA仿真环境中评估模型性能,发现微小指令变化会导致性能显著下降,揭示了现有模型在安全关键驾驶应用中的可靠性差距。

Details

Motivation: 现有视觉-语言-动作模型在自动驾驶中通常假设指令精确且格式良好,但实际部署中指令可能存在表述差异、模糊性、遗漏关键信息或包含误导性内容,导致指令级鲁棒性评估不足,需要系统化诊断方法。

Result: 在LMDrive和BEVDriver模型上的实验表明,指令的微小变化(如误导性指令)会导致标准CARLA Leaderboard指标显著下降,并引发不同的故障模式,凸显了模型鲁棒性不足的问题。

Insight: 创新点在于提出了一个系统化的指令反事实鲁棒性诊断框架,通过可控的指令扰动家族隔离语言变化对性能的影响;客观来看,该方法为评估具身基础模型在安全关键场景中的可靠性提供了可量化的基准,强调了语言理解鲁棒性对自动驾驶的重要性。

Abstract: Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.


[14] Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling cs.CL | cs.AI | cs.CVPDF

Qiyuan Chen, Hongsen Huang, Jiahe Chen, Qian Shao, Jintai Chen

TL;DR: 本文提出VL-MDR框架,通过动态分解评估为细粒度、可解释的维度(如幻觉、推理),来解决视觉语言奖励建模中生成方法可解释但慢、判别方法高效但不透明的两难问题。该方法利用视觉感知门控机制为每个输入动态选择和加权相关维度,并在构建的包含21个维度的32.1万偏好对数据集上进行验证。实验表明其在VL-RewardBench等基准上优于现有开源奖励模型,且其构建的偏好对能有效用于DPO对齐以减少视觉幻觉并提升可靠性。

Details

Motivation: 解决视觉语言奖励建模中可解释性与效率之间的矛盾,即生成方法可解释但慢,判别方法高效但如同不透明的“黑箱”。

Result: 在VL-RewardBench等基准测试中,VL-MDR一致性地优于现有的开源奖励模型,达到SOTA水平;并且使用VL-MDR构建的偏好对进行DPO对齐能有效减轻视觉幻觉并提高模型可靠性。

Insight: 核心创新在于将单标量奖励动态分解为细粒度、可解释的维度,并通过视觉感知门控机制进行动态维度选择与聚合,从而在保持高效的同时提供可解释性;此外,构建的大规模多维度标注数据集也为细粒度评估提供了基础。

Abstract: Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque “black boxes.” To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.


[15] Don’t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction cs.CLPDF

Yuzhe Zhang, Xianwei Xue, Xingyong Wu, Mengke Chen, Chen Liu

TL;DR: 论文提出VeriGUI,一种基于视觉语言模型的鲁棒GUI自动化代理,通过引入TVAE框架(思考-验证-动作-期望)来显式建模动作结果和恢复机制,以应对现实环境中的延迟和中断导致的动作失败问题。

Details

Motivation: 现有基于视觉语言模型的自主GUI代理通常假设环境响应是确定性的,在存在网络延迟、渲染延迟和系统中断的真实场景中,这会导致未检测到的动作失败、重复无效行为和灾难性错误累积,且由于在线交互成本高和离线数据集缺乏实时反馈,学习鲁棒的恢复策略具有挑战性。

Result: 在基于AndroidControl构建的鲁棒性基准测试中,VeriGUI显著减少了失败循环并提高了恢复成功率,同时保持了具有竞争力的标准任务性能。

Insight: 创新点在于提出了显式的动作效果验证和自我纠正框架(TVAE),以及结合了合成失败轨迹的鲁棒监督微调(Robust SFT)和带有非对称验证奖励的GRPO的两阶段训练流程,增强了代理在噪声环境下的故障检测和恢复能力。

Abstract: Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency, rendering delays, and system interruptions, this assumption leads to undetected action failures, repetitive ineffective behaviors, and catastrophic error accumulation. Moreover, learning robust recovery strategies is challenging due to the high cost of online interaction and the lack of real-time feedback in offline datasets.We propose VeriGUI (Verification-driven GUI Agent), which explicitly models action outcomes and recovery under noisy environments. VeriGUI introduces a Thinking–Verification–Action–Expectation (TVAE) framework to detect failures and guide corrective reasoning, and a two-stage training pipeline that combines Robust SFT with synthetic failure trajectories and GRPO with asymmetric verification rewards. We further construct a Robustness Benchmark based on AndroidControl to evaluate failure recognition and correction. Experiments show that VeriGUI significantly reduces failure loops and improves recovery success while maintaining competitive standard task performance.


[16] Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs cs.CLPDF

Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu

TL;DR: 本文针对Omni-LLMs在复杂多模态推理场景中的不足,提出了跨模态指代对齐问题,即模型需要在源模态中定位指代物并在目标模态中重新识别。作者构建了CrossOmni数据集来评估和提升该能力,并提出了两种训练策略以增强模型性能。

Details

Motivation: 现有Omni-LLMs在需要协同全模态推理的复杂场景中表现不佳,其根本原因在于缺乏细粒度的跨模态对齐能力,特别是跨模态指代识别这一关键环节被忽视。

Result: 在13个Omni-LLMs上的实验揭示了它们在跨模态指代任务上的系统性缺陷。提出的免训练上下文学习方法和基于SFT+GRPO的训练框架均带来了显著的性能提升,并能有效泛化到协作推理任务。

Insight: 创新点在于将跨模态指代问题形式化,并构建了包含人工设计推理依据的数据集进行评估。客观来看,研究强调了跨模态指代对齐是提升全模态推理鲁棒性的关键缺失环节,并提供了有效的训练策略来诱导模型形成指代感知的思维模式。

Abstract: Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.


[17] Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting cs.CLPDF

Jinhu Fu, Yan Bai, Longzhu He, Yihang Lou, Yanxiao Zhao

TL;DR: 本文提出了一种名为CoT2Edit的新范式,通过基于指令的思维链提示来教导大型语言模型编辑知识,以解决现有知识编辑方法泛化能力差和范围狭窄的问题。该方法利用语言模型代理生成高质量指令数据,结合监督微调和组相对策略优化进行训练,并在推理时集成检索增强生成以动态检索相关编辑事实,实现了在多种知识编辑场景下的强泛化能力。

Details

Motivation: 现有知识编辑方法存在两大局限:一是泛化能力差,大多数方法只是生硬地注入新知识,未能确保模型能有效利用这些知识解决实际问题;二是范围狭窄,当前方法主要关注结构化事实三元组,忽视了现实世界中普遍存在的多样化非结构化事实信息(如新闻、文章)。

Result: 实验结果表明,该方法在六个不同的知识编辑场景中实现了强泛化能力,仅需对三个开源语言模型进行单轮训练即可达到此效果。

Insight: 论文的创新点在于提出了一个结合思维链推理、指令微调和检索增强生成的知识编辑新范式,旨在提升模型对编辑后知识的推理能力和在实际问题中的应用泛化性。从客观角度看,其将知识编辑从静态注入扩展到动态推理和检索的框架设计具有借鉴意义。

Abstract: Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use it effectively to solve practical problems. (II) Narrow scope: Current methods focus primarily on structured fact triples, overlooking the diverse unstructured forms of factual information (e.g., news, articles) prevalent in real-world contexts. To address these challenges, we propose a new paradigm: teaching LLMs to edit knowledge via Chain of Thoughts (CoTs) reasoning (CoT2Edit). We first leverage language model agents for both structured and unstructured edited data to generate CoTs, building high-quality instruction data. The model is then trained to reason over edited knowledge through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). At inference time, we integrate Retrieval-Augmented Generation (RAG) to dynamically retrieve relevant edited facts for real-time knowledge editing. Experimental results demonstrate that our method achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open-source language models. The codes are available at https://github.com/FredJDean/CoT2Edit.


[18] Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects cs.CLPDF

Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng

TL;DR: 这篇论文系统性地分析了大型视觉语言模型(LVLMs)推理效率低下的核心瓶颈——视觉令牌主导问题,并围绕编码、预填充和解码三个推理生命周期阶段,对现有效率优化技术进行了结构化分类。论文揭示了上游决策如何影响下游瓶颈,并提出了四个未来研究方向。

Details

Motivation: 大型视觉语言模型在图像和视频推理方面能力强大,但其推理过程受到’视觉令牌主导’这一系统性效率障碍的阻碍,该障碍源于高分辨率特征提取、注意力二次方缩放和内存带宽限制之间的多机制相互作用。

Result: 论文是一项综述性研究,未报告具体定量结果,但提供了基于试点实证见解的未来前沿方向。

Insight: 论文的创新之处在于提供了一个端到端推理管道的结构化分析框架,将效率优化解耦为塑造信息密度、管理长上下文注意力和克服内存限制三个轴,并前瞻性地提出了基于功能单元敏感性的混合压缩、具有宽松验证的模态感知解码、用于流式连续性的渐进状态管理以及通过硬件-算法协同设计的阶段解耦服务这四个未来前沿。

Abstract: Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ‘’visual memory wall’’ in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.


[19] Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents cs.CLPDF

Yanxu Mao, Peipei Liu, Tiehan Cui, Congying Liu, Mingzhe Xing

TL;DR: 本文提出JailAgent框架,通过隐式操纵LLM智能体的推理轨迹和记忆检索来增强其安全性,避免直接修改用户提示。该方法包含触发提取、推理劫持和约束收紧三个阶段,实现了跨模型和跨场景环境下的卓越性能。

Details

Motivation: 针对现有红队方法主要依赖修改用户提示、缺乏对新数据的适应性且可能影响智能体性能的问题,旨在开发一种无需修改提示的安全增强框架。

Result: JailAgent在跨模型和跨场景环境中表现出色,通过精确触发识别、实时自适应机制和优化目标函数实现了卓越性能。

Insight: 创新点在于完全避免修改用户提示,通过隐式操纵推理轨迹和记忆检索来增强安全性,并采用触发提取、推理劫持和约束收紧的三阶段方法实现跨环境适应性。

Abstract: With the widespread application of LLM-based agents across various domains, their complexity has introduced new security threats. Existing red-team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent’s performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent’s reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real-time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments.


[20] AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery cs.CL | cs.CEPDF

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu

TL;DR: AutoSOTA是一个端到端的自动化研究系统,旨在通过多智能体架构自动复现顶级AI会议论文中的SOTA模型,并进一步优化以发现性能更优的新SOTA模型。该系统将流程分为资源准备与目标设定、实验评估、反思与构思三个阶段,能够在约五小时内平均处理一篇论文,并成功发现了105个超越原方法的新SOTA模型。

Details

Motivation: 解决AI研究中为达到SOTA性能所需的漫长复现、调试和迭代优化周期问题,旨在加速整个经验性模型优化的全流程,减轻研究人员的重复性实验负担。

Result: 在从八个顶级AI会议收集的、满足代码可用性和执行成本筛选条件的论文上进行评估,AutoSOTA在自动复现和后续优化方面均表现出强大的端到端性能,成功发现了105个超越原报告方法的新SOTA模型,平均每篇论文处理时间约五小时。

Insight: 采用多智能体架构协同处理从论文到代码的落地、环境初始化与修复、长周期实验跟踪、优化想法生成与调度以及有效性监督等任务;系统不仅能进行常规超参数调优,还能识别架构创新、算法重新设计和流程级改进,展示了端到端研究自动化作为新型研究基础设施的潜力,可将人类注意力转向更高层次的科学创造力。

Abstract: Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.


[21] EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents cs.CLPDF

Xuan Dong, Huanyang Zheng, Tianhao Niu, Zhe Han, Pengzhan Li

TL;DR: 本文提出了EpiBench,一个用于评估多模态智能体在多轮研究工作流程中表现的基准测试。该基准模拟了科学研究中的多步骤工作流程,包括主动文献检索、图表证据整合以及跨论文证据的持续使用。

Details

Motivation: 现有基准未能系统评估智能体在主动搜索、多证据整合以及长期证据使用方面的能力,EpiBench旨在填补这一空白,为可验证和可重复的研究智能体提供评估平台。

Result: 实验表明,即使在领先模型上,在困难划分上的准确率也仅为29.23%,表明多轮多证据研究工作流程仍有巨大改进空间。

Insight: 创新点在于提出了一个过程级的评估框架,能够对研究智能体进行细粒度测试和诊断,强调跨论文比较和多图表整合的客观问题,推动了多模态智能体在复杂研究任务中的评估发展。

Abstract: Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.


[22] Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs cs.CLPDF

Hongyuan Yuan, Xinran He, Run Shao, Bolei He, Xianwei Xue

TL;DR: 本文提出了一种基于图的思维链剪枝框架,旨在减少大型语言模型推理过程中产生的冗余反思内容。该方法将线性思维链转换为有向无环图,通过分支级和深度级剪枝策略移除低效的反思模式,并采用包含SFT、DPO和GRPO的三阶段训练流程来优化模型,在保证准确性的同时显著减少了推理所需的token数量。

Details

Motivation: 通过强化学习扩展思维链虽能增强LLM的推理能力,但由于奖励信号的稀疏性,容易导致模型产生过度思考等不良模式,如生成大量冗余的中间推理内容。本文认为冗余的主要来源是低效的反思,具体表现为两种问题模式:无差别反思和重复性反思。

Result: 实验表明,该方法在保持或提升准确率的同时,将平均推理token数量减少了42%。

Insight: 创新点在于将线性思维链结构化为有向无环图以显式建模依赖关系,并设计了针对不同冗余模式(分支级和深度级)的双重剪枝策略。从客观角度看,将图结构引入思维链优化,并结合多阶段训练(SFT、DPO、GRPO)来联合优化答案正确性和推理效率,是一种新颖且系统的解决冗余反思问题的方法。

Abstract: Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42% while maintaining or improving accuracy.


[23] See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs cs.CLPDF

Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou

TL;DR: 本文提出LVSpec,一种针对视频大语言模型(Video-LLMs)的无训练松散推测解码框架,旨在解决其自回归生成推理延迟高的问题。该方法基于视觉相关锚点需严格验证、视觉无关填充词可宽松验证的洞见,通过轻量级视觉相关令牌识别和容忍位置偏移的机制,在保持模型性能的同时显著提升推理速度。

Details

Motivation: 视频大语言模型在视频理解方面表现出色,但其自回归生成过程存在高推理延迟。现有的推测解码方法受限于严格的精确匹配规则,限制了加速潜力,因此需要一种更灵活的验证机制来提升效率。

Result: 实验表明,LVSpec在保持目标模型>99.8%性能的同时,将Qwen2.5-VL-32B加速2.70倍,将LLaVA-OneVision-72B加速2.94倍。与Video-LLMs领域最先进的无训练推测解码方法相比,其平均接受长度和加速比分别提升了136%和35%。

Insight: 论文的创新点在于首次为Video-LLMs设计了无训练的松散推测解码框架,核心洞见是将生成内容区分为需要严格验证的稀疏视觉相关锚点和允许宽松验证的丰富视觉无关填充词,并引入了位置偏移容忍机制来挽救语义等效但位置不匹配的令牌,从而在保证高保真度的前提下最大化加速潜力。

Abstract: Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.


[24] LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals cs.CL | cs.AI | cs.LGPDF

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang

TL;DR: 该论文将大语言模型的思维链生成过程建模为表示空间中的结构化轨迹,揭示了数学推理过程在功能上遵循有序的、步骤特定的子空间,且这些子空间随网络层加深而愈发可分。研究发现,这种结构在基础模型中已存在,而推理训练主要加速模型向终止相关子空间的收敛,而非引入新的表示组织。此外,正确与错误答案的表示轨迹在推理后期会系统性地分叉,这使得在推理中期预测最终答案正确性成为可能(ROC-AUC最高达0.87)。论文还提出了基于轨迹的引导框架,可在推理时进行干预以修正推理路径和控制长度。

Details

Motivation: 旨在从表示几何的角度理解和解释大语言模型在思维链推理过程中的内部工作机制,并探索如何利用这种几何结构来预测和控制模型的推理行为。

Result: 在数学推理任务上,研究揭示了表示轨迹的几何结构,并实现了在推理中期预测最终答案正确性(ROC-AUC最高达0.87)。同时,提出的基于轨迹的引导框架展示了在推理时进行干预的潜力。

Insight: 创新点在于将思维链推理过程视为表示空间中的几何轨迹,并利用这种几何结构进行解释、预测和控制。具体包括:揭示了推理步骤在表示空间中的功能有序性和可分离性;发现正确与错误答案的轨迹分叉点位于推理后期,为中期预测提供了可能;提出了一个基于理想轨迹进行推理时干预的通用框架。

Abstract: This work characterizes large language models’ chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth. This structure already exists in base models, while reasoning training primarily accelerates convergence toward termination-related subspaces rather than introducing new representational organization. While early reasoning steps follow similar trajectories, correct and incorrect solutions diverge systematically at late stages. This late-stage divergence enables mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87. Furthermore, we introduce trajectory-based steering, an inference-time intervention framework that enables reasoning correction and length control based on derived ideal trajectories. Together, these results establish reasoning trajectories as a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior.


[25] Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL | cs.AIPDF

Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, Jin-Long Li

TL;DR: 本文提出了一种名为Attention Editing的通用框架,用于将已训练的大型语言模型(LLMs)的注意力机制转换为新的架构(如多头部潜在注意力MLA和门控混合滑动窗口注意力GateSWA),而无需从头重新预训练。该方法通过渐进式蒸馏(包括层级的教师强制优化和模型级的下一词分布蒸馏)来训练可替换的目标注意力模块,从而在保持模型性能的同时,显著提升长上下文和长生成场景下的推理效率(减少KV缓存的内存和带宽开销)。

Details

Motivation: 解决长上下文和长生成场景下,大型语言模型推理时KV缓存的内存和带宽成本日益主导的问题。现有高效注意力架构(如MLA、SWA)难以直接集成到已训练模型中,而先前的方法对源和目标注意力模块的结构要求过于精细,不满足实际部署的可行性需求。

Result: 在Qwen3-8B和Qwen3-30B-A3B模型上实例化框架,转换为MLA和GateSWA架构。转换后的模型在保持有竞争力性能的同时,实现了显著的效率提升。实验在国产硬件Ascend 910B集群上进行,提供了实际训练案例。

Insight: 创新点在于提出了一个无需从头预训练的通用注意力转换框架,通过渐进式蒸馏(结合层级激活监督和模型级分布蒸馏)来稳健地替换注意力模块。这为将已训练LLM适配到更高效的注意力架构提供了可行且鲁棒的方案,降低了部署成本。

Abstract: Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent cold-start error accumulation, and (2) model-level distillation on next-token distributions, optionally regularized by weak feature matching. We instantiate the framework on two different target–MLA and GateSWA, a gated hybrid SWA design, and apply it to Qwen3-8B and Qwen3-30B-A3B. The resulting models maintain competitive performance while delivering substantial efficiency improvements, demonstrating that large-scale attention conversion is both feasible and robust. Notably, experiments are conducted on an Ascend 910B clusters, offering a practical training case study on domestic hardware.


[26] MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models cs.CLPDF

Han Jang, Junhyeok Lee, Heeseong Eum, Kyu Sung Choi

TL;DR: 本文提出了MedLayBench-V,这是首个用于评估医学视觉语言模型在专家与大众语义对齐能力的大规模多模态基准。该数据集通过结构化概念锚定精炼流程构建,确保了严格的语义等价性,旨在弥合临床专家与患者之间的沟通鸿沟。

Details

Motivation: 当前医学视觉语言模型主要基于专业文献训练,缺乏以患者为中心、用通俗语言解释医学影像发现的能力,而现有资源缺少专门用于促进大众可理解的医学图像理解的大规模多模态基准。

Result: 论文提出了MedLayBench-V基准,但摘要中未提及具体的定量实验结果或与其他模型的对比。

Insight: 创新点在于引入了结构化概念锚定精炼流程,通过整合统一医学语言系统的概念唯一标识符和微观层面的实体约束来强制语义等价,避免了简单简化方法可能导致的幻觉风险,为训练和评估下一代医学视觉语言模型提供了经过验证的基础。

Abstract: Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.


[27] Identifying Influential N-grams in Confidence Calibration via Regression Analysis cs.CLPDF

Shintaro Ozaki, Wataru Hashimoto, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

TL;DR: 本文通过回归分析方法,识别了大型语言模型(LLMs)在推理过程中与置信度相关的特定n-gram语言表达。研究发现,LLMs在涉及推理时往往过度自信,并揭示了导致这种行为的特定语言信息。有趣的是,一些被提取的表达与为提高推理性能而有意插入的提示短语重合。通过因果性测试和验证,论文表明,通过简单地抑制这些过度自信的表达,可以在不降低性能的情况下实现置信度校准。

Details

Motivation: 尽管LLMs通过显式推理提高了性能,但其响应往往过度自信,即使其中包含了表示不确定性的语言表达。本研究旨在识别哪些语言表达与置信度相关,以解决LLMs的过度自信问题。

Result: 在多个模型和QA基准测试中,研究表明LLMs在涉及推理时仍保持过度自信,并将此行为归因于特定的语言信息。通过抑制这些过度自信的表达,可以在不降低性能的情况下实现置信度校准。

Insight: 创新点在于应用回归分析方法,将LLMs推理部分中语言表达的置信度作为因变量,分析特定n-gram与置信度之间的关系,从而识别出影响置信度的关键语言表达。这为通过简单抑制特定表达来校准置信度提供了新途径,且无需牺牲模型性能。

Abstract: While large language models (LLMs) improve performance by explicit reasoning, their responses are often overconfident, even though they include linguistic expressions demonstrating uncertainty. In this work, we identify what linguistic expressions are related to confidence by applying the regression method. Specifically, we predict confidence of those linguistic expressions in the reasoning parts of LLMs as the dependent variables and analyze the relationship between a specific $n$-gram and confidence. Across multiple models and QA benchmarks, we show that LLMs remain overconfident when reasoning is involved and attribute this behavior to specific linguistic information. Interestingly, several of the extracted expressions coincide with cue phrases intentionally inserted on test-time scaling to improve reasoning performance. Through our test on causality and verification that the extracted linguistic information truly affects confidence, we reveal that confidence calibration is possible by simply suppressing those overconfident expressions without drops in performance.


[28] PhageBench: Can LLMs Understand Raw Bacteriophage Genomes? cs.CL | q-bio.GNPDF

Yusen Hou, Weicai Long, Haitao Hu, Houcheng Su, Junning Feng

TL;DR: 该论文提出了首个评估大型语言模型(LLMs)直接理解原始噬菌体基因组能力的基准测试PhageBench,该基准模拟生物信息学专家的工作流程,包含5,600个高质量样本,涵盖筛选、质量控制和表型注释三个阶段的五个核心任务。评估了八个LLMs,发现通用推理模型在噬菌体重叠群识别和宿主预测任务上显著优于随机基线,但在涉及长程依赖和细粒度功能定位的复杂推理任务上存在明显局限。

Details

Motivation: 噬菌体在调节微生物生态系统和作为抗生素替代品方面至关重要,但其原始基因组序列的直接解读能力在通用LLMs中尚未得到充分探索,因此需要建立一个专门的基准来评估和推动该领域的发展。

Result: 在PhageBench上对八个LLMs的评估结果显示,通用推理模型在噬菌体重叠群识别和宿主预测任务上显著优于随机基线,但在复杂推理任务(如涉及长程依赖和功能定位)上表现不佳,揭示了当前模型的局限性。

Insight: 论文的创新点在于构建了首个专注于噬菌体基因组原始序列理解的基准测试PhageBench,系统评估了LLMs在该领域的潜力与局限;客观分析认为,其工作流程模拟和任务设计为开发具有更强生物序列推理能力的下一代模型提供了明确的评估框架和方向。

Abstract: Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical value. While general-purpose Large Language Models (LLMs) excel at understanding biological texts, their ability to directly interpret raw nucleotide sequences and perform biological reasoning remains underexplored. To address this, we introduce PhageBench, the first benchmark designed to evaluate phage genome understanding by mirroring the workflow of bioinformatics experts. The dataset contains 5,600 high-quality samples covering five core tasks across three stages: Screening, Quality Control, and Phenotype Annotation. Our evaluation of eight LLMs reveals that general-purpose reasoning models significantly outperform random baselines in phage contig identification and host prediction, demonstrating promising potential for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization. These findings highlight the necessity of developing next-generation models with enhanced reasoning capabilities for biological sequences.


[29] Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation cs.CLPDF

Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar

TL;DR: 本文提出了一种用于评估AI心理健康对话系统治疗原则遵循程度的框架CARE和基准FAITH-M。该框架通过整合对话内上下文、对比示例检索和知识蒸馏的思维链推理,对AI生成的治疗师式回应在六个核心治疗原则上的表现进行细粒度评估。

Details

Motivation: 随着大语言模型在心理健康领域的应用增多,需要超越表面流畅性的评估框架,以衡量AI回应是否符合心理治疗的最佳实践原则,解决现有系统缺乏结构化评估机制的问题。

Result: 在提出的FAITH-M基准上,CARE框架取得了63.34的F1分数,显著优于其骨干模型Qwen3的38.56分,提升了64.26%。专家评估和外部数据集测试进一步证明了其在领域转移下的鲁棒性。

Insight: 创新点在于提出了一个基于六个核心治疗原则(如非评判性接纳、温暖等)的细粒度序数评估尺度和一个结合上下文、示例检索与推理的多阶段评估框架。客观来看,其核心贡献是将临床原则结构化并融入评估流程,强调了结构化推理和上下文建模的重要性,而非单纯依赖骨干模型能力。

Abstract: The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency. While recent systems exhibit conversational competence, they lack structured mechanisms to evaluate adherence to core therapeutic principles. In this paper, we study the problem of evaluating AI-generated therapist-like responses for clinically grounded appropriateness and effectiveness. We assess each therapists utterance along six therapeutic principles: non-judgmental acceptance, warmth, respect for autonomy, active listening, reflective understanding, and situational appropriateness using a fine-grained ordinal scale. We introduce FAITH-M, a benchmark annotated with expert-assigned ordinal ratings, and propose CARE, a multi-stage evaluation framework that integrates intra-dialogue context, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning. Experiments show that CARE achieves an F-1 score of 63.34 versus the strong baseline Qwen3 F-1 score of 38.56 which is a 64.26 improvement, which also serves as its backbone, indicating that gains arise from structured reasoning and contextual modeling rather than backbone capacity alone. Expert assessment and external dataset evaluations further demonstrate robustness under domain shift, while highlighting challenges in modelling implicit clinical nuance. Overall, CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems.


[30] AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning cs.CLPDF

Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan

TL;DR: 本文提出了一种名为Agentic Graph Learning(AGL)的新范式,将图学习重新定义为拓扑感知导航与基于LLM推理的交错过程。具体地,作者开发了首个强化学习驱动的框架AgentGL,该框架为LLM智能体配备了图原生工具进行多尺度探索,通过搜索约束思维来调节工具使用以平衡准确性与效率,并采用图条件课程强化学习策略来稳定长视野策略学习。在多个文本属性图基准测试和不同LLM骨干网络上,AgentGL在节点分类和链接预测任务上显著超越了现有GraphLLMs和GraphRAG基线模型。

Details

Motivation: 现有基于LLM的智能体框架将外部信息视为非结构化文本,未能利用现实世界数据中固有的拓扑依赖关系。为了弥合这一差距,本文旨在将LLM的智能体能力与图结构学习相结合,使LLM能够在复杂的关系环境中自主导航和推理。

Result: 在多个文本属性图基准测试和不同LLM骨干网络上,AgentGL在节点分类任务上实现了高达17.5%的绝对性能提升,在链接预测任务上实现了高达28.4%的绝对性能提升,显著超越了强大的GraphLLMs和GraphRAG基线模型,达到了新的最先进水平。

Insight: 论文的创新点在于提出了AGL这一新范式,并具体实现了AgentGL框架。其核心创新包括:1)将图学习任务重新定义为LLM智能体的拓扑感知导航与推理过程;2)设计了图原生工具集,使智能体能够进行多尺度图探索;3)引入了搜索约束思维机制来平衡探索的准确性与效率;4)提出了图条件课程强化学习策略,以无逐步监督的方式稳定长序列策略学习。这为LLM处理结构化、关系型知识提供了新的思路和方法。

Abstract: Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real-world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL. AgentGL equips an LLM agent with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking to balance accuracy and efficiency, and employs a graph-conditioned curriculum RL strategy to stabilize long-horizon policy learning without step-wise supervision. Across diverse Text-Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at https://github.com/sunyuanfu/AgentGL.


[31] LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring cs.CLPDF

Xiao Qin, Xingyi Song, Tong Liu, Hatim Laalej, Zepeng Liu

TL;DR: LoRM是一种自监督框架,将旋转机械的多模态传感器信号视为一种机器语言,通过将局部信号离散化为符号单元,并基于多传感器上下文预测未来演化,实现实时状态监测。该方法将多模态传感器数据重新表述为基于令牌的序列预测问题,通过微调通用预训练语言模型实现知识迁移,避免了从头训练大型模型。

Details

Motivation: 传统信号处理方法依赖手工设计的变换和特征,LoRM旨在通过自监督学习框架,将旋转机械信号理解重新定义为序列预测问题,以更灵活、可泛化的方式进行状态监测。

Result: 在刀具状态监测(TCM)实验中,LoRM展示了稳定的实时跟踪能力和强大的跨刀具泛化性能,为语言建模与工业信号分析之间建立了实用桥梁。

Insight: 创新点在于将旋转机械信号类比为机器语言,通过令牌化和序列预测实现自监督学习;客观分析认为其利用预训练语言模型进行高效知识迁移,避免了领域特定模型的大规模训练,是一种新颖的跨模态应用思路。

Abstract: We present LoRM (Language of Rotating Machinery), a self-supervised framework for multi-modal rotating-machinery signal understanding and real-time condition monitoring. LoRM is built on the idea that rotating-machinery signals can be viewed as a machine language: local signals can be tokenised into discrete symbolic units, and their future evolution can be predicted from observed multi-sensor context. Unlike conventional signal-processing methods that rely on hand-crafted transforms and features, LoRM reformulates multi-modal sensor data as a token-based sequence-prediction problem. For each data window, the observed context segment is retained in continuous form, while the future target segment of each sensing channel is quantised into a discrete token. Then, efficient knowledge transfer is achieved by partially fine-tuning a general-purpose pre-trained language model on industrial signals, avoiding the need to train a large model from scratch. Finally, condition monitoring is performed by tracking token-prediction errors as a health indicator, where increasing errors indicate degradation. In-situ tool condition monitoring (TCM) experiments demonstrate stable real-time tracking and strong cross-tool generalisation, showing that LoRM provides a practical bridge between language modelling and industrial signal analysis. The source code is publicly available at https://github.com/Q159753258/LormPHM.


[32] Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models cs.CLPDF

Xiangming Gu, Soham De, Larisa Markeeva, Petar Veličković, Razvan Pascanu

TL;DR: 本文研究了大型推理模型(LRMs)中并行采样与顺序采样策略的性能差异,发现并行采样通常优于顺序采样,尽管后者理论上具有更强的表示能力。通过实验验证了三种假设,最终指出性能差距的主要原因是顺序采样因条件依赖先前答案而导致的探索不足。

Details

Motivation: 解决在大型推理模型中,为何并行采样策略在实践中的表现优于理论上更具表示能力的顺序采样策略,并探究其背后的根本原因。

Result: 在多种模型家族和规模(如Qwen3、DeepSeek-R1蒸馏模型、Gemini 2.5)以及问题领域(数学和编程)上的实证证据表明,聚合操作和上下文长度并非性能差距的主因,而探索不足则扮演了更重要的角色。

Insight: 论文的创新点在于系统性地比较了并行与顺序采样策略,并通过假设检验揭示了探索不足是导致性能差距的关键因素,这为优化大型推理模型的采样策略提供了重要见解。

Abstract: Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.


[33] Mechanistic Circuit-Based Knowledge Editing in Large Language Models cs.CLPDF

Tianyi Zhao, Yinhan He, Wendy Zheng, Chen Chen

TL;DR: 本文提出了一种基于机制电路的知识编辑方法MCircKE,用于解决大型语言模型在动态环境中知识更新的问题。该方法通过识别与特定推理任务相关的因果电路,并仅在该电路内进行参数更新,以提升模型在多步推理链中利用编辑后知识的能力。

Details

Motivation: 现有知识编辑方法在更新孤立事实时可靠,但在多步推理中存在’推理鸿沟’,即模型能回忆编辑后的事实却无法在推理链中有效利用。

Result: 在MQuAKE-3K基准测试上的广泛实验表明,该方法在多跳推理的知识编辑任务中有效。

Insight: 创新点在于采用’映射-适应’的精确编辑流程,通过定位因果电路并针对性更新参数,以弥合推理鸿沟,这为模型的可控知识更新提供了机制解释性视角。

Abstract: Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a “Reasoning Gap”, where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underline{M}echanistic \underline{Circ}uit-based \underline{K}nowledge \underline{E}diting), a novel framework that enables a precise “map-and-adapt” editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.


[34] “I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns? cs.CL | cs.AIPDF

Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li

TL;DR: 该论文研究了大型视觉语言模型(VLMs)理解多模态双关语的能力。作者首先提出了一个多模态双关语生成流程,并构建了包含多种双关类型和对抗性非双关干扰项的MultiPun数据集。评估发现大多数模型难以区分真正的双关语和干扰项。论文还提出了提示级和模型级策略来增强双关理解能力,平均F1分数提升了16.5%。

Details

Motivation: 动机在于双关语是一种利用多义性和语音相似性创造幽默的修辞手法,而多模态双关中视觉和文本元素协同作用以同时体现字面意义和比喻意义。尽管VLMs广泛用于多模态理解和生成,但由于缺乏严格的基准,其理解双关语的能力尚未得到系统研究。

Result: 在提出的MultiPun数据集上评估,大多数模型在区分真实双关语和对抗性非双关干扰项上表现不佳。通过提出的提示级和模型级策略,平均F1分数提升了16.5%。

Insight: 创新点在于首次系统性地构建了多模态双关语理解基准(MultiPun),并提出了有效的增强策略。从客观角度看,该研究为通过跨模态推理开发掌握类人幽默细微差别的未来VLMs提供了有价值的见解,特别是在处理语言歧义和跨模态协同方面。

Abstract: Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.


[35] FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures cs.CLPDF

Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang

TL;DR: FinReporting是一个用于跨司法管辖区财务报告本地化的智能工作流系统,通过构建统一的标准本体(涵盖利润表、资产负债表和现金流量表),将报告过程分解为可审计的多个阶段,并利用大语言模型作为受约束的验证器而非自由生成器,以应对不同司法管辖区在会计分类、标记基础设施和汇总惯例上的差异。

Details

Motivation: 现有财务报告系统大多假设单一市场环境,未能解决不同司法管辖区之间的结构性差异(如会计分类法、标记基础设施和汇总惯例的不同),导致跨司法管辖区报告面临语义对齐和验证的挑战。

Result: 该系统在美国、日本和中国的年度申报文件上进行了评估,在异构报告制度下提高了报告的一致性和可靠性。

Insight: 创新点在于构建了统一的标准本体来对齐不同司法管辖区的财务概念,并将报告过程分解为可审计的阶段,同时将大语言模型用作受明确决策规则和证据基础的约束验证器,而非自由生成器,从而增强了系统的可靠性和可解释性。

Abstract: Financial reporting systems increasingly use large language models (LLMs) to extract and summarize corporate disclosures. However, most assume a single-market setting and do not address structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions make cross-jurisdiction reporting a semantic alignment and verification challenge. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system builds a unified canonical ontology over Income Statement, Balance Sheet, and Cash Flow, and decomposes reporting into auditable stages including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than using LLMs as free-form generators, FinReporting deploys them as constrained verifiers under explicit decision rules and evidence grounding. Evaluated on annual filings from the US, Japan, and China, the system improves consistency and reliability under heterogeneous reporting regimes. We release an interactive demo supporting cross-market inspection and structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo . The video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk


[36] BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection cs.CLPDF

Zhongxing Zhang, Emily K. Vraga, Jisu Huh, Jaideep Srivastava

TL;DR: 本文提出了一种名为BiMind的双头推理模型,用于检测错误信息。该模型通过解耦内容内部推理和知识增强推理,并引入注意力几何适配器、自检索知识机制以及不确定性感知融合策略,有效解决了现有方法在平衡文本内容验证与外部知识修正时面临的注意力几何崩溃问题。

Details

Motivation: 错误信息严重威胁内容的真实性和完整性,而现有检测方法难以在注意力几何崩溃的情况下,同时平衡文本内容验证与外部知识修正。

Result: 在公开数据集上的实验结果表明,BiMind模型超越了先进的检测方法,并提供了关于知识何时及为何重要的可解释诊断。

Insight: 创新点包括:1) 通过令牌条件偏移重塑注意力对数以缓解注意力崩溃的注意力几何适配器;2) 通过kNN检索构建领域内语义记忆并利用特征级线性调制注入检索邻居的自检索知识机制;3) 由对称KL一致性正则器稳定的不确定性感知融合策略(如熵门控融合和可训练一致性头)。此外,定义了新的度量指标VoX来量化知识贡献。

Abstract: Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms advanced detection approaches and provides interpretable diagnostics on when and why knowledge matters.


[37] A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models cs.CL | cs.AI | cs.IRPDF

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan

TL;DR: 本文提出了一种用于基于大语言模型(LLM)的临床信息提取的多阶段验证框架,旨在解决LLM在真实世界医疗记录处理中缺乏可扩展且可信赖的验证方法的问题。该框架通过提示校准、基于规则的合理性过滤、语义基础评估、使用独立高能力法官LLM进行针对性确认评估、选择性专家评审和外部预测有效性分析,在弱监督下实现严格评估。研究将该框架应用于从919,783份临床记录中提取11类物质使用障碍(SUD)诊断,结果表明框架能有效过滤不可靠提取,法官LLM与专家评估高度一致,提取结果在预测后续专科护理参与方面优于结构化数据基线。

Details

Motivation: 大语言模型在从非结构化健康记录中提取临床信息方面展现出潜力,但其在真实世界中的应用受到缺乏可扩展且可信赖的验证方法的限制。传统的评估方法严重依赖标注密集的参考标准或不完整的结构化数据,难以在人群规模上实施。

Result: 在应用于SUD诊断提取时,基于规则的过滤和语义基础评估移除了14.59%的LLM阳性提取(这些提取缺乏支持、不相关或结构上不合理)。对于高不确定性案例,法官LLM的评估与主题专家评审具有高度一致性(Gwet’s AC1=0.80)。以法官评估输出为参考,主要LLM在宽松匹配标准下达到F1分数0.80。LLM提取的SUD诊断在预测后续SUD专科护理参与方面比结构化数据基线更准确(AUC=0.80)。

Insight: 论文的创新点在于提出了一个系统性的多阶段弱监督验证框架,它整合了多种互补的评估技术(如规则过滤、语义基础、法官LLM确认、预测有效性),从而在不依赖密集人工标注的情况下,实现了对LLM临床信息提取结果的可扩展、可信赖的评估和不确定性量化。从客观角度看,该框架为将LLM可靠地部署于大规模现实世界临床任务提供了一种实用的方法论路径。

Abstract: Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM’s assessments showed substantial agreement with subject matter expert review (Gwet’s AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.


[38] From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection cs.CLPDF

Hongxu Zhou

TL;DR: 本文研究在大型语言模型(LLM)的开放式推理任务中,仅通过基于大纲(Outlines)的约束解码来强制执行结构化反思,能否在不额外训练的情况下阻止错误传播。研究发现,简单地施加结构约束并不能提升自我纠正性能,反而会引发一种新的失败模式,称为“结构滚雪球”。

Details

Motivation: 动机在于解决LLM在开放式推理中因“幻觉滚雪球”而导致的自我纠正失败问题,并探索不依赖外部训练批评者或符号工具、仅通过约束解码实现结构化反思以保持代理自主性的方法。

Result: 在评估80亿参数模型(Qwen3-8B)时,施加结构约束并未改善自我纠正性能,反而导致了“结构滚雪球”现象,模型在满足严格格式规则时陷入格式陷阱,无法检测或解决深层语义错误。

Insight: 创新点在于揭示了约束解码固有的“对齐税”,凸显了在自主工作流中结构粒度与内部模型能力之间的紧张关系;客观分析认为,研究发现了结构化反思中格式约束可能带来的认知负荷和性能下降的新失败模式。

Abstract: Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning tasks due to hallucination snowballing,'' a phenomenon in which models recursively justify early errors during free-text reflection. While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy. This study investigates whether enforcing structured reflection purely through Outlines-based constrained decoding can disrupt error propagation without additional training. Evaluating an 8-billion-parameter model (Qwen3-8B), we show that simply imposing structural constraints does not improve self-correction performance. Instead, it triggers a new failure mode termed structure snowballing.’’ We find that the cognitive load required to satisfy strict formatting rules pushes the model into formatting traps. This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors. These findings expose an ``alignment tax’’ inherent to constrained decoding, highlighting a tension between structural granularity and internal model capacity in autonomous workflows. Code and raw logs are available in the GitHub repository: https://github.com/hongxuzhou/agentic_llm_structured_self_critique.


[39] Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives cs.CL | cs.AI | cs.MAPDF

Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang

TL;DR: 该论文研究了大型语言模型(LLM)代理在多智能体环境中作为人类代表进行决策时,其可靠性如何受到社交网络动态的负面影响。受社会心理学启发,作者定义了四种关键社会现象(社会从众、感知的专业性、主导发言者效应和修辞说服),并通过系统性地操纵对抗者数量、相对智能、论点长度和论证风格进行实验。结果表明,随着社会压力(如更大的对抗群体、更智能的同伴、更长的论点)的增加,代表代理的决策准确性持续下降,且修辞策略会进一步影响其判断。

Details

Motivation: 动机是探究在多智能体环境中,作为决策代表的LLM代理的可靠性如何因其所在网络的社会背景而受到损害,揭示AI代理在群体决策中存在的、类似于人类心理偏见的脆弱性。

Result: 实验表明,代表代理的准确性随着社会压力的增加而持续下降:更大的对抗群体、能力更强的同伴以及更长的论点都导致性能显著下降。此外,强调可信度或逻辑的修辞策略能进一步左右代理的判断。

Insight: 论文的创新点在于将社会心理学概念(如社会从众、感知的专业性)系统性地引入对LLM多智能体系统的分析,揭示了其决策不仅受个体推理影响,还对配置中的社会动态高度敏感,这为理解和提高多智能体系统的鲁棒性提供了新的视角。

Abstract: Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. Drawing inspiration from social psychology, we investigate how the reliability of this representative agent is undermined by the social context of its network. We define four key phenomena-social conformity, perceived expertise, dominant speaker effect, and rhetorical persuasion-and systematically manipulate the number of adversaries, relative intelligence, argument length, and argumentative styles. Our experiments demonstrate that the representative agent’s accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation. Furthermore, rhetorical strategies emphasizing credibility or logic can further sway the agent’s judgment, depending on the context. These findings reveal that multi-agent systems are sensitive not only to individual reasoning but also to the social dynamics of their configuration, highlighting critical vulnerabilities in AI delegates that mirror the psychological biases observed in human group decision-making.


cs.CV [Back]

[40] Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity cs.CV | cs.AI | cs.HC | cs.IR | cs.MMPDF

Abhishek Dharmaratnakar, Srivaths Ranganathan, Debanshu Das, Anushree Sinha

TL;DR: 这篇综述论文系统回顾了自动视频预告片生成领域从基于启发式规则提取到深度生成合成的范式转变,重点分析了自回归Transformer、LLM编排流水线以及Sora/Veo等文本到视频基础模型的技术演进,并提出了基础模型时代AI驱动预告片生成的新分类体系。

Details

Motivation: 论文旨在梳理自动视频预告片生成技术从传统基于低层特征工程和规则启发式的方法,向利用大语言模型、多模态大语言模型和扩散模型等生成式AI技术进行创造性合成的演进过程,并探讨其技术架构、经济影响和伦理挑战。

Result: 作为一篇综述论文,未报告具体的定量实验结果,但通过文献分析,梳理了从图卷积网络到预告片生成Transformer的架构演进,并评估了自动化内容生成对用户生成内容平台的经济影响。

Insight: 论文的核心创新点在于提出了一个面向基础模型时代的AI驱动预告片生成新分类法,并前瞻性地指出未来系统将从提取式选择转向可控的生成式编辑和预告片的语义重建,强调了生成式AI在构建连贯、情感共鸣叙事方面的潜力。

Abstract: The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI’s Sora and Google’s Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.


[41] RCP: Representation Consistency Pruner for Mitigating Distribution Shift in Large Vision-Language Models cs.CVPDF

Jianwei Zhang, Chaoning Zhang, Sihan Cao, Wang Liu, Pengcheng Zheng

TL;DR: 本文提出了一种名为RCP(Representation Consistency Pruner)的新框架,旨在解决大型视觉语言模型(LVLM)中因视觉令牌数量庞大导致的高推理成本问题。RCP通过整合累积视觉令牌剪枝与延迟修复机制,在训练时仅优化轻量级插件模块,并在推理时物理丢弃令牌,从而在显著减少计算量的同时,最小化因令牌移除引起的表示分布偏移。

Details

Motivation: 大型视觉语言模型(LVLM)因语言解码器处理大量视觉令牌而产生高昂推理成本。现有剪枝方法常因不可逆地移除视觉令牌,导致隐藏状态分布偏离预训练的全令牌状态,从而引起显著的性能下降。本文旨在缓解这种分布偏移,实现高效剪枝。

Result: 在多个LVLM基准测试上的广泛实验表明,RCP能够移除高达88.9%的视觉令牌,并将FLOPs减少高达85.7%,同时仅带来微小的平均精度下降。在多个广泛使用的基准上,其性能优于那些避免微调原始模型的先前方法。

Insight: 创新点在于提出了一个结合累积剪枝与延迟修复的框架。具体包括:利用LLM固有注意力作为基线预测累积掩码的跨注意力剪枝器,确保跨层令牌减少的一致性和单调性;以及设计了一个延迟修复适配器(DRA),它缓存被剪枝令牌的精华,并专门对答案生成令牌应用基于FiLM的调制。通过修复损失匹配剪枝后表示与全令牌教师模型的一阶和二阶统计量,有效补偿信息损失并保持表示一致性。

Abstract: Large Vision-Language Models (LVLMs) suffer from prohibitive inference costs due to the massive number of visual tokens processed by the language decoder. Existing pruning methods often lead to significant performance degradation because the irreversible removal of visual tokens causes a distribution shift in the hidden states that deviates from the pre-trained full-token regime. To address this, we propose Representation Consistency Pruner, which we refer to as RCP, as a novel framework that integrates cumulative visual token pruning with a delayed repair mechanism. Specifically, we introduce a cross-attention pruner that leverages the intrinsic attention of the LLM as a baseline to predict cumulative masks, ensuring consistent and monotonic token reduction across layers. To compensate for the resulting information loss, we design a delayed repair adapter denoted as DRA, which caches the essence of pruned tokens and applies FiLM-based modulation specifically to the answer generation tokens. We employ a repair loss to match the first and second-order statistics of the pruned representations with a full-token teacher. RCP is highly efficient because it trains only lightweight plug-in modules while allowing for physical token discarding at inference. Extensive experiments on LVLM benchmarks demonstrate that RCP removes up to 88.9% of visual tokens and reduces FLOPs by up to 85.7% with only a marginal average accuracy drop, and outperforms prior methods that avoid fine-tuning the original model on several widely used benchmarks.


[42] Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding cs.CVPDF

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen

TL;DR: 本文提出了Video-MME-v2,一个旨在严格评估视频理解模型的鲁棒性和可信度的综合性基准。它通过渐进式三层层次结构(从视觉信息聚合、时序动态建模到复杂多模态推理)来系统评估模型能力,并采用基于组的非线性评估策略来惩罚猜测性答案,只奖励有有效推理支持的答案。该基准通过严格的人工标注流程构建,包含大量人工工时和质量保证轮次。实验揭示了当前最佳模型与人类专家之间的巨大差距,并发现了错误传播的层次瓶颈。

Details

Motivation: 现有视频理解基准的排行榜分数虚高,与模型在现实世界中的实际能力存在严重脱节。为了弥合这一日益扩大的差距,需要一个新的、更严格的基准来全面评估模型的鲁棒性和推理可信度。

Result: 广泛的实验揭示了当前最佳模型(如Gemini-3-Pro)与人类专家之间存在巨大差距。实验还暴露了一个清晰的层次瓶颈:低级(视觉信息聚合和时序建模)的错误会传播并限制高级推理能力。此外,基于思考的推理高度依赖文本线索(如字幕)。

Insight: 论文的创新点在于提出了一个系统性的渐进式三层评估层次结构和一个基于组的非线性评估策略,后者强调答案的一致性和推理的连贯性,而非简单的逐题准确率。从客观角度看,其构建基准的严格质量控制流程(大量人工工时和多轮质量保证)以及对模型能力瓶颈(错误传播、对文本的过度依赖)的深入分析,为下一代视频多模态大语言模型的开发提供了重要的诊断工具和方向指引。

Abstract: With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.


[43] ID-Sim: An Identity-Focused Similarity Metric cs.CV | cs.AIPDF

Julia Chae, Nicholas Kolkin, Jui-Hsien Wang, Richard Zhang, Sara Beery

TL;DR: 本文提出了ID-Sim,一种专注于身份相似性的前馈度量方法,旨在反映人类对身份的选择性敏感度。作者构建了一个包含真实世界图像和生成式合成数据的高质量训练集,并建立了一个新的统一评估基准来验证该度量在身份识别、检索和生成任务中与人类标注的一致性。

Details

Motivation: 人类对身份具有卓越的选择性敏感度,而视觉模型难以匹配此能力,且缺乏专注于身份的评价指标阻碍了如个性化图像生成等任务的发展。

Result: 论文在提出的新统一评估基准上进行了评估,该基准用于评估身份识别、检索和生成任务中与人类标注的一致性,但摘要中未提及具体的定量结果或与SOTA的比较。

Insight: 创新点在于提出了一个专门针对身份相似性的度量标准ID-Sim,并通过结合真实数据和可控的生成式合成数据来构建训练集,以更好地模拟人类对身份的感知。从客观角度看,这为解决身份相关任务中缺乏针对性评估指标的问题提供了新思路。

Abstract: Humans have remarkable selective sensitivity to identities – easily distinguishing between highly similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress toward identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.


[44] SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CVPDF

Zhongyu Yang, Zuhao Yang, Shuo Zhan, Tan Yue, Wei Pang

TL;DR: 本文提出SVAgent,一种基于故事情节引导的跨模态多智能体协作框架,用于长视频问答任务。该框架通过故事情节智能体构建叙事表示,并利用精炼建议智能体分析历史失败来指导帧选择,同时跨模态决策智能体在故事情节指导下分别从视觉和文本模态预测答案,最终由元智能体评估对齐预测以增强推理鲁棒性。

Details

Motivation: 现有视频问答方法大多依赖定位相关帧来回答问题,而非像人类一样通过演进的故事情节进行推理,这限制了模型对视频复杂动态的上下文理解能力。

Result: 实验结果表明,SVAgent通过模拟人类故事情节推理,在视频理解任务上实现了优越的性能和可解释性。

Insight: 创新点在于引入故事情节引导的渐进式叙事构建机制,并结合多智能体协作(包括故事情节、精炼建议、跨模态决策和元评估智能体)来模拟人类推理过程,从而提升长视频理解的鲁棒性和一致性。

Abstract: Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.


[45] Watch Before You Answer: Learning from Visually Grounded Post-Training cs.CV | cs.AI | cs.CLPDF

Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia

TL;DR: 本文发现当前视频理解基准和微调数据集中存在大量仅凭文本线索即可回答的问题,这削弱了视觉语言模型(VLM)的视频理解能力。为此,作者提出了VidGround方法,仅使用真正需要视觉基础的问题进行微调,从而显著提升了模型性能。

Details

Motivation: 当前视觉语言模型在视频理解上的进展滞后,部分原因是常用基准和微调数据集中存在大量仅依赖文本偏差的问题,未能有效促进模型学习视觉基础能力。

Result: 在基于RL的微调算法中,仅使用VidGround(占原数据69.1%)进行微调,相比使用完整数据集,性能提升高达6.2个百分点;且这种简单数据筛选方法优于多种复杂微调技术。

Insight: 核心创新在于揭示了数据质量(即确保问题真正需要视觉基础)是提升VLM视频理解的关键瓶颈,并提出了一种简单有效的数据筛选方案;这强调了精心设计微调数据和评估基准对于发展更强大VLM的重要性。

Abstract: It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.


[46] Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI cs.CV | cs.AIPDF

Mingjie Li, Edward Kim, Yue Zhao, Ehsan Adeli, Kilian M. Pohl

TL;DR: 本文提出了一种名为NeuroQuant的模态感知且基于解剖结构的3D向量量化变分自编码器(VQ-VAE),用于重建多模态脑部MRI图像。该方法通过学习跨模态的共享潜在表示,并利用双流编码器分离模态不变解剖结构和模态相关外观特征,结合特征级线性调制(FiLM)进行解码,并通过联合2D/3D训练策略提升性能。

Details

Motivation: 现有脑部VAE主要针对单模态(如T1加权MRI)数据,忽略了其他模态(如T2加权MRI)的互补诊断价值,因此需要一种能够有效处理多模态脑部MRI的鲁棒VAE方法。

Result: 在两个多模态脑部MRI数据集上的广泛实验表明,NeuroQuant相比现有VAE实现了更优的重建保真度,为下游生成建模和跨模态脑图像分析提供了可扩展的基础。

Insight: 创新点包括:1)利用因子化多轴注意力学习跨模态共享潜在表示以捕获远距离脑区关系;2)双流3D编码器显式分离解剖结构和外观特征;3)结合共享码本和FiLM进行解剖编码离散化与特征融合;4)采用联合2D/3D训练策略适应3D MRI的切片式采集特性。

Abstract: Learning a robust Variational Autoencoder (VAE) is a fundamental step for many deep learning applications in medical image analysis, such as MRI synthesizes. Existing brain VAEs predominantly focus on single-modality data (i.e., T1-weighted MRI), overlooking the complementary diagnostic value of other modalities like T2-weighted MRIs. Here, we propose a modality-aware and anatomically grounded 3D vector-quantized VAE (VQ-VAE) for reconstructing multi-modal brain MRIs. Called NeuroQuant, it first learns a shared latent representation across modalities using factorized multi-axis attention, which can capture relationships between distant brain regions. It then employs a dual-stream 3D encoder that explicitly separates the encoding of modality-invariant anatomical structures from modality-dependent appearance. Next, the anatomical encoding is discretized using a shared codebook and combined with modality-specific appearance features via Feature-wise Linear Modulation (FiLM) during the decoding phase. This entire approach is trained using a joint 2D/3D strategy in order to account for the slice-based acquisition of 3D MRI data. Extensive experiments on two multi-modal brain MRI datasets demonstrate that NeuroQuant achieves superior reconstruction fidelity compared to existing VAEs, enabling a scalable foundation for downstream generative modeling and cross-modal brain image analysis.


[47] MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing cs.CVPDF

Ziqian Liu, Stephan Alaniz

TL;DR: 该论文提出了一个名为MIRAGE的无需训练框架,用于解决指令引导图像编辑模型在处理多相似实例和复合指令时存在的过编辑和空间错位问题。同时,论文还引入了一个专门评估多实例场景下细粒度一致性的基准测试。

Details

Motivation: 现有指令引导图像编辑模型(如FLUX.2和Qwen-Image-Edit)在处理包含多个相似实例且每个实例都需要独立编辑的复杂场景时,存在严重的过编辑和空间错位问题,缺乏专门的评估基准。

Result: 在MIRA-Bench和RefEdit-Bench上的广泛评估表明,MIRAGE框架在实现精确的实例级修改并保持背景一致性方面,显著优于现有方法。

Insight: 创新点在于提出了一个无需训练的框架MIRAGE,它利用视觉语言模型将复杂指令解析为区域子集,并采用多分支并行去噪策略,将目标区域的潜在表示注入全局表示空间,同时通过参考轨迹保持背景完整性。此外,论文还贡献了一个专门针对多实例编辑的评估基准,填补了该领域的空白。

Abstract: Instruction-guided image editing has seen remarkable progress with models like FLUX.2 and Qwen-Image-Edit, yet they still struggle with complex scenarios with multiple similar instances each requiring individual edits. We observe that state-of-the-art models suffer from severe over-editing and spatial misalignment when faced with multiple identical instances and composite instructions. To this end, we introduce a comprehensive benchmark specifically designed to evaluate fine-grained consistency in multi-instance and multi-instruction settings. To address the failures of existing methods observed in our benchmark, we propose Multi-Instance Regional Alignment via Guided Editing (MIRAGE), a training-free framework that enables precise, localized editing. By leveraging a vision-language model to parse complex instructions into regional subsets, MIRAGE employs a multi-branch parallel denoising strategy. This approach injects latent representations of target regions into the global representation space while maintaining background integrity through a reference trajectory. Extensive evaluations on MIRA-Bench and RefEdit-Bench demonstrate that our framework significantly outperforms existing methods in achieving precise instance-level modifications while preserving background consistency. Our benchmark and code are available at https://github.com/ZiqianLiu666/MIRAGE.


[48] Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification cs.CVPDF

Muhammad Adil, Mehmood Ahmed, Muhammad Aqib, Vicente A. Gonzalez, Gaang Lee

TL;DR: 本研究提出了一种检测引导的小型视觉语言模型(sVLM)框架,用于建筑工地安全风险识别。该框架首先使用YOLOv11n检测器定位场景中的工人和施工机械,然后将检测到的实体嵌入结构化提示中,以指导sVLM进行基于空间位置的危害评估。在零样本设置下对六个sVLM模型进行评估,结果表明该方法显著提高了所有模型的危害检测性能和解释质量,同时推理开销极小。

Details

Motivation: 解决大型视觉语言模型(VLMs)计算成本高、难以实时应用于建筑危害检测,而小型VLMs(sVLMs)在复杂场景中准确率低、易产生幻觉的问题,旨在实现高效且准确的上下文感知安全风险识别。

Result: 在带有危害标注和解释理由的建筑工地图像数据集上,所提方法一致提升了所有sVLM的性能。最佳模型Gemma-3 4B的F1分数达到50.6%(基线为34.5%),解释质量的BERTScore F1从0.61提升至0.82。尽管加入了目标检测,推理时每张图像仅增加2.5毫秒开销。

Insight: 创新点在于将轻量级目标检测(用于实体定位)与小型VLM的多模态推理相结合,通过结构化提示将检测结果作为空间基础信息注入,从而引导模型进行更准确、更少幻觉的上下文推理。这为在资源受限环境下实现高效、准确的场景理解任务提供了一种有效范式。

Abstract: Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.


[49] Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D cs.CVPDF

Daniel DeTone, Tianwei Shen, Fan Zhang, Lingni Ma, Julian Straub

TL;DR: 本文提出了Boxer算法,用于从2D开放词汇目标检测、带姿态的图像以及可选的稀疏点云或稠密深度图中估计静态3D边界框。其核心是BoxerNet,一个基于Transformer的网络,负责将2D边界框提议提升到3D空间,再通过多视图融合和几何滤波,生成全局一致、去重后的度量世界空间3D边界框。

Details

Motivation: 解决开放世界类别下3D目标定位这一远未解决的挑战,利用现有成熟的2D检测算法来减少对昂贵3D标注数据的需求。

Result: BoxerNet在开放世界3DBB提升任务上超越了现有最佳基线:在无稠密深度的以自我为中心设置中,mAP为0.532 vs. CuTR的0.010;在CA-1M数据集且有稠密深度可用时,mAP为0.412 vs. CuTR的0.250。

Insight: 创新点包括:利用2D开放词汇检测器(如DETIC, OWLv2, SAM3)实现2D定位,让模型专注于3D提升;扩展CuTR框架,引入了用于鲁棒回归的认知不确定性、支持稀疏深度输入的深度中值块编码,以及大规模训练(超过120万个唯一3DBB)。

Abstract: Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).


[50] Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking cs.CV | cs.AI | cs.CLPDF

Chan-Wei Hu, Zhengzhong Tu

TL;DR: 本文提出了Region-R1,一个用于多模态检索增强生成(MM-RAG)中重排序的查询端区域裁剪框架。该框架将区域选择建模为决策问题,通过强化学习动态裁剪与问题相关的图像区域,以减少视觉干扰物(如背景杂乱)对全局嵌入相似度评分的影响,从而提升重排序性能。

Details

Motivation: 标准的多模态重排序器通常将整个查询图像处理为全局嵌入,容易受到视觉干扰物的影响,导致相似度评分出现偏差。本文旨在解决这一问题,通过查询端自适应来增强重排序的鲁棒性。

Result: 在两个具有挑战性的基准测试E-VQA和InfoSeek上,Region-R1取得了显著的性能提升,将条件Recall@1指标最高提升了20%,达到了最先进的(SOTA)性能水平。

Insight: 论文的核心创新点在于将查询端区域选择形式化为一个决策问题,并引入了新颖的区域感知组相对策略优化(r-GRPO)方法来学习裁剪策略。从客观角度看,这是一种简单而有效的查询端自适应方法,通过动态聚焦于问题相关区域来增强多模态重排序,而非依赖复杂的候选端或跨模态交互机制。

Abstract: Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.


[51] SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration cs.CVPDF

Xueming Fu, Lixia Han

TL;DR: 本文提出了SmokeGS-R,一种用于真实世界多视角烟雾场景恢复的实用流程,核心思想是将几何恢复与外观校正解耦。该方法通过物理引导的伪干净监督训练一个清晰的仅干净3D高斯溅射源模型,然后通过几何平均参考聚合、LAB空间Reinhard传输和轻量高斯平滑等技术,将渲染结果与捐赠集合进行协调。

Details

Motivation: 真实世界烟雾会同时衰减场景辐射、增加空气光并破坏多视角外观一致性,使得鲁棒的3D重建变得尤为困难。本文旨在解决真实世界多视角烟雾场景的恢复与重建问题。

Result: 在NTIRE 2026 3D恢复与重建挑战赛Track 2的官方测试排行榜上,最终提交获得了PSNR=15.217和SSIM=0.666。在RealX3D数据集公开后,对相同冻结结果在七个挑战场景上重新评估,获得了PSNR=15.209、SSIM=0.644和LPIPS=0.551,在相同场景上比最强的官方基线平均PSNR高出+3.68 dB。

Insight: 论文宣称的创新点在于提出了一种几何优先的重建策略与稳定的渲染后外观协调相结合的有效方案。从客观角度看,其将几何与外观解耦的处理思路、利用物理引导生成伪干净监督的方法,以及结合多种图像处理技术进行外观协调的流程设计,是值得借鉴的创新之处。

Abstract: Real-world smoke simultaneously attenuates scene radiance, adds airlight, and destabilizes multi-view appearance consistency, making robust 3D reconstruction particularly difficult. We present \textbf{SmokeGS-R}, a practical pipeline developed for the NTIRE 2026 3D Restoration and Reconstruction Track 2 challenge. The key idea is to decouple geometry recovery from appearance correction: we generate physics-guided pseudo-clean supervision with a refined dark channel prior and guided filtering, train a sharp clean-only 3D Gaussian Splatting source model, and then harmonize its renderings with a donor ensemble using geometric-mean reference aggregation, LAB-space Reinhard transfer, and light Gaussian smoothing. On the official challenge testing leaderboard, the final submission achieved \mbox{PSNR $=15.217$} and \mbox{SSIM $=0.666$}. After the public release of RealX3D, we re-evaluated the same frozen result on the seven released challenge scenes without retraining and obtained \mbox{PSNR $=15.209$}, \mbox{SSIM $=0.644$}, and \mbox{LPIPS $=0.551$}, outperforming the strongest official baseline average on the same scenes by $+3.68$ dB PSNR. These results suggest that a geometry-first reconstruction strategy combined with stable post-render appearance harmonization is an effective recipe for real-world multi-view smoke restoration. The code is available at https://github.com/windrise/3drr_Track2_SmokeGS-R.


[52] VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success cs.CV | cs.ROPDF

Chuhang Liu, Yayun He, Zuheng Kang, Xiaoyang Qu, Jianzong Wang

TL;DR: 本文提出了一种名为VLA-InfoEntropy的无训练方法,用于加速视觉-语言-动作(VLA)模型的推理。该方法通过图像熵量化视觉token的灰度分布特征,并引入注意力熵捕捉注意力在任务相关文本上的分布,结合时间步信息实现动态策略,将模型焦点从全局视觉特征转移到注意力引导的局部信息区域,从而减少冗余计算。

Details

Motivation: VLA模型在处理高维视觉特征、复杂语言输入和连续动作序列时,存在计算开销大、推理效率低的问题,阻碍了其实时部署和可靠性。本文旨在解决VLA模型的推理加速问题。

Result: 大量实验表明,该方法减少了推理参数量,加速了推理速度,并且性能优于现有方法。

Insight: 创新点在于提出了一种无训练的推理加速方法,通过结合图像熵、注意力熵和时间步信息,动态调整模型关注区域,在减少计算冗余的同时保持关键内容,为VLA模型的高效部署提供了新思路。

Abstract: Vision-Language-Action (VLA) models integrate visual perception, language understanding, and action decision-making for cross-modal semantic alignment, exhibiting broad application potential. However, the joint processing of high-dimensional visual features, complex linguistic inputs, and continuous action sequences incurs significant computational overhead and low inference efficiency, thereby hindering real-time deployment and reliability. To address this issue, we use image entropy to quantify the grayscale distribution characteristics of each visual token and introduce attention entropy to capture the distribution of attention scores over task-related text. Visual entropy identifies texture-rich or structurally informative regions, while attention entropy pinpoints semantically relevant tokens. Combined with timestep information, these metrics enable a dynamic transition strategy that shifts the model’s focus from global visual features to attention-guided local informative regions. Thus, the resulting VLA-InfoEntropy method integrates spatial, semantic, and temporal cues to reduce redundancy while preserving critical content. Extensive experiments show that our method reduces inference parameters, accelerates inference speed, and outperforms existing approaches.


[53] GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy cs.CVPDF

Yang Yi, Xieyuanli Chen, Jinpu Zhang, Hui Shen, Dewen Hu

TL;DR: 本文提出了一种名为GESS的多线索引导局部特征学习框架,通过几何与语义协同增强特征检测的鲁棒性和描述符的区分度。具体包括联合语义-法线预测头、深度稳定性预测头、语义-深度感知关键点机制以及统一的三线索融合模块。

Details

Motivation: 现有局部特征学习方法主要依赖单一外观线索建模,导致关键点不稳定和描述符区分度不足。本文旨在利用语义和几何线索协同解决这些问题。

Result: 在四个基准测试上的大量实验验证了该框架的有效性,具体性能指标未在摘要中提及。

Insight: 创新点在于通过共享3D向量场深度耦合语义与法线线索以解决异构不一致性带来的优化干扰,以及从几何一致性角度量化局部区域可靠性以指导关键点选择;同时设计了语义调度门控机制自适应融合多属性特征以提升描述符区分度。

Abstract: Robust local feature detection and description are foundational tasks in computer vision. Existing methods primarily rely on single appearance cues for modeling, leading to unstable keypoints and insufficient descriptor discriminability. In this paper, we propose a multi-cue guided local feature learning framework that leverages semantic and geometric cues to synergistically enhance detection robustness and descriptor discriminability. Specifically, we construct a joint semantic-normal prediction head and a depth stability prediction head atop a lightweight backbone. The former leverages a shared 3D vector field to deeply couple semantic and normal cues, thereby resolving optimization interference from heterogeneous inconsistencies. The latter quantifies the reliability of local regions from a geometric consistency perspective, providing deterministic guidance for robust keypoint selection. Based on these predictions, we introduce the Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection. By coupling semantic reliability with depth stability, SDAK reweights keypoint responses to suppress spurious features in unreliable regions. For descriptor construction, we design a Unified Triple-Cue Fusion (UTCF) module, which employs a semantic-scheduled gating mechanism to adaptively inject multi-attribute features, improving descriptor discriminability. Extensive experiments on four benchmarks validate the effectiveness of the proposed framework. The source code and pre-trained model will be available at: https://github.com/yiyscut/GESS.git.


[54] Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection cs.CVPDF

Rixiang Ni, Boyang Li, Jun Chen, Yonghao Li, Feiyu Ren

TL;DR: 本文重新思考红外小目标检测(IRSTD)任务,提出将IRSTD重新定义为质心回归任务,并引入一种名为SPIRE的单点监督引导红外概率响应编码方法。该方法采用仅编码器架构,通过点响应先验监督将单点标注转换为概率响应图,并结合高分辨率概率编码器实现端到端回归,旨在更有效地定位目标而非分割模糊边界。

Details

Motivation: 现有基于像素级监督的编码器-解码器分割范式忽略了红外小目标仅占少数像素且边界模糊的特性,导致难以区分目标区域与背景噪声。本文认为IRSTD的首要原则应是目标定位,而非分割所有伴随不可区分背景噪声的目标区域。

Result: 在SIRST-UAVB和SIRST4等多个IRSTD基准测试上的广泛实验表明,SPIRE在目标级检测性能上具有竞争力,实现了持续较低的错误警报率(Fa),并显著降低了计算成本。

Insight: 创新点在于将IRSTD重新定义为质心回归任务,并设计了点响应先验监督(PRPS)和高分辨率概率编码器(HRPE),通过仅编码器架构和概率响应编码,有效缓解了稀疏目标分布下的优化不稳定问题,同时降低了计算复杂度。

Abstract: Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided “encoder-decoder” segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter backgrounds. Based on this observation, we argue that the first principle of IRSTD should be target localization instead of separating all target region accompanied with indistinguishable background noise. In this paper, we reformulate IRSTD as a centroid regression task and propose a novel Single-Point Supervision guided Infrared Probabilistic Response Encoding method (namely, SPIRE), which is indeed challenging due to the mismatch between reduced supervision network and equivalent output. Specifically, we first design a Point-Response Prior Supervision (PRPS), which transforms single-point annotations into probabilistic response map consistent with infrared point-target response characteristics, with a High-Resolution Probabilistic Encoder (HRPE) that enables encoder-only, end-to-end regression without decoder reconstruction. By preserving high-resolution features and increasing effective supervision density, SPIRE alleviates optimization instability under sparse target distributions. Finally, extensive experiments on various IRSTD benchmarks, including SIRST-UAVB and SIRST4 demonstrate that SPIRE achieves competitive target-level detection performance with consistently low false alarm rate (Fa) and significantly reduced computational cost. Code is publicly available at: https://github.com/NIRIXIANG/SPIRE-IRSTD.


[55] UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation cs.CVPDF

Jintao Sun, Hu Zhang, Donglin Di, Gangyi Ding, Zhedong Zheng

TL;DR: 本文提出了首个面向无人机俯视场景的统一大规模多模态基准测试UAVReason,包含超过27.3万个视觉问答对、详细标注的单帧图像、时序序列及跨模态生成样本,覆盖22种推理类型,并建立了统一的多任务学习基线模型。

Details

Motivation: 现有视觉语言模型在无人机高空俯视场景下因领域偏移(如物体微小密集、纹理重复、视角模糊)而性能显著下降,且缺乏统一的多模态评估基准。

Result: 实验在VQA(EM/F1)、分割(mIoU)和生成(CLIP Score)等指标上验证了统一多任务学习基线的有效性,表明通用视觉语言模型存在局限,而统一方法显著提升了无人机原生场景性能。

Insight: 创新点在于构建首个统一的大规模无人机俯视多模态基准,整合推理与生成任务;客观来看,其通过仿真平台构建高保真数据并设计多任务统一评估框架,为无人机多模态研究提供了系统性工具。

Abstract: Vision-Language models (VLMs) have demonstrated remarkable capability in ground-view visual understanding but often fracture when deployed on high-altitude Unmanned Aerial Vehicles (UAVs). The failure largely stems from a pronounced domain shift, characterized by tiny and densely packed objects, repetitive textures, and ambiguous top-down orientations. These factors severely disrupt semantic grounding and hinder both spatial reasoning and controllable generation. To bridge this critical gap, we introduce UAVReason, the first unified large-scale multi-modal benchmark dedicated to nadir-view UAV scenarios, derived from a high-fidelity UAV simulation platform. In contrast to existing UAV benchmarks, which are largely siloed and focus on single tasks like object detection or segmentation, UAVReason uniquely consolidates over 273K Visual Question Answering (VQA) pairs, including 23.6K single frames with detailed captions, 68.2K 2-frame temporal sequences, and 188.8K cross-modal generation samples. The benchmark probes 22 diverse reasoning types across spatial and temporal axes while simultaneously evaluating high-fidelity generation across RGB, depth, and segmentation modalities. We further establish a strong, unified baseline model via multi-task learning. Extensive experiments validate the efficacy of our unified approach across diverse metrics, such as EM/F1 for VQA, mIoU for segmentation, and CLIP Score for generation. These results indicate limitations of general-domain vision-language models and show that unified multi-task learning substantially improves UAV-native performance. All data, code, and evaluation tools will be publicly released to advance UAV multimodal research.


[56] LUMOS: Universal Semi-Supervised OCT Retinal Layer Segmentation with Hierarchical Reliable Mutual Learning cs.CVPDF

Yizhou Fang, Jian Zhong, Li Lin, Xiaoying Tang

TL;DR: 本文提出了LUMOS,一个用于光学相干断层扫描(OCT)视网膜层分割的通用半监督框架。它通过双解码器网络与分层提示策略(DDN-HPS)以及可靠渐进多粒度学习(RPML),有效解决了标注稀缺和不同数据集间标签粒度异构的问题,并实现了优异的跨域和跨粒度泛化能力。

Details

Motivation: OCT层分割面临标注稀缺以及不同数据集标签粒度异构的挑战。现有半监督方法通常假设固定的标签粒度,未能充分利用跨粒度监督信息。

Result: 在六个OCT数据集上的实验表明,LUMOS大幅优于现有方法,并展现出卓越的跨域和跨粒度泛化能力。

Insight: 核心创新点在于结合了双分支架构与分层提示策略来抑制伪标签噪声传播,并引入了区域级可靠性加权和渐进式训练方法,以实现可靠的跨粒度一致性目标对齐。这为处理多粒度、标注稀缺的医学图像分割问题提供了新思路。

Abstract: Optical Coherence Tomography (OCT) layer segmentation faces challenges due to annotation scarcity and heterogeneous label granularities across datasets. While semi-supervised learning helps alleviate label scarcity, existing methods typically assume a fixed granularity, failing to fully exploit cross-granularity supervision. This paper presents LUMOS, a semi-supervised universal OCT retinal layer segmentation framework based on a Dual-Decoder Network with a Hierarchical Prompting Strategy (DDN-HPS) and Reliable Progressive Multi-granularity Learning (RPML). DDN-HPS combines a dual-branch architecture with a multi-granularity prompting strategy to effectively suppress pseudo-label noise propagation. Meanwhile, RPML introduces region-level reliability weighing and a progressive training approach that guides the model from easier to more difficult tasks, ensuring the reliable selection of cross-granularity consistency targets, thereby achieving stable cross-granularity alignment. Experiments on six OCT datasets demonstrate that LUMOS largely outperforms existing methods and exhibits exceptional cross-domain and cross-granularity generalization capability.


[57] Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval cs.CV | cs.MMPDF

Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma

TL;DR: 本文提出了对象锚定组合图像检索(OACIR)这一新颖的细粒度检索任务,旨在解决现有组合图像检索(CIR)过度依赖语义匹配而难以可靠检索指定具体实例的问题。为此,作者构建了首个大规模、多领域的真实图像基准数据集OACIRR,并提出了AdaFocal框架,该框架通过上下文感知注意力调制器自适应地增强对指定实例区域的关注,从而在保持实例级一致性的同时处理组合查询。

Details

Motivation: 现有组合图像检索(CIR)方法本质上优先考虑语义匹配,难以在不同上下文中可靠地检索用户指定的具体实例。在实践中,强调具体的实例保真度通常比宽泛的语义更重要。

Result: 在构建的OACIRR基准(包含超过16万个四元组和四个具有挑战性的候选库)上进行的大量实验表明,AdaFocal框架显著优于现有的组合检索模型,特别是在保持实例级保真度方面,为该任务建立了强大的基线。

Insight: 核心创新在于提出了一个强调严格实例级一致性的新任务(OACIR),并为此构建了首个大规模基准(OACIRR)。方法上的创新是AdaFocal框架中的上下文感知注意力调制器,它能自适应地平衡对锚定实例和更广泛组合上下文的关注,从而灵活地确保实例保留。这为开发更灵活、实例感知的检索系统开辟了新方向。

Abstract: Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.


[58] Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection cs.CVPDF

Hongsheng Li, Lingfeng Zhang, Zexian Yang, Liang Li, Rong Yin

TL;DR: 本文提出了一种基于天气条件分支路由的鲁棒性LiDAR-雷达3D目标检测方法。该方法将多模态感知重新定义为天气条件分支路由问题,通过维护三个并行的3D特征流(纯LiDAR分支、纯4D雷达分支和条件门控融合分支),并利用轻量级路由器根据从视觉和语义提示中提取的条件令牌动态预测样本特定权重,以软聚合这些表征。此外,引入天气监督学习策略,通过辅助分类和多样性正则化防止分支崩溃,强制产生不同的、条件依赖的路由行为。

Details

Motivation: 现有LiDAR-4D雷达融合方法主要依赖固定或弱自适应流水线,无法随着环境条件变化动态调整模态偏好,因此在恶劣天气下鲁棒性3D目标检测面临挑战。

Result: 在K-Radar基准测试上的大量实验表明,该方法实现了最先进的性能。

Insight: 创新点在于将多模态感知重新定义为天气条件分支路由问题,并引入条件令牌引导的轻量级路由器进行动态权重预测,以及通过天气监督学习策略防止分支崩溃。该方法提供了明确且高度可解释的模态偏好洞察,透明地揭示了自适应路由如何在不同的恶劣天气场景中鲁棒地转移LiDAR和4D雷达之间的依赖关系。

Abstract: Robust 3D object detection in adverse weather is highly challenging due to the varying reliability of different sensors. While existing LiDAR-4D radar fusion methods improve robustness, they predominantly rely on fixed or weakly adaptive pipelines, failing to dy-namically adjust modality preferences as environmental conditions change. To bridge this gap, we reformulate multi-modal perception as a weather-conditioned branch routing problem. Instead of computing a single fused output, our framework explicitly maintains three parallel 3D feature streams: a pure LiDAR branch, a pure 4D radar branch, and a condition-gated fusion branch. Guided by a condition token extracted from visual and semantic prompts, a lightweight router dynamically predicts sample-specific weights to softly aggregate these representations. Furthermore, to prevent branch collapse, we introduce a weather-supervised learning strategy with auxiliary classification and diversity regularization to enforce distinct, condition-dependent routing behaviors. Extensive experiments on the K-Radar benchmark demonstrate that our method achieves state-of-the-art performance. Furthermore, it provides explicit and highly interpretable insights into modality preferences, transparently revealing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios. The source code with be released.


[59] CRISP: Rank-Guided Iterative Squeezing for Robust Medical Image Segmentation under Domain Shift cs.CVPDF

Yizhou Fang, Pujin Cheng, Yixiang Liu, Xiaoying Tang, Longxi Zhou

TL;DR: 本文提出了一种名为CRISP的、无需参数且与模型无关的鲁棒医学图像分割框架,旨在解决医学影像中的域偏移问题。该框架基于新发现的’正区域秩稳定性’经验定律,通过潜在特征扰动模拟域偏移下的模型行为,构建高精度和高召回先验,并采用迭代训练策略逐步’挤压’得到最终分割结果。

Details

Motivation: 医学影像中的分布偏移是临床转化AI模型的主要瓶颈,现有域适应方法受限于模拟偏移或伪监督,难以应对现实世界中开放且不可预测的无限偏移。

Result: 在多中心心脏MRI和基于CT的肺部血管分割任务上,CRISP在跨中心、人口统计和模态偏移场景下均显著优于现有SOTA方法,HD95指标分别降低了0.14像素(提升7.0%)、1.90像素(提升13.1%)和8.39像素(提升38.9%)。

Insight: 创新点在于发现了’正区域秩稳定性’定律,并首次基于概率排序而非概率值本身进行分割;提出了一种通过特征扰动构建高精度/高召回先验并迭代优化的无监督框架,无需目标域信息,增强了模型对未知域偏移的鲁棒性。

Abstract: Distribution shift in medical imaging remains a central bottleneck for the clinical translation of medical AI. Failure to address it can lead to severe performance degradation in unseen environments and exacerbate health inequities. Existing methods for domain adaptation are inherently limited by exhausting predefined possibilities through simulated shifts or pseudo-supervision. Such strategies struggle in the open-ended and unpredictable real world, where distribution shifts are effectively infinite. To address this challenge, we introduce an empirical law called Rank Stability of Positive Regions'', which states that the relative rank of predicted probabilities for positive voxels remains stable under distribution shift. Guided by this principle, we propose CRISP, a parameter-free and model-agnostic framework requiring no target-domain information. CRISP is the first framework to make segmentation based on rank rather than probabilities. CRISP simulates model behavior under distribution shift via latent feature perturbation, where voxel probability rankings exhibit two stable patterns: regions that consistently retain high probabilities (destined positives according to the principle) and those that remain low-probability (can be safely classified as negatives). Based on these patterns, we construct high-precision (HP) and high-recall (HR) priors and recursively refine them under perturbation. We then design an iterative training framework, making HP and HR progressively squeeze’’ to the final segmentation. Extensive evaluations on multi-center cardiac MRI and CT-based lung vessel segmentation demonstrate CRISP’s superior robustness, significantly outperforming state-of-the-art methods with striking HD95 reductions of up to 0.14 (7.0% improvement), 1.90 (13.1% improvement), and 8.39 (38.9% improvement) pixels across multi-center, demographic, and modality shifts, respectively.


[60] VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG cs.CV | cs.AIPDF

Honghao Fu, Miao Xu, Yiwei Wang, Dailing Zhang, Liu Jun

TL;DR: VideoStir是一个用于理解长视频的检索增强生成框架,它通过构建时空图来保持视频的固有结构,并引入基于意图的相关性评分器来检索与查询推理意图对齐的帧,从而克服了现有方法在结构化和意图感知方面的不足。

Details

Motivation: 现有方法将视频扁平化为独立片段,破坏了其固有的时空结构,并且依赖显式的语义匹配,可能错过与查询意图隐式相关的线索。

Result: 实验表明,VideoStir在不依赖辅助信息的情况下,与最先进的基线方法具有竞争力。

Insight: 创新点在于将长视频RAG从扁平的语义匹配转向结构化的、意图感知的推理,具体通过构建时空图进行多跳检索,以及使用MLLM支持的意图相关性评分器来学习帧-查询意图对齐。

Abstract: Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query’s intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query’s reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.


[61] Cross-Stage Attention Propagation for Efficient Semantic Segmentation cs.CVPDF

Beoungwoo Kang

TL;DR: 本文提出了一种名为跨阶段注意力传播(CSAP)的轻量级语义分割解码器框架。该方法通过在最深特征尺度计算注意力,并将其传播到较浅阶段,避免了重复计算,从而在保持多尺度上下文推理能力的同时显著降低了计算成本。

Details

Motivation: 现有轻量级语义分割方法的多尺度解码器通常在各特征尺度独立计算注意力,导致跨尺度的注意力分布高度相关,引入了大量冗余计算。本文旨在解决这一效率问题。

Result: 在多个基准测试上取得了优异结果:CSAP-Tiny在ADE20K上达到42.9% mIoU(仅需5.5 GFLOPs),在Cityscapes上达到80.5% mIoU(21.5 GFLOPs),在COCO-Stuff 164K上达到40.9% mIoU(5.5 GFLOPs)。在ADE20K上比SegNeXt-Tiny提升1.8% mIoU的同时,浮点运算量减少了16.8%。

Insight: 核心创新点是跨阶段注意力传播机制,它利用深层特征计算的注意力指导浅层特征,避免了冗余的查询-键值计算。从客观角度看,这是一种高效的注意力共享策略,通过解耦注意力计算与特征尺度,实现了计算复杂度的显著降低,为设计轻量级解码器提供了新思路。

Abstract: Recent lightweight semantic segmentation methods have made significant progress by combining compact backbones with efficient decoder heads. However, most multi-scale decoders compute attention independently at each feature scale, introducing substantial redundancy since the resulting attention distributions across scales are strongly correlated. We propose Cross-Stage Attention Propagation (CSAP), a decoder framework that computes attention at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query-key computation at those stages entirely. This design preserves multi-scale contextual reasoning while substantially reducing the decoder’s computational cost. CSAP-Tiny achieves 42.9% mIoU on ADE20K with only 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO-Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt-Tiny by +1.8% on ADE20K while requiring 16.8% fewer floating-point operations.


[62] Few-Shot Semantic Segmentation Meets SAM3 cs.CVPDF

Yi-Jen Tsai, Yen-Yu Lin, Chien-Yao Wang

TL;DR: 本文提出了一种利用Segment Anything Model 3 (SAM3) 进行训练无关的小样本语义分割方法。通过简单的空间拼接策略将支持集和查询集图像置于共享画布上,无需微调即可让冻结的SAM3执行分割,在PASCAL-5^i和COCO-20^i基准上达到了最先进的性能。

Details

Motivation: 解决传统小样本语义分割方法依赖大量情景训练、计算成本高且对分布偏移敏感的问题,探索利用现代视觉基础模型SAM3作为免训练解决方案的潜力。

Result: 在PASCAL-5^i和COCO-20^i基准测试中,该方法取得了最先进的性能,超越了众多精心设计的方法。

Insight: 创新点在于重新利用SAM3的可提示概念分割能力,通过简单的空间拼接实现跨图像推理;研究发现负面提示在小样本设置中可能适得其反,会削弱目标表示并导致预测崩溃,这揭示了当前基础模型在处理冲突提示信号方面的局限性。

Abstract: Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL-$5^i$ and COCO-$20^i$ show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: https://github.com/WongKinYiu/FSS-SAM3


[63] A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator cs.CVPDF

Kidus Zewde, Yuchen Zhou, Dennis Ng, Neo Tiangratanakul, Tommy Duong

TL;DR: 本文提出了一种用于生成合成眼动视频数据的基础设施,通过从参考视频中提取真实人类虹膜轨迹,并在3D眼动模拟器上通过无头浏览器自动化进行回放,从而生成带有自动标注的大规模合成眼动视频。该方法应用于视频面试中的脚本阅读检测任务,并发布了一个包含144个会话、总计12小时时长的合成眼动视频数据集。

Details

Motivation: 解决基于视频的行为数据(如眼动)稀缺、标注成本高且涉及隐私的问题,通过模拟生成可控的、大规模自动标注的合成数据来替代真实数据收集。

Result: 评估表明生成的轨迹保留了源数据的时间动态特性(所有指标的KS D值均小于0.14)。逐帧匹配比较揭示了3D模拟器在阅读尺度运动上表现出有界的敏感性,这归因于缺乏耦合的头部运动。

Insight: 创新点在于提出了一个从真实轨迹提取到3D模拟器回放的完整合成数据生成流水线,用于生成大规模标注的眼动视频数据,支持下游行为分类器开发;客观分析发现模拟器对缺乏头部运动耦合的敏感性,为未来模拟器设计提供了见解。

Abstract: Large vision-language models have achieved remarkable capabilities by training on massive internet-scale data, yet a fundamental asymmetry persists: while LLMs can leverage self-supervised pretraining on abundant text and image data, the same is not true for many behavioral modalities. Video-based behavioral data – gestures, eye movements, social signals – remains scarce, expensive to annotate, and privacy-sensitive. A promising alternative is simulation: replace real data collection with controlled synthetic generation to produce automatically labeled data at scale. We introduce infrastructure for this paradigm applied to eye movement, a behavioral signal with applications across vision-language modeling, virtual reality, robotics, accessibility systems, and cognitive science. We present a pipeline for generating synthetic labeled eye movement video by extracting real human iris trajectories from reference videos and replaying them on a 3D eye movement simulator via headless browser automation. Applying this to the task of script-reading detection during video interviews, we release final_dataset_v1: 144 sessions (72 reading, 72 conversation) totaling 12 hours of synthetic eye movement video at 25fps. Evaluation shows that generated trajectories preserve the temporal dynamics of the source data (KS D < 0.14 across all metrics). A matched frame-by-frame comparison reveals that the 3D simulator exhibits bounded sensitivity at reading-scale movements, attributable to the absence of coupled head movement – a finding that informs future simulator design. The pipeline, dataset, and evaluation tools are released to support downstream behavioral classifier development at the intersection of behavioral modeling and vision-language systems.


[64] Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis cs.CV | cs.AIPDF

Pu Wang, Zhixuan Mao, Jialu Li, Zhuoran Zheng, Dianjie Lu

TL;DR: 本文提出了一种用于犬类气胸自动诊断的新范式,结合了视觉语言模型引导的流匹配进行病灶定位和基于随机矩阵理论的谱异常检测,旨在解决数据稀缺和模型可信度问题,并发布了一个像素级标注的数据集。

Details

Motivation: 动机是解决犬类气胸自动诊断中面临的数据稀缺问题,并构建一个可信赖的诊断模型,通过将任务重构为信号定位与谱检测的协同过程来实现。

Result: 论文提出的方法通过生成式分割与基于第一性原理的统计分析协同工作,构建了一个高精度且可解释的诊断系统,源代码已公开。

Insight: 创新点在于将诊断任务解耦为定位与检测两个协同阶段:利用VLM引导的迭代流匹配进行高保真病灶分割以纯化信号,并首次将随机矩阵理论应用于医学图像分析,通过检测统计显著的异常特征值来识别病理信号,避免了传统分类器的黑箱问题,提升了模型的可解释性。

Abstract: Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: https://github.com/Pu-Wang-alt/Canine-pneumothorax).


[65] CLIP-Guided Data Augmentation for Night-Time Image Dehazing cs.CVPDF

Xining Ge, Weijun Yuan, Gengjia Chang, Xuyang Li, Shuhong Liu

TL;DR: 本文提出了一种用于夜间图像去雾的统一框架,该框架集成了领域对齐的数据构建、分阶段训练和推理时增强技术。具体而言,利用预训练的CLIP视觉编码器筛选外部候选样本以构建更接近目标域的训练数据,然后对NAFNet进行两阶段训练,最后在推理时结合多种技术提升输出稳定性。

Details

Motivation: 夜间图像去雾面临比白天更复杂的退化模式(如雾散射与低光照、非均匀照明和强光干扰耦合),且在有限监督下,目标域样本稀缺,简单地引入外部数据会因分布不匹配而削弱适应性,加剧领域漂移和训练不稳定性。

Result: 该框架作为NTIRE 2026夜间图像去雾挑战赛的解决方案提出,通过领域对齐数据构建、两阶段训练(先适应目标域,再扩展至更广泛的退化模式)以及推理时增强(TLC、x8自集成和加权快照融合)来提升性能,提供了一种实用有效的流程。

Insight: 主要创新点在于利用CLIP的视觉编码能力进行领域对齐的数据筛选,构建更贴近目标域的训练集,并结合分阶段训练与推理时集成技术,而非依赖复杂的网络结构重设计,为数据稀缺的夜间去雾任务提供了一种以数据为中心的实用解决方案。

Abstract: Nighttime image dehazing faces a more complex degradation pattern than its daytime counterpart, as haze scattering couples with low illumination, non-uniform lighting, and strong light interference. Under limited supervision, this complexity aggravates domain drift and training instability, since target-domain samples are scarce while naively introducing external data may weaken adaptation due to distribution mismatch. This paper presents our solution to the NTIRE 2026 Night Time Image Dehazing Challenge, built as a unified framework that integrates domain-aligned data construction, stage-wise training, and inference-time enhancement. Specifically, a pre-trained CLIP visual encoder screens candidate external samples by similarity to construct training data closer to the target domain. NAFNet is then trained in two stages, first adapting to the target domain and then expanding to broader degradation patterns. At inference time, TLC, x8 self-ensemble, and weighted snapshot fusion are combined to improve output stability. Rather than relying on complex network redesign, the proposed framework offers a practical and effective pipeline for nighttime image dehazing.


[66] Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality cs.CVPDF

Yanming Xiu, Zhengayuan Jiang, Neil Zhenqiang Gong, Maria Gorlatova

TL;DR: 本文提出了ContrAR基准,用于系统评估视觉语言模型在增强现实中面对矛盾虚拟内容攻击时的鲁棒性。该基准包含312个真实世界AR视频,并测试了11个商业和开源VLM模型。实验表明,现有模型对矛盾内容有一定理解能力,但在检测对抗性操纵和平衡准确性与延迟方面仍有提升空间。

Details

Motivation: 随着增强现实技术日益普及,其安全性和可靠性成为关键挑战,特别是恶意或不一致的虚拟内容攻击可能误导用户、造成语义混淆或传递有害信息,因此需要系统评估VLM模型在此类攻击下的鲁棒性。

Result: 在ContrAR基准上测试了11个VLM模型,结果显示当前模型对矛盾虚拟内容表现出合理的理解能力,但在检测对抗性操纵方面仍需改进,且平衡检测准确性和延迟仍具挑战性。

Insight: 创新点在于首次系统建模AR中的矛盾虚拟内容攻击,并构建了包含真实世界AR视频的专用基准ContrAR,为评估VLM在动态、对抗性AR环境中的鲁棒性提供了标准化测试平台。

Abstract: Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user’s view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.


[67] Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images cs.CVPDF

Xuanguang Liu, Lei Ding, Yujie Li, Chenguang Dai, Zhenchao Zhang

TL;DR: 本文提出了一种名为STSF-Net的框架,用于解决光学与SAR图像之间的多模态变化检测问题。该框架通过联合建模模态特定特征和时空共性特征来增强变化表征,并引入一种基于预训练基础模型语义先验的自适应特征融合策略,以抑制成像机制差异导致的伪变化。

Details

Motivation: 现有多模态变化检测方法在跨模态交互和利用模态特定特征方面存在局限,导致对细粒度变化信息建模不足,阻碍了对多模态数据中语义变化的精确检测。

Result: 在Delta-SN6、BRIGHT和Wuhan-Het数据集上的实验表明,该方法在mIoU指标上分别以3.21%、1.08%和1.32%的优势超越了当前最先进的方法,达到了SOTA水平。

Insight: 创新点在于联合建模模态特定与时空共性特征以增强变化表征,并利用预训练基础模型的语义先验指导多模态特征的自适应融合。客观来看,其提出的Delta-SN6数据集作为首个公开的高分辨率全极化SAR与光学图像多类变化检测基准,也具有重要价值。

Abstract: Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: https://github.com/liuxuanguang/STSF-Net.


[68] EchoAgent: Towards Reliable Echocardiography Interpretation with “Eyes”,”Hands” and “Minds” cs.CVPDF

Qin Wang, Zhiqing He, Yu Liu, Bowen Guo, Zeju Li

TL;DR: 本文提出了EchoAgent,一个专为超声心动图(Echo)端到端解释而设计的智能体系统。它通过整合知识库构建(思维)、视觉解析与测量(眼和手)以及可解释推理,模拟心脏超声医师的完整工作流程,旨在实现更可靠的超声心动图分析。

Details

Motivation: 当前针对超声心动图分析的深度学习方法或大语言模型通常只专注于单一技能(如视觉分割或知识推理),缺乏临床医师所需的眼、手、脑协同能力,这限制了其在临床中的可靠性和实用性。本文旨在解决这一局限,构建一个能同步协调观察、操作和推理的智能系统。

Result: 在CAMUS和MIMIC-EchoQA数据集上进行了评估,覆盖了48种不同超声心动图视图和14个心脏解剖区域。实验结果表明,EchoAgent在多种结构分析任务中取得了最优性能,总体准确率最高达到80.00%。

Insight: 论文的核心创新在于提出了一个集成了知识库构建、分层协作工具包和协调推理中心的智能体框架,首次在单一系统中实现了对超声心动图分析的完整“眼-手-脑”协同模拟。从客观角度看,这种将结构化专业知识、自动化视觉操作与可解释推理深度融合的智能体架构,为医疗影像分析提供了新的系统设计范式,有望提升AI辅助诊断的可靠性和临床接受度。

Abstract: Reliable interpretation of echocardiography (Echo) is crucial for assessing cardiac function, which demands clinicians to synchronously orchestrate multiple capabilities, including visual observation (eyes), manual measurement (hands), and expert knowledge learning and reasoning (minds). While current task-specific deep-learning approaches and multimodal large language models have demonstrated promise in assisting Echo analysis through automated segmentation or reasoning, they remain focused on restricted skills, i.e., eyes-hands or eyes-minds, thereby limiting clinical reliability and utility. To address these issues, we propose EchoAgent, an agentic system tailored for end-to-end Echo interpretation, which achieves a fully coordinated eyes-hands-minds workflow that learns, observes, operates, and reasons like a cardiac sonographer. First, we introduce an expertise-driven cognition engine where our agent can automatically assimilate credible Echo guidelines into a structured knowledge base, thus constructing an Echo-customized mind. Second, we devise a hierarchical collaboration toolkit to endow EchoAgent with eyes-hands, which can automatically parse Echo video streams, identify cardiac views, perform anatomical segmentation, and quantitative measurement. Third, we integrate the perceived multimodal evidence with the exclusive knowledge base into an orchestrated reasoning hub to conduct explainable inferences. We evaluate EchoAgent on CAMUS and MIMIC-EchoQA datasets, which cover 48 distinct echocardiographic views spanning 14 cardiac anatomical regions. Experimental results show that EchoAgent achieves optimal performance across diverse structure analyses, yielding overall accuracy of up to 80.00%. Importantly, EchoAgent empowers a single system with abilities to learn, observe, operate and reason like an echocardiologist, which holds great promise for reliable Echo interpretation.


[69] Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities cs.CVPDF

Rongfei Chen, Tingting Zhang, Xiaoyu Shen, Wei Zhang

TL;DR: 本文提出了一种名为Prompt-based Missing Modality Adaptation (ProMMA)的框架,以解决多模态情感分析中模态缺失的问题。该框架在输入阶段引入缺失模态评估器,动态评估缺失模态的重要性以避免低质量数据填补;随后通过模态不变提示解缠模块和动态提示加权模块,分别捕获模态内局部相关性和自适应抑制缺失模态的干扰;最后通过多层次提示动态连接模块增强全局一致性。

Details

Motivation: 现有方法主要通过提示学习和预训练模型来提高对缺失模态的鲁棒性,但存在两个局限:一是缺乏对生成缺失模态必要性的严格评估,二是对多模态提示间的结构依赖性和全局连贯性探索不足。

Result: 在CMU MOSI、CMU MOSEI和CH-SIMS三个公开基准测试上的大量实验表明,该框架在不同模态缺失设置下实现了最先进的性能和稳定的结果。

Insight: 创新点在于提出了“生成前评估”的范式,通过动态评估缺失模态的重要性来避免盲目生成;同时,通过解缠模态特定提示、基于互信息的动态加权以及利用残差连接整合全局提示先验,系统性地增强了模型对模态缺失的鲁棒性和表示质量。

Abstract: The missing modality problem poses a fundamental challenge in multimodal sentiment analysis, significantly degrading model accuracy and generalization in real world scenarios. Existing approaches primarily improve robustness through prompt learning and pre trained models. However, two limitations remain. First, the necessity of generating missing modalities lacks rigorous evaluation. Second, the structural dependencies among multimodal prompts and their global coherence are insufficiently explored. To address these issues, a Prompt based Missing Modality Adaptation framework is proposed. A Missing Modality Evaluator is introduced at the input stage to dynamically assess the importance of missing modalities using pretrained models and pseudo labels, thereby avoiding low quality data imputation. Building on this, a Modality invariant Prompt Disentanglement module decomposes shared prompts into modality specific private prompts to capture intrinsic local correlations and improve representation quality. In addition, a Dynamic Prompt Weighting module computes mutual information based weights from cross attention outputs to adaptively suppress interference from missing modalities. To enhance global consistency, a Multi level Prompt Dynamic Connection module integrates shared prompts with self attention outputs through residual connections, leveraging global prompt priors to strengthen key guidance features. Extensive experiments on three public benchmarks, including CMU MOSI, CMU MOSEI, and CH SIMS, demonstrate that the proposed framework achieves state of the art performance and stable results under diverse missing modality settings. The implementation is available at https://github.com/rongfei-chen/ProMMA


[70] High-Resolution Single-Shot Polarimetric Imaging Made Easy cs.CVPDF

Shuangfan Zhou, Chu Zhou, Heng Guo, Youwei Lyu, Boxin Shi

TL;DR: 本文提出了一种名为EasyPolar的多视角偏振成像框架,旨在克服传统分焦平面(DoFP)传感器因空间复用机制导致的空间分辨率降低和伪影问题。该方法基于三个独立强度测量足以完全表征线性偏振的物理原理,设计了一个由三个同步RGB相机组成的系统,分别捕获一个非偏振视图和两个不同方向的偏振视图,并进一步提出了一个置信度引导的偏振重建网络来处理多视图融合中的潜在错位问题。

Details

Motivation: 解决现有分焦平面(DoFP)偏振传感器在实现单次拍摄时,因空间复用机制而固有地导致空间分辨率降低和伪影的问题,同时不牺牲其快照能力。

Result: 实验结果表明,该方法能够实现高质量的重建结果,并有益于各种下游任务。

Insight: 创新点在于基于物理原理(三个独立强度测量可完全表征线性偏振)设计了三相机硬件设置,并提出了一个置信度引导的物理指导机制进行多模态特征融合,以抑制形变伪影并对解空间施加显式几何约束。

Abstract: Polarization-based vision has gained increasing attention for providing richer physical cues beyond RGB images. While achieving single-shot capture is highly desirable for practical applications, existing Division-of-Focal-Plane (DoFP) sensors inherently suffer from reduced spatial resolution and artifacts due to their spatial multiplexing mechanism. To overcome these limitations without sacrificing the snapshot capability, we propose EasyPolar, a multi-view polarimetric imaging framework. Our system is grounded in the physical insight that three independent intensity measurements are sufficient to fully characterize linear polarization. Guided by this, we design a triple-camera setup consisting of three synchronized RGB cameras that capture one unpolarized view and two polarized views with distinct orientations. Building upon this hardware design, we further propose a confidence-guided polarization reconstruction network to address the potential misalignment in multi-view fusion. The network performs multi-modal feature fusion under a confidence-aware physical guidance mechanism, which effectively suppresses warping-induced artifacts and enforces explicit geometric constraints on the solution space. Experimental results demonstrate that our method achieves high-quality results and benefits various downstream tasks.


[71] WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval cs.CVPDF

Yizhuo Xu, Chaojian Yu, Yuanjie Shao, Tongliang Liu, Qinmu Peng

TL;DR: 本文提出了WRF4CIR方法,一种用于组合图像检索任务的权重正则化微调网络。该方法通过在微调过程中对模型权重施加对抗性扰动来缓解过拟合问题,从而在有限的三元组数据下提升模型的泛化能力。

Details

Motivation: 当前基于视觉语言预训练模型的组合图像检索方法普遍存在严重的过拟合问题,尤其是在三元组数据有限的情况下。本文旨在系统性地研究并解决这一泛化差距问题。

Result: 在多个基准数据集上的大量实验表明,WRF4CIR方法显著缩小了泛化差距,并相比现有方法取得了实质性的性能提升。

Insight: 核心创新点在于将对抗性扰动应用于模型权重进行正则化,扰动方向与梯度下降方向相反,直观上增加了模型拟合训练数据的难度,从而有效缓解过拟合。这是一种新颖的、针对微调过程的权重正则化策略。

Abstract: Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.


[72] Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher cs.CVPDF

Pengcheng Weng, Yanyu Qian, Yangxin Xu, Fei Wang

TL;DR: 本文提出了一种名为PTA(Purify-then-Align)的新型框架,旨在解决多模态人类感知中模态缺失的鲁棒性问题。该框架通过元学习驱动的加权机制净化噪声模态,并利用基于扩散的知识蒸馏范式,将净化后的知识从信息丰富的‘干净教师’模型对齐到各个单模态‘学生’编码器中,从而增强单模态模型在模态缺失场景下的性能。

Details

Motivation: 解决多模态人类感知中模态缺失的鲁棒性挑战,核心障碍是异构数据间的‘表征鸿沟’和低质量模态带来的‘污染效应’,这两者存在因果依赖关系,污染会阻碍表征差异的减小。

Result: 在存在显著表征鸿沟和污染效应的大规模MM-Fi和XRF55数据集上进行综合实验,结果表明PTA实现了最先进的(SOTA)性能,并显著提升了单模态模型在各种模态缺失场景下的鲁棒性。

Insight: 创新点在于提出‘先净化后对齐’的因果解决策略,将元学习用于动态加权以净化知识源,并引入基于扩散的知识蒸馏范式进行跨模态知识对齐,最终目标是构建强大的、蕴含跨模态知识的单模态编码器。

Abstract: Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel “Purify-then-Align” framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this “Purify-then-Align” strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.


[73] BPC-Net: Annotation-Free Skin Lesion Segmentation via Boundary Probability Calibration cs.CVPDF

Yujie Yao, Yuhaohang He, Junjie Huang, Zhou Liu, Jiangzhao Li

TL;DR: BPC-Net是一种用于无标注皮肤病变分割的边界概率校准框架,通过高斯概率平滑(GPS)在阈值化前进行局部概率空间校准,以恢复置信度不足的病变边界,同时避免前景过度扩张。该框架还结合了特征解耦解码器和交互分支适应策略,以应对噪声伪标签监督和跨域迁移的挑战。

Details

Motivation: 解决无标注皮肤病变分割中存在的三个耦合挑战:噪声伪标签监督、有限目标域数据下的不稳定迁移以及边界概率置信度不足,特别是现有方法较少明确关注压缩边界概率对最终掩码质量的影响。

Result: 在ISIC-2017、ISIC-2018和PH2数据集上的实验表明,该框架在已发表的无监督方法中达到了最先进的性能,宏观平均Dice系数和Jaccard指数分别为85.80%和76.97%,并在PH2上接近有监督参考性能。

Insight: 创新点在于提出边界概率校准(BPC)框架,核心是高斯概率平滑(GPS)进行局部概率空间校准,以恢复边界置信度;同时引入特征解耦解码器分离处理上下文抑制、细节恢复和边界细化,以及交互分支适应策略以稳定跨域迁移,这些设计可借鉴于其他无监督分割任务。

Abstract: Annotation-free skin lesion segmentation is attractive for low-resource dermoscopic deployment. However, its performance remains constrained by three coupled challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most existing annotation-free methods primarily focus on pseudo-label denoising. In contrast, the effect of compressed boundary probabilities on final mask quality has received less explicit attention, although it directly affects contour completeness and cannot be adequately corrected by global threshold adjustment alone. To address this issue, we propose BPC-Net, a boundary probability calibration framework for annotation-free skin lesion segmentation. The core of the framework is Gaussian Probability Smoothing (GPS), which performs localized probability-space calibration before thresholding to recover under-confident lesion boundaries without inducing indiscriminate foreground expansion. To support this calibration under noisy pseudo-supervision and cross-domain transfer, we further incorporate two auxiliary designs: a feature-decoupled decoder that separately handles context suppression, detail recovery, and boundary refinement, and an interaction-branch adaptation strategy that updates only the pseudo-label interaction branch while preserving the deployed image-only segmentation path. Under a strictly annotation-free protocol, no manual masks are used during training or target-domain adaptation, and validation labels, when available, are used only for final operating-point selection. Experiments on ISIC-2017, ISIC-2018, and PH2 show that the proposed framework achieves state-of-the-art performance among published unsupervised methods, reaching a macro-average Dice coefficient and Jaccard index of 85.80% and 76.97%, respectively, while approaching supervised reference performance on PH2.


[74] ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference cs.CVPDF

Zhaohong Huang, Wenjing Liu, Yuxin Zhang, Fei Chao, Rongrong Ji

TL;DR: 本文提出了一种名为ID-Selection的视觉令牌选择方法,用于加速大型视觉语言模型(LVLM)的推理。该方法通过结合重要性估计和多样性感知的迭代选择,在保留信息丰富令牌的同时减少冗余,从而在高压缩比下实现性能与效率的更好平衡。

Details

Motivation: 现有视觉令牌剪枝方法难以平衡令牌的重要性和多样性:基于重要性的方法倾向于保留冗余令牌,而基于多样性的方法可能忽略信息丰富的令牌,这一权衡问题在高压缩比下尤为突出。

Result: 在5个LVLM主干模型和16个主流基准测试上的广泛实验表明,ID-Selection始终实现卓越的性能和效率,尤其在极端剪枝比例下。例如,在LLaVA-1.5-7B上,ID-Selection剪除了97.2%的视觉令牌(仅保留16个令牌),推理FLOPs减少超过97%,同时保持了91.8%的原始性能,且无需额外训练。

Insight: 创新点在于将重要性评分与多样性感知的迭代选择耦合在一个统一的选择过程中,通过逐步抑制相似令牌的分数来同时兼顾信息保留和冗余减少。从客观角度看,该方法提供了一种简单有效的启发式策略,无需训练即可在高压缩场景下实现接近SOTA的权衡效果。

Abstract: Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.


[75] Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization cs.CV | cs.AI | cs.LGPDF

Dustin Eisenhardt, Timothy Schaumlöffel, Alperen Kantarci, Gemma Roig

TL;DR: 本文通过系统实验研究了风格迁移在提升领域泛化能力中的关键设计因素,包括风格池多样性、纹理复杂度和风格来源,并基于发现提出了无需模型修改的轻量级数据增强方法StyleMixDG,在多个真实世界数据集上验证了其有效性。

Details

Motivation: 解决深度学习模型在合成数据训练后因Sim2Real差距导致的真实场景泛化能力不足问题,并澄清现有文献中关于风格迁移增强策略的三个关键设计矛盾。

Result: 在GTAV到BDD100k、Cityscapes和Mapillary Vistas的基准测试中,StyleMixDG相比强基线模型取得了持续改进,证明了所提设计原则的实际增益。

Insight: 创新点在于通过系统实证分析明确了风格池多样性比重复使用少数风格更有效、纹理复杂度在池足够大时影响不显著、以及多样艺术风格优于领域对齐风格,并据此设计了轻量级、模型无关的增强方案StyleMixDG。

Abstract: Deep learning models for computer vision often suffer from poor generalization when deployed in real-world settings, especially when trained on synthetic data due to the well-known Sim2Real gap. Despite the growing popularity of style transfer as a data augmentation strategy for domain generalization, the literature contains unresolved contradictions regarding three key design axes: the diversity of the style pool, the role of texture complexity, and the choice of style source. We present a systematic empirical study that isolates and evaluates each of these factors for driving scene understanding, resolving inconsistencies in prior work. Our findings show that (i) expanding the style pool yields larger gains than repeated augmentation with few styles, (ii) texture complexity has no significant effect when the pool is sufficiently large, and (iii) diverse artistic styles outperform domain-aligned alternatives. Guided by these insights, we derive StyleMixDG (Style-Mixing for Domain Generalization), a lightweight, model-agnostic augmentation recipe that requires no architectural modifications or additional losses. Evaluated on the GTAV $\rightarrow$ {BDD100k, Cityscapes, Mapillary Vistas} benchmark, StyleMixDG demonstrates consistent improvements over strong baselines, confirming that the empirically identified design principles translate into practical gains. The code will be released on GitHub.


[76] Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening cs.CV | cs.AIPDF

Chenyu Xue, Yiran Liu, Mian Zhou, Jionglong Su, Zhixiang Lu

TL;DR: 本文提出了一种新颖的语义-拓扑图推理框架,用于解决自由文本临床指令驱动的医学图像分割问题。该框架通过文本到视觉意图蒸馏模块提取精确诊断指导,并将掩码选择建模为动态图推理问题,以解决临床报告的语义模糊性和低对比度扫描中的解剖结构重叠问题。同时,采用选择性非对称微调策略,仅更新不到1%的参数,有效防止过拟合。

Details

Motivation: 现有多模态和基础模型在处理临床报告的语义模糊性和低对比度扫描中复杂的解剖结构重叠时存在困难,并且在有限的医学数据集上进行全参数微调容易导致严重的过拟合。

Result: 在LIDC-IDRI和LNDb数据集上进行的严格5折交叉验证表明,该框架达到了新的最先进水平。具体而言,在LIDC-IDRI数据集上实现了81.5%的Dice相似系数,比LISA等领先的基于LLM的工具高出5%以上。选择性非对称微调策略表现出优异的跨折稳定性。

Insight: 创新点在于将大语言模型的推理能力与视觉基础模型的零样本分割能力优雅地协同,并通过动态图建模解决解剖模糊性。选择性非对称微调策略作为一种强大的正则化器,为实现稳健、上下文感知的临床部署提供了可行路径。

Abstract: Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.


[77] FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos cs.CVPDF

Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun

TL;DR: FunRec是一种从第一人称视角的RGB-D交互视频中重建室内场景功能化3D数字孪生的方法。它直接从真实世界的人类交互序列中恢复可交互的3D场景,自动发现铰接部件、估计其运动学参数、跟踪3D运动,并在规范空间中重建静态和动态几何,生成兼容仿真的网格。

Details

Motivation: 现有铰接重建方法依赖受控设置、多状态捕获或CAD先验,而FunRec旨在直接从真实世界(in-the-wild)的交互视频中恢复可交互的3D场景,以克服这些限制。

Result: 在真实和模拟的新基准测试中,FunRec大幅超越先前工作:部件分割mIoU提升高达+50,关节和姿态误差降低5-10倍,重建精度显著提高。

Insight: 创新点在于直接从真实世界交互视频端到端重建功能化3D场景(包括发现、参数估计、跟踪和重建),并生成仿真兼容的网格(如URDF/USD),支持手部引导的affordance映射和机器人-场景交互应用。

Abstract: We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.


[78] DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV | cs.CL | cs.MMPDF

Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao

TL;DR: 本文介绍了DetailVerifyBench,一个用于评估长图像描述中密集幻觉定位能力的基准测试集。该基准包含1000张高质量图像,覆盖五个不同领域,平均描述长度超过200词,并提供了细粒度的词级幻觉标注,旨在解决现有基准在细粒度和领域多样性方面的不足。

Details

Motivation: 随着多模态大语言模型(MLLMs)的发展,图像描述从简短句子演变为长篇叙述,这使得在长上下文中精确定位特定错误词或短语的幻觉检测变得极具挑战性,而现有基准缺乏评估此能力所需的细粒度和领域多样性。

Result: DetailVerifyBench是目前长图像描述领域中最具挑战性的精确幻觉定位基准,其平均描述长度超过200词,并包含密集的词级多类型幻觉标注。

Insight: 该工作的创新点在于构建了一个细粒度、多领域、高难度的长图像描述幻觉定位基准,推动了从响应级不一致性检测向词级精确定位的评估范式转变,为模型可靠性评估提供了更严格的测试平台。

Abstract: Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.


[79] A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting cs.CVPDF

Yongchuan Cui, Peng Liu

TL;DR: 本文提出了LLaRS,一个首个用于多模态、多任务遥感低层视觉的统一基础模型。它通过Sinkhorn-Knopp最优传输对齐异构波段,采用三种互补的混合专家层处理特征,并利用动态权重调整稳定训练。模型在包含11个修复与增强任务的百万级数据集LLaRS1M上训练,实验表明其在多项任务上优于七个竞争模型,并展现出强大的迁移和适应能力。

Details

Motivation: 遥感图像易受云层、雾霾、噪声、分辨率限制和传感器异质性影响,现有方法通常针对每种退化类型单独训练模型,缺乏统一高效的解决方案。

Result: 实验表明,LLaRS在多个任务上持续优于七个竞争模型,参数高效微调实验也证明了其在未见数据上具有很强的迁移能力和适应效率。

Insight: 创新点包括:1) 首次提出用于遥感低层视觉的统一多任务基础模型;2) 采用Sinkhorn-Knopp最优传输进行异构波段语义对齐;3) 设计三种互补的混合专家层(卷积、通道混合、带低秩适配器的注意力)协同处理特征;4) 构建了大规模多任务数据集LLaRS1M,结合真实观测与合成退化及语言提示;5) 引入步级动态权重调整以稳定联合训练。

Abstract: Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: https://github.com/yc-cui/LLaRS


[80] SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection cs.CVPDF

Letian Bai, Chengyu Tao, Juan Du

TL;DR: 本文提出了一种名为SGANet(语义与几何对齐网络)的统一框架,用于解决多模态多视角异常检测中的特征不一致问题。该框架通过选择性跨视角特征细化模块(SCFRM)增强跨视角特征交互,通过语义-结构块对齐(SSPA)实现跨模态语义对齐并保持结构一致性,并通过多视角几何对齐(MVGA)对齐几何对应块,从而学习跨视角和跨模态的物理一致特征表示。

Details

Motivation: 现有无监督方法在多视角异常检测中常因视角变化和模态差异导致特征不一致,影响检测性能。本文旨在解决这些挑战,提出一个能有效结合语义和几何对齐的框架。

Result: 在SiM3D和Eyecandies数据集上的大量实验表明,SGANet在异常检测和定位方面均达到了最先进的性能(SOTA),验证了其在真实工业场景中的有效性。

Insight: 创新点在于通过联合建模特征交互、语义与结构一致性以及全局几何对应,统一处理多模态多视角数据,从而学习物理一致的特征表示,提升异常检测性能。从客观角度看,该方法将语义对齐与几何对齐相结合,为多视角多模态学习提供了新的思路。

Abstract: Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimodal multi-view anomaly detection that effectively combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities. SGANet consists of three key components. The Selective Cross-view Feature Refinement Module (SCFRM) selectively aggregates informative patch features from adjacent views to enhance cross-view feature interaction. The Semantic-Structural Patch Alignment (SSPA) enforces semantic alignment across modalities while maintaining structural consistency under viewpoint transformations. The Multi-View Geometric Alignment (MVGA) further aligns geometrically corresponding patches across viewpoints. By jointly modeling feature interaction, semantic and structural consistency, and global geometric correspondence, SGANet effectively enhances anomaly detection performance in multimodal multi-view settings. Extensive experiments on the SiM3D and Eyecandies datasets demonstrate that SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating its effectiveness in realistic industrial scenarios.


[81] Towards Athlete Fatigue Assessment from Association Football Videos cs.CVPDF

Xavier Bou, Nathan Correger, Alexandre Cloots, Cédric Gavage, Silvio Giancola

TL;DR: 本文研究利用单目广播视频评估足球运动员疲劳程度,通过先进的比赛状态重建方法提取球员轨迹,并提出一种新的运动学处理算法来获取时间一致的速度和加速度估计,进而构建加速度-速度(A-S)剖面作为疲劳相关性能指标。

Details

Motivation: 当前足球疲劳监测主要依赖主观自我报告、实验室生物标志物或侵入式传感器(如心率监测器或GPS),本文旨在探索单目广播视频是否能提供足够质量的时空信号以支持疲劳分析。

Result: 在公开基准SoccerNet-GSR上评估了完整流程,包括30秒片段和完整的45分钟半场,以检验短期可靠性和长期时间一致性,结果表明单目GSR能够恢复与A-S剖面兼容的运动学模式,但也揭示了其对轨迹噪声、校准误差和广播镜头固有时间不连续性的敏感性。

Insight: 创新点在于提出了一种新颖的运动学处理算法来从重建轨迹中获取时间一致的速度和加速度估计,并首次将单目广播视频作为低成本疲劳分析基础,同时明确了未来研究的方法学挑战。

Abstract: Fatigue monitoring is central in association football due to its links with injury risk and tactical performance. However, objective fatigue-related indicators are commonly derived from subjective self-reported metrics, biomarkers derived from laboratory tests, or, more recently, intrusive sensors such as heart monitors or GPS tracking data. This paper studies whether monocular broadcast videos can provide spatio-temporal signals of sufficient quality to support fatigue-oriented analysis. Building on state-of-the-art Game State Reconstruction methods, we extract player trajectories in pitch coordinates and propose a novel kinematics processing algorithm to obtain temporally consistent speed and acceleration estimates from reconstructed tracks. We then construct acceleration–speed (A-S) profiles from these signals and analyze their behavior as fatigue-related performance indicators. We evaluate the full pipeline on the public SoccerNet-GSR benchmark, considering both 30-second clips and a complete 45-minute half to examine short-term reliability and longer-term temporal consistency. Our results indicate that monocular GSR can recover kinematic patterns that are compatible with A-S profiling while also revealing sensitivity to trajectory noise, calibration errors, and temporal discontinuities inherent to broadcast footage. These findings support monocular broadcast video as a low-cost basis for fatigue analysis and delineate the methodological challenges for future research.


[82] PanopticQuery: Unified Query-Time Reasoning for 4D Scenes cs.CVPDF

Ruilin Tang, Yang Zhou, Zhong Ye, Wenxi Liu, Yan Huang

TL;DR: 本文提出了PanopticQuery框架,用于在动态4D场景中进行统一的查询时推理。该方法基于4D高斯泼溅实现高保真动态重建,并通过多视角语义共识机制,聚合多视角和时间帧的2D语义预测,将自然语言查询映射到全局一致的4D语义理解中。

Details

Motivation: 现有基于神经表示的4D重建方法在上下文推理(如交互、时序动作和空间关系)方面存在局限,难以将噪声大、视角依赖的预测转化为全局一致的4D解释。

Result: 在提出的新基准Panoptic-L4D上,PanopticQuery在复杂语言查询(属性、动作、空间关系和多对象交互)上达到了新的SOTA水平。

Insight: 创新点包括:1)多视角语义共识机制,通过聚合和过滤不一致的2D预测来提升语义一致性;2)通过神经场优化将2D语义提升为结构化的4D语义基础;3)引入了专门用于动态场景语言查询评估的新基准Panoptic-L4D。

Abstract: Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.


[83] Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis cs.CV | cs.AIPDF

Peixi Peng, Housheng Xie, Yanling Wei, Guangcong Ruan, Xiaoyang Zou

TL;DR: 本文提出了一种名为RATNet的基础模型,用于胃肠道内窥镜诊断。该模型基于类比推理机制,通过循环预训练策略从五个异构标注的内窥镜数据集中获取和迁移知识,以解决现有AI模型在泛化性、适应性、鲁棒性和可扩展性方面的不足。RATNet在多种场景下(如常见疾病诊断、罕见病少样本学习、新医疗中心零样本迁移等)均优于现有基础模型,并支持微调、线性探测和零样本转移,具有开放性和成本效益。

Details

Motivation: 胃肠道疾病负担日益加重,内窥镜是早期诊断的主要工具,但常规图像解读存在漏诊和效率低下的问题。现有AI辅助诊断模型因医学数据有限、领域偏移和标注异构性,缺乏泛化性、适应性、鲁棒性和可扩展性。

Result: RATNet在六个场景(常见胃肠道疾病诊断、罕见病少样本学习、新医疗站点零样本迁移、长尾分布下的鲁棒性、新疾病适应、联邦学习隐私保护部署)中均优于现有基础模型(如GastroNet和GastroVision),实现了SOTA性能。

Insight: 创新点在于引入类比推理机制,将图像后验知识与学习到的先验知识库匹配,并迁移相对知识以指导诊断,从而提升泛化能力和抗偏置性。模型支持自动整合异构标注而无需人工标签统一,降低了数据获取成本,为资源有限环境下的智能诊断提供了实用基础。

Abstract: Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.


[84] Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective cs.CVPDF

Jonas Muth, Zdravko Marinov, Simon Reiß

TL;DR: 本文提出了一种名为任务对比学习(TaCo)的框架,旨在探索医学视觉任务之间的内在关系。通过将30个不同的任务(如分割、检测、去噪、修复、着色和几何变换等)嵌入到一个共享的表示空间中,该研究分析了这些任务在39个不同医学成像模态数据集上的关联性、重叠性和差异性。

Details

Motivation: 当前医学计算机视觉领域多专注于提升特定任务的性能,而任务之间在表示层面的内在关系(如它们如何关联、重叠或差异)尚未得到充分探索。本文旨在通过数据驱动的方法揭示这些任务的基本属性和相互关联性。

Result: 研究通过TaCo框架将异构任务映射到联合嵌入空间,并分析了它们的特性,例如识别哪些任务具有独特的表示、哪些任务混合在一起,以及任务迭代修改如何在嵌入空间中反映。

Insight: 创新点在于首次系统地探索了医学视觉任务之间的内在结构,并提出了一个通用的对比学习框架来量化任务关系。这为理解任务相似性和互连属性提供了基础,可能促进多任务学习、迁移学习或元学习在医学影像领域的应用。

Abstract: While much of the medical computer vision community has focused on advancing performance for specific tasks, the underlying relationships between tasks, i.e., how they relate, overlap, or differ on a representational level, remain largely unexplored. Our work explores these intrinsic relationships between medical vision tasks, specifically, we investigate 30 tasks, such as semantic tasks (e.g., segmentation and detection), image generative tasks (e.g., denoising, inpainting, or colorization), and image transformation tasks (e.g., geometric transformations). Our goal is to probe whether a data-driven representation space can capture an underlying structure of tasks across a variety of 39 datasets from wildly different medical imaging modalities, including computed tomography, magnetic resonance, electron microscopy, X-ray ultrasound and more. By revealing how tasks relate to one another, we aim to provide insights into their fundamental properties and interconnectedness. To this end, we introduce Task-Contrastive Learning (TaCo), a contrastive learning framework designed to embed tasks into a shared representation space. Through TaCo, we map these heterogeneous tasks from different modalities into a joint space and analyze their properties: identifying which tasks are distinctly represented, which blend together, and how iterative alterations to tasks are reflected in the embedding space. Our work provides a foundation for understanding the intrinsic structure of medical vision tasks, offering a deeper understanding of task similarities and their interconnected properties in embedding spaces.


[85] SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation cs.CV | cs.AIPDF

Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu

TL;DR: SnapFlow是一种即插即用的自蒸馏方法,用于压缩基于流匹配的视觉-语言-动作(VLA)模型的多步去噪过程为单步前向传播,从而显著降低推理延迟,同时保持或超越原始多步模型的性能。

Details

Motivation: 基于流匹配的VLA模型(如pi0、pi0.5和SmolVLA)在机器人操作任务中达到SOTA,但其迭代去噪(通常10步)带来了显著的延迟,占端到端推理时间的80%。直接减少步数会导致性能下降,因为速度场未针对单步跳跃进行校准。

Result: 在pi0.5(3B参数)模型上,使用SnapFlow在LIBERO套件(40个任务,400个回合)中实现了98.75%的平均成功率,与10步教师模型的97.75%相当甚至略有超越,同时去噪速度提升9.6倍,端到端延迟从274ms降至83ms;在SmolVLA(500M参数)上,MSE降低8.3%,端到端加速3.56倍。在长视野任务中,SnapFlow在n_act=5时达到93%成功率,而基线仅为90%。

Insight: 创新点包括:通过混合标准流匹配样本与一致性样本(其目标是基于模型自身边缘速度预测计算的两步欧拉捷径速度)来避免轨迹漂移;引入零初始化目标时间嵌入,使网络能在单一架构内切换局部速度估计和全局单步生成;无需外部教师、架构修改,单GPU训练约12小时,且与层蒸馏和令牌剪枝方法正交,可实现组合加速。

Abstract: Vision-Language-Action (VLA) models based on flow matching – such as pi0, pi0.5, and SmolVLA – achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model’s own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze theoretically. A zero-initialized target-time embedding lets the network switch between local velocity estimation and global one-step generation within a single architecture. SnapFlow requires no external teacher, no architecture changes, and trains in ~12h on a single GPU. We validate on two VLA architectures spanning a 6x parameter range, with identical hyperparameters: on pi0.5 (3B) across four LIBERO suites (40 tasks, 400 episodes), SnapFlow achieves 98.75% average success – matching the 10-step teacher at 97.75% and slightly exceeding it – with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms; on SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration. An action-step sweep on long-horizon tasks reveals that SnapFlow maintains its advantage across execution horizons, achieving 93% at n_act=5 where the baseline reaches only 90%. SnapFlow is orthogonal to layer-distillation and token-pruning approaches, enabling compositional speedups.


[86] 3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models cs.CVPDF

Xinye Zheng, Fei Wang, Yiqi Nie, Kun Li, Junjie Chen

TL;DR: 本文提出了一种结合视觉先验与高效3D场景建模的框架,用于从烟雾退化的多视角图像中重建3D场景。该方法利用Nano-Banana-Pro增强退化图像,并开发了Smoke-GS(一种介质感知的3D高斯泼溅框架)进行场景重建和面向复原的新视角合成。

Details

Motivation: 解决烟雾场景中3D重建的难题,包括烟雾引入的强散射效应、视角依赖的外观变化以及严重的跨视角一致性退化。

Result: 方法在具有挑战性的烟雾环境中,能够生成一致且视觉清晰的新视角图像,有效提升了3D高斯泼溅框架对烟雾退化的鲁棒性。

Insight: 创新点在于将多模态大语言模型(如Nano-Banana-Pro)提供的视觉先验与显式3D高斯表示相结合,并引入轻量级的视角依赖介质分支来建模烟雾引起的方向性外观变化,在保持渲染效率的同时增强了模型对恶劣环境的适应性。

Abstract: Reconstructing 3D scenes from smoke-degraded multi-view images is particularly difficult because smoke introduces strong scattering effects, view-dependent appearance changes, and severe degradation of cross-view consistency. To address these issues, we propose a framework that integrates visual priors with efficient 3D scene modeling. We employ Nano-Banana-Pro to enhance smoke-degraded images and provide clearer visual observations for reconstruction and develop Smoke-GS, a medium-aware 3D Gaussian Splatting framework for smoke scene reconstruction and restoration-oriented novel view synthesis. Smoke-GS models the scene using explicit 3D Gaussians and introduces a lightweight view-dependent medium branch to capture direction-dependent appearance variations caused by smoke. Our method preserves the rendering efficiency of 3D Gaussian Splatting while improving robustness to smoke-induced degradation. Results demonstrate the effectiveness of our method for generating consistent and visually clear novel views in challenging smoke environments.


[87] CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration cs.CV | cs.AIPDF

Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng

TL;DR: 本文提出了CRFT(一致循环特征流变换器),一种基于特征流学习的统一粗到精框架,用于鲁棒的跨模态图像配准。CRFT在基于Transformer的架构中学习模态无关的特征流表示,联合执行特征对齐和流估计。粗阶段通过多尺度特征相关性建立全局对应关系,而精阶段通过分层特征融合和自适应空间推理细化局部细节。通过迭代差异引导注意力机制和空间几何变换循环优化流场,逐步捕捉细微的空间不一致性并增强特征级一致性。

Details

Motivation: 解决跨模态图像配准中由于模态差异(如不同传感器或成像条件)导致的特征不一致和几何变形挑战,实现鲁棒且准确的空间对齐。

Result: 在多个跨模态数据集上的广泛实验表明,CRFT在准确性和鲁棒性方面均持续优于最先进的配准方法(SOTA)。

Insight: 创新点包括:统一的粗到精特征流学习框架、模态无关的特征表示、迭代差异引导注意力机制与空间几何变换的循环优化设计,增强了几何适应性和结构一致性,为多模态空间对应提供了通用范式,可广泛应用于遥感、自主导航和医学成像等领域。

Abstract: We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.


[88] Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs cs.CVPDF

Chongyu Wang, Ting Huang, Chunyu Sun, Xinyu Ning, Di Wang

TL;DR: 本文提出了GUIDE框架,旨在解决多模态大语言模型在物理空间感知方面的局限性。该框架通过分层解耦几何先验,将多粒度几何特征逐步注入MLLM的早期层,并引入上下文感知门控机制以动态选择空间线索,从而提升模型对复杂空间推理和感知任务的能力。

Details

Motivation: 现有几何感知MLLM通常采用单深层提取和输入级融合的范式,导致局部几何细节丢失和早期层语义不匹配,限制了模型在真实世界视觉流中的物理空间感知能力。

Result: 在多个复杂空间推理和感知任务上的大量实验表明,GUIDE显著优于现有基线,为将3D几何先验集成到大模型中建立了新范式。

Insight: 创新点在于提出了渐进式几何先验注入框架,通过多级采样和分层对齐融合几何特征,并引入上下文感知门控机制,以引导模型学习2D到3D的过渡过程并高效利用空间先验,同时抑制冗余几何噪声。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.


[89] In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting cs.CVPDF

Wenhui Xiao, Ethan Goan, Rodrigo Santa Cruz, David Ahmedt-Aristizabal, Olivier Salvado

TL;DR: 本文提出了一种将单目深度先验可靠地集成到3D高斯溅射(Gaussian Splatting)训练框架中的方法,旨在解决基础单目深度估计模型存在的尺度模糊性、多视图不一致性和局部几何不准确性问题,从而提升稀疏数据或纹理缺失场景下的渲染质量。

Details

Motivation: 在3D高斯溅射中使用准确的深度先验可以缓解训练数据稀疏和纹理缺失表面导致的伪影,但获取精确深度图需要专业采集系统。基础单目深度估计模型虽成本低,但其固有的尺度模糊、多视图不一致和局部几何误差会损害渲染性能,因此需要一种可靠利用这些噪声深度先验的方法。

Result: 在多个数据集上的广泛实验表明,该方法在不同GS变体和单目深度骨干网络测试中,均能一致地提升几何精度,实现更准确的深度估计和更高的渲染质量。

Insight: 核心创新点在于提出了一个训练框架,通过从弱对齐的深度变化中学习,并引入选择性正则化方法来隔离病态几何,从而限制深度误差向重建良好的3D结构传播,实现了对噪声且尺度模糊的单目深度先验的可靠利用。

Abstract: Using accurate depth priors in 3D Gaussian Splatting helps mitigate artifacts caused by sparse training data and textureless surfaces. However, acquiring accurate depth maps requires specialized acquisition systems. Foundation monocular depth estimation models offer a cost-effective alternative, but they suffer from scale ambiguity, multi-view inconsistency, and local geometric inaccuracies, which can degrade rendering performance when applied naively. This paper addresses the challenge of reliably leveraging monocular depth priors for Gaussian Splatting (GS) rendering enhancement. To this end, we introduce a training framework integrating scale-ambiguous and noisy depth priors into geometric supervision. We highlight the importance of learning from weakly aligned depth variations. We introduce a method to isolate ill-posed geometry for selective monocular depth regularization, restricting the propagation of depth inaccuracies into well-reconstructed 3D structures. Extensive experiments across diverse datasets show consistent improvements in geometric accuracy, leading to more faithful depth estimation and higher rendering quality across different GS variants and monocular depth backbones tested.


[90] MPM: Mutual Pair Merging for Efficient Vision Transformers cs.CVPDF

Simon Ravé, Pejman Rasti, David Rousseau

TL;DR: 本文提出了一种名为MPM(Mutual Pair Merging)的训练无关的令牌聚合模块,旨在通过减少序列长度来加速视觉Transformer在语义分割任务中的推理速度。该方法通过余弦空间中的互最近邻配对、平均合并以及基于收集的重建机制,在不改变现有分割头的情况下实现高效的特征重建,并在多种硬件平台上验证了其端到端延迟的显著降低。

Details

Motivation: 现有令牌缩减方法多针对分类任务,且常使用代理指标而非端到端延迟进行评估;在语义分割中,令牌缩减还需重建密集的像素对齐特征,且现代加速器上合并图的计算开销可能抵消预期收益。本文旨在设计一种简单、无需训练、且能明确考虑开销以实现实际时钟时间增益的令牌合并方法。

Result: 在ADE20K等标准分割数据集上,MPM在Raspberry Pi 5上为ViT-Tiny模型降低了高达60%的单图像延迟,在配备FlashAttention-2的NVIDIA H100上提升了高达20%的吞吐量,同时保持mIoU下降低于3%。

Insight: 创新点在于提出了一种基于互最近邻配对的训练无关令牌合并机制,通过离散插入调度而非连续压缩旋钮(如保留率或阈值)来平衡速度与精度,并采用基于收集的重建方法以最小化开销,确保在实际硬件上实现显著的端到端加速效果。

Abstract: Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.


[91] GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance cs.CVPDF

Weiqi Zhang, Junsheng Zhou, Haotian Geng, Kanle Shi, Shenkun Xu

TL;DR: 本文提出GaussianGrow方法,用于从3D点云生成3D高斯表示,并利用文本引导来控制生成过程。该方法通过多视角扩散模型合成一致的外观进行监督,并采用迭代策略来补全难以观察的区域,最终生成完整的3D高斯模型。

Details

Motivation: 3D高斯泼溅在渲染效率和质量上表现优异,但在缺乏适当几何先验的情况下生成3D高斯仍具挑战。现有方法依赖不可靠的估计几何作为参考,可能导致生成质量不佳。本文旨在从易于获取的3D点云出发,通过’生长’高斯来自然保证几何准确性。

Result: 论文在合成和真实扫描点云上进行了广泛的文本引导高斯生成评估,但摘要中未明确提及具体的定量结果(如PSNR、SSIM)或与SOTA的比较。

Insight: 创新点包括:1) 提出从点云’生长’高斯的几何感知生成范式,确保几何准确性;2) 设计文本引导的高斯生长方案,利用多视角扩散模型进行外观监督;3) 引入迭代相机姿态检测和基于预训练2D扩散模型的修复策略,以补全难以观察的区域。

Abstract: 3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored predicting point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose by observing the largest un-grown regions in point clouds and inpainting them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds. Project Page: https://weiqi-zhang.github.io/GaussianGrow


[92] Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP cs.CVPDF

Yusung Ro, Jaehyun Choi, Junmo Kim

TL;DR: 本文提出信息范围作为解释CLIP视觉编码器中稀疏自编码器特征的新维度,用于区分特征聚合视觉证据的广度,从局部补丁特定线索到全局图像级信号。作者通过上下文依赖分数量化特征的位置稳定性,发现不同信息范围的特征对CLIP预测和置信度产生系统性影响。

Details

Motivation: 现有稀疏自编码器分析主要关注单个特征的语义含义,忽略了特征聚合视觉信息的范围差异,因此需要补充这一解释性维度以更深入理解CLIP表示。

Result: 实验表明,通过上下文依赖分数分离出的局部范围特征(位置稳定)和全局范围特征(位置可变)对CLIP的预测和置信度有不同影响,为理解CLIP表示提供了新的诊断视角。

Insight: 创新点在于引入信息范围作为解释性新轴,并提出上下文依赖分数来量化特征的位置依赖性,这有助于更细致地分析稀疏自编码器特征在视觉任务中的行为差异。

Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP’s predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.


[93] FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips cs.CVPDF

Mengtian Li, Kunyan Dai, Yi Ding, Ruobing Ni, Ying Zhang

TL;DR: 本文提出了FoleyDesigner框架,用于为电影片段生成具有精确时空对齐的沉浸式立体声Foley音效。该框架受专业Foley工作流程启发,集成了视频片段分析、时空可控的Foley生成和专业音频混音能力,并引入了首个包含空间元数据的高质量立体声音频数据集FilmStereo。

Details

Motivation: 解决电影中手动创建时空对齐的Foley音效劳动密集、效率低下的问题,旨在自动化并提升沉浸式听觉体验的创作效率与质量。

Result: 大量实验表明,该方法在时空对齐方面优于现有基线方法,并与专业电影制作标准(如符合ITU-R BS.775标准的5.1声道杜比全景声系统)无缝兼容。

Insight: 创新点在于结合了基于视频帧时空线索训练的潜在扩散模型和LLM驱动的混合机制来模拟后期制作实践,以实现精确的时空对齐;同时构建了首个带有丰富标注的专业立体声音频数据集FilmStereo,填补了高质量数据集的空白。

Abstract: Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .


[94] SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge cs.CVPDF

Dongliang Zhu, Zhiyi Niu, Bo Zhao, Jiajian Huang, Shuo Ye

TL;DR: 本文介绍了第二届多模态欺骗检测挑战赛和首届领域泛化远程生理测量挑战赛(SVC 2026),旨在推动针对细微视觉信号的鲁棒表示学习研究。挑战赛包含跨领域多模态欺骗检测和远程光电容积描记术(rPPG)估计两个任务,共有22支队伍提交了最终结果,相关基线模型已发布。

Details

Motivation: 现有研究在处理现实世界中细微、微弱信号时,模型在鲁棒性、表示能力和泛化性方面仍面临挑战,且往往局限于特定任务或模态。本挑战赛旨在促进学习针对细微视觉信号的鲁棒表示,以推动计算机视觉和多模态学习的发展。

Result: 共有22支团队向本次研讨会竞赛提交了最终结果,相应的基线模型已在MMDD2026平台上发布。

Insight: 创新点在于组织了一个专注于学习细微视觉信号鲁棒表示的综合挑战赛,首次将跨领域多模态欺骗检测和领域泛化远程生理测量(rPPG)两个任务结合,旨在鼓励开发更具鲁棒性和泛化能力的模型,以应对现实场景中信号微弱、跨域等挑战。

Abstract: Subtle visual signals, although difficult to perceive with the naked eye, contain important information that can reveal hidden patterns in visual data. These signals play a key role in many applications, including biometric security, multimedia forensics, medical diagnosis, industrial inspection, and affective computing. With the rapid development of computer vision and representation learning techniques, detecting and interpreting such subtle signals has become an emerging research direction. However, existing studies often focus on specific tasks or modalities, and models still face challenges in robustness, representation ability, and generalization when handling subtle and weak signals in real-world environments. To promote research in this area, we organize the Subtle visual Challenge, which aims to learn robust representations for subtle visual signals. The challenge includes two tasks: cross-domain multimodal deception detection and remote photoplethysmography (rPPG) estimation. We hope that this challenge will encourage the development of more robust and generalizable models for subtle visual understanding, and further advance research in computer vision and multimodal learning. A total of 22 teams submitted their final results to this workshop competition, and the corresponding baseline models have been released on the \href{https://sites.google.com/view/svc-cvpr26}{MMDD2026 platform}\footnote{https://sites.google.com/view/svc-cvpr26}


[95] Improving Controllable Generation: Faster Training and Better Performance via $x_0$-Supervision cs.CVPDF

Amadou S. Sangare, Adrien Maglo, Mohamed Chaouch, Bertrand Luvison

TL;DR: 本文提出了一种名为$x_0$-supervision的新训练目标,用于改进可控文本到图像生成模型。通过分析去噪动态,作者发现对干净目标图像进行直接监督或对扩散损失进行等效重加权,可以显著加速模型收敛,并同时提升生成图像的视觉质量和条件控制精度。

Details

Motivation: 现有文本到图像扩散/流模型在视觉保真度和文本对齐方面取得了显著进展,但在需要精确控制图像布局时仍存在局限,因为自然语言无法可靠地表达此类信息。现有可控生成方法通常简单地用与初始网络相同的损失函数训练增强网络,这可能导致收敛时间非常长。

Result: 在多种控制设置下的实验表明,该方法根据新提出的度量指标(平均收敛曲线下面积 - mAUCC)可将收敛速度提升高达2倍,同时改善了视觉质量和条件准确性。

Insight: 核心创新点在于重新审视了可控扩散模型的训练目标,提出了对干净图像$x_0$的直接监督或对扩散损失的等效重加权方案。这从优化动态的角度提供了更高效的训练范式,既能加速训练,又能提升性能,为可控生成领域提供了新的技术思路。

Abstract: Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2$\times$ according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision


[96] Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0 cs.CV | cs.CLPDF

Roni Goldshmidt, Hamish Scott, Lorenzo Niccolini, Hernan Matzner

TL;DR: 本文介绍了第二代碰撞预警系统BADAS-2.0,它在BADAS-1.0的基础上,通过构建长尾基准数据集、知识蒸馏实现边缘实时部署以及提供实时可解释性三个方面推进了技术前沿。

Details

Motivation: 动机是构建一个更准确、可扩展且能处理罕见安全关键场景的碰撞预警系统,并实现模型在边缘设备的实时部署与预测可解释性。

Result: 在构建的包含178,500个标注视频(约200万个片段)的10组长期尾基准上,BADAS-2.0在所有子组上均取得了一致的性能提升,尤其在最困难的长尾案例上改进最大。通过知识蒸馏得到的紧凑模型(BADAS-2.0-Flash和BADAS-2.0-Flash-Lite)实现了7-12倍的加速,同时保持了接近的精度,达到了实时边缘部署的水平。

Insight: 创新点包括:1)利用现有模型作为主动预言机来评分海量未标注驾驶数据,以构建针对罕见安全关键场景的长尾基准数据集;2)通过领域特定的自监督预训练实现高效的知识蒸馏,在保持精度的同时大幅压缩模型,实现边缘实时推理;3)提供以物体为中心的实时注意力热图进行预测定位,并扩展结合视觉语言模型生成结构化文本推理,增强了系统的可解释性。

Abstract: We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar’s Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available.


[97] PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization cs.CVPDF

Shicai Wei, Chunbo Luo, Qiang Zhu, Yang Luo

TL;DR: 本文提出了一种名为性能主导模态优先(PDMP)的新策略,用于重新思考多模态学习中的平衡学习问题。该策略认为,由性能主导模态驱动的非平衡学习(即具有更优单模态性能的模态主导优化)能带来更好的多模态性能,并指出现有模型优化不足的问题源于对性能主导模态的学习不足。PDMP通过独立训练的单模态模型性能排名来识别性能主导模态,并引入非对称系数来调制各模态的梯度,使其主导优化过程。该方法不依赖于多模态模型的具体结构和融合方法,具有广泛的适用性。在多个数据集上的大量实验验证了PDMP的优越性。

Details

Motivation: 解决多模态学习中常见的优化不足问题,即多模态模型性能甚至不如其单模态对应模型。现有方法将此归因于模态间学习不平衡并通过梯度调制解决,但本文认为平衡学习并非最优,而应由性能主导模态驱动非平衡学习。

Result: 在多个数据集上的广泛实验验证了PDMP的优越性,表明其能有效提升多模态模型的性能,具体基准和定量结果在摘要中未详细说明,但暗示达到了先进水平。

Insight: 创新点在于挑战了多模态学习需平衡优化的传统观点,提出性能主导模态优先策略,通过基于单模态性能排名的简单梯度调制,实现模型无关的优化改进,为实际应用提供了新思路。

Abstract: Multimodal learning has attracted increasing attention due to its practicality. However, it often suffers from insufficient optimization, where the multimodal model underperforms even compared to its unimodal counterparts. Existing methods attribute this problem to the imbalanced learning between modalities and solve it by gradient modulation. This paper argues that balanced learning is not the optimal setting for multimodal learning. On the contrary, imbalanced learning driven by the performance-dominant modality that has superior unimodal performance can contribute to better multimodal performance. And the under-optimization problem is caused by insufficient learning of the performance-dominant modality. To this end, we propose the Performance-Dominant Modality Prioritization (PDMP) strategy to assist multimodal learning. Specifically, PDMP firstly mines the performance-dominant modality via the performance ranking of the independently trained unimodal model. Then PDMP introduces asymmetric coefficients to modulate the gradients of each modality, enabling the performance-dominant modality to dominate the optimization. Since PDMP only relies on the unimodal performance ranking, it is independent of the structures and fusion methods of the multimodal model and has great potential for practical scenarios. Finally, extensive experiments on various datasets validate the superiority of PDMP.


[98] EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion cs.CV | cs.GRPDF

Da Li, Dominik Engel, Deng Luo, Ivan Viola

TL;DR: 本文提出EfficientMonoHair,一个从单目视频快速重建发丝级头发几何的框架。它结合了隐式神经表示与基于融合块的多视角优化,并引入了并行头发生长策略,在保证高保真重建的同时,显著提升了运行效率。

Details

Motivation: 解决现有发丝级头发重建方法在精度与效率之间难以权衡的问题:隐式神经表示难以保留细粒度发丝细节,而显式优化方法计算成本高、可扩展性差。

Result: 在合成基准测试上,该方法的重建质量与最先进方法相当,同时将运行效率提升了近一个数量级。在真实世界代表性发型上的大量实验也验证了其鲁棒性和高保真重建能力。

Insight: 创新点在于融合块的多视角优化减少了点云方向的优化迭代次数,以及并行头发生长策略放宽了体素占用约束,使得大规模发丝追踪在方向场不准确或有噪声时仍能保持稳定和鲁棒。

Abstract: Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.


[99] WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering cs.CV | cs.CL | cs.IRPDF

Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan

TL;DR: 本文提出WikiSeeker,一种新颖的多模态检索增强生成框架,用于基于知识的视觉问答。它通过引入多模态检索器并重新定义视觉语言模型的作用,将VLM作为精炼器和检查器两个专门代理,而非仅作为答案生成器,从而显著提升了检索准确性和答案质量。

Details

Motivation: 当前基于知识的视觉问答方法主要依赖图像作为检索键,且未能充分利用视觉语言模型的潜力,经常忽视或错误定位其角色。本文旨在解决这一问题,通过重新思考VLM在KB-VQA中的作用,以更有效地整合多模态信息。

Result: 在EVQA、InfoSeek和M2KR基准测试上的广泛实验表明,WikiSeeker实现了最先进的性能,在检索准确性和答案质量方面均有显著提升。

Insight: 创新点在于将VLM重新定义为两个专门代理(精炼器和检查器),用于查询重写和选择性路由检索上下文,实现了检索与生成的解耦策略,从而更充分地利用了VLM的能力和内部知识。

Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM’s internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.


[100] Learn to Rank: Visual Attribution by Learning Importance Ranking cs.CV | cs.LGPDF

David Schinagl, Christian Fruhwirth-Reisinger, Alexander Prutsch, Samuel Schulter, Horst Possegger

TL;DR: 本文提出了一种名为’学习排序’的视觉归因方法,通过直接优化删除和插入度量来生成模型决策的解释图。该方法将非可微的排序和排名问题转化为排列学习问题,并使用Gumbel-Sinkhorn进行可微松弛,实现了端到端训练,能够高效地生成密集、像素级的归因图。

Details

Motivation: 现有视觉归因方法在效率、因果性和解释粒度上存在三难权衡:传播方法高效但有偏且依赖架构;扰动方法因果性强但计算昂贵且对视觉Transformer常产生粗糙的块级解释;学习方法快速但通常优化替代目标或从启发式教师蒸馏。本文旨在直接优化评估归因质量的删除和插入度量,以克服这些局限。

Result: 实验表明,该方法在定量评估上取得了一致的改进,并生成了更清晰、与边界对齐的解释图,特别是在基于Transformer的视觉模型上表现出色。

Insight: 核心创新在于将归因评估度量(删除/插入)的直接优化转化为可微的排列学习问题,通过Gumbel-Sinkhorn松弛实现端到端训练。这避免了现有学习方法对替代目标或教师模型的依赖,并能高效生成像素级精细解释,尤其提升了Transformer模型解释的粒度与质量。

Abstract: Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model’s prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.


[101] Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models cs.CVPDF

Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou

TL;DR: 本文提出了一种名为’铭文越狱’的新型攻击方法,针对能够生成段落长度文本的现代文生图模型。该方法通过将有害文本嵌入视觉无害的场景中,利用模型的文本渲染能力进行攻击。作者开发了名为Etch的黑盒攻击框架,通过将对抗性提示分解为语义伪装、视觉空间锚定和字体编码三个正交层,并利用视觉语言模型进行迭代优化,有效绕过多阶段安全过滤器。

Details

Motivation: 现代文生图模型具备了渲染可读段落文本的能力,这催生了一种全新的滥用类别。传统针对视觉内容的’描绘性越狱’攻击无法有效利用文本渲染能力,而现有技术难以在绕过安全过滤器的同时保持字符级保真度。本文旨在揭示这一安全盲点,并形式化地提出’铭文越狱’攻击。

Result: 在2个基准测试上对7个模型进行的广泛评估表明,Etch的平均攻击成功率达到65.57%(峰值达91.00%),显著优于现有基线方法。

Insight: 核心创新在于将对抗性提示分解为三个功能正交的层(语义、视觉、字体),将复杂的联合优化问题简化为可处理的子问题,并通过零阶循环和视觉语言模型的反馈进行迭代优化。这揭示了当前T2I安全对齐机制在字体感知方面的关键缺陷,强调了开发多模态防御机制的紧迫性。

Abstract: Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.


[102] Automatic dental superimposition of 3D intraorals and 2D photographs for human identification cs.CV | cs.AIPDF

Antonio D. Villegas-Yeguas, Xavier Abreau-Freire, Guillermo R-García, Andrea Valsecchi, Teresa Pinho

TL;DR: 该论文提出了一种用于人类身份识别的自动牙科叠加方法,通过计算机视觉和优化技术,将3D口腔内扫描模型与2D生前照片进行配准,以进行形态学比较。

Details

Motivation: 解决牙科形态学比较中因缺乏生前医疗记录(如边境移民死亡或无全民医疗国家)而面临的挑战,并利用社交媒体上可见牙齿的照片进行客观、定量的形态差异量化,克服现有方法在透视畸变建模和客观性方面的局限。

Result: 在142个样本的20,164次交叉比较中,两种自动方法(基于配对标志点和基于牙齿区域分割)分别获得平均排名值1.6和1.5,明显优于自动牙科图表比较方法的过滤能力,并提供了可解释的叠加图像和定量分数。

Insight: 创新点在于开发了3D-2D自动配准框架,通过优化相机参数来模拟生前照片的透视效果,实现了对牙科形态差异的客观量化,为法医牙科识别提供了自动化、可解释的工具。

Abstract: Dental comparison is considered a primary identification method, at the level of fingerprints and DNA profiling. One crucial but time-consuming step of this method is the morphological comparison. One of the main challenges to apply this method is the lack of ante-mortem medical records, specially on scenarios such as migrant death at the border and/or in countries where there is no universal healthcare. The availability of photos on social media where teeth are visible has led many odontologists to consider morphological comparison using them. However, state-of-the-art proposals have significant limitations, including the lack of proper modeling of perspective distortion and the absence of objective approaches that quantify morphological differences. Our proposal involves a 3D (post-mortem scan) - 2D (ante-mortem photos) approach. Using computer vision and optimization techniques, we replicate the ante-mortem image with the 3D model to perform the morphological comparison. Two automatic approaches have been developed: i) using paired landmarks and ii) using a segmentation of the teeth region to estimate camera parameters. Both are capable of obtaining very promising results over 20,164 cross comparisons from 142 samples, obtaining mean ranking values of 1.6 and 1.5, respectively. These results clearly outperform filtering capabilities of automatic dental chart comparison approaches, while providing an automatic, objective and quantitative score of the morphological correspondence, easily to interpret and analyze by visualizing superimposed images.


[103] Physics-Aware Video Instance Removal Benchmark cs.CVPDF

Zirui Li, Xinghao Chen, Lingyu Jiang, Dengzhe Hou, Fangzhou Lin

TL;DR: 本文提出了一个名为PVIR的物理感知视频实例移除基准测试,包含95个高质量视频,分为简单和困难子集,用于评估方法在移除目标对象时保持背景完整性和物理一致性的能力。

Details

Motivation: 现有视频实例移除基准主要评估视觉合理性,往往忽略了由对象移除引发的物理因果关系(如残留阴影),因此需要一个新的基准来专门评估物理一致性。

Result: 在PVIR基准上评估了四种代表性方法,其中PISCO-Removal和UniVideo达到了最先进的性能,而DiffuEraser经常引入模糊伪影,CoCoCo在指令跟随方面存在显著困难。在困难子集上所有方法性能均下降,凸显了恢复复杂物理副效应的持续挑战。

Insight: 论文的创新点在于引入了首个明确关注物理因果关系(如镜面反射、光照交互)的视频实例移除基准,并采用解耦的人类评估协议来分离语义、视觉和空间维度的失败案例,为评估物理一致性提供了新标准。

Abstract: Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.


[104] AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis cs.CVPDF

Dong She, Xianrong Yao, Liqun Chen, Jinghe Yu, Yang Gao

TL;DR: 本文提出了AICA-Bench基准,用于全面评估视觉语言模型在情感图像内容分析中的能力,涵盖情感理解、推理和生成三个任务。研究发现现有模型存在强度校准弱和描述浅层的问题,并提出了无需训练的Grounded Affective Tree提示框架来改善这些问题。

Details

Motivation: 当前视觉语言模型在感知方面表现强大,但在将感知、推理和生成整合到统一框架中的整体情感图像内容分析方面仍未被充分探索,因此需要建立一个全面的基准来评估和推动该领域发展。

Result: 在AICA-Bench上评估了23个VLM,揭示了其局限性;提出的GAT提示框架减少了强度误差并提高了描述深度,为未来的情感多模态理解与生成研究提供了强基线。

Insight: 创新点在于构建了首个整合情感理解、推理和生成的统一基准AICA-Bench,并提出了结合视觉支架与层次推理的GAT提示方法,这是一种无需训练即可提升模型情感分析性能的有效框架。

Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.


[105] Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning cs.CV | cs.AIPDF

Jingbo Sun, Qichao Zhang, Songjun Tu, Xing Fang, Yupeng Zheng

TL;DR: 本文提出了一种名为SRCP的新框架,用于解决视觉无监督强化学习(URL)中后继表示(SR)方法在高维视觉环境中泛化能力不足的问题。该框架通过引入显著性引导的动态任务来解耦表示学习与后继训练,并结合快速采样一致性策略来提升技能条件策略的建模和可控性。

Details

Motivation: 动机在于解决SR方法在视觉URL中存在的两个关键限制:一是SR目标导致表示关注与动态无关的区域,从而影响后继度量的准确性和任务泛化;二是这些有缺陷的表示阻碍了SR策略对多模态技能条件动作分布的建模和技能可控性。

Result: 在ExORL基准的4个数据集上的16个任务中进行广泛实验,结果表明SRCP在视觉URL中实现了最先进的零样本泛化性能,并且与多种SR方法兼容。

Insight: 创新点包括:通过显著性引导的动态任务解耦表示学习以捕获动态相关表示,以及采用快速采样一致性策略结合URL特定的无分类器引导和定制训练目标来改进策略建模。这些方法提升了SR在视觉环境中的可扩展性和泛化能力。

Abstract: Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.


[106] Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction cs.CV | eess.IVPDF

Ahmet Rasim Emirdagi, Süleyman Aslan, Mısra Yavuz, Görkay Aydemir, Yunus Bilge Kurt

TL;DR: 本文提出了一种利用图像编辑基础模型进行数据高效CT金属伪影减少的新范式,通过参数高效的LoRA适配将伪影减少任务重构为上下文推理任务,仅需16至128个配对训练样本即可实现有效伪影抑制,数据需求降低两个数量级,并在AAPM CT-MAR基准测试中达到最先进的感知和放射学特征指标性能。

Details

Motivation: 解决高衰减植入物引起的金属伪影严重降低CT图像质量、掩盖关键解剖结构的问题,并应对标准深度学习方法需要大量配对训练数据的挑战。

Result: 在AAPM CT-MAR基准测试中,该方法在感知和放射学特征指标上达到了最先进的性能水平。

Insight: 创新点包括将伪影减少重构为上下文推理任务,利用LoRA进行参数高效的基础模型适配,提出多参考条件策略以利用无关受试者的干净解剖示例进行接地恢复,并证明领域适配对于减轻幻觉至关重要。

Abstract: Metal artifacts from high-attenuation implants severely degrade CT image quality, obscuring critical anatomical structures and posing a challenge for standard deep learning methods that require extensive paired training data. We propose a paradigm shift: reframing artifact reduction as an in-context reasoning task by adapting a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). By leveraging rich visual priors, our approach achieves effective artifact suppression with only 16 to 128 paired training examples reducing data requirements by two orders of magnitude. Crucially, we demonstrate that domain adaptation is essential for hallucination mitigation; without it, foundation models interpret streak artifacts as erroneous natural objects (e.g., waffles or petri dishes). To ground the restoration, we propose a multi-reference conditioning strategy where clean anatomical exemplars from unrelated subjects are provided alongside the corrupted input, enabling the model to exploit category-specific context to infer uncorrupted anatomy. Extensive evaluation on the AAPM CT-MAR benchmark demonstrates that our method achieves state-of-the-art performance on perceptual and radiological-feature metrics . This work establishes that foundation models, when appropriately adapted, offer a scalable alternative for interpretable, data-efficient medical image reconstruction. Code is available at https://github.com/ahmetemirdagi/CT-EditMAR.


[107] Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition cs.CVPDF

Tianyi Liu, Yiming Li, Wenqian Wang, Jiaojiao Wang, Chen Cai

TL;DR: 本文提出了一种名为Mixture-of-Modality-Experts (MoME) 的框架,并结合Holistic Token Learning (HTL) 策略,用于细粒度的多模态视觉分析,特别是在驾驶员动作识别任务中。该框架通过模态专家自适应协作和整体令牌学习,旨在提升专家专业化并减少多模态融合中的歧义。

Details

Motivation: 现有多模态学习方法通常依赖固定的融合模块或预定义的跨模态交互,难以适应变化的模态可靠性并捕捉细粒度的动作线索。本文旨在解决异构模态提供互补但输入依赖证据时,鲁棒多模态视觉分析所面临的挑战。

Result: 在公开基准测试上的实验结果表明,所提出的MoME框架和HTL策略共同超越了代表性的单模态和多模态基线方法。额外的消融、验证和可视化结果进一步证实了HTL策略能改善细微的多模态理解并提供更好的可解释性。

Insight: 创新点在于提出了一个以知识为中心的多模态学习框架,通过MoME实现模态专家的自适应协作,并通过HTL策略(利用类别令牌和时空令牌)同时增强专家内部细化和专家间知识转移,从而提升对细粒度、输入依赖的多模态线索的建模能力。

Abstract: Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.


[108] Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning cs.CV | cs.LGPDF

Ioannis Nasios

TL;DR: 本研究提出了一种用于滑坡检测的模块化多模型框架,该框架融合了Sentinel-2光学影像和Sentinel-1合成孔径雷达数据。方法采用多编码器视觉Transformer分别处理不同模态数据,并结合神经网络与梯度提升模型的集成学习,在无需灾前光学数据的非经典变化检测场景下,实现了稳健的滑坡检测。

Details

Motivation: 滑坡是重大地质灾害,需要准确及时的检测方法以支持减灾。现有方法可能依赖灾前数据或单一数据源,本研究旨在融合光学与雷达数据的互补优势,构建一个稳健且可操作的检测框架。

Result: 该方法在滑坡检测任务上达到了最先进的F1分数0.919,并在一个机器学习竞赛中取得了顶级性能,在精确率和召回率之间实现了良好平衡。

Insight: 创新点在于:1)采用多编码器视觉Transformer架构分别处理光学和SAR模态,有效利用其互补性;2)结合神经网络与梯度提升模型(如LightGBM、XGBoost)进行集成学习以提升性能;3)框架设计模块化,支持仅光学、仅SAR或融合输入,具有可扩展性和可迁移性,适用于更广泛的自然灾害监测。

Abstract: Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi-model framework that fuses Sentinel-2 optical imagery with Sentinel-1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi-encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state-of-the-art F1 score of 0.919 on landslide detection, addressing a patch-based classification task rather than pixel-level segmentation and operating without pre-event Sentinel-2 data, highlighting its effectiveness in a non-classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical-only, SAR-only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in https://github.com/IoannisNasios/sentinel-landslide-cls.


[109] HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation cs.CVPDF

Tao Hu, Varun Jampani

TL;DR: HumANDiff是一种用于人体视频生成的新框架,它通过引入关节化噪声扩散来增强对人体运动的控制。该方法包含三个核心设计:关节化运动一致噪声采样、联合外观-运动学习以及几何运动一致性学习。它通过微调现有视频扩散模型实现,无需修改模型架构,并在推理时支持单框架内的图像到视频生成,实现了内在的运动控制。

Details

Motivation: 尽管人体视频生成领域近期取得了巨大进展,但生成式视频扩散模型在忠实捕捉人体运动动力学和物理特性方面仍存在困难。本文旨在解决这一问题,提出一个能更好地控制人体运动一致性和物理真实性的新框架。

Result: 大量实验表明,该方法在渲染运动一致、高保真且具有多样化服装风格的人体视频方面,实现了最先进的性能。

Insight: 创新点在于将3D人体拓扑先验引入噪声采样过程,用关节化表面流形上的噪声替代非结构化的随机高斯噪声,从而实现了时空一致的噪声采样。同时,通过联合学习外观与物理运动,以及定义在关节化噪声空间中的几何运动一致性损失,增强了模型的物理真实性和运动控制能力。该方法与扩散模型设计无关,具有很好的通用性。

Abstract: Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/


[110] Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family cs.CV | cs.CLPDF

Oscar Chew, Hsiao-Ying Huang, Kunal Jain, Tai-I Chen, Khoa D Doan

TL;DR: 本文揭示了CLIP系列模型存在中心偏差问题,即模型过度关注图像中心区域而忽略边缘的重要物体。通过表征和注意力分析,发现该偏差源于视觉嵌入聚合过程中的信息丢失,特别是池化机制。研究提出了无需训练的视觉提示和注意力重分配策略来缓解此问题。

Details

Motivation: 解决CLIP系列模型因中心偏差导致的细粒度视觉理解不足问题,该偏差使模型难以识别图像边缘的关键物体,影响依赖这些物体的高级任务性能。

Result: 通过嵌入分解和注意力图分析,定性地证明了中心偏差的存在;提出的训练免费策略(如视觉提示和注意力重分配)能有效将模型注意力引导至非中心区域,缓解偏差。

Insight: 创新点在于首次系统性地识别并分析了CLIP的中心偏差问题,揭示了池化机制在信息丢失中的关键作用;提出的训练免费干预方法为改进视觉语言模型的细粒度感知提供了实用且低成本的解决方案。

Abstract: Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model’s embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models’ attention to off-center regions.


[111] OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control cs.CVPDF

Yukun Wang, Ruihuang Li, Jiale Tao, Shiyuan Yang, Liyi Chen

TL;DR: 本文提出了OmniCamera,一个用于多任务视频生成的统一框架,能够将场景动态内容与相机运动这两个关键维度进行显式解耦和独立控制。该框架通过创新的混合数据集OmniCAM和双层级课程协同训练策略,解决了模态冲突和数据稀缺的挑战,实现了对复杂相机运动的灵活控制并保持卓越的视觉质量。

Details

Motivation: 现有视频生成模型通常将场景动态内容与相机运动因素纠缠在一起,限制了独立控制。本文旨在显式解耦并独立控制这两个维度,以实现更灵活的创意视频生成。

Result: OmniCamera在复杂相机运动控制方面实现了最先进的性能,在保持卓越视觉质量的同时,提供了灵活的生成控制能力。

Insight: 创新点包括:1) 构建了结合真实世界视频与合成数据的混合数据集OmniCAM,为鲁棒的多任务学习提供多样化的配对示例;2) 提出了双层级课程协同训练策略,通过条件级和数据级的渐进式训练,有效缓解模态干扰并协同利用不同数据源。从客观角度看,这种解耦控制框架与课程学习策略的组合为解决视频生成中的多模态控制问题提供了可借鉴的思路。

Abstract: Video fundamentally intertwines two crucial axes: the dynamic content of a scene and the camera motion through which it is observed. However, existing generation models often entangle these factors, limiting independent control. In this work, we introduce OmniCamera, a unified framework designed to explicitly disentangle and command these two dimensions. This compositional approach enables flexible video generation by allowing arbitrary pairings of camera and content conditions, unlocking unprecedented creative control. To overcome the fundamental challenges of modality conflict and data scarcity inherent in such a system, we present two key innovations. First, we construct OmniCAM, a novel hybrid dataset combining curated real-world videos with synthetic data that provides diverse paired examples for robust multi-task learning. Second, we propose a Dual-level Curriculum Co-Training strategy that mitigates modality interference and synergistically learns from diverse data sources. This strategy operates on two levels: first, it progressively introduces control modalities by difficulties (condition-level), and second, trains for precise control on synthetic data before adapting to real data for photorealism (data-level). As a result, OmniCamera achieves state-of-the-art performance, enabling flexible control for complex camera movements while maintaining superior visual quality.


[112] Toward Aristotelian Medical Representations: Backpropagation-Free Layer-wise Analysis for Interpretable Generalized Metric Learning on MedMNIST cs.CVPDF

Michael Karnes, Alper Yilmaz

TL;DR: 本文提出了一种基于柏拉图表示假说的亚里士多德快速对象建模(A-ROM)框架,旨在解决深度学习在医学影像中因黑盒性质而难以临床落地的问题。该框架利用预训练视觉变换器的通用度量空间,无需梯度微调即可快速建模新医学概念,并通过可解释的概念字典和k近邻分类器替代传统决策层,在MedMNIST v2数据集上实现了与基准方法相当的性能。

Details

Motivation: 动机是解决基于反向传播的深度学习模型在医学影像中的“黑盒”问题,以满足临床环境对模型透明度和可解释性的严格要求。

Result: 在MedMNIST v2数据集上的实验表明,A-ROM的性能与标准基准方法相当,同时提供了一种简单、可扩展的“少样本”解决方案。

Insight: 创新点在于结合柏拉图表示假说,利用预训练ViT的通用度量空间实现免梯度微调的快速建模,并通过可解释的概念字典和kNN分类器提升模型透明度,为医学AI的可信部署提供了新思路。

Abstract: While deep learning has achieved remarkable success in medical imaging, the “black-box” nature of backpropagation-based models remains a significant barrier to clinical adoption. To bridge this gap, we propose Aristotelian Rapid Object Modeling (A-ROM), a framework built upon the Platonic Representation Hypothesis (PRH). This hypothesis posits that models trained on vast, diverse datasets converge toward a universal and objective representation of reality. By leveraging the generalizable metric space of pretrained Vision Transformers (ViTs), A-ROM enables the rapid modeling of novel medical concepts without the computational burden or opacity of further gradient-based fine-tuning. We replace traditional, opaque decision layers with a human-readable concept dictionary and a k-Nearest Neighbors (kNN) classifier to ensure the model’s logic remains interpretable. Experiments on the MedMNIST v2 suite demonstrate that A-ROM delivers performance competitive with standard benchmarks while providing a simple and scalable, “few-shot” solution that meets the rigorous transparency demands of modern clinical environments.


[113] Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors cs.CV | cs.AI | cs.MMPDF

Junbin Zhang, Meng Cao, Feng Tan, Yikai Lin, Yuexian Zou

TL;DR: Graph-PiT是一个基于图先验的部件图像合成框架,通过显式建模视觉部件之间的空间语义关系来增强生成图像的结构连贯性。它使用层次图神经网络(HGNN)在粗粒度部件节点和细粒度IP+令牌子节点之间进行双向消息传递,并引入图拉普拉斯平滑损失和边重建损失来优化部件嵌入。

Details

Motivation: 现有基于部件的框架将用户提供的部件视为无序集合,忽略了其内在的空间和语义关系,导致生成的组合缺乏结构完整性。论文旨在通过引入图先验来建模部件间的结构依赖关系,以解决这一问题。

Result: 在受控合成领域(如字符、产品、室内布局和拼图)的定量实验以及对真实网络图像的定性迁移表明,Graph-PiT在保持与原始IP-Prior流程兼容的同时,相比基础PiT模型提高了结构连贯性。消融实验证实显式关系推理对于强制执行用户指定的邻接约束至关重要。

Insight: 创新点在于将部件及其关系建模为图结构,并利用HGNN进行层次化消息传递来细化部件嵌入,从而增强多部件图像合成的结构合理性和可解释性。这为复杂图像生成提供了一种可扩展且结构感知的机制。

Abstract: Achieving fine-grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part-based frameworks treat user-provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph-PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial-semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge-reconstruction loss so that adjacent parts acquire compatible, relation-aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP-Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user-specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi-part image synthesis. The code is available at https://github.com/wolf-bailang/Graph-PiT.


[114] Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning cs.CV | cs.AIPDF

Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin

TL;DR: 本文提出了一种用于科学图形程序合成的双自一致性强化学习框架,旨在解决将科学示意图逆向工程为可编辑TikZ代码的挑战。该工作通过构建高质量数据集SciTikZ-230K和综合基准SciTikZ-Bench,并引入基于往返验证的双自一致性强化学习优化范式,训练出在视觉保真度和结构逻辑上均达到最先进水平的模型SciTikZer-8B。

Details

Motivation: 解决多模态大语言模型在将科学示意图逆向工程为需要严格空间精度的TikZ代码时面临的挑战,具体针对现有数据缺乏严格可执行性和可靠视觉对齐,以及缺乏评估结构保真度和视觉保真度的基准这两个主要差距。

Result: 在涵盖从基础几何构造到复杂层次示意图的SciTikZ-Bench基准上,训练出的SciTikZer-8B模型取得了最先进的性能,一致性地超越了Gemini-2.5-Pro和Qwen3-VL-235B-A22B-Instruct等专有或大规模模型。

Insight: 核心创新点包括:1)构建了大规模、高质量、跨学科且以可执行性为中心的数据集SciTikZ-230K;2)提出了一个评估视觉和结构保真度的多层面基准SciTikZ-Bench;3)引入了一种新颖的双自一致性强化学习优化范式,利用往返验证来惩罚退化代码并提升整体自一致性,为视觉-代码优化方法提供了新思路。

Abstract: Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.


[115] Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes cs.CVPDF

Athanasios Angelakis, Marta Gomez-Barrero

TL;DR: 本研究将先前提出的ZACH-ViT(一种紧凑、置换不变性的视觉Transformer)扩展到医学图像鲁棒性评估中,在低数据量条件下,首次系统评估了其对常见图像损坏和对抗性扰动的鲁棒性。实验表明,ZACH-ViT在干净数据和常见损坏上取得了最佳的平均排名,在对抗性攻击下也保持竞争力,但其对抗鲁棒性仍有待提升。

Details

Motivation: ZACH-ViT的原始设计动机是,传统的ViT中的位置嵌入和专用类别标记编码了固定的空间假设,这在空间信息弱、局部分布或可变的生物医学图像中可能不是最优的。原始研究仅在干净数据上评估了性能,本研究旨在填补空白,系统评估其在低数据量医学成像任务中对图像损坏和对抗性攻击的鲁棒性。

Result: 在七个MedMNIST数据集上,使用每类50个样本的低数据量设置进行评估。ZACH-ViT在干净数据上取得了最佳的平均排名(1.57),在常见图像损坏上也取得了最佳的平均排名(1.57)。在对抗性压力测试下,所有模型性能均大幅下降,但ZACH-ViT在FGSM攻击下排名第一(2.00),在PGD攻击下排名第二(2.29),而ABMIL在对抗性攻击下总体表现最佳。

Insight: 摘要宣称的创新点在于首次对ZACH-ViT进行了鲁棒性扩展评估,证明了紧凑、置换不变的Transformer架构的优势不仅体现在干净数据的性能上,在低数据量医学成像的现实扰动压力下也能持续存在。客观来看,该研究提供了一个重要的视角:在医学影像等特定领域,架构与数据空间结构的对齐可能比追求通用基准的绝对性能更有价值,尤其是在数据有限且需要鲁棒性的场景下。然而,对抗性鲁棒性对所有评估模型来说仍是一个开放的挑战。

Abstract: The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.


[116] SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation cs.CVPDF

Hiba Dahmani, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou

TL;DR: 本文提出SEM-ROVER,一种基于语义体素引导扩散的大规模驾驶场景生成框架。该方法利用Σ-Voxfield网格作为离散3D表示,通过语义条件扩散模型在局部体素邻域上生成内容,并通过渐进空间外推扩展至大场景,最后通过延迟渲染模块生成多视角一致的光真实感图像,无需逐场景优化。

Details

Motivation: 现有方法在生成大规模室外驾驶场景时,要么依赖蒸馏到3D空间的图像或视频生成模型,损害几何一致性并限制渲染视角,要么仅限于小规模3D场景或以物体为中心的生成,缺乏可扩展的多视角一致大规模场景生成能力。

Result: 大量实验表明,该方法能生成多样化的大规模城市室外场景,可渲染成具有不同传感器配置和相机轨迹的光真实感图像,同时与现有方法相比保持了适中的计算成本。

Insight: 创新点在于结合了语义条件扩散模型与Σ-Voxfield网格表示,通过局部体素邻域操作和3D位置编码捕获空间结构,并采用渐进空间外推实现大规模扩展,实现了无需逐场景优化的大规模、多视角一致的3D场景生成。

Abstract: Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $Σ$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.


[117] DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models cs.CV | cs.AI | cs.GRPDF

Zhengming Yu, Li Ma, Mingming He, Leo Isikdogan, Yuancheng Xu

TL;DR: DiffHDR是一个利用视频扩散模型将低动态范围(LDR)视频转换为高动态范围(HDR)视频的生成框架。它将LDR到HDR的转换建模为潜在空间中的辐射度修复任务,通过预训练的视频扩散模型生成过曝和欠曝区域的逼真HDR细节,并支持通过文本提示或参考图像进行可控转换。

Details

Motivation: 解决现有LDR到HDR转换技术在恢复过曝和欠曝区域真实细节方面的不足,以支持HDR显示和后期制作中的精确重曝光。

Result: 在辐射度保真度和时间稳定性方面显著优于现有最先进方法,能够生成具有高重曝光潜力的逼真HDR视频。

Insight: 创新性地将LDR到HDR转换定义为生成式辐射度修复任务,利用预训练视频扩散模型的时空先验;在Log-Gamma色彩空间中操作以更好地处理动态范围;开发了从静态HDRI地图合成高质量HDR视频训练数据的流程,缓解了配对数据稀缺问题。

Abstract: Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.


[118] HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models cs.CV | cs.LGPDF

Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, Mahdieh Soleymani Baghshah, Anna Rohrbach

TL;DR: 本文提出了HaloProbe,一个用于检测和缓解视觉语言模型中物体幻觉的贝叶斯框架。该框架通过分解外部描述统计和内部解码信号来估计token级别的幻觉概率,并以此作为外部评分信号进行非侵入式缓解,从而在减少幻觉的同时保持模型实用性。

Details

Motivation: 大型视觉语言模型在图像描述中会产生物体幻觉,而现有基于注意力权重的检测方法由于存在隐藏的混杂因素(如token位置和物体重复)而不可靠,导致辛普森悖论。

Result: 实验表明,HaloProbe引导的解码在减少幻觉方面比最先进的基于干预的方法更有效,同时保持了模型的实用性。

Insight: 创新点在于揭示了基于粗粒度注意力的分析不可靠的根本原因(混杂因素导致辛普森悖论),并提出了一个贝叶斯框架来分离内外证据以估计真实后验概率,实现了非侵入式的缓解策略。

Abstract: Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model’s attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson’s paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models’ internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.


[119] Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV | cs.ROPDF

Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang

TL;DR: 本文提出了一种名为Action Images的统一世界动作模型,将策略学习构建为多视角视频生成任务,通过将7自由度机器人动作转化为可解释的动作图像(即基于像素的多视角动作视频),使视频主干网络能够直接作为零样本策略,无需额外的策略头或动作模块。

Details

Motivation: 现有世界动作模型(WAMs)通常依赖独立的动作模块或使用非像素基础的动作表示,难以充分利用预训练视频模型的知识,且限制了跨视角和环境的迁移能力。

Result: 在RLBench和真实世界评估中,该模型实现了最强的零样本成功率,并在视频-动作联合生成质量上优于先前的视频空间世界模型。

Insight: 创新点在于提出像素基础的动作图像表示,将动作显式编码为多视角视频,使模型能够统一支持策略学习、视频-动作联合生成、动作条件视频生成和动作标注等多种任务,提升了可解释性和迁移性。

Abstract: World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.


eess.SY [Back]

[120] Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset eess.SY | cs.AI | cs.CLPDF

Tinko Sebastian Bartels, Ruixiang Wu, Xinyu Lu, Yikai Lu, Fanzeng Xia

TL;DR: 本文介绍了首个开源数字孪生平台OpenCEM Simulator and Dataset,旨在将丰富的非结构化上下文信息(如事件日程、系统日志)与可再生能源系统的定量动态相结合,以推动智能、情境感知的能源管理研究。

Details

Motivation: 传统能源管理严重依赖数值时间序列,忽略了人类生成的上下文信息(如事件安排、用户意图)中蕴含的显著预测能力,OpenCEM旨在填补这一空白。

Result: 论文通过实际案例(如情境感知负荷预测和在线最优电池充电控制策略)展示了该平台的实用性,但摘要未提及在特定基准测试上的定量比较结果或SOTA水平。

Insight: 创新点在于首次提供了一个将语言丰富数据集与模块化模拟器相结合的平台,支持多模态上下文原生处理,为利用大语言模型等开发新型控制算法和预测模型提供了高保真环境。

Abstract: Addressing the critical need for intelligent, context-aware energy management in renewable systems, we introduce the \textbf{OpenCEM Simulator and Dataset}: the first open-source digital twin explicitly designed to integrate rich, unstructured contextual information with quantitative renewable energy dynamics. Traditional energy management relies heavily on numerical time series, thereby neglecting the significant predictive power embedded in human-generated context (e.g., event schedules, system logs, user intentions). OpenCEM bridges this gap by offering a unique platform comprising both a meticulously aligned, language-rich dataset from a real-world PV-and-battery microgrid installation and a modular simulator capable of natively processing this multi-modal context. The OpenCEM Simulator provides a high-fidelity environment for developing and validating novel control algorithms and prediction models, particularly those leveraging Large Language Models. We detail its component-based architecture, hybrid data-driven and physics-based modelling capabilities, and demonstrate its utility through practical examples, including context-aware load forecasting and the implementation of online optimal battery charging control strategies. By making this platform publicly available, OpenCEM aims to accelerate research into the next generation of intelligent, sustainable, and truly context-aware energy systems.


cs.AI [Back]

[121] ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning cs.AI | cs.CLPDF

Xuan Xiong, Huan Liu, Li Gu, Zhixiang Chi, Yue Qiu

TL;DR: 本文提出了一种名为熵趋势奖励(ETR)的新方法,旨在优化大语言模型的思维链(CoT)推理过程。研究发现,推理效率与不确定性的变化轨迹密切相关,具有主导性下降熵趋势的CoT推理路径更短。ETR作为一种轨迹感知的目标函数,鼓励逐步减少不确定性,同时允许有限的局部探索。该方法被集成到组相对策略优化(GRPO)框架中,并在多个推理模型和基准测试上验证了其有效性。

Details

Motivation: 现有方法通过长度惩罚或全局熵减来缩短思维链(CoT),隐含地假设整个推理过程都应追求低不确定性。然而,作者发现这种假设并不完全正确,推理效率实际上由不确定性的变化轨迹所主导。因此,本文旨在解决CoT推理中产生的过长且低效的推理轨迹问题,通过关注熵的趋势而非绝对水平来提升效率。

Result: 在多个具有挑战性的基准测试上,ETR方法在准确性与效率之间取得了优越的权衡。具体而言,在四个基准测试中,ETR将DeepSeek-R1-Distill-7B模型的准确率提高了9.9%,同时将CoT长度减少了67%。

Insight: 论文宣称的创新点在于揭示了推理效率与不确定性变化轨迹(熵趋势)之间的关键联系,并据此提出了轨迹感知的熵趋势奖励(ETR)目标。从客观角度看,其核心创新在于摒弃了“全程低不确定性最优”的隐含假设,转而通过建模和优化熵的动态变化趋势来引导更高效的推理路径生成,这为优化大语言模型的推理过程提供了一个新的、更精细的视角和有效方法。

Abstract: Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory-aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy-efficiency tradeoff, improving DeepSeek-R1-Distill-7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at https://github.com/Xuan1030/ETR


[122] PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection cs.AI | cs.CLPDF

Siyuan Cheng, Bozhong Tian, YanChao Hao, Zheng Wei

TL;DR: 本文提出了PRISM-MCTS,一种新颖的推理框架,旨在解决现有基于蒙特卡洛树搜索(MCTS)的推理方法因缺乏轨迹间信息共享而导致的低效和计算冗余问题。该框架受人类并行思维和反思过程启发,集成了过程奖励模型(PRM)和动态共享内存,以捕获启发式和谬误,从而强化成功策略并剪除易错分支。

Details

Motivation: 现有基于MCTS的推理方法通常将每次搜索轨迹视为孤立的,缺乏信息共享,导致效率低下和计算冗余。本文旨在通过模仿人类的并行思考和元认知反思过程,构建一个更高效的推理框架。

Result: 在多个推理基准测试上的实证评估证实了PRISM-MCTS的有效性。特别是在GPQA基准上,它将所需轨迹数量减半,同时性能超越了MCTS-RAG和Search-o1方法。

Insight: 核心创新点在于将过程奖励模型(PRM)与动态共享内存结合,实现了跨搜索轨迹的“启发式”和“谬误”知识共享与利用。这提供了一种通过明智而非穷举的推理来扩展推理能力的新范式,其数据高效的PRM训练策略在少样本场景下也颇具借鉴意义。

Abstract: PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection Siyuan Cheng, Bozhong Tian, Yanchao Hao, Zheng Wei Published: 06 Apr 2026, Last Modified: 06 Apr 2026 ACL 2026 Findings Conference, Area Chairs, Reviewers, Publication Chairs, Authors Revisions BibTeX CC BY 4.0 Keywords: Efficient/Low-Resource Methods for NLP, Generation, Question Answering Abstract: The emergence of reasoning models, exemplified by OpenAI o1, signifies a transition from intuitive to deliberative cognition, effectively reorienting the scaling laws from pre-training paradigms toward test-time computation. While Monte Carlo Tree Search (MCTS) has shown promise in this domain, existing approaches typically treat each rollout as an isolated trajectory. This lack of information sharing leads to severe inefficiency and substantial computational redundancy, as the search process fails to leverage insights from prior explorations. To address these limitations, we propose PRISM-MCTS, a novel reasoning framework that draws inspiration from human parallel thinking and reflective processes. PRISM-MCTS integrates a Process Reward Model (PRM) with a dynamic shared memory, capturing both “Heuristics” and “Fallacies”. By reinforcing successful strategies and pruning error-prone branches, PRISM-MCTS effectively achieves refinement. Furthermore, we develop a data-efficient training strategy for the PRM, achieving high-fidelity evaluation under a few-shot regime. Empirical evaluations across diverse reasoning benchmarks substantiate the efficacy of PRISM-MCTS. Notably, it halves the trajectory requirements on GPQA while surpassing MCTS-RAG and Search-o1, demonstrating that it scales inference by reasoning judiciously rather than exhaustively.


[123] Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning cs.AI | cs.CLPDF

Xiaotian Zhou, Di Tang, Xiaofeng Wang, Xiaozhong Liu

TL;DR: 本文提出了一种名为GMRL-BD的新算法,用于在仅能黑盒访问大型语言模型(LLM)并受特定查询约束的条件下,识别LLM在哪些主题上可能产生不可信(如偏见、意识形态化或错误)的答案,即确定其不可信边界。该算法基于从维基百科导出的通用知识图谱,结合多智能体强化学习来高效定位LLM可能产生偏见答案的主题节点。实验表明,该算法仅需对LLM进行有限查询即可有效检测不可信边界。此外,作者发布了一个新数据集,包含Llama2、Vicuna、Falcon、Qwen2、Gemma2和Yi-1.5等流行LLM,并标注了每个模型可能产生偏见的主题。

Details

Motivation: 大型语言模型(LLM)在广泛主题上展现出强大的问答能力,但有时会产生偏见、意识形态化或不正确的回答。若无法明确了解LLM在哪些主题上的答案可信,将限制其实际应用。因此,需要一种方法在仅能黑盒访问LLM的条件下,高效识别其不可信的主题边界。

Result: 实验证明,GMRL-BD算法能够以有限的LLM查询次数,高效检测出LLM的不可信边界。作者还发布了一个包含多个流行LLM(如Llama2、Vicuna等)的新数据集,其中标注了各LLM可能产生偏见的主题。

Insight: 创新点在于提出了一种结合知识图谱和多智能体强化学习的黑盒检测框架(GMRL-BD),无需模型内部信息即可系统性地探索和识别LLM的不可信主题边界。从客观角度看,该方法将不可信性检测形式化为知识图谱上的搜索问题,并利用多智能体协作提高探索效率,为评估和提升LLM的可靠性提供了一种可扩展的自动化工具。同时,发布的数据集为社区提供了宝贵的基准资源。

Abstract: Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.


[124] LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo cs.AI | cs.CL | cs.GT | cs.LG | cs.MAPDF

Ojas Jain, Dhruv Kumar

TL;DR: 本文介绍了LudoBench,一个用于评估大型语言模型在Ludo(飞行棋)游戏中策略推理能力的基准。该基准包含480个手工制作的特定棋盘场景,覆盖12个行为决策类别,并提供了一个支持多种智能体的完整模拟器。评估发现,所有测试模型与基于博弈论的基准策略一致率仅为40-46%,并可分为两种有缺陷的行为原型,且模型行为易受提示词影响。

Details

Motivation: 动机是评估LLM在具有随机性、多智能体交互和复杂规划的棋盘游戏(如Ludo)中的策略决策能力,现有基准可能无法充分捕捉此类不确定性下的战略推理。

Result: 在LudoBench基准上评估了来自四个模型家族的六个模型,所有模型与基于期望极小化搜索的博弈论基准策略的平均一致率仅为40-46%,未达到SOTA水平。模型行为可分为’完成者’和’建设者’两种有缺陷的原型,且在相同棋盘状态下,受历史条件影响的’怨恨’提示会引发可测量的行为偏移。

Insight: 创新点在于提出了一个轻量级、可解释的基准(LudoBench),通过隔离特定战略选择的’点场景’来系统评估LLM的策略推理;客观分析认为,其将博弈论智能体作为策略上限、识别模型行为原型以及揭示提示敏感性是关键贡献,为理解LLM在不确定性下的决策脆弱性提供了新框架。

Abstract: We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/


[125] Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration cs.AI | cs.CLPDF

Yi Yuan, Xuhong Wang, Shanzhe Lei

TL;DR: 本文提出了一种新型深度研究智能体,通过渐进式置信度估计与校准机制,旨在自动生成可信赖的研究报告。该系统采用深思熟虑的检索模型,结合深度检索和多跳推理,为报告中的每个主张提供可验证的证据并分配置信度分数,从而提升生成内容的透明度和可信度。

Details

Motivation: 现有基于智能体的研究报告生成系统缺乏对内容可信度的有效评估,尤其是在开放域研究中缺乏真实答案的情况下,现有主观评估框架无法有效度量生成内容的认知置信度,导致校准困难,用户易受误导或幻觉信息影响。

Result: 实验和案例研究表明,该方法显著提高了报告的可解释性,并大幅增强了用户信任。

Insight: 创新点在于将渐进式置信度估计与校准机制集成到报告生成流程中,通过深思熟虑的检索模型(深度检索+多跳推理)为生成主张提供证据支撑和置信度评分,从而在缺乏真实答案的开放域场景下实现可信报告生成与透明度提升。

Abstract: As agent-based systems continue to evolve, deep research agents are capable of automatically generating research-style reports across diverse domains. While these agents promise to streamline information synthesis and knowledge exploration, existing evaluation frameworks-typically based on subjective dimensions-fail to capture a critical aspect of report quality: trustworthiness. In open-ended research scenarios where ground-truth answers are unavailable, current evaluation methods cannot effectively measure the epistemic confidence of generated content, making calibration difficult and leaving users susceptible to misleading or hallucinated information. To address this limitation, we propose a novel deep research agent that incorporates progressive confidence estimation and calibration within the report generation pipeline. Our system leverages a deliberative search model, featuring deep retrieval and multi-hop reasoning to ground outputs in verifiable evidence while assigning confidence scores to individual claims. Combined with a carefully designed workflow, this approach produces trustworthy reports with enhanced transparency. Experimental results and case studies demonstrate that our method substantially improves interpretability and significantly increases user trust.


[126] Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis cs.AI | cs.CLPDF

Michael Cuccarese

TL;DR: 本文提出了一种名为’认知遮蔽’的推理时协议,用于审计在LLM辅助分析中模型参数知识对数据驱动推理的潜在污染。该方法在药物靶点优先排序等任务中,通过将实体标识符替换为匿名代码后再提示LLM,并与未遮蔽的对照组比较输出,从而量化模型输出中有多少来源于提供的数据而非其记忆的先验知识。

Details

Motivation: 解决LLM在辅助分析时,其输出会隐式地混合数据驱动的推理与模型从训练数据中记忆的关于命名实体的先验知识,且这种混合无法从单一输出中区分的问题,旨在恢复对分析过程的可审计性。

Result: 在四种癌症类型的肿瘤学药物靶点优先排序任务中,遮蔽改变了16%的Top-20预测,但保持了已验证靶点的相同召回率;在S&P 500股票筛选任务中,品牌认知偏见在五个随机种子下重塑了30-40%的Top-20排名。

Insight: 创新点在于提出了一种简单、可部署的推理时审计协议,能够量化LLM输出对数据与参数知识的依赖程度,而不声称能提升结果质量,但提供了评估分析过程忠实性的关键工具。该方法已开源并集成到Claude Code中,降低了采用门槛。

Abstract: This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization. During development, it became apparent that LLM outputs silently blend data-driven inference with memorized priors about named entities - and the blend is invisible: there is no way to determine, from a single output, how much came from the data on the page and how much came from the model’s training memory. Epistemic blinding is a simple inference-time protocol that replaces entity identifiers with anonymous codes before prompting, then compares outputs against an unblinded control. The protocol does not make LLM reasoning deterministic, but it restores one critical axis of auditability: measuring how much of an output came from the supplied data versus the model’s parametric knowledge. The complete target identification system is described - including LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization - with demonstration that both stages operate without access to entity identity. In oncology drug target prioritization across four cancer types, blinding changes 16% of top-20 predictions while preserving identical recovery of validated targets. The contamination problem is shown to generalize beyond biology: in S&P 500 equity screening, brand-recognition bias reshapes 30-40% of top-20 rankings across five random seeds. To lower the barrier to adoption, the protocol is released as an open-source tool and as a Claude Code skill that enables one-command epistemic blinding within agentic workflows. The claim is not that blinded analysis produces better results, but that without blinding, there is no way to know to what degree the agent is adhering to the analytical process the researcher designed.


[127] ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments cs.AI | cs.CLPDF

Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma

TL;DR: 本文提出了ACE-Bench,一个用于评估智能体(Agent)的新型基准测试,旨在解决现有基准测试中环境交互开销高、任务视野和难度分布不平衡的问题。该基准基于一个统一的网格规划任务构建,通过可扩展的视野(H)和可控的难度(B)两个正交轴进行细粒度控制,并采用轻量级环境设计,所有工具调用均通过静态JSON文件解析,实现了快速、可复现的评估。

Details

Motivation: 解决现有智能体基准测试的两个关键局限性:高昂的环境交互开销(占总评估时间高达41%)以及不平衡的任务视野和难度分布,这使得总体得分不可靠。

Result: 实验验证了H和B能可靠地控制任务视野和难度,且ACE-Bench表现出很强的领域一致性和模型区分度。在6个领域上对13个不同规模和家族的模型进行了全面实验,揭示了显著的跨模型性能差异,并确认了ACE-Bench能为智能体推理提供可解释和可控的评估。

Insight: 创新点在于提出了一个基于统一网格规划任务的基准,通过两个正交维度(可扩展视野H和可控难度B)实现细粒度、可配置的评估,并设计了轻量级环境(通过静态JSON文件解析工具调用)以消除设置开销,实现快速、可复现的评估,尤其适合训练时验证。

Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots $H$, and Controllable Difficulty, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.


[128] Part-Level 3D Gaussian Vehicle Generation with Joint and Hinge Axis Estimation cs.AI | cs.CV | cs.ROPDF

Shiyao Qian, Yuan Ren, Dongfeng Bai, Bingbing Liu

TL;DR: 本文提出了一种从单张图像或稀疏多视角输入生成可动画化3D高斯车辆模型的框架,通过部件边缘细化模块和运动学推理头解决静态生成模型在部件边界失真和缺乏运动参数的问题,实现了部件级别的精确重建与动画模拟。

Details

Motivation: 当前自动驾驶仿真系统通常将车辆建模为刚性资产,无法捕捉部件级别的关节运动(如车轮转向或车门开合),而基于CAD的流程受限于库覆盖范围和固定模板,难以真实重建野外实例,因此需要开发能够生成可动画化车辆表示的方法。

Result: 论文通过部件边缘细化模块确保高斯分布的独占所有权以减少动画失真,并利用运动学推理头预测可移动部件的关节位置和铰链轴,实现了对车辆部件运动参数的准确估计,从而在仿真中达到更真实的动画效果。

Insight: 创新点在于将静态3D生成与运动学参数预测相结合,通过联合优化部件分割和运动学推理,解决了现有方法在部件边界失真和缺乏运动参数的问题,为自动驾驶仿真提供了可动画化的部件级车辆模型生成方案。

Abstract: Simulation is essential for autonomous driving, yet current frameworks often model vehicles as rigid assets and fail to capture part-level articulation. With perception algorithms increasingly leveraging dynamics such as wheel steering or door opening, realistic simulation requires animatable vehicle representations. Existing CAD-based pipelines are limited by library coverage and fixed templates, preventing faithful reconstruction of in-the-wild instances. We propose a generative framework that, from a single image or sparse multi-view input, synthesizes an animatable 3D Gaussian vehicle. Our method addresses two challenges: (i) large 3D asset generators are optimized for static quality but not articulation, leading to distortions at part boundaries when animated; and (ii) segmentation alone cannot provide the kinematic parameters required for motion. To overcome this, we introduce a part-edge refinement module that enforces exclusive Gaussian ownership and a kinematic reasoning head that predicts joint positions and hinge axes of movable parts. Together, these components enable faithful part-aware simulation, bridging the gap between static generation and animatable vehicle models.


[129] Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models cs.AI | cs.CVPDF

Keuntae Kim, Mingyu Kang, Yong Suk Choi

TL;DR: 本文针对扩散多模态大语言模型(dMLLMs)在结合思维链推理时存在的两个关键问题——过早生成最终答案和早期时间步对视觉提示依赖不足——提出了两种方法:位置与步数惩罚(PSP)和视觉推理引导(VRG)。PSP通过在早期时间步惩罚后续位置的token来延迟过早的答案生成,鼓励跨时间步的渐进式推理;VRG则通过放大视觉基础信号来增强模型与视觉证据的对齐。实验表明,该方法在多种dMLLMs上实现了高达7.5%的准确率提升,同时相比使用四倍扩散步数的推理,速度提升了三倍以上。

Details

Motivation: 扩散多模态大语言模型(dMLLMs)在结合思维链推理时,存在过早生成最终答案和早期时间步对视觉输入依赖不足的问题,这导致其推理性能下降,未能充分利用视觉信息进行基础推理。

Result: 在多个dMLLMs上的广泛实验表明,该方法实现了高达7.5%的准确率提升,同时相比使用四倍扩散步数的推理,速度提升了三倍以上,达到了更优的性能与效率平衡。

Insight: 论文的创新点在于通过PSP机制延迟过早答案生成以促进渐进推理,以及通过VRG增强视觉基础信号对齐,这为改善扩散模型在多模态推理中的视觉基础能力提供了可借鉴的思路,即结合时间步和位置感知的惩罚与引导策略来优化推理过程。

Abstract: Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs. To address these limitations, we propose Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by classifier-free guidance, amplifies visual grounding signals to enhance the model’s alignment with visual evidence. Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps.


cs.LG [Back]

[130] LLMs Should Express Uncertainty Explicitly cs.LG | cs.AI | cs.CLPDF

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

TL;DR: 该论文研究如何让大语言模型(LLMs)明确表达不确定性,作为控制接口。它比较了两种互补的接口:全局接口(模型为其最终答案输出经过校准的置信度分数)和局部接口(模型在推理过程中进入高风险状态时发出明确的标记)。研究发现,这两种接口分别改善了校准、减少过度自信错误、提高错误答案覆盖率,并可作为有效的检索触发器,共同表明LLMs的不确定性应被训练为与任务匹配的通信方式。

Details

Motivation: LLMs越来越多地应用于需要不确定性来驱动决策(如弃权、检索和验证)的场景。现有方法大多将不确定性视为生成后需要估计的潜在量,而非模型被训练来表达的信号。本文旨在将不确定性作为一种控制接口来研究。

Result: 实验表明,口头化置信度显著改善了校准,减少了过度自信的错误,并产生了最强的整体自适应RAG控制器,同时更选择性地使用检索。推理时的不确定性信号使先前沉默的失败在生成过程中可见,提高了错误答案的覆盖率,并提供了有效的高召回检索触发器。

Insight: 论文的创新点在于将不确定性视为可训练的显式控制接口,而非隐式估计量,并提出了全局(最终答案置信度)和局部(推理过程标记)两种互补的接口。从客观角度看,其核心见解是LLMs的有效不确定性应被训练为与任务匹配的通信:全局置信度用于决定是否信任最终答案,局部信号用于决定何时需要干预,这为LLMs的可靠部署提供了新的设计思路。

Abstract: Large language models are increasingly used in settings where uncertainty must drive decisions such as abstention, retrieval, and verification. Most existing methods treat uncertainty as a latent quantity to estimate after generation rather than a signal the model is trained to express. We instead study uncertainty as an interface for control. We compare two complementary interfaces: a global interface, where the model verbalizes a calibrated confidence score for its final answer, and a local interface, where the model emits an explicit marker during reasoning when it enters a high-risk state. These interfaces provide different but complementary benefits. Verbalized confidence substantially improves calibration, reduces overconfident errors, and yields the strongest overall Adaptive RAG controller while using retrieval more selectively. Reasoning-time uncertainty signaling makes previously silent failures visible during generation, improves wrong-answer coverage, and provides an effective high-recall retrieval trigger. Our findings further show that the two interfaces work differently internally: verbal confidence mainly refines how existing uncertainty is decoded, whereas reasoning-time signaling induces a broader late-layer reorganization. Together, these results suggest that effective uncertainty in LLMs should be trained as task-matched communication: global confidence for deciding whether to trust a final answer, and local signals for deciding when intervention is needed.


[131] Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement cs.LG | cs.AI | cs.CLPDF

Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao

TL;DR: 本文提出了一种名为潜在语义增强多令牌预测(LSE-MTP)的新方法,旨在解决标准多令牌预测(MTP)在训练大语言模型时可能引发的结构幻觉问题。通过将预测锚定在真实隐藏状态轨迹上,该方法在合成图和真实世界曼哈顿出租车行程数据集上的实验表明,它能有效弥合离散令牌与连续状态表示之间的差距,提升表示对齐、减少幻觉并增强鲁棒性。

Details

Motivation: 动机在于探讨大语言模型(LLMs)是否能形成连贯的内部世界模型,并指出标准多令牌预测(MTP)虽然比传统下一令牌预测(NTP)更有助于学习结构化表示,但会因离散令牌监督而在潜在空间中产生违反环境约束的非法捷径,即结构幻觉问题。

Result: 在合成图和真实世界曼哈顿出租车行程数据集上的实验结果表明,LSE-MTP方法有效提升了表示对齐、减少了结构幻觉,并增强了对扰动的鲁棒性。

Insight: 创新点在于从理论角度分析了MTP的梯度归纳偏置,揭示了其通过梯度耦合诱导表示收缩性以促进内部信念状态收敛的机制,并针对其结构幻觉缺陷,提出了通过锚定真实隐藏状态轨迹来增强潜在语义的LSE-MTP方法,从而在离散令牌与连续状态表示之间建立更可靠的桥梁。

Abstract: Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.


eess.IV [Back]

[132] CI-ICM: Channel Importance-driven Learned Image Coding for Machines eess.IV | cs.CV | cs.MMPDF

Yun Zhang, Junle Liu, Huan Zhang, Zhaoqing Pan, Gangyi Jiang

TL;DR: 本文提出了一种面向机器视觉的通道重要性驱动的学习图像编码方法(CI-ICM),旨在在给定码率约束下最大化机器视觉任务的性能。该方法通过量化通道重要性、非均匀分组和缩放特征通道、基于重要性的上下文建模以及任务特定通道适应,优化特征表示以提升下游任务效果。

Details

Motivation: 传统的以人类视觉为中心的图像压缩方法由于视觉特性和特征表征的不同,在以机器视觉为中心的压缩任务中表现不佳。本文旨在解决这一差距,提出专门为机器视觉任务优化的图像编码方案。

Result: 在COCO2017数据集上的实验表明,所提出的CI-ICM在目标检测任务上实现了16.25%的BD-mAP@50:95增益,在实例分割任务上实现了13.72%的增益,优于现有基线编解码器。消融研究验证了各模块的有效性,计算复杂度分析证明了其实用性。

Insight: 创新点在于将通道重要性量化并用于指导特征通道的分组、缩放和比特分配,从而为机器视觉任务定制压缩策略。该方法建立了面向机器视觉压缩的特征通道优化框架,弥合了图像编码与机器感知之间的鸿沟。

Abstract: Traditional human vision-centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance-driven learned Image Coding for Machines (CI-ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non-uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance-based Context (CI-CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task-Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI-ICM achieves BD-mAP@50:95 gains of 16.25$%$ in object detection and 13.72$%$ in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI-ICM. This work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception.


cs.DC [Back]

[133] CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics cs.DC | cs.CV | cs.LGPDF

Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin

TL;DR: CoStream是一个用于视频流分析的系统,通过利用视频编解码器元数据来统一优化视频解码、视觉处理和LLM预填充,从而降低多模态推理成本。该系统无需离线训练,实现了端到端的资源高效处理,在动态实时流中显著提升了吞吐量并减少了GPU计算开销。

Details

Motivation: 现有视频流分析系统在减少多模态推理成本时,要么仅针对视觉Transformer或LLM进行优化,未能充分利用端到端机会,要么通过离线分析或高开销的在线计算来识别冗余,不适合动态实时流。CoStream旨在解决这些问题,利用编解码器已有的时空结构信息实现低成本、高效的在线优化。

Result: 实验表明,CoStream在保持竞争力的准确率(F1分数仅下降0-8%)的同时,相比最先进的基线方法,实现了高达3倍的吞吐量提升和高达87%的GPU计算减少。

Insight: 创新点在于将视频编解码器元数据作为低成本运行时信号,统一指导ViT编码前的补丁剪枝和LLM预填充期间的选择性键值缓存刷新,实现了无需离线训练的完全在线优化,并直接操作压缩比特流以降低传输开销。

Abstract: Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.


cs.CE [Back]

[134] INTERACT: An AI-Driven Extended Reality Framework for Accesible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition cs.CE | cs.AI | cs.CL | cs.CV | cs.ETPDF

Nikolaos D. Tantaroudas, Andrew J. McCracken, Ilias Karachalios, Evangelos Papatheou

TL;DR: 本文提出了INTERACT,一个基于AI的扩展现实(XR)平台,旨在通过集成实时语音转文本、国际手语(ISL)3D虚拟人渲染、多语言翻译和情感识别功能,为听障和多语言用户提供沉浸式无障碍通信解决方案。该平台部署在Meta Quest 3头显上,并进行了两阶段试点评估,显示出高用户满意度和技术性能。

Details

Motivation: 解决视频会议平台对聋人、听力障碍及多语言用户支持有限的问题,应对全球听力康复需求增长,并利用XR技术克服传统无障碍措施成本高、可用性有限和物流障碍的约束。

Result: 在试点评估中,报告了92%的用户满意度、超过85%的转录准确率、90%的情感检测精度,整体体验平均评分为4.6/5.0,90%的参与者愿意参与进一步测试,结果突显了在教育、文化和专业场景中推进无障碍性的潜力。

Insight: 创新点在于将多种AI技术(Whisper、NLLB、RoBERTa、Google MediaPipe)与XR框架(CORTEX2)集成,创建了一个实时、沉浸式的无障碍通信工具,通过3D虚拟人手语渲染和情感识别增强了交互的自然性和包容性。

Abstract: Video conferencing has become central to professional collaboration, yet most platforms offer limited support for deaf, hard-of-hearing, and multilingual users. The World Health Organisation estimates that over 430 million people worldwide require rehabilitation for disabling hearing loss, a figure projected to exceed 700 million by 2050. Conventional accessibility measures remain constrained by high costs, limited availability, and logistical barriers, while Extended Reality (XR) technologies open new possibilities for immersive and inclusive communication. This paper presents INTERACT (Inclusive Networking for Translation and Embodied Real-Time Augmented Communication Tool), an AI-driven XR platform that integrates real-time speech-to-text conversion, International Sign Language (ISL) rendering through 3D avatars, multilingual translation, and emotion recognition within an immersive virtual environment. Built on the CORTEX2 framework and deployed on Meta Quest 3 headsets, INTERACT combines Whisper for speech recognition, NLLB for multilingual translation, RoBERTa for emotion classification, and Google MediaPipe for gesture extraction. Pilot evaluations were conducted in two phases, first with technical experts from academia and industry, and subsequently with members of the deaf community. The trials reported 92% user satisfaction, transcription accuracy above 85%, and 90% emotion-detection precision, with a mean overall experience rating of 4.6 out of 5.0 and 90% of participants willing to take part in further testing. The results highlight strong potential for advancing accessibility across educational, cultural, and professional settings. An extended version of this work, including full pilot data and implementation details, has been published as an Open Research Europe article [Tantaroudas et al., 2026a].


cs.IR [Back]

[135] Learning to Retrieve from Agent Trajectories cs.IR | cs.AI | cs.CLPDF

Yuqi Zhou, Sunhao Dai, Changle Qu, Liang Pang, Jun Xu

TL;DR: 本文提出了一种新的检索模型训练范式——从智能体轨迹中学习检索(LRAT),旨在解决传统基于人类交互数据训练的检索模型与LLM驱动的搜索智能体行为模式不匹配的问题。LRAT框架通过挖掘智能体多步交互轨迹中的行为信号(如浏览动作、未浏览拒绝和浏览后推理痕迹)来生成高质量的检索监督,并采用加权优化来整合相关性强度。

Details

Motivation: 随着基于大语言模型的搜索智能体兴起,检索系统越来越多地被智能体而非人类使用,并嵌入到多轮推理和行动循环中。传统基于人类交互数据训练的检索模型与智能体查询和消费结果的方式存在根本性不匹配,因此需要直接从智能体交互数据中训练检索模型。

Result: 在领域内和领域外的深度研究基准测试上的大量实验表明,使用LRAT训练的检索器在不同智能体架构和规模上,持续提升了证据召回率、端到端任务成功率和执行效率。

Insight: 创新点在于将智能体交互轨迹确立为一种实用且可扩展的监督来源,并提出了一个从这些轨迹中挖掘监督信号的框架。这为智能体搜索时代的检索模型训练指明了新方向,即利用智能体特有的行为信号(如浏览、拒绝、推理)来替代或补充传统的人类点击信号。

Abstract: Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.


[136] CURE:Circuit-Aware Unlearning for LLM-based Recommendation cs.IR | cs.AI | cs.CL | cs.LGPDF

Ziheng Chen, Jiali Cheng, Zezhong Fan, Hadi Amiri, Yunzhi Yao

TL;DR: 本文提出CURE框架,一种面向基于大语言模型的推荐系统的电路感知遗忘方法,通过解耦模型组件为功能不同的子集并选择性更新,以解决现有遗忘方法中梯度冲突导致的优化不稳定和模型效用下降问题。

Details

Motivation: 随着隐私法规收紧,将用户数据整合到基于大语言模型的推荐系统中带来显著隐私风险,而现有遗忘方法通常将遗忘视为遗忘与保留目标的加权组合,导致梯度冲突、优化不稳定且缺乏透明度。

Result: 在真实世界数据集上的实验表明,CURE方法比现有基线实现了更有效的遗忘。

Insight: 创新点在于将模型组件解耦为功能不同的子集(如遗忘特定、保留特定和任务共享模块),并基于电路分析为每个模块设计特定更新规则,以缓解梯度冲突,提升遗忘过程的透明度和效果。

Abstract: Recent advances in large language models (LLMs) have opened new opportunities for recommender systems by enabling rich semantic understanding and reasoning about user interests and item attributes. However, as privacy regulations tighten, incorporating user data into LLM-based recommendation (LLMRec) introduces significant privacy risks, making unlearning algorithms increasingly crucial for practical deployment. Despite growing interest in LLMRec unlearning, most existing approaches formulate unlearning as a weighted combination of forgetting and retaining objectives while updating model parameters in a uniform manner. Such formulations inevitably induce gradient conflicts between the two objectives, leading to unstable optimization and resulting in either ineffective unlearning or severe degradation of model utility. Moreover, the unlearning procedure remains largely black-box, undermining its transparency and trustworthiness. To tackle these challenges, we propose CURE, a circuit-aware unlearning framework that disentangles model components into functionally distinct subsets and selectively updates them. Here, a circuit refers to a computational subgraph that is causally responsible for task-specific behaviors. Specifically, we extract the core circuits underlying item recommendation and analyze how individual modules within these circuits contribute to the forget and retain objectives. Based on this analysis, these modules are categorized into forget-specific, retain-specific, and task-shared groups, each subject to function-specific update rules to mitigate gradient conflicts during unlearning. Experiments on real-world datasets show that our approach achieves more effective unlearning than existing baselines.


[137] Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges cs.IR | cs.AI | cs.CL | cs.CV | cs.LGPDF

Rong Lu, Hao Liu, Song Hou

TL;DR: 本文对基于嵌入的方法和生成式模型在地球科学技术文档分类任务上进行了比较分析,使用了一个多学科基准数据集,评估了模型在准确性、稳定性和计算成本之间的权衡。研究发现,采用思维链提示增强的生成式视觉语言模型(如Qwen2.5-VL)在零样本准确率(82%)上优于最先进的多模态嵌入模型(如QQMM,63%),同时指出监督微调虽能提升VLM性能,但对训练数据不平衡敏感。

Details

Motivation: 动机在于比较嵌入式和生成式这两种主流方法在LLM驱动的文档分类任务中的表现,特别是在地球科学这一特定领域,以评估它们在准确性、稳定性和计算成本方面的权衡,为实际应用选择提供依据。

Result: 在零样本设置下,采用思维链提示的生成式VLM(Qwen2.5-VL)在基准数据集上取得了82%的准确率,显著优于最先进的多模态嵌入模型QQMM(63%)。监督微调可以进一步提升VLM性能,但结果对训练数据不平衡敏感。

Insight: 创新点在于系统地比较了嵌入式和生成式方法在特定领域文档分类中的表现,并强调了思维链提示对生成式VLM零样本性能的显著提升作用。客观分析认为,该研究揭示了生成式方法在零样本场景下的潜力,以及监督微调中数据平衡的重要性,为实际部署中的模型选择和数据策略提供了关键见解。

Abstract: This work presents a comparative analysis of embedding-based and generative models for classifying geoscience technical documents. Using a multi-disciplinary benchmark dataset, we evaluated the trade-offs between model accuracy, stability, and computational cost. We find that generative Vision-Language Models (VLMs) like Qwen2.5-VL, enhanced with Chain-of-Thought (CoT) prompting, achieve superior zero-shot accuracy (82%) compared to state-of-the-art multimodal embedding models like QQMM (63%). We also demonstrate that while supervised fine-tuning (SFT) can improve VLM performance, it is sensitive to training data imbalance.


[138] CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation cs.IR | cs.CL | cs.LGPDF

Siddharth Jain, Venkat Narayan Vedam

TL;DR: 本文提出了CUE-R框架,用于评估检索增强生成(RAG)中单个证据项的效用,超越了仅关注最终答案的传统评估方法。通过扰动操作(如移除、替换、复制)并测量正确性、忠实度和置信度误差等维度的变化,该框架能分析证据项在推理过程中的具体作用。

Details

Motivation: 现有RAG评估主要关注最终答案质量、引用忠实度或答案级归因,但缺乏对单个证据项基于干预的效用分析,无法直接衡量其在推理中的操作价值。

Result: 在HotpotQA和2WikiMultihopQA数据集上使用Qwen-3 8B和GPT-5.2进行实验,结果显示移除和替换操作显著损害正确性和忠实度并导致追踪轨迹大幅变化,而复制操作虽常冗余但行为不完全中性;零检索对照和双支持消融实验进一步证实了证据项的非加性交互作用。

Insight: 创新点在于提出了轻量级干预框架CUE-R,通过可观测的检索使用痕迹量化单个证据项的操作效用,并建立了证据角色分类法;客观来看,该方法为RAG评估提供了实用的补充视角,揭示了仅评估答案会忽略的重要证据效应。

Abstract: As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.


Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio

TL;DR: 本文介绍了JUÁ,一个用于巴西法律信息检索的公开基准测试,旨在支持跨异构法律文本集合的可复现和可比较评估。该基准不仅是一个静态数据集,更是一个持续评估基础设施,涵盖判例检索以及更广泛的立法、监管和问题驱动的法律搜索。作者评估了多种检索方法,包括词汇、密集和基于BM25的重排序流水线,并展示了领域自适应在特定子集上的优势。

Details

Motivation: 解决葡萄牙语法律信息检索领域缺乏系统性评估基准的问题,因为现有数据集在文档类型、查询风格和相关性定义上差异很大,难以进行可复现和可比较的评估。

Result: 评估了包括领域自适应Qwen嵌入模型在内的多种检索方法,结果显示基准的异构性足以区分不同检索范式,并揭示了显著的跨数据集权衡。领域自适应在监督对齐的JUÁ-Juris子集上效果最明显,而BM25在其他集合上,尤其是在具有强词汇和制度性措辞线索的场景中,仍然极具竞争力。

Insight: 创新点在于提出了一个综合性的、持续的法律检索评估基础设施,统一了评估协议和指标,并展示了领域自适应嵌入模型在特定法律子任务上的有效性,同时验证了传统词汇方法(如BM25)在特定法律检索场景中的持续竞争力。

Abstract: Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present \textsc{JUÁ}, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, \textsc{JUÁ} is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on \textsc{JUÁ}-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned \textsc{JUÁ-Juris} subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, \textsc{JUÁ} provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.


cs.CR [Back]

[140] BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents cs.CR | cs.CVPDF

Bo Ma, Jinsong Wu, Weiqi Yan

TL;DR: 本文提出了BodhiPromptShield框架,用于抑制LLM/VLM智能体中提示隐私风险的跨阶段传播。该框架通过检测敏感信息、使用类型化占位符/语义抽象/安全符号映射进行路由,并延迟到授权边界恢复,实现了传播感知的调解。在CPPB基准测试中,有效降低了隐私在检索、记忆和工具阶段的传播率。

Details

Motivation: 现有去标识化方法处理文档边界,但无法解决LLM/VLM智能体中原始用户内容在检索查询、记忆写入、工具调用和日志等跨阶段流动导致的隐私风险传播问题。

Result: 在受控的提示隐私基准CPPB上评估,该方法将跨阶段的传播率从10.7%抑制到7.1%;PER达到9.3%,AC为0.94,TSR为0.92,性能优于通用的去标识化方法。

Insight: 创新点在于提出了一个策略感知的、传播感知的调解框架,将敏感信息路由和延迟恢复时机作为安全变量进行显式控制,而不仅仅是静态的文档级去标识化。

Abstract: In LLM/VLM agents, prompt privacy risk propagates beyond a single model call because raw user content can flow into retrieval queries, memory writes, tool calls, and logs. Existing de-identification pipelines address document boundaries but not this cross-stage propagation. We propose BodhiPromptShield, a policy-aware framework that detects sensitive spans, routes them via typed placeholders, semantic abstraction, or secure symbolic mapping, and delays restoration to authorized boundaries. Relative to enterprise redaction, this adds explicit propagation-aware mediation and restoration timing as a security variable. Under controlled evaluation on the Controlled Prompt-Privacy Benchmark (CPPB), stage-wise propagation suppresses from 10.7% to 7.1% across retrieval, memory, and tool stages; PER reaches 9.3% with 0.94 AC and 0.92 TSR, outperforming generic de-identification. These are controlled systems results on CPPB rather than formal privacy guarantees or public-benchmark transfer claims. The project repository is available at https://github.com/mabo1215/BodhiPromptShield.git.


cs.CY [Back]

[141] EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content cs.CY | cs.AI | cs.CLPDF

Shuzhen Bi, Mingzi Zhang, Zhuoxuan Li, Xiaolong Wang, keqian Li

TL;DR: 该论文提出了EduIllustrate基准,用于评估大语言模型在K-12 STEM问题中生成图文交织解释的能力,涵盖五个学科和三个年级的230个问题,并引入序列锚定协议以确保视觉一致性。评估了十个LLM,发现Gemini 3.0 Pro Preview性能最佳(87.8%),而Kimi-K2.5成本效益最高(80.8%,每问题0.12美元)。

Details

Motivation: 当前LLM的教育能力评估主要集中在问答和辅导任务,缺乏对多媒体教学内容生成的评估,特别是生成结合几何精确视觉和逐步推理的图文解释的能力。

Result: 在EduIllustrate基准上,Gemini 3.0 Pro Preview以87.8%的得分领先,Kimi-K2.5以每问题0.12美元的成本达到80.8%的得分,成本效益最佳;序列锚定工作流将视觉一致性提升了13%,成本降低94%。

Insight: 创新点包括:提出了首个专注于图文交织教育内容生成的基准,引入了序列锚定协议来强制跨图视觉一致性,并基于多媒体学习理论设计了涵盖文本和视觉质量的8维评估标准;客观评估维度上LLM作为评判者的可靠性较高(ρ≥0.83)。

Abstract: Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation – the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro Preview leads at 87.8%, while Kimi-K2.5 achieves the best cost-efficiency (80.8% at \$0.12/problem). Workflow ablation confirms sequential anchoring improves Visual Consistency by 13% at 94% lower cost. Human evaluation with 20 expert raters validates LLM-as-judge reliability for objective dimensions ($ρ\geq 0.83$) while revealing limitations on subjective visual assessment.


cs.RO [Back]

[142] StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing cs.RO | cs.AI | cs.CVPDF

StarVLA Community

TL;DR: StarVLA是一个用于构建视觉-语言-动作(VLA)模型的开源代码库,旨在解决当前VLA研究在架构、代码库和评估协议上碎片化的问题。它提供了一个模块化的骨干-动作头架构,支持多种VLM和世界模型骨干以及动作解码范式,并集成了可复用的训练策略和统一的基准测试接口。

Details

Motivation: 当前VLA方法在架构、代码库和评估协议上存在碎片化和不兼容的问题,阻碍了方法的比较和复现。该论文旨在提供一个统一、模块化的开源框架来降低VLA研究的门槛。

Result: 论文提供了在多个基准(如LIBERO、SimplerEnv、RoboTwin~2.0、RoboCasa-GR1和BEHAVIOR-1K)上完全可复现的训练方案。尽管数据工程最小化,这些方案在多个基准上使用VLM或世界模型骨干时,已经达到或超越了先前方法的性能。

Insight: 主要创新点在于提出了一个模块化、可插拔的骨干-动作头架构抽象,使得骨干模型和动作头可以独立替换。同时,框架集成了统一的训练策略和评估接口,为VLA研究提供了一个标准化、可扩展的‘乐高式’开发平台,有望促进该领域的比较研究和快速原型开发。

Abstract: Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone–action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.


[143] Final Report, Center for Computer-Integrated Computer-Integrated Surgical Systems and Technology, NSF ERC Cooperative Agreement EEC9731748, Volume 1 cs.RO | cs.CVPDF

Russell H. Taylor, Gregory D. Hager, Ralph Etienne-Cummings. Eric Grimson, Ron Kikinis, Cameron Riviere

TL;DR: 这篇报告总结了计算机集成外科系统与技术工程研究中心(CISST ERC)在1998年启动后十年间,通过NSF资助推动医疗机器人从处理常规任务发展到执行复杂干预的转型历程,并展望了其在提高手术精度、一致性、患者安全及降低成本等方面的广泛影响。

Details

Motivation: 该中心的动机是整合数据与技术到临床系统中,以彻底改变外科手术及其他医疗程序的实施方式,解决医疗领域对更准确、一致、安全且经济的干预手段的需求。

Result: 报告未提及具体的定量基准测试结果,但定性指出医疗机器人已从边缘走向主流,并能够执行高度复杂的干预任务,这代表了该领域技术成熟度的显著提升。

Insight: 创新点在于通过工程研究中心模式,系统性地构建了将基础科学与工程应用于临床的专业基础设施,其核心洞察是跨学科整合(数据、机器人技术、临床实践)能驱动医疗范式的根本性变革,实现从手术精度到整体医疗系统健康的全面提升。

Abstract: In the last ten years, medical robotics has moved from the margins to the mainstream. Since the Engineering Research Center for Computer-Integrated Surgical Systems and Technology was Launched in 1998 with National Science Foundation funding, medical robots have been promoted from handling routine tasks to performing highly sophisticated interventions and related assignments. The CISST ERC has played a significant role in this transformation. And thanks to NSF support, the ERC has built the professional infrastructure that will continue our mission: bringing data and technology together in clinical systems that will dramatically change how surgery and other procedures are done. The enhancements we envision touch virtually every aspect of the delivery of care: - More accurate procedures - More consistent, predictable results from one patient to the next - Improved clinical outcomes - Greater patient safety - Reduced liability for healthcare providers - Lower costs for everyone - patients, facilities, insurers, government - Easier, faster recovery for patients - Effective new ways to treat health problems - Healthier patients, and a healthier system The basic science and engineering the ERC is developing now will yield profound benefits for all concerned about health care - from government agencies to insurers, from clinicians to patients to the general public. All will experience the healing touch of medical robotics, thanks in no small part to the work of the CISST ERC and its successors.


[144] CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment cs.RO | cs.CVPDF

Li Kang, Yutao Fan, Rui Li, Heng Zhou, Yiran Qin

TL;DR: 本文提出了CoEnv框架,通过引入组合环境概念,将真实世界与仿真组件协同集成,以解决多智能体具身协作中的空间协调、时序推理和共享工作空间感知等挑战。该框架包含三个阶段:真实到仿真的场景重建、VLM驱动的动作合成以及经过验证的仿真到真实的安全迁移。

Details

Motivation: 解决多智能体具身系统在复杂协作操作中面临的空间协调、时序推理和共享工作空间感知等关键挑战,受人类认知规划与物理执行分离的协作方式启发。

Result: 在具有挑战性的多臂操作基准测试上进行了广泛实验,证明了CoEnv在实现高任务成功率和执行效率方面的有效性,为多智能体具身AI建立了新范式。

Insight: 创新性地提出了组合环境概念,将仿真用于安全策略探索与真实世界可靠部署相结合;采用VLM驱动的动作合成,支持高级接口的实时规划和基于代码轨迹生成的迭代规划;通过包含碰撞检测的验证仿真到真实迁移确保安全部署。

Abstract: Multi-agent embodied systems hold promise for complex collaborative manipulation, yet face critical challenges in spatial coordination, temporal reasoning, and shared workspace awareness. Inspired by human collaboration where cognitive planning occurs separately from physical execution, we introduce the concept of compositional environment – a synergistic integration of real-world and simulation components that enables multiple robotic agents to perceive intentions and operate within a unified decision-making space. Building on this concept, we present CoEnv, a framework that leverages simulation for safe strategy exploration while ensuring reliable real-world deployment. CoEnv operates through three stages: real-to-sim scene reconstruction that digitizes physical workspaces, VLM-driven action synthesis supporting both real-time planning with high-level interfaces and iterative planning with code-based trajectory generation, and validated sim-to-real transfer with collision detection for safe deployment. Extensive experiments on challenging multi-arm manipulation benchmarks demonstrate CoEnv’s effectiveness in achieving high task success rates and execution efficiency, establishing a new paradigm for multi-agent embodied AI.


[145] Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation cs.RO | cs.CVPDF

Jiahua Ma, Yiran Qin, Xin Wen, Yixiong Li, Yuyu Sun

TL;DR: 本文提出了一种称为Referring-Aware Visuomotor Policy (ReV)的闭环框架,用于增强机器人操作中视觉运动策略在分布外执行错误或动态重规划轨迹情况下的鲁棒性。该框架通过即时整合由人类或高级推理规划器提供的稀疏参考点来适应意外情况,利用耦合扩散头保持标准任务执行模式,并通过轨迹引导策略无缝集成稀疏参考。

Details

Motivation: 解决机器人操作中视觉运动策略学习的一个基本问题:在模型仅依赖原始专家演示进行训练的情况下,如何增强其在分布外执行错误或动态重规划轨迹时的鲁棒性。

Result: 在具有挑战性的模拟和现实世界任务中,无需任何额外数据或微调方案,ReV实现了更高的成功率。

Insight: 创新点在于提出了一个闭环的、参考感知的视觉运动策略框架,通过耦合扩散头(全局和局部)和轨迹引导策略,实现了对稀疏参考点的实时整合与动态轨迹重规划,且训练仅需对专家演示施加针对性扰动,无需精细标注。

Abstract: This paper addresses a fundamental problem of visuomotor policy learning for robotic manipulation: how to enhance robustness in out-of-distribution execution errors or dynamically re-routing trajectories, where the model relies solely on the original expert demonstrations for training. We introduce the Referring-Aware Visuomotor Policy (ReV), a closed-loop framework that can adapt to unforeseen circumstances by instantly incorporating sparse referring points provided by a human or a high-level reasoning planner. Specifically, ReV leverages the coupled diffusion heads to preserve standard task execution patterns while seamlessly integrating sparse referring via a trajectory-steering strategy. Upon receiving a specific referring point, the global diffusion head firstly generates a sequence of globally consistent yet temporally sparse action anchors, while identifies the precise temporal position for the referring point within this sequence. Subsequently, the local diffusion head adaptively interpolates adjacent anchors based on the current temporal position for specific tasks. This closed-loop process repeats at every execution step, enabling real-time trajectory replanning in response to dynamic changes in the scene. In practice, rather than relying on elaborate annotations, ReV is trained only by applying targeted perturbations to expert demonstrations. Without any additional data or fine-tuning scheme, ReV achieve higher success rates across challenging simulated and real-world tasks.


[146] Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming cs.RO | cs.CVPDF

Baoshun Tong, Haoran He, Ling Pan, Yang Liu, Liang Lin

TL;DR: 本文提出了一种名为DAERT的多样性感知具身红队测试框架,旨在通过生成多样化的对抗性指令,全面揭示视觉-语言-动作模型在语言细微变化下的脆弱性,从而提升其在机器人操作任务中的安全性评估。

Details

Motivation: 现有基于强化学习的自动化红队测试方法存在严重的模式崩溃问题,导致只能发现少量重复的失败模式,无法全面揭示VLA模型在语言鲁棒性方面的安全风险,阻碍了其在实际部署中的安全性保障。

Result: 在两个SOTA VLA模型(π₀和OpenVLA)上的实验表明,DAERT方法能持续生成更广泛、更有效的对抗指令,将平均任务成功率从93.33%大幅降低至5.85%,显著优于标准方法。

Insight: 核心创新在于设计了一个能平衡攻击有效性与指令多样性的统一策略,克服了传统RL红队测试的模式崩溃问题,为系统性地压力测试VLA代理和暴露其安全盲点提供了一种可扩展的方法。

Abstract: Vision-Language-Action (VLA) models have achieved remarkable success in robotic manipulation. However, their robustness to linguistic nuances remains a critical, under-explored safety concern, posing a significant safety risk to real-world deployment. Red teaming, or identifying environmental scenarios that elicit catastrophic behaviors, is an important step in ensuring the safe deployment of embodied AI agents. Reinforcement learning (RL) has emerged as a promising approach in automated red teaming that aims to uncover these vulnerabilities. However, standard RL-based adversaries often suffer from severe mode collapse due to their reward-maximizing nature, which tends to converge to a narrow set of trivial or repetitive failure patterns, failing to reveal the comprehensive landscape of meaningful risks. To bridge this gap, we propose a novel \textbf{D}iversity-\textbf{A}ware \textbf{E}mbodied \textbf{R}ed \textbf{T}eaming (\textbf{DAERT}) framework, to expose the vulnerabilities of VLAs against linguistic variations. Our design is based on evaluating a uniform policy, which is able to generate a diverse set of challenging instructions while ensuring its attack effectiveness, measured by execution failures in a physical simulator. We conduct extensive experiments across different robotic benchmarks against two state-of-the-art VLAs, including $π_0$ and OpenVLA. Our method consistently discovers a wider range of more effective adversarial instructions that reduce the average task success rate from 93.33% to 5.85%, demonstrating a scalable approach to stress-testing VLA agents and exposing critical safety blind spots before real-world deployment.