Table of Contents

cs.CL [Back]

[1] CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models cs.CL | cs.AIPDF

Zhehao Tan, Yihan Jiao, Dan Yang, Junjie Wang, Duolin Sun

TL;DR: 本文提出了一种名为CTRL-RAG的新型强化学习框架,旨在提升检索增强生成(RAG)模型在上下文敏感推理和忠实性方面的性能。该框架的核心是引入了对比似然奖励(CLR),通过优化模型在有/无支持证据条件下生成响应的对数似然差距,来鼓励模型提取相关证据并增强其在特定上下文中的置信度。

Details

Motivation: 现有面向RAG的强化学习方法依赖外部奖励,这些奖励往往无法有效评估文档忠实性,且在开放域设置中可能误判相似答案。同时,缺乏基于RAG的自奖励机制,而单纯的自判断由于缺少客观反馈可能导致幻觉累积和模型崩溃。

Result: 实验表明,该方法(单独使用或与外部正确性奖励结合)在单跳、多跳、垂直领域和忠实性基准测试中均取得了强劲的性能。

Insight: 主要创新点在于提出了一个“内部-外部”混合奖励框架,其核心是对比似然奖励(CLR)。该机制直接从模型内部优化对数似然差距,作为一种自奖励信号,以增强模型对证据的依赖和上下文忠实性,从而缓解了传统外部奖励的局限性以及纯自奖励可能导致的幻觉问题。

Abstract: With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel “internal-external” hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.


[2] The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning cs.CL | cs.CVPDF

Ruobing Zheng, Tianqi Li, Jianing Li, Qingpei Guo, Yi Yuan

TL;DR: 本文提出了一种名为’双调优’的框架,旨在评估在给定基础模型和数据集下,推理训练是否对目标任务产生积极增益。通过联合微调思维链和直接答案数据,并引入’思维边界’概念,该方法量化了推理在不同多模态任务中的适用性,挑战了’万物皆需推理’的范式。

Details

Motivation: 当前推理增强的大语言模型在数学和编码等复杂任务上表现出色,但其在通用多模态场景下的有效性尚不确定。由于缺乏判断推理何时真正有益的标准,业界通常并行发布’指令’和’思考’模型,这是一种资源密集型的权宜之计。

Result: 研究在包括空间、数学和多学科领域在内的多样化多模态任务上建立了’思维边界’,以评估推理训练的适用性。进一步探索了强化训练和思维模式对推理适用性的影响,并验证了’思维边界’能否指导数据精炼。

Insight: 主要创新点在于提出了一个系统性的量化框架来评估推理训练的增益,并引入了’思维边界’这一概念。从客观角度看,该研究为识别合适的数据和训练策略提供了实用指导,并推动了资源高效、自适应自动思考系统的开发。

Abstract: While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel “Instruct” and “Thinking” models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the “Thinking Boundary” to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the “Thinking Boundary” can guide data refinement. Our findings challenge the “reasoning-for-all” paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.


[3] Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction cs.CLPDF

Rabab Alkhalifa

TL;DR: 本文提出了一种面向阿拉伯语社交媒体情感预测的可靠性感知弱监督框架,通过一个小型多智能体LLM流程(包含两个框架生成器、一个批评器和一个鉴别器)来评估数据实例的可靠性,并利用QUBO优化方法选择可靠且平衡的数据子集,以提高情感预测的准确性。

Details

Motivation: 解决阿拉伯语社交媒体中框架检测因解释模糊性、文化背景依赖和可靠监督数据有限而面临的挑战,现有基于LLM的弱监督方法在标注数据少且依赖社会背景时表现脆弱。

Result: 内在诊断和跨领域阿拉伯语情感迁移测试表明,所选子集更可靠、编码了非随机且可迁移的结构,且未损害强文本基线的性能。

Insight: 创新点在于将焦点从标签聚合转向数据筛选,利用多智能体LLM的认知信号(如分歧和推理质量)进行实例级可靠性估计,并通过QUBO优化实现框架平衡和冗余减少的数据子集选择。

Abstract: Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.


[4] Context-Dependent Affordance Computation in Vision-Language Models cs.CL | cs.AI | cs.LGPDF

Murad Farzulla

TL;DR: 本研究通过大规模计算实验,系统性地描述了视觉语言模型(VLMs)中情境依赖的可用性计算现象。研究使用Qwen-VL 30B和LLaVA-1.5-13B模型,在COCO-2017数据集的3,213个场景-情境对上进行测试,并引入7种代理角色进行情境启动。结果表明,模型对场景的词汇描述存在显著的’可用性漂移’,超过90%的词汇描述是情境依赖的,语义层面的依赖程度也达到58.5%。研究通过随机基线实验和Tucker分解分析,证实了这种漂移是真实的情境效应,而非生成噪声,并揭示了稳定的潜在因子(如’烹饪流形’和’可达性轴’)。

Details

Motivation: 旨在探究视觉语言模型在理解场景时,其’可用性’(即物体或场景可供代理执行的动作或功能)的计算是否以及如何受到上下文情境(如代理角色)的影响,以揭示模型内部表征的动态特性。

Result: 在COCO-2017数据集的大规模实验中,不同情境条件下的词汇描述Jaccard相似度均值仅为0.095(p < 0.0001),表明超过90%的词汇描述是情境依赖的;句子级余弦相似度均值为0.415,表明58.5%的语义内容依赖情境。随机基线实验证实了这是真实的情境效应。Tucker分解分析揭示了稳定的正交潜在因子。

Insight: 论文的核心创新在于首次通过大规模系统性实验,量化并证实了VLMs中存在强烈的、非随机的上下文依赖可用性计算。这挑战了静态世界建模的假设,为机器人学等领域提出了’即时本体论’(JIT Ontology)的新研究方向,即模型应根据动态查询进行本体投射。摘要中宣称的创新点是揭示了词汇层面(90%)与语义层面(58.5%)情境依赖性的差异,表明表层词汇比底层含义对情境变化更敏感。

Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a “Culinary Manifold” isolated to chef contexts and an “Access Axis” spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner – with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts – and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.


[5] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis? cs.CL | cs.AI | cs.MAPDF

Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim, Pranav Rajpurkar

TL;DR: 该论文研究了多智能体大语言模型(LLM)系统在临床诊断中的应用,特别关注了智能体来源的多样性(即是否来自不同供应商)对诊断性能的影响。通过比较单一LLM、单一供应商多智能体和混合供应商多智能体对话框架,发现混合供应商配置在RareBench和DiagnosisArena基准测试上表现最佳,达到了最先进的召回率和准确率。

Details

Motivation: 现有临床诊断多智能体LLM系统大多依赖单一供应商的智能体团队,这可能导致相关故障模式,即智能体共享的偏见被强化而非纠正。论文旨在探究供应商多样性是否能提升诊断的鲁棒性和准确性。

Result: 在RareBench和DiagnosisArena基准测试上,混合供应商多智能体配置(使用o4-mini、Gemini-2.5-Pro和Claude-4.5-Sonnet实例化医生智能体)持续优于单一供应商配置,取得了最先进的(SOTA)召回率和准确率。

Insight: 论文宣称的创新点在于将供应商多样性确立为构建鲁棒临床诊断系统的关键设计原则。其核心洞察是,混合供应商团队能够汇集互补的归纳偏差,从而发现单个模型或同质团队集体遗漏的正确诊断,这为缓解模型偏见和提升协作效果提供了新思路。

Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.


[6] A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science cs.CL | cs.AIPDF

Zonglin Yang, Runze Mao, Tianhao Wu, Han Li, QingGuo Zhou

TL;DR: 本研究提出了首个用于燃烧科学领域大语言模型(LLM)开发的端到端框架,包含一个包含35亿token的多模态知识库、一个包含436个问题的自动化评估基准(CombustionQA),以及一个从检索增强生成(RAG)到知识图谱增强检索再到持续预训练的三阶段知识注入路径。研究发现,标准RAG方法存在性能上限(60%准确率),而构建领域基础模型需要结合结构化知识图谱和持续预训练。

Details

Motivation: 为了推进燃烧科学领域的基础大语言模型发展,解决领域专业知识获取和评估的难题,并克服现有方法(如标准RAG)在性能上的瓶颈。

Result: 在CombustionQA基准上,标准RAG(第一阶段)的准确率达到60%,远超零样本性能(23%),但仍远低于理论上限(87%)。研究指出,标准RAG的性能受限于上下文污染,因此需要后续阶段(知识图谱和持续预训练)来构建真正的领域基础模型。

Insight: 论文的创新点在于构建了一个统一的、包含知识库、评估基准和渐进式知识注入路径的端到端框架。客观来看,其核心洞察是指出了标准RAG在专业领域的局限性(存在性能上限和上下文污染问题),并系统性地论证了结合结构化知识(知识图谱)和模型参数更新(持续预训练)的必要性,为领域专用LLM的开发提供了清晰的路线图。

Abstract: To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage’s performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).


[7] Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models cs.CL | cs.AI | cs.LGPDF

Wai Tuck Wong, Jun Sun, Arunesh Sinha

TL;DR: 本文研究多模态大语言模型(MLLMs)中一种由诱导数值不稳定性导致的性能退化新失效模式,通过优化损失项在推理阶段最大化数值不稳定性,构造能显著降低模型输出的图像,并在多个SOTA视觉语言模型和标准数据集上验证了其有效性。

Details

Motivation: 随着多模态大语言模型的广泛应用,研究其失效点变得至关重要,本文旨在揭示一种不同于对抗性扰动的、间接导致性能退化的新失效模式。

Result: 在LLaVa-v1.5-7B、Idefics3-8B、SmolVLM-2B-Instruct等SOTA模型上,针对Flickr30k、MMVet、TextVQA、VQAv2、POPE、COCO等标准数据集,实验表明即使输入图像仅有微小变化,模型性能也会显著下降。

Insight: 创新点在于提出了一种通过诱导数值不稳定性(而非传统对抗攻击)来攻击MLLMs的方法,揭示了模型内部数值敏感性的脆弱性,为模型鲁棒性评估提供了新视角。

Abstract: The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.


[8] Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning cs.CL | cs.AIPDF

Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang

TL;DR: 本文提出了GOLF强化学习框架,通过利用群体层面的自然语言反馈来引导智能体进行更有针对性的探索,从而解决传统RL算法仅依赖标量奖励导致探索效率低下的问题。

Details

Motivation: 当前强化学习算法仅依赖标量奖励,未能充分利用自然语言反馈中丰富的语义信息,导致探索效率低下。

Result: 在可验证和不可验证的基准测试中,GOLF实现了卓越的性能和探索效率,样本效率相比仅使用标量奖励的RL方法提升了2.2倍。

Insight: 创新点在于聚合外部批评和组内尝试两种互补的群体级语言反馈来生成高质量的行动改进建议,并将其作为离策略的支架自适应地注入训练过程,在稀疏奖励区域提供针对性指导,同时在一个统一的RL循环中联合优化生成和精炼能力,形成良性循环。

Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.


[9] Optimizing Language Models for Crosslingual Knowledge Consistency cs.CL | cs.AIPDF

Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández

TL;DR: 本文提出了一种名为直接一致性优化(DCO)的方法,旨在解决大型语言模型在多语言场景下的知识不一致问题。该方法受DPO启发,无需显式奖励模型,通过强化学习优化模型策略,以生成跨语言一致的回答。

Details

Motivation: 大型语言模型在多语言场景中常表现出知识不一致性,即对同一问题的不同语言提问给出矛盾回答,这严重损害了模型的可靠性。本文旨在缓解这一问题。

Result: 综合实验表明,DCO显著提升了多种LLM的跨语言一致性,在多语言样本训练中优于现有方法,并在有黄金标签时与DPO形成互补。额外实验验证了其在双语设置中的有效性、显著的领域外泛化能力以及通过方向超参数实现可控对齐。

Insight: 核心创新点是提出了无需显式奖励模型的DCO方法,直接从LLM本身推导出优化策略,以结构化的方式强化跨语言知识一致性。该方法高效、鲁棒,为多语言LLM的对齐提供了新思路。

Abstract: Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.


[10] TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings cs.CL | cs.AIPDF

Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang

TL;DR: 本文提出了TSEmbed,一个通用的多模态嵌入框架,旨在解决多模态大语言模型(MLLMs)作为通用嵌入模型时面临的任务冲突问题。该框架结合了混合专家(MoE)和低秩适应(LoRA)来显式解耦冲突的任务目标,并引入了专家感知负采样(EANS)策略,利用专家路由分布作为语义相似性的内在代理,以动态选择信息丰富的困难负样本来增强模型的判别能力。通过一个两阶段学习范式确保训练稳定性,TSEmbed在Massive Multimodal Embedding Benchmark(MMEB)和实际工业数据集上均取得了最先进的性能。

Details

Motivation: 尽管多模态大语言模型(MLLMs)具有卓越的推理能力,但任务冲突严重阻碍了它们适应为通用嵌入模型。本文旨在解决这一限制,实现任务级别的可扩展性。

Result: TSEmbed在Massive Multimodal Embedding Benchmark(MMEB)和实际工业生产数据集上均达到了最先进的(SOTA)性能。

Insight: 主要创新点包括:1)结合MoE与LoRA来显式解耦任务冲突;2)提出专家感知负采样(EANS)策略,利用专家路由动态选择困难负样本以锐化嵌入边界;3)设计两阶段学习范式以确保训练稳定性和专家专业化。这些方法为通用多模态嵌入的任务级扩展提供了基础。

Abstract: Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model’s discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.


[11] Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models cs.CL | cs.AIPDF

Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish

TL;DR: 本文提出了一种无需训练、低成本的干预方法,用于增强扩散语言模型在文本生成中的多样性。该方法通过在批次采样过程中,让每个样本在特征空间上远离先前样本,主动惩罚冗余,从而提升生成候选的独特性。在HumanEval和GSM8K基准测试上使用LLaDA-8B-Instruct模型验证,该方法显著提高了多样性和Pass@k性能。

Details

Motivation: 在代码生成和数学问题求解等复杂推理任务中,多样化的输出对于有效探索解决方案空间至关重要。传统的采样方法(包括扩散语言模型)常因生成重复的失败模式而浪费计算资源,导致候选样本缺乏多样性,影响Pass@k问题的性能。

Result: 在HumanEval(代码生成)和GSM8K(数学推理)基准测试中,使用LLaDA-8B-Instruct模型,该方法在不同温度设置下均显著提升了生成多样性和Pass@k性能,表明其能有效覆盖更广的解决方案空间。

Insight: 创新点在于提出了一种训练免费、计算开销可忽略的采样干预策略,通过序列化地修改批次中的中间样本,在特征空间上施加排斥力以减少冗余。这为扩散语言模型提供了一种即插即用的多样性增强方案,无需重新训练或使用束搜索,可广泛应用于需要多样化搜索的任务中。

Abstract: Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.


[12] AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection cs.CLPDF

Panagiotis Alexios Spanakis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou

TL;DR: 本文提出了一种新颖的智能体化大语言模型(LLM)流程,用于SemEval-2026 Task 10任务,该任务旨在联合提取心理语言学阴谋论标记并检测对阴谋论的认同。该方法通过解耦设计,将语义推理与结构定位分开处理,并引入了动态判别性思维链和反回声室架构来解决语义模糊性和模型偏见问题。

Details

Motivation: 解决传统分类器在语义推理和结构定位任务上混淆的问题,并克服模型在检测阴谋论认同时可能错误惩罚客观报道的’记者陷阱’。

Result: 在任务子集S1上取得了0.24的宏平均F1分数(比基线提升100%),在S2上取得了0.79的宏平均F1分数(提升49%),其中S1系统在开发排行榜上排名第三。

Insight: 主要创新点在于解耦的任务设计、用于标记提取的动态判别性思维链(DD-CoT)方法,以及用于检测的’反回声室’架构(包含对抗性平行委员会和校准法官)。这为可解释的、基于心理语言学的NLP提供了一个通用范式。

Abstract: This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an “Anti-Echo Chamber” architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the “Reporter Trap,” where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100% over baseline) on S1 and 0.79 Macro F1 (+49%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.


[13] Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition cs.CLPDF

Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang

TL;DR: 本文提出了一种用于混合自动语音识别(ASR)的联邦异构语言模型优化方法。针对联邦学习中产生的多个异构语言模型(包括非神经n-gram模型和神经网络模型)难以有效合并的问题,论文引入了匹配-合并范式,并设计了两种算法:基于遗传操作的GMMA和基于强化学习的RMMA,以优化语言模型的融合,提升ASR系统的识别精度和泛化能力。

Details

Motivation: 在基于联邦学习的混合ASR系统中,声学模型已有成熟的合并方法,但用于重评分的语言模型因存在非神经n-gram模型和神经网络模型的异构性,其有效合并面临挑战。本文旨在解决这种异构语言模型的优化合并问题。

Result: 在七个OpenSLR数据集上的实验表明,提出的强化学习匹配-合并算法(RMMA)取得了最低的平均字符错误率(Character Error Rate),并展现出比基线方法更好的泛化性能。同时,RMMA的收敛速度比遗传算法GMMA快达七倍。

Insight: 论文的核心创新在于将异构语言模型合并问题形式化为一个优化任务,并提出了一个通用的匹配-合并范式。具体算法上,利用强化学习来指导模型匹配与合并,在保证性能的同时显著提升了优化效率,这为构建可扩展、保护隐私的ASR系统提供了新思路。

Abstract: Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm’s potential for scalable, privacy-preserving ASR systems.


[14] HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation cs.CLPDF

Yifan Zhu, Guanting Chen, Bing Wei, Haoran Luo

TL;DR: 本文提出了HiFlow框架,用于解决大语言模型在复杂约束下长文本生成的难题。该框架将生成过程建模为包含规划层和生成层的双层优化,通过约束感知的规划筛选和闭环反馈机制,协同优化全局结构与局部语义,以生成高质量且满足约束的长文本。

Details

Motivation: 大语言模型在短文本生成上表现良好,但在处理具有复杂约束(如全局结构一致性、局部语义连贯性和约束可行性)的长文本生成任务时仍面临挑战,现有方法多依赖静态规划或离线监督,难以在生成过程中有效协调全局与局部目标。

Result: 在多个骨干模型上的实验证实了HiFlow相对于基线方法的有效性。

Insight: 创新点在于将约束长文本生成形式化为双层优化问题,并引入分层反馈驱动机制,通过规划层与生成层的闭环交互实现动态协调与渐进优化,提升了生成文本的整体质量与约束满足度。

Abstract: Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow’s effectiveness over baseline methods.


[15] ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI cs.CL | cs.AI | cs.LGPDF

Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A Zarin Nishat, Dhananjay Bhandiwad

TL;DR: 本文介绍了ARC-TGI,一个用于生成多样化ARC-AGI视觉推理任务的开放源代码框架。该框架通过紧凑的Python程序(任务族生成器)来采样任务,每个生成的任务都配有自然语言输入、转换推理链和部分评估的Python代码。它支持任务级约束,确保训练示例共同揭示推断潜在规则所需的变化,并经过人工细化和验证。

Details

Motivation: 解决在静态手工编写的ARC-AGI谜题集合上,由于过拟合、数据集泄露和记忆化导致进展难以衡量的问题,旨在实现可扩展的数据集采样和受控基准测试。

Result: 发布了461个生成器,覆盖了180个ARC-Mini任务、215个ARC-AGI-1任务(200个训练,15个测试)和66个ARC-AGI-2任务(55个训练,11个测试),支持可扩展的数据集采样和受控基准测试。

Insight: 创新点在于引入了任务族生成器的概念,通过任务级约束确保训练示例能共同揭示规则变化,并采用面向求解器的表示(包括推理链和代码),结合人工验证以提高任务的自然性和一致性。

Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.


[16] LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting cs.CL | cs.AIPDF

Yewen Li, Zhiyi Lyu, Peng Jiang, Qingpeng Cai, Fei Pan

TL;DR: 本文提出了一种分层的大型自动竞价模型(LBM),通过结合大型语言模型(LLM)的推理能力和专门的行动生成模块,旨在解决在线广告自动竞价中黑盒方法导致的非直观行为和泛化能力不足的问题。

Details

Motivation: 在线广告拍卖规模扩大,手动竞价不切实际,现有自动竞价方法(如离线强化学习或生成方法)因黑盒训练和数据模式覆盖有限,可能导致反直觉行为,难以理解任务状态和在动态广告环境中泛化。LLMs提供了利用先验知识和推理能力的潜力,但直接应用会因竞价环境需要精确行动且缺乏专业知识而导致幻觉和次优决策。

Result: 实验表明,基于LBM的生成骨干模型在训练效率和泛化能力方面表现出优越性。

Insight: 创新点包括:1)分层架构(LBM-Think用于推理,LBM-Act用于行动生成),结合LLM推理与专门行动模块;2)双嵌入机制,有效融合语言和数值模态,用于语言引导训练LBM-Act;3)离线强化微调技术GQPO,缓解LLM-Think的幻觉并提升决策性能,无需像之前基于多轮LLM的方法那样进行模拟或真实环境部署。

Abstract: The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think’s hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.


[17] Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers cs.CL | cs.LGPDF

Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang

TL;DR: 本文从理论角度研究了Transformer模型中的类比推理机制,通过分离相似性和属性前提训练,证明了联合训练能实现表征对齐从而支持类比推理,并揭示了训练顺序和显式身份桥接对多跳推理的必要性。

Details

Motivation: 针对现有评估方法混淆多种推理类型的问题,本文旨在孤立地研究类比推理(基于已知相似性推断实体间共享属性)在Transformer中的涌现机制。

Result: 在高达15亿参数的模型架构上的实验验证了理论结果,表明表征几何形状塑造了归纳推理能力。

Insight: 创新点在于理论证明了Transformer通过将具有相似属性的实体编码为相似表征来实现属性迁移,并揭示了训练课程(先学相似性结构)和显式身份桥接数据对复杂推理的必要性。

Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.


[18] C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning cs.CL | cs.AIPDF

Avni Mittal, Rauno Arike

TL;DR: 本文介绍了C2-Faith基准,用于评估大型语言模型(LLM)作为评判者在思维链(CoT)推理中因果和覆盖忠实性的能力。该基准基于PRM800K构建,通过引入受控扰动(如替换因果步骤或删除覆盖内容)创建测试样例,并评估模型在因果错误检测、因果步骤定位和覆盖评分三个任务上的表现。研究发现,模型排名高度依赖于任务框架,没有单一模型在所有设置中占优;所有模型在检测错误与定位错误之间存在显著差距;且对不完整推理的覆盖判断存在系统性高估。

Details

Motivation: 随着LLM越来越多地被用作思维链推理的评判者,目前尚不清楚它们是否能可靠地评估过程忠实性(而不仅仅是答案的合理性)。因此,需要建立一个基准来系统评估LLM在因果和覆盖两个互补维度上的忠实性判断能力。

Result: 在C2-Faith基准上评估了三个前沿的LLM评判者。结果显示:模型排名因任务框架(二元因果检测、因果步骤定位、覆盖评分)不同而有很大差异,没有单一模型在所有任务中占主导;所有模型在检测错误与精确定位错误步骤之间存在显著性能差距;对于不完整推理,覆盖评分存在系统性高估现象。

Insight: 论文的创新点在于提出了一个专门针对思维链推理过程忠实性(因果性和覆盖性)的评估基准C2-Faith,并通过受控扰动方法构建了具有已知错误位置的测试集。从客观角度看,其研究揭示了LLM作为过程评判者的局限性(如任务依赖性、检测与定位的差距、评分偏差),为在实际评估中选择合适的LLM评判者提供了重要指导,并强调了过程级评估需要更精细的基准。

Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation


Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li

TL;DR: 本文提出了一套用于标注和可视化中国司法判决书中法律论证结构的系统性操作指南。该指南基于法律推理与论证理论,旨在揭示司法推理的逻辑组织,为计算分析提供可靠的数据基础。它在命题层面区分了四种命题类型,在关系层面定义了五种论证关系,并进一步规定了形式化表示规则、可视化约定以及标准化的标注流程与一致性控制机制。

Details

Motivation: 动机是解决如何系统、可操作地表示司法判决中的法律论证结构,以揭示其逻辑组织,并为法律论证挖掘、法律推理计算建模及AI辅助法律分析等研究提供可靠的数据基础和方法论支持。

Result: 摘要中未提及具体的定量实验结果或基准测试,因此无相关结果。

Insight: 创新点在于提出了一个融合法律理论与计算需求的、层次化的标注框架(命题类型与关系类型),并配套了形式化表示规则、可视化方案以及确保数据可复现性与可靠性的标准化工作流程和一致性控制机制,为大规模司法推理分析提供了方法论工具。

Abstract: This guideline proposes a systematic and operational annotation framework for representing the structure of legal argumentation in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and to provide a reliable data foundation for computational analysis. At the proposition level, the guideline distinguishes four types of propositions: general normative propositions, specific normative propositions, general factual propositions, and specific factual propositions. At the relational level, five types of relations are defined to capture argumentative structures: support, attack, joint, match, and identity. These relations represent positive and negative argumentative connections, conjunctive reasoning structures, the correspondence between legal norms and case facts, and semantic equivalence between propositions. The guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent graphical representation of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure reproducibility and reliability of the annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this guideline offers methodological support for large-scale analysis of judicial reasoning and for future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.


[20] Diffusion LLMs can think EoS-by-EoS cs.CLPDF

Sarah Breckner, Sebastian Schuster

TL;DR: 本文研究了扩散大语言模型在推理任务中利用序列结束(EoS)令牌作为隐藏草稿纸的机制。通过在加法、实体追踪和数独任务上对LLaDA1.5、LLaDA2.0-mini和Dream-v0等模型进行实验,发现增加EoS令牌能提升模型推理能力,并通过因果干预证实EoS令牌携带了解决问题的信息。

Details

Motivation: 扩散大语言模型在复杂推理任务上表现优异,尤其是在生成长度远超所需答案长度、模型用EoS令牌填充答案时。本文旨在探究这是否意味着扩散模型将EoS令牌的表示作为隐藏草稿纸来辅助推理。

Result: 在Addition、Entity Tracking和Sudoku等任务上的实验表明,添加EoS令牌能提升扩散LLMs的推理能力。通过因果干预(将EoS令牌的隐藏状态替换为反事实生成的隐藏状态),生成的输出频繁改变为反事实结果,验证了EoS令牌携带关键信息。

Insight: 创新点在于揭示了扩散大语言模型可能利用EoS令牌的表示作为内部计算空间(隐藏草稿纸)来增强复杂推理,这一机制不同于自回归模型。从客观角度看,这为理解扩散模型的内部工作机制提供了新视角,并暗示了通过设计特殊令牌来增强模型推理能力的潜在方法。

Abstract: Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs’ reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.


[21] Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic cs.CL | cs.SCPDF

Sara Candussio, Gabriele Sarti, Gaia Saveri, Luca Bortolussi

TL;DR: 本文提出了一种通过知识蒸馏将形式逻辑规范(特别是信号时序逻辑STL)的语义几何结构嵌入到连续神经表示中的框架。该方法使用教师-学生架构,将符号鲁棒性核函数的知识蒸馏到Transformer编码器中,以生成高效、可逆且保留语义相似度的神经嵌入。

Details

Motivation: 现有方法要么依赖计算成本高、依赖锚点且不可逆的符号核函数,要么使用无法捕捉底层结构的基于语法的神经嵌入。本文旨在弥合这一差距,实现既保留行为语义又高效可扩展的神经符号推理。

Result: 在信号时序逻辑(STL)上的实验表明,所提方法生成的神经表示能忠实保留STL公式的语义相似性,准确预测鲁棒性和约束满足性,并保持内在可逆性,同时大幅降低了运行时计算成本。

Insight: 创新点在于采用基于核加权的几何对齐目标进行监督,使误差惩罚与语义差异成比例;通过蒸馏实现了符号核的高效神经模拟,支持可扩展的神经符号推理和公式重建,避免了重复的核计算。

Abstract: We introduce a framework for learning continuous neural representations of formal specifications by distilling the geometry of their semantics into a latent space. Existing approaches rely either on symbolic kernels – which preserve behavioural semantics but are computationally prohibitive, anchor-dependent, and non-invertible – or on syntax-based neural embeddings that fail to capture underlying structures. Our method bridges this gap: using a teacher-student setup, we distill a symbolic robustness kernel into a Transformer encoder. Unlike standard contrastive methods, we supervise the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors in proportion to their semantic discrepancies. Once trained, the encoder produces embeddings in a single forward pass, effectively mimicking the kernel’s logic at a fraction of its computational cost. We apply our framework to Signal Temporal Logic (STL), demonstrating that the resulting neural representations faithfully preserve the semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Our proposed approach enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation at runtime.


[22] Oral to Web: Digitizing ‘Zero Resource’Languages of Bangladesh cs.CL | cs.HCPDF

Mohammad Mamun Or Rashid

TL;DR: 本文介绍了孟加拉国首个国家级、并行、多模态的语言数据集——多语言云语料库,旨在数字化该国约40种少数民族语言,其中14种濒危。该语料库包含85792条结构化文本条目(含孟加拉语刺激文本、英语翻译和IPA音标转录)以及约107小时的转录音频,覆盖42种来自藏缅语系、印欧语系、南亚语系和达罗毗荼语系的语言变体。数据通过90天的系统田野调查收集,涉及77名发言人和43名验证者,并公开提供于多语言云平台。

Details

Motivation: 孟加拉国拥有约40种少数民族语言,其中许多为口语化、计算上‘零资源’的濒危语言,缺乏跨语系的系统性数字语料库,本文旨在填补这一空白,支持濒危语言记录和低资源自然语言处理研究。

Result: 构建了包含85792条文本条目和107小时音频的并行多模态数据集,覆盖42种语言变体,数据通过多语言云平台公开提供,为语言文档化和低资源NLP提供了首个国家级资源。

Insight: 创新点在于首次创建了孟加拉国跨语系、多模态的‘零资源’语言数据集,采用系统化田野调查和三层语言粒度(词汇、语法结构、定向语音)的收集模板,结合IPA转录和独立评审,为濒危语言数字化和低资源NLP提供了可扩展的方法和公开可访问的平台。

Abstract: We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh’s ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally “zero resource” varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.


[23] DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning cs.CLPDF

Mohammad Mahdi Moradi, Sudhir Mudur

TL;DR: DiSCTT是一个用于大型语言模型推理任务测试时自适应的高效框架,它通过基于推理轨迹一致性估计的认知不确定性,动态分配优化策略:对高一致性输入使用多数同意的解决方案作为伪标签进行监督微调,对低一致性输入则使用共识正则化目标的强化学习进行优化。

Details

Motivation: 现有测试时自适应方法对所有输入应用统一的优化目标,导致在异构推理问题上适应效率低下或不稳定,DiSCTT旨在通过考虑实例难度和不确定性来解决这一问题。

Result: 在广泛的数学和通用推理基准测试中,DiSCTT始终优于强大的测试时自适应基线,以更低的计算和训练时间实现了更高的准确率和更低的方差。

Insight: 核心创新点是提出了一种基于认知不确定性(通过推理轨迹一致性估计)的动态、难度感知的自课程学习框架,将实例级难度纳入测试时优化策略分配,从而实现了更稳定、高效和有效的自适应。

Abstract: Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.


[24] An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs cs.CLPDF

Deshan Sumanathilaka, Nicholas Micallef, Julian Hough

TL;DR: 本文提出了一种探索-分析-消歧的推理框架,用于提升低参数大语言模型在词义消歧任务上的性能。研究通过在FEWS数据集上对Gemma、Qwen等小规模开源模型进行微调,结合思维链推理和邻词分析,使这些低参数模型在零样本设置下达到了与GPT-4-Turbo相当的效果,并显著降低了计算和能耗需求。

Details

Motivation: 解决词义消歧任务中,特别是罕见或领域特定词义常被误解的挑战,同时探索如何通过强调推理驱动的微调策略,使低参数大语言模型达到与高参数模型相当的性能,以克服后者在计算和能耗上的可扩展性限制。

Result: 在FEWS数据集上,Gemma-3-4B和Qwen-3-4B模型持续超越了所有中等参数基线模型和SOTA模型,并在未见词义上表现出鲁棒的泛化能力;在未见的’Fool Me If You Can’数据集上的评估也证实了其强大的跨领域适应性,且无需任务特定微调。

Insight: 论文宣称的创新点在于提出了一个结合思维链推理和邻词分析的推理中心微调框架,使低参数模型能高效执行词义消歧。从客观角度看,其核心创新在于证明了通过精心设计的、强调推理的微调策略,小模型可以在保持高性能的同时大幅降低资源消耗,为资源受限环境下的NLP应用提供了可行方案。

Abstract: Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen “Fool Me If You Can’’ dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.


[25] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought cs.CL | cs.AI | cs.LGPDF

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow

TL;DR: 该论文揭示了大型语言模型在思维链推理中存在的‘表演性推理’现象,即模型在生成最终答案后仍会继续生成无关的推理过程,其内部真实信念早已形成。研究通过激活探测、早期强制回答和思维链监控等方法,在DeepSeek-R1 671B和GPT-OSS 120B模型上发现,对于简单的MMLU任务,最终答案可远早于推理过程被解码,而在复杂的GPQA-Diamond任务中则存在真实的推理过程。

Details

Motivation: 动机在于探究大型语言模型在思维链推理中是否真正进行逐步推理,还是仅仅在‘表演’一个推理过程,即模型内部信念早已确定却仍生成冗长的推理链。

Result: 在MMLU和GPQA-Diamond基准测试上,激活探测引导的早期退出方法能将推理令牌减少80%和30%,同时保持相似的准确率。研究还发现,模型在简单任务中表现出明显的‘表演性推理’,而在困难任务中则存在真实的推理拐点(如回溯、顿悟时刻)。

Insight: 创新点在于提出了‘推理剧场’的概念,并利用激活探测技术来区分模型的真实信念与表演性推理行为。该方法不仅能高效检测表演性推理,还能通过自适应计算(如早期退出)显著减少计算开销,为理解模型内部推理机制提供了新工具。

Abstract: We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model’s final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, ‘aha’ moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned “reasoning theater.” Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.


cs.CV [Back]

[26] Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology cs.CV | cs.AI | cs.LGPDF

Ekansh Arora

TL;DR: 本研究探讨了在计算病理学中,基础模型(如CPath-CLIP)在跨癌症和跨物种迁移时的表现。研究发现,标准视觉-语言对齐方法在跨物种泛化上表现不佳,性能低于最先进基准。为此,论文提出了语义锚定方法,利用语言为视觉特征提供稳定的坐标系,有效避免了嵌入崩溃,提升了相同癌症和跨癌症分类的性能。

Details

Motivation: 动机是研究基础模型在计算病理学中,特别是在跨癌症和跨物种迁移场景下的行为,因为现有模型在这些条件下的表现尚未明确,且标准视觉-语言对齐可能不足以支持有效的跨物种泛化。

Result: 在相同癌症分类中,性能从64.9% AUC提升到72.6% AUC;跨癌症分类从56.84% AUC提升到66.31% AUC。然而,跨物种评估性能仍低于最先进基准H-optimus-0(84.97% AUC)。通过语义锚定方法,相同癌症和跨癌症分类分别获得了8.52%和5.67%的额外增益。

Insight: 创新点包括揭示了跨物种迁移中的语义崩溃问题,即模型因物种主导的对齐而非视觉信息缺失而失效;提出了语义锚定方法,利用语言作为控制机制,为视觉特征提供稳定坐标系,无需重新训练即可实现语义重新解释;研究表明文本对齐机制本身(而非文本编码器复杂度)是关键益处来源,这为改进视觉-语言模型在病理学中的泛化能力提供了新方向。

Abstract: Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP’s failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.


[27] Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living cs.CVPDF

Kooshan Hashemifard, Pau Climent-Pérez, Francisco Florez-Revuelta

TL;DR: 本文提出了一种面向环境辅助生活(AAL)的多模态深度学习系统,用于识别老年人的日常活动。该系统融合了3D CNN处理的视频信息、图卷积网络分析的3D人体姿态数据,以及通过交叉注意力机制与视频特征融合的物体检测上下文信息。

Details

Motivation: 开发稳健的日常活动识别系统对于有效的AAL至关重要,但面临类内差异、类间相似性、环境变化、摄像机视角和场景复杂性等挑战。本文旨在通过多模态融合方法解决这些问题,以更好地监测和支持老年人的独立生活。

Result: 该方法在Toyota SmartHome真实世界室内活动数据集上进行了评估,结果表明,对于一系列日常活动,该系统取得了具有竞争力的分类准确率。

Insight: 创新点在于将视频、3D人体姿态和物体上下文信息通过多模态(3D CNN、图卷积网络)和交叉注意力机制进行深度融合,以增强对复杂日常活动的识别能力,这为构建先进的AAL监控解决方案提供了关键技术支持。

Abstract: Recognition of daily activities is a critical element for effective Ambient Assisted Living (AAL) systems, particularly to monitor the well-being and support the independence of older adults in indoor environments. However, developing robust activity recognition systems faces significant challenges, including intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity. This paper presents a multi-modal approach for the recognition of activities of daily living tailored for older adults within AAL settings. The proposed system integrates visual information processed by a 3D Convolutional Neural Network (CNN) with 3D human pose data analyzed by a Graph Convolutional Network. Contextual information, derived from an object detection module, is fused with the 3D CNN features using a cross-attention mechanism to enhance recognition accuracy. This method is evaluated using the Toyota SmartHome dataset, which consists of real-world indoor activities. The results indicate that the proposed system achieves competitive classification accuracy for a range of daily activities, highlighting its potential as an essential component for advanced AAL monitoring solutions. This advancement supports the broader goal of developing intelligent systems that promote safety and autonomy among older adults.


[28] Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data cs.CV | cs.LGPDF

Ancymol Thomas, Jaya Sreevalsan-Nair

TL;DR: 本研究针对多模态遥感数据的局部气候区分类任务,系统分析了深度学习模型中的融合与分组策略。通过构建四种融合模型(包括基线混合融合、自注意力与交叉注意力机制、多尺度高斯滤波图像融合及加权决策级融合),并结合波段分组与标签合并策略,在So2Sat LCZ42数据集上验证了基线混合融合结合分组策略的有效性,取得了76.6%的整体准确率,并显著提升了对少数类别的预测性能。

Details

Motivation: 局部气候区分类对研究城市结构与气候变化至关重要,但多模态遥感数据复杂度高,现有研究缺乏对深度学习分类器中融合机制的全面分析,因此需系统评估不同融合与分组策略对分类精度的影响。

Result: 在So2Sat LCZ42数据集(包含合成孔径雷达与多光谱图像对)上的实验表明,基线混合融合模型结合波段分组和标签合并策略达到最优,整体准确率为76.6%,优于其他融合方法,并有效提升了代表性不足类别的预测准确性。

Insight: 创新点在于系统比较了像素级、特征级和决策级融合机制,并引入基于数据内在特性的分组策略(如波段分组和标签合并);客观分析认为,将混合融合与数据分组结合能显著处理多模态数据复杂性,为类似遥感分类任务提供了可借鉴的架构设计思路。

Abstract: Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC-MultiModalHybridFusion


[29] Mask-aware inference with State-Space Models cs.CVPDF

Ignasi Mas, Ramon Morros, Javier-Ruiz Hidalgo, Ivan Huerta

TL;DR: 本文提出了Partial Vision Mamba (PVM),一种将部分卷积(Partial Convolution)的掩码感知推理机制移植到Mamba骨干网络中的新型架构组件,以处理视觉任务中任意形状的缺失或无效数据。

Details

Motivation: 解决状态空间模型(如Mamba)在推理时缺乏处理任意形状无效数据的内在机制的问题,使其能适应深度补全、图像修复等现实任务。

Result: 在深度补全、图像修复以及含无效数据的分类任务上验证了方法的有效性和泛化能力。

Insight: 将CNN中成熟的掩码感知归一化思想(Partial Convolutions)迁移到线性复杂度的SSM架构中,并定义了使用PVM设计架构的系列规则,为SSM处理不完整数据提供了通用解决方案。

Abstract: Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.


[30] PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing cs.CVPDF

Rohan Mahadev, Joyce Yuan, Patrick Poirson, David Xue, Hao-Yu Wu

TL;DR: 本文提出了一个名为PinPoint的综合基准测试,用于评估组合图像检索(CIR)任务。该基准包含7,635个查询和329K个相关性判断,覆盖23个查询类别,其特点包括提供多个正确答案、显式困难负样本、用于鲁棒性测试的查询改写、多图像组合查询支持以及用于公平性评估的人口统计元数据。通过对20多种方法的分析,揭示了现有CIR方法在避免误检、处理查询改写和多图像查询方面的显著不足。作者还提出了一种基于现有多模态大语言模型(MLLM)的无训练重排序方法,以提升现有系统的性能。

Details

Motivation: 当前组合图像检索(CIR)的基准测试存在局限,通常只提供单一标准答案,缺乏用于评估误检避免、鲁棒性和多图像推理能力的标注。本文旨在通过构建一个更全面、更贴近真实世界的基准(PinPoint)来解决这些问题,以推动CIR技术的发展。

Result: 在PinPoint基准上评估了20多种来自4种主要范式的方法。最佳方法在mAP@10指标上达到28.5%,但仍会以9%的概率检索到不相关的困难负样本;面对查询改写,最佳模型的性能波动高达25.1%;对于多图像查询,所有方法的性能都下降了40%到70%。作者提出的基于MLLM的无训练重排序方法可以应用于任何现有系统以弥补这些差距。

Insight: 论文的创新点在于构建了一个包含多种挑战(如多答案、显式负样本、查询改写、多图像组合)的综合性CIR基准测试,这比传统基准更能反映真实应用场景。从客观角度看,该工作不仅系统地揭示了现有CIR技术的核心弱点(鲁棒性差、多模态组合推理能力弱),而且提出的基于现成MLLM的重排序方案提供了一种轻量级、可插拔的性能提升思路,具有很好的实用价值。

Abstract: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.


[31] SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D cs.CVPDF

Zirui Wang, Ruiping Liu, Yufan Chen, Junwei Zheng, Weijia Fan

TL;DR: 本文提出了一种免训练的3D场景图检索-推理模型(SGR3 Model),该模型利用多模态大语言模型(MLLMs)和检索增强生成(RAG)技术,无需显式3D重建即可生成语义场景图。通过引入加权块级相似性选择机制来提升检索鲁棒性,并利用ColPali风格的跨模态框架检索语义对齐的场景图以增强关系推理。

Details

Motivation: 现有3D场景图生成方法通常结合场景重建与图神经网络,需要多模态数据且依赖启发式图构建,限制了关系三元组的预测。本文旨在克服这些限制,提出一种无需训练、不依赖显式3D重建的框架。

Result: 实验表明,SGR3 Model在免训练基线方法中取得了有竞争力的性能,并与基于GNN的专家模型水平相当。消融研究证实检索的外部信息被显式整合到令牌生成过程中。

Insight: 创新点在于将MLLMs与RAG结合用于3D场景图生成,避免了显式重建;提出的加权块级相似性选择机制提升了跨模态检索的鲁棒性;框架证明了外部知识可通过检索被显式而非隐式地利用于推理任务。

Abstract: 3D scene graphs provide a structured representation of object entities and their relationships, enabling high-level interpretation and reasoning for robots while remaining intuitively understandable to humans. Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs). However, such pipelines require multi-modal data that may not always be available, and their reliance on heuristic graph construction can constrain the prediction of relationship triplets. In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model), a training-free framework that leverages multi-modal large language models (MLLMs) with retrieval-augmented generation (RAG) for semantic scene graph generation. SGR3 Model bypasses the need for explicit 3D reconstruction. Instead, it enhances relational reasoning by incorporating semantically aligned scene graphs retrieved via a ColPali-style cross-modal framework. To improve retrieval robustness, we further introduce a weighted patch-level similarity selection mechanism that mitigates the negative impact of blurry or semantically uninformative regions. Experiments demonstrate that SGR3 Model achieves competitive performance compared to training-free baselines and on par with GNN-based expert models. Moreover, an ablation study on the retrieval module and knowledge base scale reveals that retrieved external information is explicitly integrated into the token generation process, rather than being implicitly internalized through abstraction.


[32] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks cs.CV | cs.AIPDF

Chenjun Li

TL;DR: 本文研究了多图像推理任务中视觉语言模型(VLMs)的注意力机制,发现其在思维链(CoT)生成过程中存在分散的’脉冲’现象和系统性的位置偏差。为解决此问题,作者提出了无需训练的推理时方法PulseFocus,通过交替的规划/聚焦模块和软注意力门控来引导模型明确规划并聚焦于相关图像,从而在多图像基准测试上取得了性能提升。

Details

Motivation: 多图像推理是视觉语言模型面临的重要挑战,作者观察到在CoT生成过程中,模型的文本到图像注意力存在分散的’脉冲’模式,无法有效聚焦于任务相关图像,且存在系统性的位置偏差,这促使他们设计方法来改善注意力聚焦。

Result: 在BLINK基准测试上性能提升3.7%,在MuirBench上提升1.07%,表明该方法能有效提升多图像推理任务的性能。

Insight: 创新点在于揭示了VLMs在多图像推理中的注意力分散现象及位置偏差,并提出了无需额外训练、通过结构化CoT推理(规划/聚焦模块与软注意力门控)来引导模型在解码时明确聚焦于特定图像的推理时优化方法,为改善VLM的跨图像注意力机制提供了新思路。

Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse “pulses”: sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).


[33] Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild cs.CV | cs.AIPDF

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在真实世界视频异常检测(VAD)任务中的零样本性能,发现其在ShanghaiTech和CHAD基准测试上存在显著的保守偏差,表现为高精度但召回率崩溃,限制了实际应用。通过引入类别特定指令可显著提升F1分数,但召回率仍是关键瓶颈。

Details

Motivation: MLLMs在视频理解方面展现出强大能力,但其在真实监控场景下的零样本异常检测可靠性尚未被充分探索。本文旨在将VAD重新定义为语言引导的推理任务,以评估MLLMs在此领域的实际表现。

Result: 在ShanghaiTech和CHAD基准测试上,零样本MLLMs表现出高精度但极低召回率(如ShanghaiTech上初始峰值F1仅为0.09)。通过使用类别特定指令进行提示优化,可将ShanghaiTech的峰值F1提升至0.64,但召回率问题依然突出,表明与现有SOTA方法存在显著性能差距。

Insight: 论文创新点在于将VAD重构为基于弱时间监督的二分类语言推理任务,并系统分析了提示词特异性和时间窗口长度对性能的影响。客观来看,其揭示了MLLMs在开放世界监控任务中存在的决策保守性偏差,为未来面向召回率优化的提示工程和模型校准研究提供了重要基础。

Abstract: Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s–3s) influence performance, focusing on the precision–recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the ‘normal’ class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.


[34] Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary cs.CV | cs.AI | cs.LGPDF

Alexandru Florea, Shansong Wang, Mingzhe Hu, Qiang Li, Zach Eidex

TL;DR: 本文对GPT-5系列模型(GPT-5、GPT-5 Mini、GPT-5 Nano)在临床医学多模态推理任务上的能力进行了首次系统性评估,并与GPT-4o进行了对比。研究发现,GPT-5在文本推理(如MedXpertQA)上取得显著进步,在多模态视觉问答(VQA)任务中,尤其在需要细粒度病变描述的乳腺X线摄影任务上,表现优于GPT-4o,但在神经放射学等高度专业化任务上仍落后于领域专用模型。

Details

Motivation: 随着AI从任务特定模型向通用基础模型转变,需要评估其在临床医学中整合推理(如综合模糊的患者叙述、实验室数据和多模态影像)的能力。

Result: GPT-5在MedXpertQA上绝对提升超过25个百分点;在多模态VQA基准测试中达到SOTA或具有竞争力,在乳腺X线摄影任务上比GPT-4o提升10-40%;但在神经放射学任务上宏观平均准确率为44%,在乳腺X线摄影任务上准确率为52-64%,低于领域专用模型(>80%)。

Insight: GPT-5在整合多模态临床推理方面取得实质性进展,能够利用增强的推理能力将不确定的临床叙述与具体影像证据相结合,模拟临床医生的认知过程。然而,在高度专业化、感知关键的任务中,通用模型尚无法替代专用系统。

Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5’s 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician’s cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.


[35] DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction cs.CV | cs.AIPDF

Shiyu Zhang, Zhicong Wu, Huangxuan Zhao, Zhentao Liu, Lei Chen

TL;DR: 本文提出了DSA-SRGS,首个用于动态稀疏视图数字减影血管造影(DSA)重建的超分辨率高斯泼溅框架。该框架通过多保真度纹理学习模块整合超分辨率先验,并采用辐射亚像素致密化策略优化4D高斯核,从而从稀疏动态输入中重建出高分辨率、细节丰富的4D血管模型。

Details

Motivation: 现有基于高斯泼溅和动态神经表示的方法受限于输入投影的分辨率,无法恢复细粒度血管细节和复杂分支结构,限制了其在精准诊疗中的应用。本文旨在解决动态稀疏视图DSA重建中的超分辨率问题。

Result: 在两个临床DSA数据集上的大量实验表明,DSA-SRGS在定量指标和定性视觉保真度上均显著优于现有最先进(SOTA)方法。

Insight: 创新点包括:1)多保真度纹理学习模块,整合微调后的DSA专用超分辨率模型的高质量先验;2)置信度感知策略,自适应加权低分辨率投影与生成的高分辨率伪标签之间的监督信号;3)辐射亚像素致密化策略,利用高分辨率亚像素采样的梯度积累来细化4D辐射高斯核。

Abstract: Digital subtraction angiography (DSA) is a key imaging technique for the auxiliary diagnosis and treatment of cerebrovascular diseases. Recent advancements in gaussian splatting and dynamic neural representations have enabled robust 3D vessel reconstruction from sparse dynamic inputs. However, these methods are fundamentally constrained by the resolution of input projections, where performing naive upsampling to enhance rendering resolution inevitably results in severe blurring and aliasing artifacts. Such lack of super-resolution capability prevents the reconstructed 4D models from recovering fine-grained vascular details and intricate branching structures, which restricts their application in precision diagnosis and treatment. To solve this problem, this paper proposes DSA-SRGS, the first super-resolution gaussian splatting framework for dynamic sparse-view DSA reconstruction. Specifically, we introduce a Multi-Fidelity Texture Learning Module that integrates high-quality priors from a fine-tuned DSA-specific super-resolution model, into the 4D reconstruction optimization. To mitigate potential hallucination artifacts from pseudo-labels, this module employs a Confidence-Aware Strategy to adaptively weight supervision signals between the original low-resolution projections and the generated high-resolution pseudo-labels. Furthermore, we develop Radiative Sub-Pixel Densification, an adaptive strategy that leverages gradient accumulation from high-resolution sub-pixel sampling to refine the 4D radiative gaussian kernels. Extensive experiments on two clinical DSA datasets demonstrate that DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.


[36] Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper cs.CV | cs.AIPDF

Kiranmayee Janardhan, Vinay Martin DSa Prabhu, T. Christy Bobby

TL;DR: 这篇综述论文比较了传统方法和深度学习方法在脑胶质瘤影像分割与分类任务上的表现,指出卷积神经网络架构在这些任务中优于传统技术。

Details

Motivation: 脑胶质瘤的精确分割和分类对于治疗规划、预后预测和疾病监测至关重要,但实现无错误且可重复的分割具有挑战性,因此需要评估现有技术的有效性。

Result: 论文综述指出,在磁共振成像后处理中,卷积神经网络架构在分割和分类任务上超越了传统技术。

Insight: 论文强调了深度学习(特别是CNN)在医学影像分析中的优越性能,为临床采纳自动或半自动技术提供了依据,其中半自动方法因需准确评估而更受青睐。

Abstract: Segmentation is crucial for brain gliomas as it delineates the glioma s extent and location, aiding in precise treatment planning and monitoring, thus improving patient outcomes. Accurate segmentation ensures proper identification of the glioma s size and position, transforming images into applicable data for analysis. Classification of brain gliomas is also essential because different types require different treatment approaches. Accurately classifying brain gliomas by size, location, and aggressiveness is essential for personalized prognosis prediction, follow-up care, and monitoring disease progression, ensuring effective diagnosis, treatment, and management. In glioma research, irregular tissues are often observable, but error free and reproducible segmentation is challenging. Many researchers have surveyed brain glioma segmentation, proposing both fully automatic and semi-automatic techniques. The adoption of these methods by radiologists depends on ease of use and supervision, with semi-automatic techniques preferred due to the need for accurate evaluations. This review evaluates effective segmentation and classification techniques post magnetic resonance imaging acquisition, highlighting that convolutional neural network architectures outperform traditional techniques in these tasks.


[37] MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models cs.CVPDF

Lulu Hu, Wenhu Xiao, Xin Chen, Xinhua Xu, Bowen Xu

TL;DR: 本文提出了一种针对多模态大语言模型(MLLMs)的后训练量化(PTQ)新框架MASQuant,旨在解决现有方法(如SmoothQuant)在应用于MLLMs时遇到的平滑错位和跨模态计算不变性问题。该框架通过引入模态感知平滑(MAS)和跨模态补偿(CMC)技术,实现了对双模态和三模态MLLMs的稳定量化,并在实验中展现出与最先进PTQ算法相当的性能。

Details

Motivation: 现有后训练量化方法(如SmoothQuant)在大型语言模型(LLMs)上效果显著,但直接应用于多模态大语言模型(MLLMs)时存在挑战,主要问题包括平滑错位(Smoothing Misalignment)和跨模态计算不变性(Cross-Modal Computational Invariance),导致量化性能下降。本文旨在解决这些问题。

Result: 实验结果表明,MASQuant在双模态和三模态MLLMs上均表现出稳定的量化性能,与最先进(SOTA)的后训练量化算法相比具有竞争力。

Insight: 创新点在于提出了模态感知平滑(MAS),为不同模态学习独立的平滑因子以避免平滑错位;以及跨模态补偿(CMC),利用SVD白化技术将多模态激活差异转换为低秩形式,从而实现跨模态的统一量化。这为多模态模型的轻量化部署提供了新思路。

Abstract: Post-training quantization (PTQ) with computational invariance for Large Language Models(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.


[38] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation cs.CV | cs.AI | cs.LGPDF

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui

TL;DR: 本文提出了一种名为扩散对比重建(DCR)的方法,旨在解决CLIP视觉编码器表征能力不足的问题。该方法通过在基于扩散模型的重建过程中注入对比信号,而非直接结合扩散与对比学习,以统一优化目标,从而协同提升表征的判别能力和细节感知能力。

Details

Motivation: 现有利用扩散模型增强CLIP视觉表征的方法可能损害其判别能力,无法全面解决表征瓶颈。本文旨在通过整合对比信号到扩散重建中,以追求更平衡、全面的视觉表征。

Result: 在多个基准测试和多模态大语言模型上的广泛实验验证了该方法的有效性,表明其能提升下游任务性能。

Insight: 核心创新在于将对比信号从原始输入图像转移到每个重建图像上,以此统一学习目标,理论上能联合优化判别能力和细节感知能力,避免了朴素结合的梯度冲突问题。

Abstract: The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP’s representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.


[39] Revisiting Shape from Polarization in the Era of Vision Foundation Models cs.CVPDF

Chenhao Li, Taishi Ono, Takeshi Uemori, Yusuke Moriuchi

TL;DR: 本文重新审视了偏振形状恢复(SfP)任务,指出先前SfP方法性能不佳主要源于领域差距,而非偏振模态本身。通过构建高质量合成数据集、引入DINOv3先验和传感器感知数据增强,一个在小型数据集上训练的轻量级模型在单次物体级表面法线估计任务上,超越了仅使用RGB的视觉基础模型(VFMs)。

Details

Motivation: 解决偏振形状恢复(SfP)方法性能落后于仅RGB视觉基础模型(VFMs)的问题,并探究偏振线索在数据有限、需要专用硬件情况下的必要性,认为性能差距主要源于合成数据与现实世界之间的领域差距。

Result: 在仅使用4万训练场景的情况下,该方法显著超越了最先进的SfP方法和仅RGB的VFMs。实验表明,利用偏振线索,可将训练数据减少33倍或将模型参数减少8倍,同时仍能获得比仅RGB方法更好的性能。

Insight: 创新点在于识别并系统性解决了SfP中的两个关键领域差距:1)使用真实世界3D扫描物体构建高质量合成数据集以改善几何和纹理真实性;2)引入偏振传感器感知数据增强以模拟真实噪声。此外,结合预训练的DINOv3先验知识提升了泛化能力,证明了在数据高效和模型轻量化方面,物理感知的偏振模态相比纯数据驱动的RGB大规模模型具有独特优势。

Abstract: We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.


[40] Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction cs.CVPDF

Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu

TL;DR: 本文提出了一种语义增强的动态对比攻击方法(SADCA),旨在提升针对视觉-语言预训练模型的对抗样本的可迁移性。该方法通过动态的跨模态交互和对比学习机制,渐进式地破坏图像与文本的对齐,并结合语义增强模块来增加对抗样本的多样性。

Details

Motivation: 现有针对视觉-语言模型的对抗攻击主要依赖静态的跨模态交互,且仅关注破坏正样本对,导致跨模态破坏有限且可迁移性差。本文旨在解决这一问题,提升对抗攻击在不同模型和任务间的迁移能力。

Result: 在多个数据集和模型上的大量实验表明,SADCA显著提升了对抗样本的可迁移性,并持续超越了现有最先进的方法。

Insight: 创新点在于引入了动态的、渐进式的跨模态扰动生成机制,并结合了对比学习(同时考虑对抗样本、正样本和负样本)来强化语义不一致性。此外,将传统攻击中的输入变换技术适配到视觉-语言领域,通过语义增强模块提升了对抗样本的多样性和泛化能力。

Abstract: With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.


[41] Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models cs.CVPDF

Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu

TL;DR: 本文提出了一种新颖的多范式协同攻击框架(MPCAttack),旨在提升针对多模态大语言模型的对抗样本的可迁移性。该框架通过聚合视觉图像和语言文本的语义表示,并采用多范式协同优化策略进行联合对抗优化,有效缓解了单一范式训练导致的表示偏差问题。

Details

Motivation: 现有针对多模态大语言模型的对抗攻击通常依赖于单一学习范式训练的代理模型,并在各自特征空间中进行独立优化,这限制了特征表示的丰富性和对抗扰动的多样性,从而影响了攻击的可迁移性。本文旨在解决这一问题。

Result: 在多个基准测试上的大量实验结果表明,MPCAttack在针对开源和闭源多模态大语言模型的有目标和无目标攻击中,均持续优于最先进的方法,展现了其优越性。

Insight: 主要创新点在于提出了一个多范式协同攻击框架,通过聚合多模态特征并采用对比匹配的自适应优化策略来平衡不同范式表示的重要性,从而引导全局扰动优化,缓解表示偏差,有效提升了对抗样本的可迁移性。

Abstract: The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at https://github.com/LiYuanBoJNU/MPCAttack.


[42] GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction cs.CV | cs.GRPDF

Tianyu Xiong, Rui Li, Linjie Li, Jiaqi Yang

TL;DR: 本文提出了GloSplat框架,用于在3D高斯泼溅训练期间进行联合姿态-外观优化。该框架将显式的SfM特征轨迹作为一等实体,通过重投影损失和光度监督共同优化,从而在无需COLMAP或追求最高质量的不同场景下,实现了更快、更准确的3D重建。

Details

Motivation: 传统的特征提取、匹配、运动恢复结构和新视角合成被视为独立优化问题,本文旨在通过联合优化姿态和外观,克服仅依赖光度梯度进行姿态细化方法的局限性,如早期姿态漂移和缺乏细粒度细化能力。

Result: 实验表明,GloSplat-F在无需COLMAP的方法中达到了最先进水平,而GloSplat-A超越了所有基于COLMAP的基线方法。

Insight: 主要创新点在于将显式SfM特征轨迹作为独立的可优化参数与高斯基元分离,通过结合重投影损失和光度监督进行联合优化,这提供了持久的几何锚点,防止了早期姿态漂移并实现了细粒度细化,这是仅依赖光度方法所不具备的能力。

Abstract: Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF–, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement – a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.


[43] Scalable Injury-Risk Screening in Baseball Pitching From Broadcast Video cs.CVPDF

Jerrin Bright, Justin Mende, John Zelek

TL;DR: 本文提出了一种基于单目广播视频的棒球投球损伤风险筛查方法,通过从视频中恢复18个临床相关的生物力学指标,实现了可扩展的损伤预测。该方法利用改进的DreamPose3D框架,结合漂移控制全局提升模块和运动学优化流程,从广播视频中提取稳定且物理合理的运动学数据。

Details

Motivation: 解决专业棒球投球损伤预测依赖昂贵、场地受限的多摄像头运动捕捉系统的问题,旨在利用广泛可得的广播视频作为可扩展的生物力学信号来源。

Result: 在13名职业投手(156次投球)的数据上,18项指标中有16项的平均绝对误差小于1度。在7,348名投手的损伤预测任务中,自动化筛查模型对于Tommy John手术和重大手臂损伤的AUC分别达到0.811和0.825。

Insight: 创新点包括:1) 引入基于速度参数化和滑动窗口推理的漂移控制全局提升模块,以恢复骨盆轨迹;2) 结合骨骼长度约束、关节限制逆运动学、平滑和对称约束的运动学优化流程,以应对运动模糊和极端姿态。这为利用单目视频进行大规模生物力学分析提供了可行方案。

Abstract: Injury prediction in pitching depends on precise biomechanical signals, yet gold-standard measurements come from expensive, stadium-installed multi-camera systems that are unavailable outside professional venues. We present a monocular video pipeline that recovers 18 clinically relevant biomechanics metrics from broadcast footage, positioning pose-derived kinematics as a scalable source for injury-risk modeling. Built on DreamPose3D, our approach introduces a drift-controlled global lifting module that recovers pelvis trajectory via velocity-based parameterization and sliding-window inference, lifting pelvis-rooted poses into global space. To address motion blur, compression artifacts, and extreme pitching poses, we incorporate a kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to ensure temporally stable and physically plausible kinematics. On 13 professional pitchers (156 paired pitches), 16/18 metrics achieve sub-degree agreement (MAE $< 1^{\circ}$). Using these metrics for injury prediction, an automated screening model achieves AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers. The resulting pose-derived metrics support scalable injury-risk screening, establishing monocular broadcast video as a viable alternative to stadium-scale motion capture for biomechanics.


[44] SURE: Semi-dense Uncertainty-REfined Feature Matching cs.CVPDF

Sicheng Li, Zaiwang Gu, Jie Zhang, Qing Guo, Xudong Jiang

TL;DR: 本文提出SURE,一种半稠密不确定性精化特征匹配框架,通过联合建模偶然不确定性和认知不确定性来预测对应点及其置信度,以解决传统方法在视角变化大或纹理缺失区域中因过度自信错误导致的匹配不可靠问题。

Details

Motivation: 现有图像匹配方法在视角变化大或纹理缺失区域中,即使错误对应点也可能获得高相似度分数,主要因为传统模型仅依赖特征相似性,缺乏显式机制评估匹配可靠性,导致过度自信错误。

Result: 在多个标准基准测试中,该方法在准确性和效率上均优于现有最先进的半稠密匹配模型。

Insight: 创新点包括引入新颖的证据头进行可信坐标回归,以及轻量级空间融合模块以最小开销提升局部特征精度,通过联合不确定性建模提高匹配可靠性。

Abstract: Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect cor- respondences may still receive high similarity scores. This is mainly because conventional models rely solely on fea- ture similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. our code will be available on https://github.com/LSC-ALAN/SURE.


[45] DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization cs.CV | cs.AI | cs.MMPDF

Xiaodong Zhu, Suting Wang, Yuanming Zheng, Junqi Yang, Yangxu Liao

TL;DR: DeformTrace是一种用于时序伪造定位(Temporal Forgery Localization, TFL)的可变形状态空间模型,通过引入可变形动态和接力令牌机制,解决了现有方法在模糊边界、稀疏伪造和长程建模方面的局限性。

Details

Motivation: 解决时序伪造定位任务中,现有状态空间模型(SSMs)因模糊边界、稀疏伪造和长程建模能力有限而难以精确识别视频和音频中篡改片段的问题。

Result: 在广泛的实验中,DeformTrace以更少的参数、更快的推理速度和更强的鲁棒性,在相关基准测试上实现了最先进的性能(SOTA)。

Insight: 主要创新点包括:1)可变形自状态空间模型(DS-SSM)引入动态感受野以提升时序定位精度;2)接力令牌机制增强长程推理能力并缓解衰减;3)可变形交叉状态空间模型(DC-SSM)通过划分状态空间提升对稀疏伪造的敏感性;4)结合Transformer全局建模与SSM效率的混合架构。

Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.


[46] Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation cs.CVPDF

Hong Liu, Dong Wei, Qian Dai, Xian Wu, Yefeng Zheng

TL;DR: 本文提出了一种名为FedMEPD的新型联邦学习框架,用于解决多模态医学图像分割中存在的模态间异质性和个性化需求问题。该框架采用联邦化的模态特定编码器来处理不同客户端可能缺少部分模态数据的情况,并采用部分个性化的融合解码器来满足客户端的个性化需求。通过在BraTS数据集上进行验证,该方法在性能上超越了现有的多模态和个性化联邦学习方法。

Details

Motivation: 现有联邦学习方法主要处理模态内异质性,难以应对多模态医学成像中客户端可能只拥有部分模态数据(模态间异质性)的挑战,同时客户端也期望获得适应其本地数据特征的个性化模型。

Result: 在BraTS 2018和2020多模态脑肿瘤分割基准测试上,FedMEPD超越了多种最新的多模态和个性化联邦学习方法,其新颖设计被证明是有效的。

Insight: 创新点在于将编码器按模态联邦化以处理模态缺失,同时通过全局与本地参数更新的差异动态决定解码器中哪些滤波器需要个性化,并使用服务器端的全模态融合表示作为锚点,通过跨注意力机制帮助缺失模态的客户端校准其表示,从而弥补信息损失。

Abstract: Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants’ data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs – using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.


[47] Locality-Attending Vision Transformer cs.CVPDF

Sina Hajimiri, Farzad Beizaee, Fereshteh Shakeri, Christian Desrosiers, Ismail Ben Ayed

TL;DR: 本文提出了一种称为Locality-Attending Vision Transformer(LocAtViT)的简单有效附加模块,旨在增强视觉Transformer在分割任务中的性能,同时保持其图像级分类能力。该方法通过引入可学习的高斯核来调制自注意力机制,使其偏向于关注相邻的图像块,并进一步细化块表示以学习更好的位置嵌入。

Details

Motivation: 视觉Transformer通过全局自注意力在分类任务中取得了显著成功,但该机制可能会模糊对分割等任务至关重要的细粒度空间细节。本文的动机是在不改变标准图像级分类训练的前提下,提升视觉Transformer在分割任务上的性能。

Result: 实验表明,该方法在三个基准测试上带来了显著的分割性能提升(例如,在ADE20K数据集上,ViT Tiny和ViT Base模型分别获得了超过6%和4%的增益),且未改变训练方式或牺牲分类性能。

Insight: 论文宣称的创新点在于通过可学习的高斯核调制自注意力,引导模型关注局部邻域,并结合位置嵌入的细化,从而在保留全局信息整合能力的同时,增强对局部空间细节的捕捉。从客观角度看,这是一种轻量级的、即插即用的改进策略,有效平衡了全局依赖与局部细节建模的需求。

Abstract: Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers’ image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model’s ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.


[48] FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation cs.CVPDF

Ganggui Ding, Hao Chen, Xiaogang Xu

TL;DR: 本文提出FC-VFI方法,用于实现忠实且一致的视频帧插值,以生成高帧率慢动作视频。该方法通过潜在序列上的时序建模策略继承起始和结束帧的保真度线索,并利用语义匹配线进行结构感知的运动引导,同时引入时序差异损失来减轻时序不一致性,支持4倍和8倍插值,可将2560×1440分辨率视频从30 FPS提升至120或240 FPS。

Details

Motivation: 现有大型预训练视频扩散模型在视频帧插值中依赖内在生成先验,导致难以生成高保真度帧,且细节保留不足;现有方法常依赖运动控制(如易出错的光流或缺乏结构上下文的稀疏点)来保证时序一致性,存在局限性。

Result: 大量实验表明,FC-VFI在多种场景下实现了高性能和结构完整性,在视频帧插值任务中表现出色。

Insight: 创新点包括:在潜在序列上进行时序建模以继承起始和结束帧的保真度线索;使用语义匹配线进行结构感知的运动引导,提升运动一致性;提出时序差异损失以缓解时序不一致问题。这些方法共同提升了插值帧的视觉保真度和运动一致性。

Abstract: Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting (4\times)x and (8\times) interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at (2560\times 1440)resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.


[49] AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM cs.CVPDF

Li’an Zhong, Ziqiang He, Jibin Zheng, Jin Li, Z. Jane Wang

TL;DR: 本文提出了一种名为AdaIAT的方法,旨在缓解大型视觉语言模型(LVLM)中的幻觉问题。该方法通过自适应地增加对生成文本的注意力权重,在减少幻觉的同时避免重复描述,并在多个LVLM上验证了其有效性。

Details

Motivation: 当前大型视觉语言模型(LVLM)中的幻觉问题是其发展和应用的主要障碍。直接增加对图像token的注意力虽然能减少幻觉,但常导致重复描述。本文旨在找到一种既能缓解幻觉,又能保持语言连贯性的方法。

Result: 在LLaVA-1.5等LVLM上的实验表明,AdaIAT显著降低了幻觉率(在LLaVA-1.5上,幻觉率指标C_S和C_I分别降低了35.8%和37.1%),同时保持了语言性能和预测能力,实现了良好的权衡。

Insight: 核心创新在于发现了真实物体token对生成文本的注意力更高,并据此提出了基于生成文本的注意力干预(IAT)及其自适应版本(AdaIAT)。AdaIAT通过分层阈值控制干预时机,并根据每个注意力头的特性进行细粒度的注意力放大,从而在不损害模型固有预测能力的前提下有效缓解幻觉。

Abstract: Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.


[50] Adaptive Prototype-based Interpretable Grading of Prostate Cancer cs.CVPDF

Riddhasree Bhattacharyya, Pallabi Dutta, Sushmita Mitra

TL;DR: 本文提出了一种基于原型的新型弱监督框架,用于从组织病理学图像中对前列腺癌进行可解释分级。该框架通过预训练学习与每个分级相关的原型特征,并使用原型感知损失函数进行微调,同时引入基于注意力的动态剪枝机制来处理样本间异质性。在PANDA和SICAP基准数据集上的广泛验证表明,该框架可作为病理学家日常诊断工作流程中的可靠辅助工具。

Details

Motivation: 前列腺癌是男性常见恶性肿瘤,活检需求增加给病理学家带来沉重负担。现有深度学习系统性能虽好但可解释性有限,阻碍其在医学等高风险应用中的广泛采用。现有可解释性技术提供粗略解释,但未揭示突出区域为何重要。

Result: 在PANDA和SICAP基准数据集上进行了广泛验证,确认该框架可作为病理学家日常诊断工作流程中的可靠辅助工具。

Insight: 创新点包括:1)提出原型驱动的弱监督框架,其显式推理过程模拟病理学家将可疑区域与临床验证示例进行比较的工作流程,增强可信度;2)引入原型感知损失函数进行弱监督微调;3)提出基于注意力的动态剪枝机制,选择性强调相关原型以处理样本异质性并优化性能。

Abstract: Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.


[51] Location-Aware Pretraining for Medical Difference Visual Question Answering cs.CV | cs.AIPDF

Denis Musinguzi, Caren Han, Prasenjit Mitra

TL;DR: 本文提出了一种针对医学差异视觉问答(VQA)的位置感知预训练框架,通过引入自动指代表达、接地字幕和条件自动指代表达等任务,使视觉编码器能够学习细粒度的、空间接地的视觉表征,以更好地捕捉医学图像间的细微差异,并将其与语言模型结合用于医学差异VQA任务。

Details

Motivation: 解决传统单图像模型在医学差异VQA中难以区分疾病进展与采集差异的局限性,因为标准视觉编码器基于对比或分类目标的预训练往往无法捕捉必要的细微视觉变化。

Result: 在胸部X光图像的临床相关变化检测和推理任务上,该方法达到了最先进的性能。

Insight: 创新点在于将位置感知任务(如AREF、GCAP、CAREF)引入预训练,使视觉编码器学习到传统方法忽略的细粒度空间接地表征,从而提升医学图像差异分析能力;客观来看,这种针对特定领域(医学影像)的预训练任务设计是有效的领域自适应策略。

Abstract: Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.


[52] VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters cs.CV | cs.CLPDF

Jiaxin Fan, Wenpo Song

TL;DR: 本文提出了VisionPangu,一个仅含17亿参数的紧凑型多模态模型,旨在通过高效的多模态对齐和高质量监督来提升详细图像描述生成能力。该模型结合了InternVL视觉编码器和OpenPangu-Embedded语言主干,并采用LLaVA启发的指令调优流程,利用DOCCI数据集中密集的人工撰写描述进行训练。

Details

Motivation: 现有大型多模态模型(LMMs)通常依赖大规模架构和粗粒度监督,限制了其生成详细图像描述的能力,因此需要设计更紧凑且能产生细粒度描述的模型。

Result: 实验结果表明,紧凑的多模态模型能够实现有竞争力的性能,同时生成更具结构化和细节丰富的描述。

Insight: 创新点在于通过轻量级MLP投影器高效对齐视觉与语言模态,并利用高质量、密集的人工标注数据(DOCCI)进行监督,在不依赖激进模型缩放的情况下提升了语义连贯性和描述丰富度,证明了小模型在细粒度多模态任务上的潜力。

Abstract: Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.


[53] Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression cs.CV | cs.GRPDF

Toby Chong, Ryota Nakajima

TL;DR: 本文提出了一种用于单目3D形变模型回归的新相机模型,通过引入一个收缩参数来扩展正交投影,从而有效捕捉近景面部图像中常见的透视畸变效应,同时保持原始投影的稳定性。

Details

Motivation: 现有的基于回归的3D形变模型拟合方法通常使用正交投影以获得稳定性能,但这使其无法处理近景拍摄(如头戴式摄像机拍摄)的面部图像,因为正交投影忽略了透视畸变。

Result: 论文在自定义的头戴式摄像机数据集上进行了定量和定性比较,证明了所提修改的有效性。

Insight: 主要创新点在于通过引入一个收缩参数来扩展正交投影,在保留其稳定性的同时模拟透视效果,从而能够处理近景图像,并提出了对现有模型进行微调的技术。

Abstract: We introduce a novel camera model for monocular 3D Morphable Model (3DMM) regression methods that effectively captures the perspective distortion effect commonly seen in close-up facial images. Fitting 3D morphable models to video is a key technique in content creation. In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance with orthographic projection, which eliminates the ambiguity between focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras. We extend orthographic projection with a new shrinkage parameter, incorporating a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow finetuning of existing models, and demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.


[54] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding cs.CV | cs.AIPDF

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang

TL;DR: 本文提出了3D-RFT,首个将可验证奖励的强化学习(RLVR)扩展到基于视频的3D场景理解任务的框架。该框架通过监督微调激活3D感知多模态大语言模型,然后使用基于严格可验证奖励函数(如3D IoU和F1-Score)的组相对策略优化进行强化微调,直接针对评估指标优化模型,从而解决了传统监督微调中训练目标与任务性能不匹配的问题。

Details

Motivation: 现有基于视频的3D场景理解方法主要依赖监督微调,其使用的token级交叉熵损失作为优化代理,导致训练目标与最终任务性能之间存在偏差。本文旨在将已在LLM推理中证明有效的RLVR范式引入3D场景理解领域,以直接优化任务指标。

Result: 在多个基于视频的3D场景理解任务(如3D视频检测、3D视觉定位和空间推理)的基准测试中,3D-RFT-4B模型取得了最先进的性能,并显著超越了参数规模更大的模型(如VG LLM-8B)。

Insight: 主要创新点在于将RLVR范式首次应用于3D多模态场景理解,并设计了直接从任务评估指标(如3D IoU)派生的可验证奖励函数来指导强化微调,实现了训练目标与评估指标的直接对齐。这为3D场景理解模型的训练提供了一个新的、更有效的优化范式。

Abstract: Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.


[55] Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding cs.CVPDF

Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian

TL;DR: 本文提出VideoHV-Agent,一个用于长视频理解的假设-验证多智能体框架。该框架将视频问答重新构建为一个结构化过程,包含思考者、判断者、验证者和回答者四个智能体,旨在通过先思考后检索的原则,减少语义漂移和相关驱动错误,提升推理的逻辑性和可解释性。

Details

Motivation: 解决长视频理解中因视觉冗余、长程时序依赖以及思维链和基于检索的智能体容易累积语义漂移和相关驱动错误所带来的挑战。主张长视频推理应从深思熟虑的任务制定开始,而非被动检索。

Result: 在三个长视频理解基准测试上达到了最先进的准确率,同时提供了增强的可解释性、改进的逻辑严密性和更低的计算成本。

Insight: 核心创新在于将‘先思考后寻找’原则具体化为一个结构化的假设-验证流程,通过将答案候选重写为可测试的假设并派生出可验证的判别性线索,实现了更精准、逻辑更严谨的细粒度视频内容定位与验证,提升了推理的鲁棒性和效率。

Abstract: Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.


[56] A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction cs.CVPDF

Jie Zhu, Hanghang Ma, Jia Wang, Yayong Guan, Yanbing Zeng

TL;DR: 本文提出了Wallaroo,一个基于自回归和下一词预测的简单基线模型,旨在统一多模态理解、图像生成和编辑任务。该模型支持多分辨率图像输入输出及中英双语,通过解耦视觉编码路径和四阶段训练策略来塑造模型能力。

Details

Motivation: 解决现有模型在多模态理解、生成和编辑任务上各自为政、缺乏统一框架的问题,探索自回归模型在统一多模态任务中的潜力。

Result: 在多个基准测试中,Wallaroo取得了有竞争力的性能,甚至超越了其他统一模型,表明其在统一多模态理解与生成方面具有巨大潜力。

Insight: 创新点在于使用简单的下一词预测自回归框架统一多模态任务,并通过解耦视觉编码和分阶段训练策略有效整合不同能力;客观来看,其简洁的架构设计为构建通用多模态模型提供了有启发性的基线。

Abstract: In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model’s capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.


[57] TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events cs.CVPDF

Jiaxiong Liu, Zhen Tan, Jinpu Zhang, Yi Zhou, Hui Shen

TL;DR: 本文提出了TAPFormer,一种基于Transformer的框架,通过瞬态异步融合机制将RGB帧与事件流进行异步时序一致融合,以实现鲁棒且高频的任意点跟踪。该方法通过跨模态局部加权融合模块自适应调整空间注意力,并在真实世界条件下构建了新的帧-事件TAP数据集进行验证。

Details

Motivation: 解决现有结合RGB帧和事件流的点跟踪方法中存在的时序不对齐和模态失效时性能严重下降的问题,旨在实现更鲁棒、高精度的长期运动推理。

Result: 在构建的真实世界帧-事件TAP数据集上,平均像素误差在阈值内提升了28.2%;在标准点跟踪基准测试中,性能持续达到最佳。

Insight: 创新点在于提出了瞬态异步融合机制,通过连续事件更新显式建模离散帧间的时序演化,以及跨模态局部加权融合模块根据模态可靠性自适应调整空间注意力,从而在模糊或低光条件下也能产生稳定且具有判别性的特征。

Abstract: Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io


[58] MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration cs.CVPDF

Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Yu Feng

TL;DR: MultiGO++是一种用于单目3D穿衣人体重建的新框架,通过几何与纹理协同工作,从单张图像生成完整且逼真的带纹理3D人体模型。它通过多源纹理合成策略构建大规模纹理扫描数据、区域感知形状提取模块与傅里叶几何编码器减少模态差异,以及双重建U-Net利用几何-纹理协同特征来优化生成高保真网格。

Details

Motivation: 现有方法受限于纹理训练数据缺乏、外部几何先验不准确以及单模态监督偏差,导致重建效果不佳,本文旨在解决这些问题。

Result: 在两个基准测试和大量真实场景案例上的广泛实验表明,该方法优于现有最先进方法。

Insight: 创新点包括多源纹理合成策略构建大规模纹理数据集、区域感知形状提取与傅里叶几何编码器减少模态差距,以及双重建U-Net实现几何与纹理的有效协同,提升了在挑战性场景下的纹理质量和几何精度。

Abstract: Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.


[59] Physics-consistent deep learning for blind aberration recovery in mobile optics cs.CVPDF

Kartik Jhawar, Tamo Sancho Miguel Tandoc, Khoo Jun Xuan, Wang Lipo

TL;DR: 本文提出了一种名为Lens2Zernike的深度学习框架,用于从单张模糊图像中盲恢复物理光学参数,以解决移动摄影中镜头特定像差的问题。该方法通过结合Zernike系数回归、可微分物理约束和辅助空间图预测的多任务策略,实现了对像差的物理一致建模,并显著优于仅回归系数的基线方法和现有深度学习方法。

Details

Motivation: 移动摄影常受复杂、镜头特定的光学像差限制。现有端到端去模糊深度学习方法缺乏明确的光学建模且可能产生伪细节,而经典盲解卷积方法则极不稳定。本文旨在弥合这一差距,通过深度学习盲恢复物理参数来实现稳定的非盲解卷积。

Result: 在ResNet-18主干网络上的消融研究表明,完整的多任务框架(z+p+m)相比仅回归系数的基线有35%的性能提升。对比分析显示,该方法优于两种已有的深度学习方法,实现了显著更低的回归误差。在IDMxS移动相机镜头数据库上,恢复的物理参数实现了稳定的非盲解卷积,显著提升了从严重像差的移动拍摄图像中恢复衍射极限细节的能力。

Insight: 创新点在于首次同时整合了三个不同光学域的监督:直接的Zernike系数回归、涵盖波前和点扩散函数推导的可微分物理约束,以及辅助的多任务空间图预测。这种物理一致的多任务策略,将深度学习的表示能力与经典光学模型的物理可解释性相结合,为光学像差校正提供了新思路。

Abstract: Mobile photography is often limited by complex, lens-specific optical aberrations. While recent deep learning methods approach this as an end-to-end deblurring task, these “black-box” models lack explicit optical modeling and can hallucinate details. Conversely, classical blind deconvolution remains highly unstable. To bridge this gap, we present Lens2Zernike, a deep learning framework that blindly recovers physical optical parameters from a single blurred image. To the best of our knowledge, no prior work has simultaneously integrated supervision across three distinct optical domains. We introduce a novel physics-consistent strategy that explicitly minimizes errors via direct Zernike coefficient regression (z), differentiable physics constraints encompassing both wavefront and point spread function derivations (p), and auxiliary multi-task spatial map predictions (m). Through an ablation study on a ResNet-18 backbone, we demonstrate that our full multi-task framework (z+p+m) yields a 35% improvement over coefficient-only baselines. Crucially, comparative analysis reveals that our approach outperforms two established deep learning methods from previous literature, achieving significantly lower regression errors. Ultimately, we demonstrate that these recovered physical parameters enable stable non-blind deconvolution, providing substantial in-domain improvement on the patented Institute for Digital Molecular Analytics and Science (IDMxS) Mobile Camera Lens Database for restoring diffraction-limited details from severely aberrated mobile captures.


[60] How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices cs.CVPDF

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang

TL;DR: 本文对生成式图像修复(GIR)模型的能力、局限性和评估实践进行了大规模系统性研究,提出了一个基于细节、锐度、语义正确性和整体质量的多维度评估流程,揭示了不同架构模型间的关键性能差异,并指出了该领域正从细节不足转向细节质量和语义控制的新范式转变。

Details

Motivation: 旨在评估生成式图像修复模型的真实能力进展,并与传统方法进行对比,以明确其当前的实际水平和局限性。

Result: 研究覆盖了基于扩散模型、GAN、PSNR导向和通用生成模型等多种架构,揭示了显著的性能差异,并基于此基准训练了一个与人类感知判断更一致的新图像质量评估(IQA)模型。

Insight: 研究揭示了感知导向的低层视觉领域的关键范式转变:核心挑战已从细节稀缺(欠生成)演变为细节质量和语义控制(防止过生成);提出的多维度评估流程和新的IQA模型为未来模型开发和评估提供了重要参考。

Abstract: Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.


[61] Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model cs.CVPDF

Yulong Shi, Shijie Li, Ziyi Li, Lin Qi

TL;DR: 本文提出Tell2Adapt,一个利用视觉基础模型(VFM)通用知识的源域自由无监督域自适应(SFUDA)统一框架,用于医学图像分割。它通过上下文感知提示正则化(CAPR)生成高质量伪标签,并利用视觉合理性细化(VPR)提升临床可靠性,在多种模态和目标上实现了统一的多目标适应。

Details

Motivation: 现有SFUDA方法通常针对特定、低差异的域偏移设计,难以推广到统一、多模态、多目标的现实世界应用场景,这构成了实际部署的主要障碍。

Result: 在迄今为止最广泛的SFUDA评估之一中,该方法在10个域适应方向和22个解剖目标(包括脑、心脏、息肉和腹部)上得到验证,结果一致优于现有方法,在医学图像分割的统一SFUDA框架中达到了SOTA水平。

Insight: 核心创新在于利用VFM的通用知识构建统一的SFUDA框架,并通过CAPR确保提示的高保真转换以生成可靠伪标签,以及引入VPR利用解剖知识将预测重新锚定于图像低级特征以去除噪声,这为处理多模态、多目标域适应提供了可扩展的新思路。

Abstract: Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM’s anatomical knowledge to re-ground the adapted model’s predictions in target image’s low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.


[62] A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset cs.CV | cs.AI | eess.IVPDF

Francisco Vacalebri-Lloret, Lucas Banchero, Jose J. Lopez, Jose M. Mossi

TL;DR: 本研究提出了一种用于检测紧急车辆蓝色警示灯的先进系统,该系统基于ABLDataset(包含欧洲紧急车辆在各种气候和地理条件下的图像)开发。系统采用四个鱼眼摄像头(每个具有180度水平视场)安装在车辆侧面,通过校准实现检测的方位定位。研究比较了多种深度神经网络算法(包括YOLO系列、RetinaNet、Faster R-CNN和RT-DETR),最终选择RT-DETR作为基础模型,并通过引入颜色注意力模块进行增强,在测试集上达到94.7%的准确率和94.1%的召回率,现场测试检测距离可达70米。系统还通过几何变换估计紧急车辆相对于本车中心的接近角度,设计用于集成到结合视觉和声学数据的多模态系统中,以提高高级驾驶辅助系统(ADAS)和道路安全性。

Details

Motivation: 解决紧急车辆检测在复杂环境下的挑战,通过多摄像头系统和专门数据集提升检测性能,以增强ADAS系统的安全功能。

Result: 在ABLDataset测试集上达到94.7%准确率和94.1%召回率,现场测试检测距离达70米;通过比较主流检测模型(YOLO系列、RetinaNet、Faster R-CNN等),验证了增强RT-DETR的优越性。

Insight: 创新点包括:1)构建专门针对欧洲紧急车辆的ABLDataset;2)采用360度多鱼眼摄像头系统实现全方位检测;3)在RT-DETR中引入颜色注意力模块提升蓝色光检测性能;4)结合几何变换实现方位估计,为多模态ADAS集成提供基础。

Abstract: This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.


[63] MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration cs.CVPDF

Nian Liu, Jin Gao, Shubo Lin, Yutong Kou, Sikui Zhang

TL;DR: 本文提出了一种名为MI-DETR的生物启发式双通路检测器,用于红外小目标检测。该方法通过视网膜启发的细胞自动机(RCA)从帧序列中显式生成运动图,并结合外观与运动双通路特征,利用RT-DETR解码器进行检测,在多个基准测试上取得了优异性能。

Details

Motivation: 解决红外小目标检测中目标微小、对比度低、易被复杂动态背景干扰的挑战,避免传统多帧方法需要额外运动监督或显式对齐模块的问题。

Result: 在三个常用ISTD基准测试上表现强劲:在IRDST-H上达到70.3% mAP@50和72.7% F1(比最佳多帧基线提升26.35 mAP@50),在DAUB-R上达到98.0% mAP@50,在ITSDT-15K上达到88.3% mAP@50,实现了SOTA性能。

Insight: 创新点在于生物启发的双通路设计(外观与运动)及其中间互连机制(PMI Block),以及无需额外运动标签或对齐操作、仅用一组边界框即可监督双通路的RCA模块,为运动建模提供了简洁有效的解决方案。

Abstract: Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at https://github.com/nliu-25/MI-DETR.


[64] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark cs.CVPDF

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao

TL;DR: 该论文提出了UniM基准,这是首个统一任意到任意交错多模态数据集,涵盖文本、图像、音频、视频、文档、代码和3D等7种模态,包含31K个高质量实例,用于评估多模态大语言模型在理解和生成交错多模态内容方面的能力。

Details

Motivation: 现实世界多模态应用需要系统理解用户任意组合和交错的多模态输入,并生成任意交错的多媒体输出,这定义了统一理解和生成范式下任意到任意交错多模态学习的目标,为推进多模态大语言模型带来新挑战和机遇。

Result: 论文通过UniM评估套件从语义正确性与生成质量、响应结构完整性和交错连贯性三个维度评估模型,并提出了具有可追溯推理能力的代理基线模型UniMA,综合实验证明了UniM的难度,并突出了推进统一任意到任意多模态智能的关键挑战和方向。

Insight: 创新点在于首次构建了统一任意到任意交错多模态基准,覆盖广泛模态和领域,并设计了多维评估框架;客观分析认为,该工作为多模态大语言模型提供了更全面的评估标准,强调了交错生成和推理的挑战,有助于推动模型向更灵活、连贯的多模态交互发展。

Abstract: In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.


[65] MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer cs.CVPDF

Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang

TL;DR: MoRe是一种前馈式4D重建网络,能够从单目视频中高效恢复动态3D场景。它基于静态重建骨干网络,通过注意力强制策略分离动态运动与静态结构,并利用分组因果注意力捕捉时间依赖性,实现时间一致的几何重建。

Details

Motivation: 解决动态4D场景重建中因运动物体干扰相机姿态估计导致的挑战,现有优化方法通常需要额外监督且计算成本高,难以实时应用。

Result: 在多个基准测试上的广泛实验表明,MoRe实现了高质量的动态重建,并具有卓越的效率。

Insight: 创新点包括注意力强制策略以解耦动态与静态成分,分组因果注意力处理时间依赖性和可变令牌长度,以及在大规模多样化数据集上的微调以增强鲁棒性。

Abstract: Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.


[66] Orthogonal Spatial-temporal Distributional Transfer for 4D Generation cs.CVPDF

Wei Liu, Shengqiong Wu, Bobo Li, Haoyu Zhao, Hao Fei

TL;DR: 本文提出了一种名为正交时空分布迁移(Orster)的新框架,用于解决高质量4D内容生成中因缺乏大规模4D数据集而导致的时空特征学习不足问题。该框架通过从现有3D扩散模型迁移空间先验和从视频扩散模型迁移时间先验来增强4D合成,并设计了时空解耦的4D扩散模型(STD-4D Diffusion)以及时空感知的HexPlane(ST-HexPlane)来整合迁移的特征,从而提升4D变形和4D高斯特征建模。

Details

Motivation: 当前4D合成研究因缺乏大规模4D数据集而严重受限,导致模型无法充分学习高质量4D生成所需的关键时空特征,阻碍了该领域的发展。本文旨在通过迁移学习从现有3D和视频模型中获取丰富的先验知识来克服这一数据瓶颈。

Result: 实验表明,该方法显著优于现有方法,在4D合成中实现了更优的时空一致性和更高质量的结果,但摘要未具体说明在哪些基准测试上达到SOTA或与特定模型相当。

Insight: 创新点包括:1)提出正交时空分布迁移(Orster)机制,通过精心建模和注入时空特征分布来实现最佳特征迁移;2)开发时空解耦的4D扩散模型(STD-4D Diffusion),利用解耦的空间和时间潜在变量合成4D感知视频;3)设计时空感知的HexPlane(ST-HexPlane),在4D构建中整合迁移的时空特征以改进变形和特征建模。从客观角度看,该方法通过跨模态迁移学习有效缓解了4D数据稀缺问题,具有借鉴意义。

Abstract: In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.


[67] GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement cs.CV | cs.AIPDF

Xiaodong Zhu, Yuanming Zheng, Suting Wang, Junqi Yang, Yuhong Yang

TL;DR: 本文提出GEM-TFL,一个用于弱监督时序伪造定位的两阶段分类-回归框架,旨在解决现有方法中训练与推理目标不匹配、二元标签监督有限、不可微top-k聚合导致的梯度阻塞以及缺乏提案间关系显式建模等问题。该方法通过基于EM的优化将二元标签重构为多维潜在属性以增强弱监督,引入无需训练的时间一致性细化以平滑时序动态,并设计基于图的提案细化模块来建模提案间的时序-语义关系以实现全局一致的置信度估计。

Details

Motivation: 解决弱监督时序伪造定位中存在的训练-推理目标不匹配、二元标签监督信息有限、梯度阻塞以及缺乏对提案间关系建模等问题,旨在以更低的标注成本实现接近全监督方法的定位精度。

Result: 在基准数据集上的大量实验表明,GEM-TFL实现了更准确和鲁棒的时序伪造定位,显著缩小了与全监督方法之间的性能差距。

Insight: 创新点包括:1) 通过EM优化将二元弱监督重构为多维潜在属性,增强了监督信号;2) 引入无需训练的时间一致性细化模块,改善了预测的时序平滑性;3) 设计基于图的提案细化模块,显式建模提案间的时序-语义关系以实现全局一致性。从客观角度看,其将EM算法与图神经网络结合用于弱监督时序任务的两阶段框架设计具有借鉴意义。

Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.


[68] UniPAR: A Unified Framework for Pedestrian Attribute Recognition cs.CV | cs.AIPDF

Minghe Xu, Rouying Wu, Jiarui Xu, Minhao Sun, Zikang Yan

TL;DR: 本文提出UniPAR,一个基于Transformer的统一框架,用于行人属性识别(PAR)。该框架通过统一的数据调度策略和动态分类头,使单个模型能够同时处理来自不同模态(如RGB图像、视频序列和事件流)的多样化数据集,并引入分阶段融合编码器来对齐视觉特征与文本属性查询。实验表明,UniPAR在多个基准数据集上达到与专用SOTA方法相当的性能,且多数据集联合训练显著提升了模型在低光、运动模糊等极端环境下的跨域泛化能力和鲁棒性。

Details

Motivation: 现有行人属性识别研究常受限于‘一个模型对应一个数据集’的范式,难以处理不同域在模态、属性定义和环境场景上的显著差异,因此需要一种统一框架来克服这些挑战。

Result: 在MSP60K、DukeMTMC和EventPAR等广泛使用的基准数据集上,UniPAR实现了与专用SOTA方法相当的性能;多数据集联合训练进一步增强了模型在低光和运动模糊等极端环境下的跨域泛化与识别鲁棒性。

Insight: 创新点包括统一数据调度策略、动态分类头以及分阶段融合编码器(通过后期深度融合策略显式对齐视觉特征与文本属性查询);从客观角度看,该框架通过单一模型处理多模态异构数据,提升了PAR任务的通用性和适应性,为下游应用提供了更灵活的解决方案。

Abstract: Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset” paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model’s cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR


[69] Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models cs.CV | cs.ROPDF

Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

TL;DR: 本文提出了一种面向视觉-语言-动作(VLA)模型的复杂度感知自适应推理框架,旨在根据感知状态的复杂度动态选择执行路径(立即执行、推理或中止),以平衡性能与计算效率。

Details

Motivation: 当前VLA模型的研究主要通过增加推理机制来提升泛化能力,但这会普遍增加计算复杂度和推理延迟,且缺乏对任务复杂度的自适应判断与不确定性估计,可能导致资源浪费或在分布外任务上失败。

Result: 在LIBERO和LIBERO-PRO基准测试以及真实机器人上的实验表明,仅使用视觉嵌入的配置在仅用5%训练数据的情况下达到了80%的F1分数,证明了其作为高效任务复杂度检测器的可靠性。

Insight: 创新点在于受人类认知启发,将VLA模型的视觉-语言主干转化为主动检测工具,通过将潜在嵌入投影到参数化和非参数化估计器集合中,实现基于复杂度的动态路由(Act/Think/Abstain);一个关键发现是,由于语言的语义不变性,仅视觉嵌入在推断任务复杂度方面表现更优。

Abstract: Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA’s vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.


[70] SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction cs.CV | cs.AI | cs.GRPDF

Ningjing Fan, Yiqun Wang

TL;DR: 论文提出SSR-GS框架,用于在3D高斯泼溅(3DGS)中分离镜面反射,以重建光泽表面。该框架通过Mip-Cubemap建模直接镜面反射,通过IndiASG模块捕捉间接镜面反射,并引入视觉几何先验(VGP)来降低反射主导区域的光度损失权重。

Details

Motivation: 解决3DGS在复杂光照下(特别是强镜面反射和多表面互反射场景)难以准确重建光泽表面的问题。

Result: 在合成和真实世界数据集上的实验表明,SSR-GS在光泽表面重建任务上达到了最先进的性能(SOTA)。

Insight: 创新点包括:1)预滤波Mip-Cubemap高效建模直接镜面反射;2)IndiASG模块捕捉间接镜面反射;3)结合反射感知视觉先验(反射评分RS)与几何先验(渐进衰减深度监督和变换法向约束)的VGP设计。

Abstract: In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi-surface interreflections. To address this issue, we propose SSR-GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip-Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection-aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection-dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real-world datasets demonstrate that SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.


[71] Generic Camera Calibration using Blurry Images cs.CV | eess.IVPDF

Zezhun Shi

TL;DR: 该论文提出了一种利用模糊图像进行通用相机标定的方法,通过结合几何约束和局部参数化光照模型,同时估计特征点位置和空间变化的点扩散函数,并解决了传统图像去模糊任务中无需考虑的平移模糊问题。实验结果表明该方法有效。

Details

Motivation: 通用相机标定比参数化标定更精确,但使用打印标定板需要更多图像,导致运动模糊难以避免,本文旨在解决这一实际问题。

Result: 实验验证了该方法的有效性,但摘要未提及具体基准测试或与SOTA的比较结果。

Insight: 创新点在于首次尝试利用模糊图像进行通用相机标定,通过几何约束和局部参数化模型联合估计特征和模糊核,并处理了平移模糊这一独特挑战。

Abstract: Camera calibration is the foundation of 3D vision. Generic camera calibration can yield more accurate results than parametric cam era calibration. However, calibrating a generic camera model using printed calibration boards requires far more images than parametric calibration, making motion blur practically unavoidable for individual users. As a f irst attempt to address this problem, we draw on geometric constraints and a local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while re solving the translational ambiguity that need not be considered in con ventional image deblurring tasks. Experimental results validate the effectiveness of our approach.


[72] Mario: Multimodal Graph Reasoning with Large Language Models cs.CVPDF

Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan

TL;DR: 本文提出了Mario框架,旨在解决大型语言模型在多模态图推理中的挑战,通过图条件化视觉语言模型设计和模态自适应图指令调优机制,实现了对具有文本和视觉属性的多模态图的有效推理。

Details

Motivation: 现有方法通常孤立编码图像-文本对,忽略了真实世界多模态数据中自然形成的关系结构,因此需要开发能够处理多模态图并保持图拓扑的LLM推理方法。

Result: 在多个多模态图基准测试中,Mario在节点分类和链接预测任务上,无论是监督还是零样本场景,均一致优于最先进的图模型。

Insight: 创新点包括通过图拓扑引导的细粒度跨模态对比学习联合优化文本和视觉特征,以及利用可学习路由器为每个节点及其邻域选择最信息丰富的模态配置,从而解决跨模态一致性和异质模态偏好问题。

Abstract: Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.


[73] Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule cs.CV | cs.AIPDF

Muhammad Zarar, MingZheng Zhang, Xiaowang Zhang, Zhiyong Feng, Sofonias Yitagesu

TL;DR: Logi-PAR是首个逻辑融合的患者活动识别框架,它通过可微规则将上下文事实融合与神经引导相结合,从视觉线索中自动学习逻辑规则,并提供可审计的解释和反事实干预能力。

Details

Motivation: 现有患者活动识别模型主要关注识别活动本身,通过注意力机制学习隐含模式,但缺乏对风险原因(即为何一组视觉线索意味着风险)的显式逻辑推理能力,这限制了临床安全性的提升。

Result: 在VAST和OmniFall等临床基准测试上,Logi-PAR取得了最先进的性能,显著优于视觉语言模型和Transformer基线模型。

Insight: 创新点在于将可学习的逻辑规则注入到符号映射中,实现端到端的规则优化,使隐含模式在训练中被显式标注,从而生成可解释的规则追踪和反事实分析(例如,量化显示若有辅助存在风险将降低65%)。

Abstract: Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}


[74] Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation cs.CVPDF

Yingxue Su, Yiheng Zhong, Keying Zhu, Zimu Zhang, Zhuoru Zhang

TL;DR: 本文提出了一种名为语义类别分布学习(SCDL)的即插即用框架,用于缓解半监督医学图像分割中的监督偏差和表示偏差。该框架通过学习结构化的类别条件特征分布,整合了类别分布双向对齐(CDBA)和语义锚点约束(SAC)来对齐嵌入特征与可学习的类别代理,并利用标注数据引导代理。在Synapse和AMOS数据集上的实验表明,SCDL显著提升了整体和类别级别的分割性能,尤其在少数类别上取得了强劲增益,达到了最先进水平。

Details

Motivation: 医学图像分割对于计算机辅助诊断至关重要,但密集的像素级标注耗时昂贵,且医学数据集通常存在严重的类别不平衡问题。这种不平衡导致少数结构在特征表示中被主导类别淹没,阻碍了判别性特征的学习,使得可靠分割极具挑战性。

Result: 在Synapse和AMOS数据集上的实验表明,SCDL显著提升了整体和类别级别的分割性能,尤其在少数类别上取得了强劲增益,达到了最先进(SOTA)水平。

Insight: 论文的创新点在于提出了一个即插即用的SCDL框架,通过类别分布双向对齐(CDBA)和语义锚点约束(SAC)来学习结构化的类别条件特征分布,从而缓解半监督学习中的监督和表示偏差。从客观角度看,该方法将类别分布学习与特征对齐相结合,并利用可学习的类别代理和标注数据的语义引导,有效处理了医学图像中的类别不平衡问题,提升了模型对少数类别的分割能力。

Abstract: Medical image segmentation is critical for computer-aided diagnosis. However, dense pixel-level annotation is time-consuming and expensive, and medical datasets often exhibit severe class imbalance. Such imbalance causes minority structures to be overwhelmed by dominant classes in feature representations, hindering the learning of discriminative features and making reliable segmentation particularly challenging. To address this, we propose the Semantic Class Distribution Learning (SCDL) framework, a plug-and-play module that mitigates supervision and representation biases by learning structured class-conditional feature distributions. SCDL integrates Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies and leverages Semantic Anchor Constraints (SAC) to guide proxies using labeled data. Experiments on the Synapse and AMOS datasets demonstrate that SCDL significantly improves segmentation performance across both overall and class-level metrics, with particularly strong gains on minority classes, achieving state-of-the-art results. Our code is released at https://github.com/Zyh55555/SCDL.


[75] SPyCer: Semi-Supervised Physics-Guided Contextual Attention for Near-Surface Air Temperature Estimation from Satellite Imagery cs.CV | cs.AIPDF

Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai

TL;DR: SPyCer是一种半监督物理引导网络,利用卫星图像连续估计近地表气温(NSAT)。它将NSAT预测视为像素级视觉问题,结合地面传感器观测和物理约束(如表面能量平衡和平流-扩散-反应偏微分方程)进行监督,并通过基于土地覆盖特性和高斯距离加权的多头注意力机制捕捉相邻像素的物理影响。

Details

Motivation: 解决近地表气温测量中地面传感器稀疏且分布不均、无法提供连续空间测量的问题,通过卫星图像作为代理来填补这一空白。

Result: 在真实世界数据集上的实验表明,SPyCer在准确性、泛化能力和与基础物理过程的一致性方面优于现有基线,能生成空间连贯且物理一致的NSAT估计。

Insight: 创新点在于将半监督学习与物理引导相结合,通过物理约束正则化和上下文注意力机制(结合土地覆盖和距离加权)来增强模型的可解释性和物理一致性,为遥感数据与物理模型融合提供了新思路。

Abstract: Modern Earth observation relies on satellites to capture detailed surface properties. Yet, many phenomena that affect humans and ecosystems unfold in the atmosphere close to the surface. Near-ground sensors provide accurate measurements of certain environmental characteristics, such as near-surface air temperature (NSAT). However, they remain sparse and unevenly distributed, limiting their ability to provide continuous spatial measurements. To bridge this gap, we introduce SPyCer, a semi-supervised physics-guided network that can leverage pixel information and physical modeling to guide the learning process through meaningful physical properties. It is designed for continuous estimation of NSAT by proxy using satellite imagery. SPyCer frames NSAT prediction as a pixel-wise vision problem, where each near-ground sensor is projected onto satellite image coordinates and positioned at the center of a local image patch. The corresponding sensor pixel is supervised using both observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization derived from the surface energy balance and advection-diffusion-reaction partial differential equations. To capture the physical influence of neighboring pixels, SPyCer employs a multi-head attention guided by land cover characteristics and modulated with Gaussian distance weighting. Experiments on real-world datasets demonstrate that SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in terms of accuracy, generalization, and alignment with underlying physical processes.


[76] Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems cs.CV | cs.ROPDF

Serkan Ergun, Tobias Mitterer, Hubert Zangl

TL;DR: 本文提出了一种数字孪生驱动的机器人纺织品分拣系统,该系统集成了抓取预测、多模态感知和语义推理,用于现实世界中的纺织品分类和异物识别。系统采用配备RGBD传感、电容触觉反馈和碰撞感知运动规划的双臂机器人单元,能够从无序篮子中自主分离衣物,将其转移到检测区域,并使用先进的视觉语言模型(VLM)进行分类。研究在包含223个检测场景的数据集上对来自五个模型家族的九种VLM进行了基准测试,评估了其分类准确性、幻觉行为以及在硬件约束下的计算性能。

Details

Motivation: 解决可持续纺织品回收中,在杂乱环境下处理可变形衣物并检测异物的自动化需求,旨在开发一个鲁棒且可扩展的自主分拣解决方案。

Result: 在包含衬衫、袜子、裤子、内衣、异物和空场景的数据集上,Qwen模型家族取得了最高的整体准确率(达87.9%),并表现出强大的异物检测性能;而Gemma3等轻量级模型则在边缘部署中提供了有竞争力的速度-精度权衡。

Insight: 创新点在于将语义VLM推理与传统的抓取检测及数字孪生技术相结合,利用数字孪生实现碰撞感知路径规划,并将检测衣物的分割3D点云集成到虚拟环境中,从而提高了操作的可靠性,为现实工业环境中的可扩展自主纺织品分拣提供了可行性验证。

Abstract: The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.


[77] Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum cs.CVPDF

Shan Ning, Longtian Qiu, Xuming He

TL;DR: 本文提出了Wiki-R1,一个基于数据生成和课程采样的强化学习框架,旨在激励多模态大语言模型(MLLMs)在知识库视觉问答(KB-VQA)任务中进行有效推理。该方法通过可控课程数据生成和课程采样策略,构建与模型能力演进相匹配的训练分布,以弥合预训练模型与KB-VQA目标分布之间的差距。在两个KB-VQA基准测试(Encyclopedic VQA和InfoSeek)上取得了新的最先进(SOTA)结果。

Details

Motivation: 解决知识库视觉问答(KB-VQA)中,由于知识库的噪声检索和结构化、百科全书性质,导致预训练多模态大语言模型(MLLMs)存在分布差距,从而在微调阶段难以进行有效推理和领域适应的问题。

Result: 在两个KB-VQA基准测试上实现了新的最先进(SOTA)结果:在Encyclopedic VQA上将准确率从35.5%提升至37.1%,在InfoSeek上将准确率从40.1%提升至44.1%。

Insight: 创新点在于提出了一个结合数据生成和课程采样的强化学习框架。具体包括:1)可控课程数据生成,通过操纵检索器生成指定难度级别的样本;2)课程采样策略,选择在RL更新中可能产生非零优势的信息丰富样本;3)利用观测奖励估计样本难度并传播到未观测样本以指导学习。这为弥合预训练模型与特定任务分布之间的差距提供了一种系统化的课程学习方法。

Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model’s evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5% to 37.1% on Encyclopedic VQA and from 40.1% to 44.1% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.


[78] Layer by layer, module by module: Choose both for optimal OOD probing of ViT cs.CV | cs.LG | stat.MLPDF

Ambroise Odonnat, Vasilii Feofanov, Laetitia Chapel, Romain Tavenard, Ievgen Redko

TL;DR: 本文对预训练视觉Transformer(ViT)中间层的表示能力进行了系统性研究,发现下游任务与预训练数据之间的分布偏移是导致深层表示性能下降的主要原因,并通过模块级细粒度分析揭示了在不同偏移程度下,前馈网络激活或多头自注意力归一化输出分别是最优的探测位置。

Details

Motivation: 针对基础模型中间层通常比最终层更具判别性的现象,研究旨在深入分析预训练ViT中间层的行为,探究性能下降的根本原因并寻找最优的线性探测位置。

Result: 在多个图像分类基准测试上的广泛线性探测实验表明,在显著分布偏移下,探测前馈网络激活性能最佳;在弱偏移下,探测多头自注意力模块的归一化输出最优。

Insight: 研究创新性地从模块层面进行细粒度分析,挑战了标准Transformer块输出探测的次优性,为基于分布偏移程度自适应选择最优探测模块提供了实用见解。

Abstract: Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.


[79] Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers cs.CVPDF

Guandong Li

TL;DR: 本文提出了一种名为SpectralCache的频率感知误差有界缓存框架,用于加速扩散变换器(DiTs)的推理过程。该方法通过识别并利用去噪过程在时间、深度和特征维度上的非均匀性,设计了时间感知动态调度、累积误差预算和频率分解缓存三个组件,从而在保证生成质量的同时显著提升推理速度。

Details

Motivation: 现有的缓存方法通过跨时间步重用中间计算来加速DiTs,但它们普遍将去噪过程视为在时间、深度和特征维度上均匀的,这限制了加速效果。本文旨在解决这一局限性,通过分析并利用DiT去噪过程中的非均匀性来设计更高效的缓存策略。

Result: 在512x512分辨率的FLUX.1-schnell模型上,SpectralCache实现了2.46倍的加速,LPIPS为0.217,SSIM为0.727。其速度比TeaCache(2.12倍加速,LPIPS 0.215,SSIM 0.734)提升了16%,同时保持了可比的生成质量(LPIPS差异小于1%)。

Insight: 论文的核心创新在于识别了DiT去噪过程中三个正交的非均匀性轴(时间、深度、特征),并据此设计了一个统一的、无需训练、即插即用的缓存框架。从客观角度看,将频率分析(频率分解缓存)与误差预算管理(累积误差预算)和动态调度(时间感知动态调度)相结合,是一种新颖且有效的系统级优化思路,可广泛应用于其他迭代生成模型。

Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal – sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth – consecutive caching decisions lead to cascading approximation errors; and (3) feature – different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.


[80] Dark3R: Learning Structure from Motion in the Dark cs.CVPDF

Andrew Y Guo, Anagh Malik, SaiKiran Tedla, Yutong Dai, Yiqian Qin

TL;DR: Dark3R是一个用于极低光照条件下(信噪比低于-4 dB)从运动恢复结构(SfM)的框架,它直接在原始图像上操作。其核心是通过师生蒸馏过程,将大规模3D基础模型适配到极端低光条件,从而实现鲁棒的特征匹配和相机姿态估计。该框架无需3D监督,仅使用噪点-干净原始图像对进行训练,并引入了一个包含约42,000张多视角原始图像的新数据集进行评估。

Details

Motivation: 解决传统基于特征和学习的方法在极低信噪比(SNR < -4 dB)的黑暗条件下失效的问题,实现黑暗环境下的3D结构恢复和相机姿态估计。

Result: 在低信噪比条件下实现了最先进的(SOTA)从运动恢复结构性能;同时,利用其预测的姿态和从粗到细的辐射场优化过程,在黑暗条件下也实现了最先进的新视角合成。

Insight: 通过师生蒸馏将大规模3D基础模型的知识迁移到极端低光域,避免了昂贵的3D监督;仅需噪点-干净图像对的训练范式简单有效;提出的曝光包围数据集为低光3D视觉研究提供了基准。

Abstract: We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB – a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy–clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R’s predicted poses and a coarse-to-fine radiance field optimization procedure.


[81] ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking cs.CVPDF

Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao

TL;DR: 本文提出了一个名为ORMOT的新任务,即全向参考多目标跟踪,旨在解决传统参考多目标跟踪(RMOT)中因常规相机视野有限导致的目标丢失和跟踪碎片化问题。为此,作者构建了ORSet数据集,包含27个全向场景、848条语言描述和3401个标注对象,并开发了一个基于大型视觉语言模型(LVLM)的框架ORTrack。实验在ORSet上验证了该框架的有效性,相关数据和代码将开源。

Details

Motivation: 现有RMOT方法主要基于常规相机数据集,其有限视野常导致目标移出画面,造成跟踪中断和上下文信息丢失。本文旨在通过扩展RMOT至全向图像来克服这一视野限制,并提升模型对长范围语言描述的理解能力。

Result: 在自建的ORSet数据集上进行的大量实验证明了所提出的ORTrack框架的有效性,但摘要未提及具体的定量结果(如精度指标)或与现有SOTA方法的比较。

Insight: 创新点包括:1)提出了ORMOT新任务,将RMOT扩展至全向视觉领域以解决视野限制;2)构建了首个全向参考多目标跟踪数据集ORSet,提供丰富的视觉、时序和语言信息;3)设计了基于LVLM的专用框架ORTrack。从客观角度看,该工作通过结合全向视觉和语言描述,为多目标跟踪引入了新的研究方向和数据集资源。

Abstract: Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model’s ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.


[82] Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations cs.CVPDF

Hajar Dekdegue, Moncef Garouani, Josiane Mothe, Jordan Bernigaud

TL;DR: 本文提出了一种名为Fusion-CAM的新型可视化解释框架,旨在解决现有类激活图(CAM)方法在解释深度卷积神经网络决策时的局限性。该框架通过专门的融合机制,将基于梯度的CAM(如Grad-CAM)和基于区域的CAM(如Score-CAM)两种范式统一起来,以生成更鲁棒、更具判别力的视觉解释。

Details

Motivation: 现有CAM方法存在不足:基于梯度的方法(如Grad-CAM)能提供高判别性的细粒度细节,但往往产生噪声大、不完整的激活图,仅突出最显著区域;而基于区域的方法(如Score-CAM)能捕获更广的对象覆盖范围,但存在过度平滑和对细微特征敏感性降低的问题。本文旨在弥合这一解释鸿沟。

Result: 在标准基准测试上的大量实验表明,Fusion-CAM在定性可视化和定量评估方面均持续优于现有的CAM变体,为解释深度神经网络提供了一个鲁棒且灵活的工具。

Insight: 创新点在于提出了一个包含去噪、基于贡献权重的融合以及自适应相似性像素级融合的专用融合机制。该机制能评估两种范式之间的一致性并动态调整融合强度,从而强化一致的激活区域并柔和地混合冲突区域,产生更丰富、上下文感知且输入自适应的视觉解释。

Abstract: Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.


[83] Video-based Locomotion Analysis for Fish Health Monitoring cs.CVPDF

Timon Palm, Clemens Seibold, Anna Hilsmann, Peter Eisert

TL;DR: 本文提出了一种基于视频的鱼类运动分析系统,用于监测鱼类健康状况。该系统采用基于检测的多目标跟踪框架,核心是YOLOv11检测器,并通过研究不同架构配置和融入多帧信息来提高检测精度。

Details

Motivation: 监测鱼类健康状况对于早期疾病检测、保障动物福利和促进可持续水产养殖至关重要,而鱼类生理和病理状况可通过分析其运动活动来推断。

Result: 该系统在一个模拟家庭水族箱环境中记录的、经过人工标注的苏拉威西米鱼数据集上进行了评估,证明了其能够可靠地测量游泳方向和速度,用于健康监测。

Insight: 创新点在于将先进的YOLOv11目标检测器集成到多目标跟踪框架中,并探索了多帧融合策略以提升视频中鱼类检测的准确性;同时,该研究创建并计划公开一个专门的鱼类运动分析数据集。

Abstract: Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.


[84] SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning cs.CV | cs.AIPDF

Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim

TL;DR: 本文提出SAIL方法,用于弱监督密集视频描述任务,该方法通过跨模态对齐构建语义感知的掩码,并引入基于LLM的增强策略生成合成描述以提供额外的对齐信号,从而在仅使用描述标注的情况下实现视频事件的定位和描述。

Details

Motivation: 现有弱监督密集视频描述方法仅关注生成非重叠掩码,未考虑掩码与对应事件的语义关系,导致掩码分布简单且无法捕获语义相关区域;同时,仅依赖真实描述会因数据集固有的稀疏性导致性能次优。

Result: 在ActivityNet Captions和YouCook2数据集上的实验表明,该方法在描述和定位指标上均达到了最先进的性能水平。

Insight: 创新点在于通过相似性感知的训练目标引导掩码强调与事件描述高度相似的视频区域,并利用LLM生成合成描述通过掩码间机制提供辅助指导,从而在稀疏标注下实现更准确的掩码生成和时序定位。

Abstract: Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.


[85] NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries cs.CVPDF

Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata

TL;DR: 本文提出NaiLIA,一种基于密集意图描述和调色板查询的多模态美甲设计图像检索方法。该方法通过引入基于置信度的松弛损失来对齐未标注图像与描述,并在一个包含10,625张图像、由200多名标注者提供密集意图描述的新建基准上验证了其有效性。

Details

Motivation: 现有视觉-语言基础模型难以有效处理美甲设计中用户提供的密集、多层次的意图描述(包括绘制元素、装饰、视觉特征、主题和整体印象)以及通过颜色选择器指定的连续细微色调的调色板查询,因此需要一种能够全面对齐这些多模态输入的检索方法。

Result: 在新建的包含10,625张图像、标注有密集意图描述的基准数据集上,NaiLIA的实验结果优于标准方法,实现了更好的检索性能。

Insight: 创新点在于提出了一种专门针对密集意图描述和调色板查询的多模态对齐检索框架,并引入了基于置信度的松弛损失来处理未标注图像的对齐问题,为细粒度、多属性结合的图像检索任务提供了新思路。

Abstract: We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.


[86] RealWonder: Real-Time Physical Action-Conditioned Video Generation cs.CV | cs.AI | cs.GRPDF

Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu

TL;DR: RealWonder是一个实时动作条件视频生成系统,能够从单张图像生成模拟物理作用(如力、机器人操作)的视频。其核心创新在于利用物理模拟作为中间桥梁,将连续动作转化为视频模型可处理的光流和RGB表示,从而实现对刚性物体、可变形体、流体和颗粒材料等物理交互的实时模拟。

Details

Motivation: 现有视频生成模型缺乏对动作如何影响3D场景的结构性理解,无法模拟3D动作(如力和机器人操作)的物理后果。

Result: 该系统在480x832分辨率下达到13.2 FPS的实时性能,能够交互式探索力、机器人动作和相机控制对多种物理材料的影响。

Insight: 创新点在于将物理模拟作为动作编码的中间表示,结合单图像3D重建、物理模拟和仅需4步扩散的蒸馏视频生成器,实现了实时物理交互视频生成,为沉浸式体验、AR/VR和机器人学习开辟了新应用可能。

Abstract: Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/


[87] Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes cs.CVPDF

Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

TL;DR: 本文提出了一种名为最长稳定前缀(LSP)的调度器,用于解决扩散语言模型(DLMs)在推理时因‘分散接受’解码策略导致的KV缓存碎片化、内存局部性差和重复修复开销大的问题。LSP通过在每个去噪步骤中动态识别并原子化提交一个连续、左对齐的稳定预测前缀,将碎片化的KV缓存更新转换为高效的连续追加,从而显著加速推理。

Details

Motivation: 扩散语言模型(DLMs)理论上支持高度并行的文本生成,但其实际推理速度常受限于次优的解码调度器。标准的‘分散接受’方法会在序列中不连续的位置提交高置信度标记,导致KV缓存碎片化、破坏内存局部性,并迫使模型在不稳定的标记边界进行代价高昂的重复修复。

Result: 在LLaDA-8B和Dream-7B模型上的广泛评估表明,LSP在数学推理、代码生成、多语言(CJK)任务和创意写作等严格基准测试中,将推理速度最高提升了3.4倍,同时保持或略微提升了输出质量。

Insight: 论文的核心创新在于通过‘整体前缀吸收’的训练无关、模型无关的推理范式,从根本上重构了标记提交的拓扑结构。其技术亮点包括:将碎片化KV缓存更新转化为高效连续追加的系统性优化,以及通过几何收缩的活动后缀保留双向前瞻性从而大幅降低标记翻转率和去噪器调用次数的算法优化,弥合了DLMs理论并行性与实际硬件效率之间的差距。

Abstract: Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on ‘scattered acceptance’-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.


[88] EdgeDAM: Real-time Object Tracking for Mobile Devices cs.CVPDF

Syed Muhammad Raza, Syed Murtaza Hussain Abidi, Khawar Islam, Muhammad Ibrahim, Ajmal Saeed Mian

TL;DR: 本文提出EdgeDAM,一种用于移动设备的轻量级检测引导跟踪框架,旨在解决单目标跟踪在遮挡、干扰物和快速运动下的实时性问题。它通过双缓冲区干扰感知内存和置信度驱动切换机制,在保持高帧率的同时提升了跟踪鲁棒性。

Details

Motivation: 现有基于分割的干扰感知内存机制计算开销大,难以在资源受限的边缘设备上实时部署;而轻量级跟踪器在遇到视觉相似干扰物时容易漂移。本文旨在设计一个兼顾实时性与鲁棒性的边缘设备跟踪框架。

Result: 在包括DiDi在内的五个基准测试上进行了广泛实验,在DiDi数据集上达到88.2%的准确率,并在iPhone 15上实现25 FPS的实时性能,证明了其在遮挡和快速运动下的鲁棒性提升。

Insight: 主要创新点包括:1) 双缓冲区干扰感知内存,整合了近期感知内存和干扰物解析内存以分别保持目标假设的时间一致性和显式惩罚干扰物;2) 置信度驱动切换与保持框稳定化机制,自适应地在遮挡期间激活检测和内存引导的重识别,并通过临时冻结和扩展估计框来抑制干扰物污染。这是一种为边界框跟踪重新设计干扰感知内存的轻量化方案。

Abstract: Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.


[89] HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token cs.CVPDF

Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou

TL;DR: 该论文提出了一种名为HALP的方法,用于在视觉语言模型生成任何文本标记之前,通过单次前向传播探测模型内部表征来预测幻觉风险。该方法在多种任务和八个现代VLM上进行了验证,证明无需解码即可实现强大的幻觉检测性能。

Details

Motivation: 解决现有幻觉检测方法通常在文本生成后运行,导致干预成本高且不及时的问题,探索是否能在生成前预测幻觉风险。

Result: 在包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL在内的八个现代VLM上,探测器在Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上达到了高达0.93的AUROC,表明无需解码即可实现强大的幻觉检测性能。

Insight: 创新点在于提出了一种在生成前预测幻觉风险的方法,通过探测视觉特征、视觉-标记表征和查询-标记表征等内部表征实现;客观分析表明,幻觉风险可在生成前被检测,且最具信息量的层和模态因架构而异,轻量级探测器有望实现早期弃权、选择性路由和自适应解码以提高安全性和效率。

Abstract: Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model’s internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.


[90] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline cs.CVPDF

Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou

TL;DR: 本文提出了MM-Lifelong数据集,用于多模态终身理解,包含181.1小时按日、周、月尺度组织的视频,以模拟自然、非脚本的日常生活。研究发现现有方法存在工作记忆瓶颈和全局定位崩溃问题,并提出了递归多模态智能体ReMA,通过动态记忆管理迭代更新递归信念状态,显著优于现有方法。

Details

Motivation: 现有视频理解数据集通常由密集拼接的片段组成,与自然、非脚本的日常生活存在差距,因此需要构建一个能捕捉不同时间密度、更贴近真实生活的多模态终身理解数据集。

Result: 在MM-Lifelong数据集上的广泛评估表明,提出的ReMA方法显著优于现有方法,解决了端到端MLLM的工作记忆瓶颈和代表性智能体基线的全局定位崩溃问题。

Insight: 创新点在于构建了按时间尺度(日、周、月)结构化的多模态终身理解数据集,并提出了递归多模态智能体ReMA,通过动态记忆管理迭代更新信念状态,以应对长期稀疏时间线中的记忆和定位挑战;同时设计了隔离时间和领域偏差的数据集划分,为监督学习和分布外泛化研究提供了严格基础。

Abstract: While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.


[91] Accelerating Text-to-Video Generation with Calibrated Sparse Attention cs.CVPDF

Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar

TL;DR: 本文提出了一种名为CalibAtt的训练无关方法,通过校准的稀疏注意力加速文本到视频生成。该方法基于观察到注意力计算中大量token连接得分可忽略且模式重复,通过离线校准识别稳定的块级稀疏和重复模式,并在推理时跳过不重要的连接,从而在保持生成质量的同时实现端到端加速。

Details

Motivation: 现有基于扩散模型的高质量视频生成方法因使用时空注意力导致运行缓慢,本文旨在通过减少不必要的注意力计算来加速推理过程。

Result: 在Wan 2.1 14B、Mochi 1及多分辨率少步蒸馏模型上的实验表明,CalibAtt实现了最高1.58倍的端到端加速,优于现有训练无关方法,且保持了视频生成质量和文本-视频对齐。

Insight: 创新点在于利用注意力得分的稳定稀疏性和重复模式进行硬件高效的稀疏化,无需重新训练即可加速生成;客观分析认为该方法通过离线校准和编译优化模式,有效平衡了计算效率与生成质量。

Abstract: Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.


[92] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning cs.CVPDF

Weijie Lyu, Ming-Hsuan Yang, Zhixin Shu

TL;DR: FaceCam是一个为单目人像视频输入生成可定制相机轨迹视频的系统,通过提出面向人脸的尺度感知相机表示来避免几何失真和视觉伪影,并在多视角工作室捕捉和野外单目视频上训练视频生成模型,实现了优越的相机可控性、视觉质量以及身份和运动保持。

Details

Motivation: 解决现有基于大视频生成模型的相机控制方法在肖像视频中因尺度模糊的相机表示或3D重建错误导致的几何失真和视觉伪影问题。

Result: 在Ava-256数据集和多样野外视频上的实验表明,FaceCam在相机可控性、视觉质量、身份和运动保持方面达到优越性能。

Insight: 创新点包括提出人脸定制的尺度感知相机表示以提供确定性条件而不依赖3D先验,以及合成相机运动和多镜头拼接两种相机控制数据生成策略,利用静态训练相机泛化到动态连续轨迹。

Abstract: We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.


[93] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups cs.CV | cs.GRPDF

Leif Van Holland, Domenic Zingsheim, Mana Takhsha, Hannah Dröge, Patrick Stotko

TL;DR: 本文提出了一种基于Transformer的实时3D流媒体修复方法,用于稀疏多摄像头设置下的沉浸式AR/VR应用。该方法作为一种独立于底层表示的图像后处理步骤,通过多视图感知的Transformer网络架构和时空嵌入来补全渲染图像中的缺失纹理,确保帧间一致性并保留细节。其分辨率无关设计和自适应补丁选择策略实现了实时性能与质量的平衡。

Details

Motivation: 解决稀疏多摄像头设置下实时3D流媒体中因视角有限导致的纹理缺失和不完整表面问题,现有基于启发式的方法易产生不一致性或视觉伪影,需要一种高效、通用的修复方法。

Result: 在相同实时约束下与最先进的修复技术相比,该方法在质量和速度之间取得了最佳权衡,在图像和视频指标上均优于竞争对手。

Insight: 创新点包括:将修复设计为独立于3D表示的通用后处理模块;引入多视图感知的Transformer架构结合时空嵌入以确保一致性;分辨率无关设计和自适应补丁选择策略实现了实时性能与高质量输出的平衡。

Abstract: High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.


cs.RO [Back]

[94] OpenFrontier: General Navigation with Visual-Language Grounded Frontiers cs.RO | cs.CVPDF

Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum

TL;DR: 论文提出了一种名为OpenFrontier的免训练导航框架,将导航问题重新定义为稀疏子目标识别与抵达问题,通过选择导航前沿作为语义锚点,并集成多种视觉-语言先验模型,实现了在无需密集3D重建、策略训练或模型微调情况下的高效零样本导航。

Details

Motivation: 解决开放世界导航中传统方法依赖密集3D重建和人工设计目标度量导致泛化性差,以及现有视觉-语言导航模型通常需要交互式训练、大规模数据收集或任务特定微调的问题。

Result: 在多个导航基准测试中展示了强大的零样本性能,并在移动机器人上实现了有效的真实世界部署。

Insight: 核心创新在于将导航前沿作为语义锚点来为高层语义先验提供视觉锚定目标,从而实现了高效的目标条件导航;其免训练、轻量化的系统设计,无需密集建图或模型微调,是其实用性的关键创新点。

Abstract: Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision–language navigation (VLN) and vision–language–action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision–language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.


cs.AI [Back]

[95] Adaptive Memory Admission Control for LLM Agents cs.AI | cs.CL | cs.MAPDF

Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao

TL;DR: 本文提出了一种自适应记忆准入控制框架(A-MAC),用于解决基于LLM的智能体在长期记忆管理中缺乏可控性的问题。该框架将记忆准入视为结构化决策问题,通过分解为五个可解释因素(未来效用、事实置信度、语义新颖性、时间新近性和内容类型先验)并结合轻量级规则特征提取与LLM辅助的效用评估,学习领域自适应的准入策略。

Details

Motivation: 当前基于LLM的智能体在长期记忆管理上存在缺陷:要么积累大量包含幻觉或过时信息的对话内容,要么依赖不透明、完全由LLM驱动的记忆策略,这些策略成本高且难以审计。记忆准入成为智能体架构中一个定义不清、控制薄弱的环节。

Result: 在LoCoMo基准测试上的实验表明,A-MAC实现了更优的精确率-召回率权衡,将F1分数提升至0.583,同时与最先进的LLM原生记忆系统相比延迟降低了31%。消融实验结果显示内容类型先验是影响可靠记忆准入的最关键因素。

Insight: 创新点在于将记忆准入明确为结构化决策问题,并提出了一个由五个可解释因素组成的分解框架,结合了规则特征与LLM辅助评估,实现了透明且高效的控制。这为构建可扩展、可靠的LLM智能体记忆系统提供了关键的设计原则。

Abstract: LLM-based agents increasingly rely on long-term memory to support multi-session reasoning and interaction, yet current systems provide little control over what information is retained. In practice, agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven memory policies that are costly and difficult to audit. As a result, memory admission remains a poorly specified and weakly controlled component in agent architectures. To address this gap, we propose Adaptive Memory Admission Control (A-MAC), a framework that treats memory admission as a structured decision problem. A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior. The framework combines lightweight rule-based feature extraction with a single LLM-assisted utility assessment, and learns domain-adaptive admission policies through cross-validated optimization. This design enables transparent and efficient control over long-term memory. Experiments on the LoCoMo benchmark show that A-MAC achieves a superior precision-recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems. Ablation results identify content type prior as the most influential factor for reliable memory admission. These findings demonstrate that explicit and interpretable admission control is a critical design principle for scalable and reliable memory in LLM-based agents.


[96] Using Vision + Language Models to Predict Item Difficulty cs.AI | cs.CL | cs.CVPDF

Samin Khan

TL;DR: 本研究探索了利用大型语言模型(LLMs)预测数据可视化素养测试题目难度的能力。通过分析题目文本(问题和选项)和可视化图像,或两者结合的特征,来预测美国成年人的答题正确率。研究发现,结合视觉和文本特征的多模态方法预测误差最低。

Details

Motivation: 解决如何利用LLMs自动评估数据可视化测试题目的难度,以辅助心理测量分析和自动化题目开发。

Result: 在多模态方法中,结合视觉和文本特征的模型取得了最低的平均绝对误差(MAE为0.224),优于仅使用视觉(0.282)或仅使用文本(0.338)的单模态方法;在外部测试集上,最佳多模态模型的均方误差为0.10805。

Insight: 创新点在于将LLMs应用于心理测量领域,特别是通过多模态(视觉+语言)特征融合来预测题目难度,展示了LLMs在自动化教育评估中的潜力。

Abstract: This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.


[97] Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models cs.AI | cs.CL | cs.LGPDF

Jihoon Jeong

TL;DR: 本文提出了’模型医学’这一研究框架,将AI模型类比为生物有机体,旨在系统性地理解、诊断、治疗和预防AI模型的’疾病’。论文建立了包含15个子学科的学科分类法,提出了基于行为遗传学的’四壳模型’解释模型行为,开发了名为’神经MRI’的开源诊断工具,并构建了五层诊断框架和临床实践工具。

Details

Motivation: 当前AI可解释性研究(类比为解剖观察)与复杂AI系统日益需要的系统性临床实践之间存在差距,本文旨在弥合这一差距,为AI模型建立一套类似医学的、系统性的临床科学框架。

Result: 提出的’四壳模型’基于Agora-12项目中720个智能体和24,923个决策的经验数据;’神经MRI’诊断工具通过四个临床案例验证了其成像、比较、定位和预测能力。

Insight: 核心创新在于将医学范式系统性地引入AI模型分析,提出了一个完整的’模型医学’学科体系、行为遗传学解释框架、将神经影像技术映射到AI可解释性的诊断工具,以及连接诊断与治疗的系统性方法。提出的’分层核心假说’也为模型参数架构提供了新的生物学启发视角。

Abstract: Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models – like biological organisms – have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions – Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core–Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis – a biologically-inspired three-layer parameter architecture – and a therapeutic framework connecting diagnosis to treatment.


[98] Interactive Benchmarks cs.AI | cs.CL | cs.LGPDF

Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang

TL;DR: 该论文提出了交互式基准(Interactive Benchmarks),一种在预算约束下通过交互过程评估模型推理能力的统一评估范式,以解决传统基准因饱和、主观性和泛化性差而不可靠的问题。该框架在交互证明(Interactive Proofs)和交互游戏(Interactive Games)两个场景中实例化,评估模型主动获取信息的能力。结果表明,交互式基准能更稳健、忠实地评估模型智能,并揭示了模型在交互场景中仍有巨大改进空间。

Details

Motivation: 传统基准测试因饱和、主观性和泛化性差而日益不可靠,因此需要评估模型主动获取信息的能力以更准确地衡量其智能水平。

Result: 论文在交互证明和交互游戏两个设置中实例化了该框架,结果显示交互式基准提供了对模型智能的稳健和忠实评估,并表明模型在交互场景中仍有显著提升空间。

Insight: 创新点在于提出了一个统一的交互式评估范式,强调在预算约束下通过主动交互(如逻辑推理和策略游戏)来评估模型推理能力,这为超越静态基准、更贴近真实世界智能的评估提供了新方向。

Abstract: Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model’s ability to acquire information actively is important to assess model’s intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model’s reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench


[99] Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction cs.AI | cs.CLPDF

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, Difan Zou

TL;DR: 该论文提出了一种名为RLSTA(基于单轮锚点的强化学习)的训练方法,旨在解决大语言模型在多轮交互中表现出的‘上下文惯性’问题,即模型倾向于固守先前的推理路径而忽略后续的新信息或修正。该方法利用模型在单轮任务中的优异表现作为稳定的内部锚点来提供奖励信号,从而引导模型在多轮交互中整合最新信息并自我校准推理。实验表明,RLSTA在稳定多轮交互方面显著优于标准微调和基于弃权的方法,并展现出强大的跨领域泛化能力。

Details

Motivation: 大语言模型在单轮提供完整信息时展现出强大的推理能力,但在信息逐步揭示或需要更新的多轮交互中,模型往往无法有效整合新的约束条件,导致性能相比单轮基线大幅下降。其根本原因被定义为‘上下文惯性’,即模型僵化地遵循先前的推理轨迹,即使后续轮次用户明确提供了修正或新数据。

Result: 实验表明,RLSTA方法在稳定多轮交互方面显著优于标准微调和基于弃权的方法。该方法展现出强大的跨领域泛化能力(例如从数学领域泛化到代码领域),并且即使在没有外部验证器的情况下也证明有效,突显了其在通用领域应用的潜力。

Insight: 论文的核心创新点在于将模型在单轮任务中的优异能力作为稳定的内部锚点,用于生成强化学习的奖励信号,从而引导模型在多轮交互中打破上下文惯性并实现自我校准。从客观角度看,这是一种新颖的利用模型自身能力(而非外部工具)来提升其多轮交互鲁棒性的训练范式,具有较好的通用性和可扩展性。

Abstract: While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model’s superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.


[100] On Multi-Step Theorem Prediction via Non-Parametric Structural Priors cs.AI | cs.CVPDF

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang

TL;DR: 本文针对多步定理预测任务,提出了一种基于非参数结构先验的训练无关方法。通过构建定理优先图来编码历史解轨迹中的时序依赖关系,并结合检索增强的图构建和逐步符号执行器,使大语言模型能够作为结构化规划器进行推理,无需梯度优化。在FormalGeo7k基准测试中,该方法达到了89.29%的准确率,显著优于上下文学习基线,并与最先进的监督模型性能相当。

Details

Motivation: 现有神经符号方法主要依赖监督参数模型,对演化的定理库泛化能力有限。本文旨在探索基于上下文学习的免训练定理预测方法,并解决随着推理深度增加,普通上下文学习性能急剧下降(结构漂移问题)的瓶颈。

Result: 在FormalGeo7k基准测试上,所提方法达到了89.29%的准确率,显著优于上下文学习基线,并与最先进的监督模型性能相当。

Insight: 创新点在于提出了定理优先图来编码时序依赖作为显式拓扑约束,有效剪枝推理搜索空间;结合检索增强图构建和逐步符号执行器,使大语言模型能作为结构化规划器工作,无需训练。这为基于大语言模型的符号推理扩展提供了新方向。

Abstract: Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM’s inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.


[101] Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry cs.AI | cs.CLPDF

Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha

TL;DR: 本文提出了分布式部分信息谜题(DPIP),这是一种在认知不对称条件下引发丰富多模态通信的协作构建任务,并构建了一个包含语音、手势和动作模态标注的多模态交互数据集。论文评估了两种建模共同基础(CG)的范式:基于提示的大型语言模型(LLMs)和基于动态认知逻辑(DEL)的公理化流程,发现DPIP对现代LLMs跟踪任务进展和信念状态的能力构成了挑战。

Details

Motivation: 解决当前AI系统在多模态、多方协作场景中,由于参与者信息不对称而难以建立共同基础(即共享信念和相互认可的事实)的挑战。

Result: 在标注的DPIP数据上,实验结果表明,该任务对现代LLMs跟踪任务进展和信念状态的能力构成了挑战;同时,基于动态认知逻辑(DEL)的公理化流程被提出来执行相同的增量任务。

Insight: 创新点在于提出了DPIP任务和配套的多模态标注数据集,用于系统研究认知不对称下的共同基础构建;客观分析认为,将形式化的逻辑推理(DEL)与数据驱动的LLM方法进行对比评估,为理解AI系统在多模态协作中的推理局限提供了新视角。

Abstract: Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs’ abilities to track both task progression and belief state.


[102] WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces cs.AI | cs.CVPDF

Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, Li Ling

TL;DR: 本文介绍了WebChain,一个大规模、开源、人工标注的真实世界网页交互轨迹数据集,旨在加速网页智能体的可复现研究。该数据集包含31,725条轨迹和318k个步骤,其核心特点是视觉、结构和动作数据的‘三重对齐’,提供了丰富的多模态监督。基于此数据集,作者提出了一种‘双中期训练’方法,将空间定位与规划解耦,并在提出的WebChainBench和其他公共GUI基准测试中取得了最先进的性能。

Details

Motivation: 为了解决现有合成方法难以覆盖复杂、高价值网页交互任务,以及缺乏大规模、高质量、多模态对齐的真实世界交互数据来训练和评估可扩展网页智能体的问题。

Result: 在作者提出的WebChainBench以及其他公共GUI基准测试(如MiniWoB++)上,所提出的‘双中期训练’方法取得了最先进的性能。

Insight: 核心创新点在于构建了大规模、多模态对齐(视觉、结构、动作)的真实世界网页交互数据集,并提出了一种将空间定位与规划解耦的训练方法,这为构建和严格评估下一代可扩展网页智能体提供了必要的数据和见解。

Abstract: We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.


cs.IR [Back]

[103] FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents cs.IR | cs.AI | cs.CLPDF

Eric Y. Kim, Jie Huang

TL;DR: 论文提出了FinRetrieval基准测试,用于评估AI智能体从结构化数据库中检索具体数值的能力。该基准包含500个金融检索问题,并收集了来自Anthropic、OpenAI和Google三大前沿提供商的14种配置的智能体响应及完整的工具调用轨迹。评估发现,工具可用性是性能的主导因素,Claude Opus在使用结构化数据API时准确率达到90.8%,而仅使用网络搜索时仅为19.8%。研究还分析了推理模式效益、地理性能差异的原因,并开源了数据集以促进金融AI系统研究。

Details

Motivation: 目前缺乏评估AI智能体从结构化数据库中检索具体数值能力的基准测试,这阻碍了金融研究辅助AI系统的发展。

Result: 在FinRetrieval基准上,Claude Opus使用结构化数据API达到90.8%的准确率,仅用网络搜索则降至19.8%,性能差距达71个百分点,远超其他提供商。推理模式对性能的提升与基础能力成反比(OpenAI提升9.0个百分点,Claude提升2.8个百分点)。美国地区因财年命名惯例有5.6个百分点的优势。

Insight: 创新点在于构建了首个专注于金融数据数值检索的基准测试,并系统评估了工具可用性对AI智能体性能的关键影响。客观分析表明,该研究揭示了工具调用策略(而非纯粹的推理能力)是当前智能体性能差异的主要来源,并指出领域特定知识(如财年惯例)对结果有显著影响,为优化金融AI系统提供了明确方向。

Abstract: AI agents increasingly assist with financial research, yet no benchmark evaluates their ability to retrieve specific numeric values from structured databases. We introduce FinRetrieval, a benchmark of 500 financial retrieval questions with ground truth answers, agent responses from 14 configurations across three frontier providers (Anthropic, OpenAI, Google), and complete tool call execution traces. Our evaluation reveals that tool availability dominates performance: Claude Opus achieves 90.8% accuracy with structured data APIs but only 19.8% with web search alone–a 71 percentage point gap that exceeds other providers by 3-4x. We find that reasoning mode benefits vary inversely with base capability (+9.0pp for OpenAI vs +2.8pp for Claude), explained by differences in base-mode tool utilization rather than reasoning ability. Geographic performance gaps (5.6pp US advantage) stem from fiscal year naming conventions, not model limitations. We release the dataset, evaluation code, and tool traces to enable research on financial AI systems.


[104] Core-based Hierarchies for Efficient GraphRAG cs.IR | cs.CLPDF

Jakir Hossain, Ahmet Erdem Sarıyüce

TL;DR: 本文提出了一种基于k核分解的GraphRAG框架,用于解决现有基于Leiden聚类的GraphRAG方法在稀疏知识图谱上社区划分不具可重现性的问题,通过构建确定性的密度感知层次结构,并结合轻量级启发式方法和令牌预算感知采样策略,以提高答案的全面性和多样性,同时降低LLM成本。

Details

Motivation: 现有基于向量的RAG方法在处理需要跨多个文档进行推理的全局意义理解任务时往往失效,而当前GraphRAG方法依赖的Leiden聚类在稀疏知识图谱上存在大量近似最优划分,导致社区检测结果不可重现。

Result: 在包括财报电话会议记录、新闻文章和播客在内的真实世界数据集上,使用三种LLM生成答案并由五个独立的LLM法官进行两两评估,该方法在多个数据集和模型上持续提高了答案的全面性和多样性,同时减少了令牌使用量。

Insight: 创新点在于用确定性的k核分解替代非确定性的Leiden聚类来构建层次社区,并提出了利用k核层次结构构建大小受限、保持连通性的社区的轻量级启发式方法,以及令牌预算感知的采样策略,从而实现了高效且可重现的全局意义理解框架。

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge. However, existing vector-based methods often fail on global sensemaking tasks that require reasoning across many documents. GraphRAG addresses this by organizing documents into a knowledge graph with hierarchical communities that can be recursively summarized. Current GraphRAG approaches rely on Leiden clustering for community detection, but we prove that on sparse knowledge graphs, where average degree is constant and most nodes have low degree, modularity optimization admits exponentially many near-optimal partitions, making Leiden-based communities inherently non-reproducible. To address this, we propose replacing Leiden with k-core decomposition, which yields a deterministic, density-aware hierarchy in linear time. We introduce a set of lightweight heuristics that leverage the k-core hierarchy to construct size-bounded, connectivity-preserving communities for retrieval and summarization, along with a token-budget-aware sampling strategy that reduces LLM costs. We evaluate our methods on real-world datasets including financial earnings transcripts, news articles, and podcasts, using three LLMs for answer generation and five independent LLM judges for head-to-head evaluation. Across datasets and models, our approach consistently improves answer comprehensiveness and diversity while reducing token usage, demonstrating that k-core-based GraphRAG is an effective and efficient framework for global sensemaking.


cs.NI [Back]

[105] Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey cs.NI | cs.CL | cs.PFPDF

Yasmin Moslem, John D. Kelleher

TL;DR: 本文系统综述了大型语言模型推理中的动态路由与级联方法,分析了多种路由范式(如查询难度、人类偏好、聚类、不确定性量化等),提出了一个三维概念框架来表征路由系统,并指出设计良好的路由系统可以通过战略性地利用各模型的专业能力来超越单个最强模型。

Details

Motivation: 随着具有不同能力、成本和领域的大型语言模型快速增长,静态模型部署无法根据查询的复杂性和领域进行适配,导致性能次优和成本增加,因此需要智能的动态模型选择系统来根据查询特征自适应选择模型。

Result: 本文为综述性论文,未报告具体定量结果,但通过系统性分析表明,设计良好的路由系统在操作约束下可以超越单个最强模型,实现性能与效率的最佳平衡。

Insight: 创新点在于提出了一个三维概念框架(决策时机、使用信息、计算方式)来统一表征路由系统,并强调实际系统通常是组合式的,在操作约束下整合多种范式;客观来看,将多LLM路由与MoE架构明确区分,并系统梳理跨模型路由的多样化范式与权衡,对构建高效推理系统具有重要指导价值。

Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.


cs.LG [Back]

[106] Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series cs.LG | cs.AI | cs.CLPDF

Jiafeng Lin, Mengren Zheng, Simeng Ye, Yuxuan Wang, Huan Zhang

TL;DR: 本文提出Aura框架,用于航空时间序列预测中整合多维外部因素。该框架通过三元编码机制将异质外部信息(如维护记录、天气数据等)按交互模式编码并集成到时间序列模型中,以提升预测准确性。

Details

Motivation: 解决实际工业场景中时间序列预测需整合多维异质外部因素(如维护、天气、运营数据)的挑战,传统单模态模型难以捕捉这些因素的异构交互。

Result: 在中国南方航空为期三年的大规模工业数据集(涵盖波音777和空客A320机队)上实验表明,Aura在所有基线方法中均达到最先进(SOTA)性能,并展现出卓越的适应性。

Insight: 创新点在于根据外部因素与目标时间序列的交互模式(如直接影响、周期性影响等)显式组织并编码异质信息,通过三元编码机制实现非序列上下文的无缝集成,为航空安全与可靠性提供通用增强方案。

Abstract: Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviation maintenance scenario and identify three distinct types of exogenous factors that influence temporal dynamics through distinct interaction modes. Based on this empirical insight, we propose Aura, a universal framework that explicitly organizes and encodes heterogeneous external information according to its interaction mode with the target time series. Specifically, Aura utilizes a tailored tripartite encoding mechanism to embed heterogeneous features into well-established time series models, ensuring seamless integration of non-sequential context. Extensive experiments on a large-scale, three-year industrial dataset from China Southern Airlines, covering the Boeing 777 and Airbus A320 fleets, demonstrate that Aura consistently achieves state-of-the-art performance across all baselines and exhibits superior adaptability. Our findings highlight Aura’s potential as a general-purpose enhancement for aviation safety and reliability.


[107] Knowledge Divergence and the Value of Debate for Scalable Oversight cs.LG | cs.CLPDF

Robin Young

TL;DR: 本文通过知识分歧的几何框架分析了AI辩论与基于AI反馈的强化学习(RLAIF)在可扩展监督中的关系,证明了辩论的价值取决于模型间知识表示子空间的主角度,并识别了知识分歧的三种类型及其对辩论有效性的影响。

Details

Motivation: 为了解决AI可扩展监督中辩论与RLAIF方法之间缺乏形式化联系的问题,并量化辩论在何种条件下能提供优势,本文旨在通过几何视角分析模型间知识分歧如何影响辩论的有效性。

Result: 理论分析表明,当模型知识相同时,辩论退化为类似RLAIF的单智能体方法;当知识分歧时,辩论优势呈现从二次方到线性相变的缩放规律,并在组合型分歧下存在使辩论失效的尖锐阈值。

Insight: 创新点在于首次建立了辩论与RLAIF的形式化联系,并利用主角度几何框架量化了知识分歧对辩论价值的影响,为理解对抗性监督协议的适用性提供了理论基础,同时揭示了在互补信息模型中激发潜在知识的问题关联。

Abstract: AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage. We analyze this by parameterizing debate’s value through the geometry of knowledge divergence between debating models. Using principal angles between models’ representation subspaces, we prove that the debate advantage admits an exact closed form. When models share identical training corpora, debate reduces to RLAIF-like where a single-agent method recovers the same optimum. When models possess divergent knowledge, debate advantage scales with a phase transition from quadratic regime (debate offers negligible benefit) to linear regime (debate is essential). We classify three regimes of knowledge divergence (shared, one-sided, and compositional) and provide existence results showing that debate can achieve outcomes inaccessible to either model alone, alongside a negative result showing that sufficiently strong adversarial incentives cause coordination failure in the compositional regime, with a sharp threshold separating effective from ineffective debate. We offer the first formal connection between debate and RLAIF, a geometric foundation for understanding when adversarial oversight protocols are justified, and connection to the problem of eliciting latent knowledge across models with complementary information.


[108] FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation cs.LG | cs.AI | cs.CVPDF

Min Tan, Junchao Ma, Yinfu Feng, Jiajun Ding, Wenwen Pan

TL;DR: FedAFD是一个多模态联邦学习框架,通过对抗性融合与蒸馏技术,解决客户端数据模态异质性和模型异质性带来的挑战,提升客户端和服务器端的性能与效率。

Details

Motivation: 现有多模态联邦学习方法忽视客户端个性化性能,难以处理模态/任务差异及模型异质性,FedAFD旨在解决这些问题。

Result: 在IID和非IID设置下的广泛实验表明,FedAFD在客户端和服务器端均实现了优越的性能和效率。

Insight: 创新点包括客户端双层次对抗对齐策略、粒度感知融合模块,以及服务器端基于相似性的集成蒸馏机制,可借鉴于处理联邦学习中的模态对齐和知识融合问题。

Abstract: Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.


[109] WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation cs.LG | cs.AI | cs.CL | cs.SDPDF

Luca Della Libera, Cem Subakan, Mirco Ravanelli

TL;DR: WavSLM是一种单流语音语言模型,通过将自监督的WavLM表示量化和蒸馏到单一码本中,并优化自回归的下一个片段预测目标来训练。该模型无需文本监督或文本预训练,即可在单一令牌流中联合建模语义和声学信息,实现了在一致性基准和语音生成上的竞争性能,且参数量更少、训练数据更少,并支持流式推理。

Details

Motivation: 大型语言模型展示了简单自回归训练可实现可扩展且连贯的生成,但将此范式扩展到语音领域仍具挑战,因为语义和声学信息存在纠缠。现有语音语言模型多依赖文本监督、分层令牌流或复杂混合架构,偏离了在文本领域已证明有效的单流生成预训练范式。

Result: WavSLM在一致性基准和语音生成上取得了竞争性能,同时使用更少的参数和训练数据,并支持流式推理。

Insight: 创新点在于通过量化和蒸馏自监督语音表示(WavLM)到单一码本,实现了无需文本监督的单流自回归语音建模,简化了架构并保持了语义和声学的联合建模能力。

Abstract: Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm-web/.


eess.AS [Back]

[110] An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production eess.AS | cs.AI | cs.CLPDF

Jihwan Lee, Parsa Razmara, Kevin Huang, Sean Foley, Aditya Kommineni

TL;DR: 本文提出了一种同时采集实时MRI视频、EEG和表面EMG数据的方法,用于捕捉言语产生过程中的神经、肌肉和发音运动活动,并针对多模态采集中的伪影问题开发了抑制流程。

Details

Motivation: 言语产生是一个涉及神经规划、运动控制、肌肉激活和发音运动学的复杂过程,而声学信号无法直接揭示其神经生理学基础,因此需要同时获取多模态数据以全面研究这一过程。

Result: 论文实现了首次同时采集实时MRI、EEG和表面EMG数据,但未提及具体的定量结果或基准测试,主要贡献在于技术框架的建立。

Insight: 创新点在于首次实现了言语产生过程中脑信号、肌肉活动和发音运动的同时多模态采集,并针对MRI引起的电磁干扰和肌源性伪影设计了专门的抑制流程,为言语神经科学和脑机接口研究提供了新的工具。

Abstract: Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.


cs.MM [Back]

[111] SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning cs.MM | cs.CL | cs.SDPDF

Zhu Li, Yongjian Chen, Huiyuan Lai, Xiyuan Gao, Shekhar Nayak

TL;DR: 本文提出了SarcasmMiner,一个基于强化学习的后训练框架,旨在提升基础模型在多模态讽刺检测任务中的鲁棒性。该框架通过将讽刺检测重构为结构化推理问题,并采用双轨蒸馏策略(高质量教师轨迹初始化学生模型,完整轨迹集训练生成式奖励模型)来抵抗多模态推理中的幻觉。

Details

Motivation: 多模态讽刺检测需要通过对文本、声学和视觉线索进行跨模态推理来解决语用不一致性。现有方法存在幻觉问题,因此需要一种能够实现鲁棒讽刺推理的框架。

Result: 在MUStARD++基准测试上,SarcasmMiner将F1分数从零样本的59.83%、监督微调的68.23%提升至70.22%,实现了性能提升。

Insight: 核心创新点在于将讽刺检测重构为结构化推理任务,并设计了双轨蒸馏策略(结合轨迹初始化和生成式奖励模型训练)以及使用解耦奖励(准确性和推理质量)的组相对策略优化(GRPO)。这为增强多模态基础模型的推理能力和事实性提供了可借鉴的思路。

Abstract: Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.


eess.IV [Back]

[112] ICHOR: A Robust Representation Learning Approach for ASL CBF Maps with Self-Supervised Masked Autoencoders eess.IV | cs.CV | physics.med-phPDF

Xavier Beltran-Urbano, Yiran Li, Xinglin Zeng, Katie R. Jobson, Manuel Taso

TL;DR: 本文提出了ICHOR,一种基于自监督掩码自编码器的ASL CBF图像稳健表征学习方法。该方法使用Vision Transformer主干,通过大规模多站点、多协议的ASL CBF扫描数据集进行预训练,学习可迁移的表征,并作为通用编码器应用于下游任务,如图像质量预测和诊断分类。

Details

Motivation: 解决ASL灌注MRI中因图像质量差异、站点/厂商/协议异质性以及标注数据稀缺导致的深度学习模型泛化能力受限的问题。

Result: 在三个下游诊断分类任务和一个ASL CBF图像质量预测回归任务上,ICHOR均优于现有的、适配于ASL的神经影像自监督预训练方法。

Insight: 创新点在于将3D掩码图像建模(masked image modeling)与Vision Transformer结合,专门用于ASL CBF图像,并利用大规模、多源异构数据集进行预训练,以学习稳健、可迁移的表征,缓解领域差异和数据标注瓶颈。

Abstract: Arterial spin labeling (ASL) perfusion MRI allows direct quantification of regional cerebral blood flow (CBF) without exogenous contrast, enabling noninvasive measurements that can be repeated without constraints imposed by contrast injection. ASL is increasingly acquired in research studies and clinical MRI protocols. Building on successes in structural imaging, recent efforts have implemented deep learning based methods to improve image quality, enable automated quality control, and derive robust quantitative and predictive biomarkers with ASL derived CBF. However, progress has been limited by variable image quality, substantial inter-site, vendor and protocol differences, and limited availability of labeled datasets needed to train models that generalize across cohorts. To address these challenges, we introduce ICHOR, a self supervised pre-training approach for ASL CBF maps that learns transferable representations using 3D masked autoencoders. ICHOR is pretrained via masked image modeling using a Vision Transformer backbone and can be used as a general-purpose encoder for downstream ASL tasks. For pre-training, we curated one of the largest ASL datasets to date, comprising 11,405 ASL CBF scans from 14 studies spanning multiple sites and acquisition protocols. We evaluated the pre-trained ICHOR encoder on three downstream diagnostic classification tasks and one ASL CBF map quality prediction regression task. Across all evaluations, ICHOR outperformed existing neuroimaging self-supervised pre-training methods adapted to ASL. Pre-trained weights and code will be made publicly available.