Table of Contents

cs.CL [Back]

[1] Children’s Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs cs.CL | cs.AIPDF

Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan

TL;DR: 本文提出了KidGym,一个基于2D网格的综合性基准测试,用于评估多模态大语言模型在五个核心能力上的表现:执行、感知推理、学习、记忆和规划。该基准包含12个独特任务,模拟儿童认知发展阶段,旨在更准确、稳健地衡量MLLMs的适应性和发展潜力。

Details

Motivation: 受韦氏儿童智力量表的启发,旨在将通用的人类智能分解为可解释、可测试的能力,从而评估MLLMs是否具备类似儿童的、超越纯语言模型的通用智能,并揭示当前模型的局限性。

Result: 通过对最先进的MLLMs进行评估,发现了模型能力的显著洞察,并揭示了当前模型的若干局限性。

Insight: 创新点在于借鉴心理学中的儿童智力测试框架,构建了一个系统化、可定制和可扩展的2D网格基准,将模型评估与人类认知发展阶段对齐,为衡量MLLMs的通用智能和发展潜力提供了新视角。

Abstract: Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs’ adaptability and developmental potential, mirroring the stages of children’s cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://kidgym.github.io/KidGym-Website/.


[2] Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models cs.CL | cs.LGPDF

Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu

TL;DR: 本文提出了一种名为快慢思维奖励模型(F/S-RM)的混合架构,旨在高效整合标量奖励模型(SRM)和生成式奖励模型(GRM)。该模型受双过程理论启发,通过一个单一模型集成两种奖励范式:快速思维的首次令牌标量评分和基于思维链(CoT)推理的慢速思维生成式判断,并采用双置信度激活机制动态决定何时启用计算成本更高的慢速思维过程。

Details

Motivation: 现有奖励模型在强化学习人类反馈(RLHF)对齐大语言模型时面临权衡:生成式奖励模型(GRM)通过思维链推理实现高精度但计算成本高昂,而标量奖励模型(SRM)高效但性能和场景适应性有限。本文旨在设计一种能兼顾性能与效率的混合奖励模型架构。

Result: F/S-RM在保持性能的同时显著提升了效率:相比最先进(SOTA)模型实现了1.2%的相对性能提升,同时减少了20.8%的令牌消耗。

Insight: 核心创新点在于将认知科学中的双过程理论(快/慢思维)引入奖励模型设计,通过单一模型统一标量与生成式奖励范式,并设计了基于双置信度的自适应激活机制来动态调度计算资源。这为构建高效且高性能的AI对齐组件提供了可借鉴的架构思路。

Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.


[3] Multi-Agent Debate with Memory Masking cs.CL | cs.LGPDF

Hongduan Tian, Xiao Feng, Ziyuan Zhao, Xiangyu Zhu, Rolan Yan

TL;DR: 本文提出了一种名为MAD-M^2(带记忆掩码的多智能体辩论)的新框架,旨在解决多智能体辩论(MAD)中存在的错误记忆问题。通过在每个辩论轮次开始时允许智能体掩码前一轮的错误记忆,该方法提升了MAD的鲁棒性,并在数学和逻辑推理基准测试中取得了优于原始MAD的性能。

Details

Motivation: 多智能体辩论(MAD)虽然能提升大语言模型的推理能力,但智能体容易受到前轮辩论中产生的错误记忆的影响,这些错误记忆会威胁MAD的性能。本文旨在解决MAD框架对错误记忆的脆弱性问题。

Result: 在主流的数学和逻辑推理基准测试上进行的广泛实验表明,MAD-M^2能够有效识别错误记忆,并在推理性能上超越了原始的MAD方法。

Insight: 论文的核心创新点是引入了记忆掩码机制,允许智能体在每个辩论轮次开始时主动过滤掉前轮的错误记忆,从而净化上下文信息。这为提升多智能体协作推理系统的鲁棒性提供了一个简单而有效的思路,即通过管理历史记忆的质量来优化迭代推理过程。

Abstract: Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, multi-agent debate (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, multi-agent debate with memory masking (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.


[4] Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference cs.CLPDF

Patrick Wilhelm, Thorsten Wittkopp, Odej Kao

TL;DR: 本文探讨大型语言模型推理中的能耗问题,指出在请求密集型场景下,虽然小型语言模型结合思维链等推理策略能接近大模型性能并降低计算需求,但会引入额外能耗,形成能耗-准确率权衡。论文通过MMLU基准分析这种权衡,并提出能量效率指标(如每令牌能耗)作为传统准确率基准的补充,同时建议通过控制思维链令牌生成深度来实现动态能耗调节,以实现可持续AI部署。

Details

Motivation: 解决LLM推理中因过度计算和复杂推理策略导致的过高能耗问题,平衡模型性能与能源效率,以促进可持续AI部署。

Result: 在MMLU基准上分析小型模型与大型模型在测试时计算策略下的能耗-准确率权衡,提出能量效率指标作为评估补充,但未具体报告定量性能提升数据。

Insight: 创新点包括提出每令牌能耗等能量效率指标来量化AI物理影响,以及通过控制思维链令牌生成深度实现动态能耗调节的能源感知路由机制,为绿色AI提供新视角。

Abstract: Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks but come with substantial energy and computational costs, particularly in request-heavy scenarios. In many real-world applications, the full scale and capabilities of LLMs are often unnecessary, as Small Language Models (SLMs) can provide accurate responses for simpler text generation tasks. When enhanced with advanced reasoning strategies, such as Chain-of-Thought (CoT) prompting or Majority Voting, SLMs can approach the performance of larger models while reducing overall computational requirements. However, these strategies can also introduce additional energy costs, creating an energy-accuracy trade-off. Our analysis examines these trade-offs in test-time compute strategies for smaller models compared to larger ones, using the MMLU benchmark. Additionally, we explore the input-output token dynamics of transformer architectures, which result in nonlinear hardware energy operation curves for LLMs. To bridge AI research with its physical impact, we propose \textit{energy efficiency metrics}, including Energy-per-Token, as complements to traditional accuracy benchmarks. Beyond model selection, we propose controlled reasoning in CoT token generation, using operating curves to regulate reasoning depth dynamically. This vision integrates a energy-aware routing mechanism, ensuring that model selection and inference strategies balance accuracy for sustainable AI deployment.


[5] Coding Agents are Effective Long-Context Processors cs.CL | cs.AIPDF

Weili Cao, Xunjian Yin, Bhuwan Dhingra, Shuyan Zhou

TL;DR: 本文研究了将长上下文处理任务外部化给编码智能体(coding agents)的可行性,通过让智能体使用文件系统和原生工具(如代码和终端命令)来组织和操作大规模文本,从而替代传统基于注意力机制的隐式处理方式。

Details

Motivation: 当前大型语言模型(LLMs)通过不可解释的注意力机制处理长上下文时性能显著下降,本文旨在探索是否可以将长上下文处理从隐式注意力机制外部化为显式、可执行的交互过程。

Result: 在多个基准测试(包括长上下文推理、检索增强生成和包含高达三万亿token的大规模开放域问答)中,现成的先进编码智能体平均比已发布的最先进方法(SOTA)高出17.3%。

Insight: 创新点在于利用编码智能体的原生工具熟练度(通过可执行代码和终端命令而非被动语义查询)和文件系统熟悉度(将大规模文本语料库视为目录结构进行导航),为LLMs的长上下文处理提供了语义搜索或上下文窗口扩展之外的有效替代方案。

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.


[6] A Training-Free Regeneration Paradigm: Contrastive Reflection Memory Guided Self-Verification and Self-Improvement cs.CLPDF

Yuran Li, Di Wu, Benoit Boulet

TL;DR: 本文提出了一种无需训练的再生范式,通过离线构建的对比性反思记忆(RM)提供纠正指导,结合从头再生的方式突破错误推理。该方法在推理时执行RM引导的自验证和单次RM引导的再生,避免了迭代校正和多样本选择,在保持低计算成本的同时提升了LLM输出的准确性。

Details

Motivation: 现有验证引导的自改进方法在推理效率与准确性之间存在权衡:迭代验证-修正计算成本高且易陷入错误推理,而最佳N选择需要大量采样且无法解决内部模型缺陷。

Result: 在涵盖算法、推理、符号和领域特定任务的九个基准测试中,该方法在小型和大型LLM上均优于先前方法,同时保持了较低的计算成本。

Insight: 创新点在于利用离线构建的对比性反思记忆提供纠正信号,结合单次从头再生来打破错误推理循环,实现了无需训练的高效自改进机制。

Abstract: Verification-guided self-improvement has recently emerged as a promising approach to improving the accuracy of large language model (LLM) outputs. However, existing approaches face a trade-off between inference efficiency and accuracy: iterative verification-rectification is computationally expensive and prone to being trapped in faulty reasoning, while best-of-N selection requires extensive sampling without addressing internal model flaws. We propose a training-free regeneration paradigm that leverages an offline-curated contrastive Reflection Memory (RM) to provide corrective guidance, while regenerating from scratch helps break out of faulty reasoning. At inference time, the method performs RM-guided self-verification followed by a single RM-guided regeneration, avoiding both iterative correction and multi-sample selection. We evaluated our method on nine benchmarks that span algorithmic, reasoning, symbolic, and domain-specific tasks in both small- and large-scale LLMs. Experiment results show that our method outperforms prior methods while maintaining low computational cost.


[7] A Modular LLM Framework for Explainable Price Outlier Detection cs.CL | cs.CEPDF

Shadi Sartipi, John Wu, Sina Ghotbi, Nikhita Vedula, Shervin Malmasi

TL;DR: 本文提出了一种模块化的大型语言模型(LLM)框架,用于可解释的产品价格异常检测。该框架将价格异常标记视为一个基于相关产品检测与比较的推理任务,通过三个阶段(相关性分类、相对效用评估和基于推理的决策)来处理目标产品价格,并生成可解释的判断。

Details

Motivation: 解决零售和电商中产品价格异常检测问题,传统方法仅提供简单阈值而忽略了产品属性间丰富的语义关系,需要一种能理解语义并给出解释的检测方法。

Result: 在测试数据集上,该框架与人类审核员的一致性超过75%,并且优于零样本和基于检索的LLM技术;消融研究显示了方法对关键超参数的敏感性及其在不同精度要求和审核员一致性场景下的灵活性。

Insight: 创新点在于将价格异常检测构建为模块化的、基于LLM的推理任务,通过分阶段处理实现可解释性,并展示了在语义理解和灵活性方面的优势。

Abstract: Detecting product price outliers is important for retail and e-commerce stores as erroneous or unexpectedly high prices adversely affect competitiveness, revenue, and consumer trust. Classical techniques offer simple thresholds while ignoring the rich semantic relationships among product attributes. We propose an agentic Large Language Model (LLM) framework that treats outlier price flagging as a reasoning task grounded in related product detection and comparison. The system processes the prices of target products in three stages: (i) relevance classification selects price-relevant similar products using product descriptions and attributes; (ii) relative utility assessment evaluates the target product against each similar product along price influencing dimensions (e.g., brand, size, features); (iii) reasoning-based decision aggregates these justifications into an explainable price outlier judgment. The framework attains over 75% agreement with human auditors on a test dataset, and outperforms zero-shot and retrieval based LLM techniques. Ablation studies show the sensitivity of the method to key hyper-parameters and testify on its flexibility to be applied to cases with different accuracy requirement and auditor agreements.


[8] Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention cs.CLPDF

Manh Nguyen, Anh Nguyen, Dung Nguyen, Svetha Venkatesh, Hung Le

TL;DR: 本文提出了一种名为多样性感知保留(DAR)的高效多智能体辩论框架,旨在通过选择性传播信息来提升大型语言模型的推理质量。该方法在每一轮辩论中,选择那些与彼此及多数投票意见分歧最大的智能体响应进行广播,以减少噪声和冗余。

Details

Motivation: 现有基于不确定性估计的过滤方法因置信度分数校准不准确和对阈值选择敏感而不可靠,导致广播所有信息会引入噪声和冗余,从而降低辩论质量并浪费计算资源。

Result: 在多种推理和问答基准测试上的实验表明,选择性信息传播能持续提升辩论性能,尤其在智能体数量增加、噪声累积最严重时效果更显著。

Insight: 创新点在于引入基于索引的保留机制,确保保留的原始分歧信息不被修改,从而保持真实性;核心洞察是,在多智能体推理系统中,智能体听到什么与说什么同样重要。

Abstract: Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems.


[9] PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs cs.CL | cs.AIPDF

Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang

TL;DR: 论文提出了PAVE(前提感知验证与编辑)方法,这是一种用于检索增强型大语言模型的推理时验证层。PAVE将检索到的上下文分解为问题条件化的原子事实,草拟答案,评估草稿与提取前提的匹配程度,并在最终确定前修订支持度低的输出。该方法在固定检索器和骨干模型的条件下,在两个证据基础问答设置中优于简单的检索后基线,最大增益在跨度基础基准上达到32.7个准确率点。

Details

Motivation: 解决检索增强型语言模型在检索到相关证据后,未明确检查检索上下文是否支持结论就生成答案的问题,旨在增强证据基础的一致性。

Result: 在固定检索器和骨干模型的受控消融实验中,PAVE在两个证据基础问答设置中优于简单的检索后基线,最大增益在跨度基础基准(span-grounded benchmark)上达到32.7个准确率点,证明了其有效性。

Insight: 创新点在于通过显式前提提取和支持门控修订来强化检索增强型LLM系统的证据基础一致性,使答案承诺在显式前提、支持分数和修订决策层面可审计,提供了一种可解释的推理时验证机制。

Abstract: Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.


[10] Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks cs.CL | cs.AIPDF

Fan Huang

TL;DR: 本文提出了网络思维(Network-of-Thought, NoT)框架,将大语言模型的推理过程建模为带有类型化节点和边的有向图,并由基于启发式的控制器策略引导。该框架旨在解决复杂推理任务中需要合并中间结果、重新审视假设和整合多源证据的需求,超越了现有线性链式思维(CoT)和树状思维(ToT)的拓扑结构限制。

Details

Motivation: 现有提示范式(如CoT和ToT)将LLM推理结构限制在简单的线性或树状拓扑中,而复杂推理任务通常需要更灵活的、支持合并与回溯的图结构。本文旨在探索更灵活的推理拓扑结构何时以及如何能提升复杂推理性能。

Result: 在GSM8K、Game of 24、HotpotQA和ProofWriter四个基准测试及GPT-4o-mini、Llama-3.3-70B-Instruct、Qwen2.5-72B-Instruct三个模型上的实验表明:NoT在多跳推理任务(如HotpotQA)上超越了ToT(91.0% vs. 88.0%);使用72B开源模型时,NoT在GSM8K上取得了最高准确率(91.5%);在逻辑推理任务(ProofWriter)上,模型自生成的控制器启发式策略优于固定和随机策略。同时发现评估方法(如字符串匹配与LLM-as-Judge)对方法排名有显著影响。

Insight: 论文的核心创新点在于将推理过程显式地建模为有向图(NoT),并引入基于启发式的控制器来动态引导图结构的构建与搜索,这为处理需要非顺序、多路径整合的复杂推理提供了更通用的框架。从客观角度看,其对不同推理拓扑(链、树、图)在计算-精度权衡上的系统性比较,以及揭示评估方法对结果判定的重大影响,具有重要的方法论价值。

Abstract: Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0% vs.\ 88.0% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14–18 percentage point gap on HotpotQA).


[11] MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages cs.CLPDF

Anri Lombard, Simbarashe Mawere, Temi Aina, Ethan Wolff, Sbonelo Gumede

TL;DR: 该论文针对南非11种官方语言(其中9种为低资源语言)缺乏公开可用的仅解码器语言模型的问题,提出了MzansiText(一个经过筛选的多语言预训练语料库)和MzansiLM(一个1.25亿参数的仅解码器语言模型)。论文评估了该模型在三种适应策略下的自然语言理解和生成性能,发现其在特定任务微调上表现良好,但在少样本推理方面仍有挑战。

Details

Motivation: 解决低资源语言(特别是南非的多种官方语言)缺乏公开可用的仅解码器语言模型的问题,并探索小规模模型通过指令微调等策略的泛化能力。

Result: 在特定任务微调上,MzansiLM在数据到文本生成任务中,isiXhosa语种达到20.65 BLEU,与参数量大十倍的编码器-解码器基线模型竞争;在多语言主题分类任务中,isiXhosa新闻分类达到78.5% macro-F1。然而,在少样本推理任务上,即使对于更大的仅解码器模型,性能也接近随机水平。

Insight: 论文的创新点在于为南非低资源语言构建了可复现的预训练语料库(MzansiText)和基线模型(MzansiLM),并系统评估了不同适应策略(单语/多语任务微调、多任务指令微调)在小规模模型上的有效性,为低资源语言建模提供了明确的策略指导。

Abstract: Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.


[12] Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement cs.CLPDF

Jiang Liu, Ge Qiu, Hao Fei, Dongdong Xie, Jinbo Li

TL;DR: 本文提出了一种名为Code-MIE的代码风格多模态信息抽取框架,该框架将多模态信息抽取任务形式化为统一的代码理解与生成过程。它通过提取文本中的实体属性(如性别、所属机构)来增强实体理解,并将图像转换为场景图和视觉特征以融入丰富的视觉信息。输入被构造为Python函数,输出则形式化为包含所有抽取结果(如实体、关系)的Python字典。

Details

Motivation: 现有基于大语言模型的多模态信息抽取方法通常使用自然语言模板作为输入输出,这与信息抽取任务主要包含实体、关系等结构化信息的特点不匹配;少数采用结构化代码风格模板的方法仅探索了纯文本信息抽取,且设计复杂,需要为每个任务单独设计模板。

Result: 在M$^3$D(英文和中文)、Twitter-15、Twitter-17和MNRE数据集上的实验表明,该方法相比六个基线模型取得了最先进的性能,在M$^3$D英文和中文数据集上分别达到61.03%和60.49%,在其他三个数据集上分别达到76.04%、88.07%和73.94%。

Insight: 创新点在于将多模态信息抽取统一形式化为代码风格的输入输出,利用实体属性知识和场景图增强模型对上下文和视觉信息的理解,并通过Python函数和字典的统一模板简化了多任务设计,提高了与结构化信息抽取任务的匹配度。

Abstract: With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03% and 60.49% on the English and Chinese datasets of M$^3$D, and 76.04%, 88.07%, and 73.94% on the other three datasets.


[13] RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution cs.CL | cs.LGPDF

Kaiyuan Li, Jing-Cheng Pang, Yang Yu

TL;DR: 本文研究了基于可验证奖励的强化学习(RLVR)在提升大语言模型(LLMs)推理能力方面的局限性,发现RLVR对可验证任务有效,但对通用问答(GQA)任务效果不佳。作者提出了一个跨代评估框架来验证这一发现,并引入了一种名为START的简单训练方法,通过分离思维过程和答案生成来避免奖励捷径,从而在多个GQA基准上提升了思维质量和最终答案。

Details

Motivation: 动机在于验证RLVR训练是否能自动提升LLMs在通用问答任务上的性能,因为现有假设认为RLVR带来的推理能力提升应能迁移到GQA,但这一假设缺乏充分验证。

Result: 评估结果显示,RLVR在GQA任务上的思维过程效果显著低于可验证任务,表明仅靠可验证任务训练不足;而提出的START方法在多个GQA基准(如相关评测)和RL算法上均提升了思维质量和最终答案。

Insight: 创新点包括:提出了跨代评估框架来量化思维过程质量;揭示了RLVR在GQA任务上存在奖励捷径问题;设计了START方法,通过分离思维训练和答案生成来避免捷径,从而有效提升GQA性能。从客观角度看,该方法为LLMs的推理训练提供了新思路,强调了任务特异性训练的重要性。

Abstract: Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.


[14] BenchBench: Benchmarking Automated Benchmark Generation cs.CLPDF

Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan

TL;DR: 本文提出了BenchBench,一个用于评估自动基准生成能力的三阶段流程和数据集。该研究通过从种子基准中提取结构化领域卡片,使用多个设计者LLM生成配额控制的测试套件,并通过多模型回答者面板验证项目,从而生成包含项目质量标志和心理测量诊断的设计者-回答者矩阵。

Details

Motivation: 解决静态测试集易饱和、易受污染且更新成本高的问题,以及开放性问题评估中LLM评判带来的偏见和提示敏感性,主张评估应从模型回答基准的能力扩展到模型设计基准的能力。

Result: 在计算机科学、数学、医学和心理理论推理等九个变体(包括多语言和多模态设置)中,生成了16.7K个项目,过滤后保留约15K个核心项目,并产生约152K个分级模型-项目响应。结果显示基准设计能力与回答能力仅中度相关(Spearman rho 0.37),无效性与区分度负相关(Pearson r0.62)。

Insight: 创新点在于将评估焦点从模型回答基准转向模型设计基准,并提出了一个可扩展的自动化基准生成评估框架,能够审计格式/模态/语言保真度以及套件依赖的自我/家族交互,为基准的动态生成和评估提供了新思路。

Abstract: Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer–answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain 15K core items post-filtering, and produce ~152K graded model–item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r0.62), and the resulting designer–answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: https://github.com/koanatakiyo/BenchBench.


[15] Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach cs.CL | cs.AIPDF

Hongyu Cao, Kunpeng Liu, Dongjie Wang, Yanjie Fu

TL;DR: 本文提出了一种名为SART(Shortcut-Aware Reasoning Training)的梯度感知训练框架,旨在检测和缓解大语言模型在推理任务中依赖表面模式匹配和答案记忆等捷径的问题。该方法通过ShortcutScore和梯度手术识别并处理促进捷径的样本,从而提升模型的逻辑推理能力。

Details

Motivation: 大语言模型虽然展现出强大的推理能力,但往往依赖捷径推理而非真正的逻辑推断,这限制了其泛化性能。本文旨在解决模型在推理任务中过度依赖表面模式和记忆答案的问题。

Result: 在受控推理基准测试中,SART相比最强基线在准确率上提升了16.5%,在鲁棒性上提升了40.2%,显著提高了模型在分布偏移下的泛化能力。

Insight: 创新点在于提出了一种梯度感知的框架,通过梯度错位和答案令牌集中度来识别捷径信号,并利用梯度手术调整训练动态。这为缓解模型捷径推理提供了一种数据中心的训练方法,可借鉴其结合梯度分析和样本重加权来增强模型推理鲁棒性的思路。

Abstract: Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut-Aware Reasoning Training (SART), a gradient-aware framework that detects and mitigates shortcut-promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer-token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: https://github.com/fuyanjie/short-cut-aware-data-centric-reasoning.


[16] DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles cs.CL | cs.LGPDF

Bo Jiang

TL;DR: 本文提出了DiscoUQ框架,用于量化多智能体大语言模型(LLM)系统输出的不确定性。该框架通过分析智能体间分歧的结构(包括语言属性如证据重叠、论证强度、分歧深度,以及嵌入几何特征如聚类距离、离散度和内聚性)来生成校准良好的置信度估计。

Details

Motivation: 现有方法依赖简单的投票统计来量化多智能体LLM系统的不确定性,丢弃了智能体推理中丰富的语义信息,因此需要一种能利用分歧结构化信息的方法来获得更准确、校准更好的置信度估计。

Result: 在四个基准测试(StrategyQA、MMLU、TruthfulQA、ARC-Challenge)上,使用Qwen3.5-27B的5智能体系统进行评估,DiscoUQ-LLM的平均AUROC达到0.802,优于最佳基线(LLM Aggregator,0.791),且校准效果显著更好(ECE为0.036 vs. 0.098)。学习到的特征在不同基准间具有良好的泛化能力,性能下降接近零,并在简单投票方法失效的模糊“弱分歧”场景中提升最大。

Insight: 创新点在于首次系统地提取并利用智能体间分歧的结构化信息(语言和嵌入几何特征)进行不确定性量化,而非仅依赖投票统计。这为多智能体系统的置信度估计提供了更丰富、更可靠的语义基础,特别是在模糊场景下表现出色。

Abstract: Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents’ reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement – both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) – to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous “weak disagreement” tier where simple vote counting fails.


[17] Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO cs.CL | cs.AI | cs.LGPDF

Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng

TL;DR: 本文提出了一种名为PA-GRPO(Permutation-Aware Group Relative Policy Optimization)的新方法,旨在缓解大型语言模型(LLMs)在多项选择和成对评估任务中因选项位置、标签符号等非语义因素而产生的选择偏差。该方法通过构建每个实例的排列组合,并利用跨排列优势与一致性感知奖励两种互补机制进行优化,从而强制模型进行排列一致的语义推理。

Details

Motivation: 现有推理时去偏方法成本高且可能损害推理能力,而逐点训练方法忽略了同一问题在不同排列下应得到一致答案,因此需要一种能有效缓解选择偏差且保持高性能的训练方法。

Result: 实验结果表明,PA-GRPO在七个基准测试上均优于强基线方法,显著减少了选择偏差,同时保持了较高的整体性能。

Insight: 创新点在于将排列感知引入策略优化,通过跨排列优势计算和一致性奖励设计,使模型学习对排列不变的语义表示,这为LLMs的鲁棒性训练提供了新思路。

Abstract: Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).


[18] Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models cs.CLPDF

Abdul-Salem Beibitkhan

TL;DR: 本文通过评估八种大语言模型在英语、哈萨克语和蒙古语上的表现,发现低资源语言与英语之间存在13.8-16.7个百分点的性能差距,模型在保持表面流畅性的同时准确性显著下降。跨语言迁移策略(先用英语推理再翻译)对双语架构模型有选择性提升(+2.2至+4.3个百分点),但对英语主导模型无效。

Details

Motivation: 研究大语言模型在低资源语言上的表现,揭示当前模型对低资源语言社区的系统性服务不足问题。

Result: 在涵盖事实、推理、技术和文化类别的50个手工问题上评估了2000个回答,低资源语言条件相比英语准确率下降13.8-16.7个百分点;跨语言迁移策略仅在双语模型架构中带来有限提升。

Insight: 跨语言迁移策略的有效性依赖于模型架构(双语vs英语主导),而非通用解决方案;低资源语言模型在保持流畅性的同时内容准确性显著不足,凸显了针对性优化策略的必要性。

Abstract: We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.


[19] Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding cs.CL | cs.HCPDF

Taara Kumar, Kokil Jaidka

TL;DR: 本文系统研究了电子非语言线索(eNVCs)在文本计算机中介沟通中的作用,通过三项互补研究,建立了eNVCs的统一分类法、开发了自动检测工具包,并实证证明了eNVCs能显著提高情感解码准确性、降低感知模糊性,同时揭示了用户解读数字韵律的策略。

Details

Motivation: 随着文本计算机中介沟通日益普及,如何在缺乏身体线索的环境中重建非语言表达成为一个紧迫问题,本文旨在系统探究电子非语言线索在公共微博沟通中的角色。

Result: 研究2通过受控实验提供了因果证据,表明eNVCs能显著提高情感解码准确性并降低感知模糊性,但在讽刺等边界条件下效果减弱或消失。

Insight: 创新点包括基于非语言沟通理论建立了eNVCs的统一分类法,开发了可扩展的自动化检测工具包,并揭示了用户解读数字韵律的策略(如从线索缺失中推断意义、在模糊情境中默认负面解读),为情感计算和用户建模提供了理论和方法支持。

Abstract: As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.


[20] Evaluating Reasoning-Based Scaffolds for Human-AI Co-Annotation: The ReasonAlign Annotation Protocol cs.CLPDF

Smitha Muthya Sudheendra, Jaideep Srivastava

TL;DR: 本文提出了ReasonAlign注释协议,一种基于推理的注释框架,通过向注释者展示LLM生成的解释(但不显示预测标签)来研究推理如何影响人类注释行为。采用两轮协议,注释者先独立标注,再查看模型推理后修订决策。在情感分类和观点检测任务上评估,发现推理暴露能提高注释者间一致性,同时修订比例较低,表明推理主要帮助解决模糊案例而非引发广泛改变。

Details

Motivation: 解决主观NLP注释任务中注释者间存在显著差异的问题,并探究LLM生成的推理解释如何影响人类注释行为,而非全面评估注释准确性。

Result: 在情感分类和观点检测任务上,暴露于推理后,注释者间一致性增加,同时修订行为较少(通过提出的注释者努力代理AEP度量)。结果表明推理主要帮助解决模糊案例,而非导致广泛标签修订。

Insight: 创新点在于提出ReasonAlign框架和两轮修订协议,并引入AEP指标量化修订行为;客观分析认为,该研究揭示了推理解释在提升注释一致性方面的作用,为支持人-AI协作注释工作流提供了实用机制。

Abstract: Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.


[21] Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects cs.CL | cs.CVPDF

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy

TL;DR: 本文介绍了BanglaVerse,一个针对孟加拉文化的多语言视觉语言模型评估基准,涵盖多种历史关联语言和区域方言,包含约32.3K个数据点,旨在更真实地评估模型在文化背景下的多模态理解能力。

Details

Motivation: 孟加拉文化在多模态评估中代表性不足,现有评估仅关注标准孟加拉语会高估模型能力,需考虑方言和语言变体以更准确衡量文化理解。

Result: 实验表明,模型在方言变体下性能下降,尤其在字幕生成任务中;印地语和乌尔都语等历史关联语言能保留部分文化含义,但在结构化推理上较弱;主要瓶颈是文化知识缺失而非视觉基础问题。

Insight: 创新点在于构建了首个针对孟加拉文化的多语言多方言评估基准,揭示了语言变体对模型文化理解的关键影响,强调了跨语言文化知识整合的重要性。

Abstract: Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.


[22] Graph Fusion Across Languages using Large Language Models cs.CL | cs.IRPDF

Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan

TL;DR: 该论文提出了一种利用大语言模型进行跨语言知识图谱融合的框架,通过将图谱三元组线性化为自然语言序列,使LLM能够作为通用语义桥梁来对齐关系和实体,从而解决多语言知识图谱的语义异构问题。

Details

Motivation: 解决多语言知识图谱融合中因语义异构和图结构复杂性导致的跨语言对齐挑战。

Result: 在DBP15K数据集上的实验表明,该方法能成功顺序聚合多个异构图谱,为多源多语言环境下的持续知识合成提供了可扩展的模块化解决方案。

Insight: 创新点在于利用LLM的上下文推理和多语言语义先验,通过结构线性化将图谱三元组直接映射到自然语言序列,实现了跨语言图谱的语义对齐与融合。

Abstract: Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.


[23] More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection cs.CL | cs.AIPDF

Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen

TL;DR: 本文针对社交媒体上日益复杂的多模态仇恨言论检测挑战,提出了一种超越简单二元分类的细粒度方法。论文首先构建了H-VLI基准数据集,专注于捕捉模态间交互产生的隐含意图转移(如从良性线索构建仇恨或通过语义反转中和毒性)。为了有效解码这些复杂线索,作者进一步提出了ARCADE框架,通过模拟法庭辩论过程,迫使模型在做出裁决前深入审查语义证据。实验表明,ARCADE在H-VLI基准上显著优于现有SOTA方法,尤其在具有挑战性的隐含案例上表现出色,同时在现有基准上保持竞争力。

Details

Motivation: 随着内容格式演变,仇恨言论从纯文本转向复杂的多模态表达,使得隐含攻击更难被发现。现有系统在处理多模态内容时常常失效,因为它们难以处理模态交互产生的、超越单个模态简单聚合的涌现意义。

Result: 在提出的H-VLI基准上,ARCADE框架显著优于最先进的基线方法,特别是在具有挑战性的隐含案例上。同时,在已建立的基准上保持了有竞争力的性能。

Insight: 论文的核心创新在于:1)将问题从二元分类细化为对模态间语义意图转移的表征;2)构建了专注于模态复杂交互(而非显性侮辱)的H-VLI基准;3)提出了受司法辩论启发的ARCADE框架,通过代理间的主动辩论来迫使模型进行深度语义推理,这是一种新颖的模型推理机制设计。

Abstract: Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: https://github.com/Sayur1n/H-VLI


[24] Enhancing reasoning accuracy in large language models during inference time cs.CL | cs.AIPDF

Vinay Sharma, Manish Jain

TL;DR: 本文系统评估了三种无需额外训练即可提升大语言模型推理准确性的推理时技术:基于随机解码的自洽性、双模型推理一致性以及自我反思。实验表明,在链式思维提示下,采用核采样和受控温度的自洽性方法能带来9%至15%的显著精度提升,适用于低风险场景;双模型方法通过交叉验证推理步骤,更适合中等风险领域;而自我反思方法改进有限。

Details

Motivation: 大语言模型虽具备强大语言能力,但在多步推理任务中仍不可靠,尤其是在不进行额外训练或微调的情况下部署时。本文旨在研究推理时技术以提升LLMs的推理准确性。

Result: 在LLM模型上的实验表明,与贪婪单次解码相比,采用核采样和受控温度的自洽性方法在推理准确性上取得了9%至15%的绝对提升,达到当前先进水平;双模型方法提供了额外的可靠性;自我反思方法仅带来边际改进。

Insight: 论文的创新点在于对三种推理时策略进行了受控比较评估,并明确了各自适用场景:自洽性方法在低开销下实现显著增益;双模型方法通过一致性检查提升可靠性;自我反思对较小非推理模型效果有限。从客观角度看,系统性地量化了不同推理时优化技术的收益与成本权衡,为实际部署提供了实用指南。

Abstract: Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.


[25] Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles cs.CLPDF

Adi Gabay, Gabriel Stanovsky, Liat Peterfreund

TL;DR: 这篇论文探讨了大型语言模型在解决经典认知谜题时的推理能力,挑战了先前将模型行为简单划分为认知推理与机械记忆的二分法。作者提出应将记忆视为一种特殊的归约形式,并引入了一个’归约阶梯’框架,通过逐步修改谜题实例来增加归约难度,同时保持底层逻辑不变。研究发现,某些大型模型能通过归约成功解决部分问题,但所有模型在需要真正认知推理时都表现不佳。

Details

Motivation: 动机在于更细致地评估LLMs的推理能力,超越简单的’记忆vs.推理’二分法,通过区分归约性推理与认知性推理来深入理解模型在解决需要多智能体知识推断的经典逻辑谜题时的行为模式。

Result: 实验结果表明,一些大型模型能够通过归约策略解决部分修改后的谜题实例,但随着’归约阶梯’难度增加(即实例与原始训练数据的相似度降低),所有模型的表现都下降,并且在需要纯粹认知推理的任务上均告失败。

Insight: 主要创新点在于提出了’归约阶梯’这一评估框架,将记忆重新概念化为归约的一种特例,从而提供了一个更连续、更精细的维度来测量和诊断LLMs的推理能力,揭示了当前模型在泛化到需要真正理解他人知识状态的认知任务上的根本局限性。

Abstract: Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents’ knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.


[26] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning cs.CL | cs.AIPDF

Shuai Wang, Yinan Yu

TL;DR: 论文提出KG-Hopper,一个基于强化学习(RL)的框架,旨在增强紧凑型开源大语言模型(LLM)在知识图谱(KG)上进行多跳推理的能力。该方法将整个知识图谱遍历和决策过程整合到一个统一的“思考”阶段,实现单轮推理内的全局依赖处理和动态路径探索,避免了传统多步流水线方法的错误累积和灵活性限制。

Details

Motivation: 现有大语言模型在知识密集型推理任务(如知识库问答KBQA)上表现不佳,尤其是需要多跳推理时。传统方法依赖预定义的多步流水线,导致推理步骤孤立、灵活性差且错误会级联传播。

Result: 在八个知识图谱推理基准测试上,基于7B参数LLM的KG-Hopper一致性地超越了参数更大(高达70B)的多步系统,并与GPT-3.5-Turbo和GPT-4o-mini等专有模型取得了具有竞争力的性能,同时保持了模型紧凑、开源和数据高效的特点。

Insight: 主要创新点在于使用强化学习框架,将多步知识图谱推理过程压缩到LLM的单轮“思考”中,实现了跨步骤依赖的全局推理和带回溯的动态路径探索。这为在紧凑型开源模型上实现高效、灵活的知识图谱推理提供了一种新范式。

Abstract: Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking’’ stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.


[27] Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis cs.CLPDF

Tae-Eun Song

TL;DR: 本文提出了一种名为跨上下文验证(CCV)的黑盒方法,结合分层跨上下文架构(HCCA),用于检测LLM编码基准测试中的污染问题。该方法通过在独立会话中多次解决同一问题并测量解决方案的多样性,有效区分模型是真实推理还是记忆了泄露的答案。

Details

Motivation: 当前LLM编码基准测试面临可信度危机,普遍存在解决方案泄露和测试质量问题,而现有检测方法(如释义一致性、n-gram重叠、困惑度分析)无法直接观察模型是推理还是回忆。简单的重复验证会降低准确性,因此需要结构性方法。

Result: 在SWE-bench Verified的9个问题(45次试验,使用Claude Opus 4.6,温度0)上,CCV实现了污染与真实推理的完美分离(Mann-Whitney U=0,p≈0.012,r=1.0)。关键发现包括:污染是二元的(模型要么完美回忆,要么完全不回忆);推理缺失是完美的判别器;33%的先前污染标签是假阳性;HCCA能发现单分析师方法遗漏的污染-缺陷复合案例。

Insight: 创新点在于CCV通过独立会话中的解决方案多样性直接检测污染,而HCCA通过限制信息流动的多智能体分析框架防止确认偏差。研究表明,信息限制而非结构复杂性是关键机制,这为基准测试验证提供了新的结构化方法。

Abstract: LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods–paraphrase consistency, n-gram overlap, perplexity analysis–never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary–models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA’s independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result–100% sycophantic confirmation–providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.


[28] DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation cs.CL | cs.LGPDF

Siqi Guo, Ming Lin, Tianbao Yang

TL;DR: 本文提出了DRTriton,一个用于训练大语言模型将PyTorch代码转换为高度优化的Triton内核的可扩展学习框架。该框架包含三个关键组件:保证操作符空间全覆盖和无偏均匀采样的数据合成算法CSP-DAG、同时优化转换成功率和推理速度的课程强化学习,以及进一步提升生成内核推理速度的测试时搜索算法。

Details

Motivation: 解决在生成式AI行业中开发高效CUDA内核这一基础但具有挑战性的任务。现有最先进的大语言模型(如GPT-5.2和Claude-Sonnet-4.5)在此特定任务上仍存在困难。

Result: 在KernelBench Level 2基准测试中,DRTriton-7B模型在92%的案例上实现了加速,而GPT-5.2和Claude-Sonnet-4.5分别仅为23%和19%,达到了新的SOTA水平。

Insight: 创新点在于提出了一个完全基于合成数据进行训练的可扩展框架,并通过CSP-DAG算法、课程强化学习和测试时搜索的组合,使模型能够有效泛化到真实世界复杂CUDA内核的生成任务上,解决了数据稀缺和任务难度高的问题。

Abstract: Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.


[29] TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild cs.CL | cs.LG | eess.ASPDF

Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou, Wenze Ren, Yu-Han Huang

TL;DR: 本文介绍了TaigiSpeech,一个针对低资源且主要为口语的台湾台语(又称台湾闽南语/闽南语)的真实世界语音意图数据集。该数据集包含21位老年人的3千条语音,专为医疗保健和家庭助手等实际意图检测场景设计。为解决标注数据稀缺问题,论文探索了两种数据挖掘策略:基于中间语言的LLM伪标注关键词匹配数据挖掘,以及利用多模态线索且文本监督最少的视听框架。该设计旨在为低资源和无文字口语实现可扩展的数据集构建。

Details

Motivation: 许多语言因资源有限而在语音技术中代表性不足,本文旨在为低资源、主要为口语的台湾台语构建一个真实世界的语音意图数据集,以支持实际应用并促进相关研究。

Result: 论文介绍了TaigiSpeech数据集的初步构建结果,包含3千条来自21位老年人的语音,并提出了两种可扩展的数据挖掘策略,但摘要中未提及具体的定量实验结果或基准测试性能。

Insight: 创新点在于为低资源无文字口语构建真实世界意图数据集,并提出了结合LLM伪标注(通过中间语言)和视听多模态线索的两种可扩展数据挖掘方法,以最小化文本监督来应对标注稀缺挑战。

Abstract: Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.


[30] Generalizable Self-Evolving Memory for Automatic Prompt Optimization cs.CLPDF

Guanbao Liang, Yuanchen Bei, Sheng Zhou, Yuheng Qin, Huan Zhou

TL;DR: 本文提出MemAPO框架,将提示优化重新定义为可泛化且自我演化的经验积累过程。该框架采用双记忆机制,将成功的推理轨迹提炼为可复用的策略模板,并将错误生成组织为结构化错误模式以捕捉常见失败模式。面对新提示时,MemAPO检索相关策略和错误模式来组合提示,从而促进有效推理并避免已知错误。通过迭代式自我反思和记忆编辑,框架能持续更新记忆,使提示优化随时间改进而非为每个任务从头开始。

Details

Motivation: 现有自动提示优化方法通常针对固定任务搜索特定提示,这种范式限制了其在异构查询间的泛化能力,并阻碍模型随时间积累可复用的提示知识。

Result: 在多个基准测试上的实验表明,MemAPO在持续优于代表性提示优化基线方法的同时,显著降低了优化成本。

Insight: 核心创新在于将提示优化重构为一种可泛化、自我演化的记忆驱动过程,通过双记忆机制(策略模板与错误模式)实现经验的积累与复用,并通过迭代自我反思实现持续改进,这为构建更通用、高效的提示优化系统提供了新思路。

Abstract: Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.


[31] DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing cs.CLPDF

Nasser-Eddine Monir, Zakaria Baou

TL;DR: DATASHI是一个新的英语-塔什利特语平行语料库,包含5000个句子对,其中1500个句子对包含专家标准化版本和非标准用户生成版本,旨在填补阿马齐格语计算资源的空白,支持正字法规范化和低资源语言处理研究。

Details

Motivation: 解决阿马齐格语(特别是塔什利特语)缺乏计算资源的问题,为系统研究正字法多样性和规范化提供数据基础,并支持文本NLP任务及多模态对齐。

Result: 使用最先进的大型语言模型(GPT-5、Claude-Sonnet-4.5、Gemini-2.5-Pro、Mistral、Qwen3-Max)进行评估,从零样本到少样本提示均显示明显改进,其中Gemini-2.5-Pro在词级和字符级错误率最低,并展现出强大的跨语言泛化能力。

Insight: 语料库的双重设计(标准化与非标准版本)支持正字法规范化研究;通过细粒度编辑操作(删除、替换、插入)和音韵类别(双辅音、强调音、小舌音、咽音)分析,揭示了模型对塔什利特语标记特征的敏感性,为低资源阿马齐格语正字法规范化提供了新的诊断见解。

Abstract: DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.


[32] TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression cs.CLPDF

Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng

TL;DR: 本文提出了一种名为TAMTRL的新方法,用于解决大语言模型在处理长文档时因上下文窗口限制而必须进行分块处理所带来的多轮强化学习中的信用分配问题。该方法利用相关文档作为教师信号,通过自监督的方式为每一轮内存更新提供细粒度的学习信号,从而提升长上下文处理能力。

Details

Motivation: 当处理超出模型上下文窗口的长文档时,需要分块进行多轮读取和内存更新,但监督信号通常仅由最终结果提供,这导致难以评估每一轮内存更新的质量,即存在时间信用分配挑战。现有方法(如LLM-as-a-judge或过程奖励模型)计算开销大且存在估计噪声。

Result: 在七个长上下文基准测试上,对多个不同规模的模型进行的实验表明,TAMTRL始终优于强基线方法,证明了其有效性。

Insight: 论文的核心创新点在于提出了一种教师对齐的奖励重塑机制,通过将相关文档与每轮模型输入对齐,并以自监督的归一化概率形式分配奖励,从而为多轮内存更新提供细粒度的学习信号。这为解决多轮训练中的信用分配问题提供了一种高效且低噪声的解决方案。

Abstract: The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model’s context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.


[33] Probing How Scalable Table Data Enhances General Long-Context Reasoning cs.CLPDF

Huaibing Xie, Guoliang Zhao, Yang Liu, Shihan Dou, Siming Huang

TL;DR: 本文探讨了结构化表格数据如何提升大语言模型的长上下文推理能力。研究发现表格数据因其周期性结构在长上下文推理中具有潜力,并通过互信息分析揭示了其非衰减的周期性依赖关系。基于此,作者提出了一个名为TableLong的可扩展数据合成流程,用于生成高质量、多样且可验证的表格数据,并通过强化学习增强模型的长上下文推理能力。实验表明,该方法在多个长上下文基准测试中平均提升8.24%,并在域外基准上平均提升8.06%。

Details

Motivation: 解决大语言模型在复杂现实任务中长上下文推理能力不足的问题,并探索何种数据类型对此有效及其原因,特别关注结构化表格数据的潜力。

Result: 在多个长上下文基准测试中,模型性能平均提升8.24%;在域外基准测试中,平均提升8.06%,证明了表格数据对增强长上下文推理的有效性。

Insight: 创新点在于发现表格数据的周期性结构能提供非衰减的长期依赖,从而有效支持长上下文推理;并提出一个可扩展的数据合成流程TableLong,通过强化学习利用表格数据提升模型性能,为后训练数据选择提供了实用指导。

Abstract: As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24% on average), and even improves performance on out-of-domain benchmarks (+8.06% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.


[34] SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models cs.CL | cs.AIPDF

Pengfei Cao, Mingxuan Yang, Yubo Chen, Chenlong Zhang, Mingxuan Liu

TL;DR: 该论文介绍了SemEval-2026 Task 12:溯因事件推理(AER)任务,旨在推动大语言模型在证据丰富的真实世界场景中进行直接因果推断。该任务要求系统从支持证据中识别目标事件最合理的直接原因,并构建了一个基于证据的多选基准,以捕捉分布式证据、间接背景因素和语义相关但非因果的干扰项等关键挑战。

Details

Motivation: 解决真实世界事件发生的直接原因推断在自然语言处理和实践决策中至关重要,但在证据丰富的环境中仍未得到充分探索。

Result: 共享任务吸引了122名参与者,收到了518份提交。论文介绍了任务制定、数据集构建流程、评估设置和系统结果,但未在摘要中具体说明定量结果或是否达到SOTA水平。

Insight: 创新点在于将溯因事件推理(AER)形式化为一个证据驱动的多选基准,专注于真实世界事件的因果推理,强调了分布式证据处理和区分因果与非因果关系的挑战,为未来因果推理和多文档理解研究提供了明确的评估框架。

Abstract: Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at https://github.com/sooo66/semeval2026-task12-dataset.git} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.


[35] The Semantic Ladder: A Framework for Progressive Formalization of Natural Language Content for Knowledge Graphs and AI Systems cs.CL | cs.DBPDF

Lars Vogt

TL;DR: 本文提出了一个名为’语义阶梯’的架构框架,旨在解决自然语言与形式化语义模型之间的表示鸿沟。该框架通过模块化语义单元,支持数据和知识从自然语言片段到基于本体的高阶逻辑模型的渐进式形式化,从而实现语义知识的增量构建、异构表示的集成,并为可扩展、可互操作且AI就绪的数据与知识基础设施奠定基础。

Details

Motivation: 解决语义数据和知识基础设施中自然语言(知识创建和传播的主要形式)与形式化语义模型(支持机器可操作的集成、互操作和推理)之间的根本性表示差异,特别是在数据录入点需要完全语义形式化时,弥合这一差距是核心挑战。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试,而是从概念和架构层面阐述了该框架如何支持语义丰富化、陈述结构化和逻辑建模,同时保持语义连续性和可追溯性。

Insight: 创新点在于提出了一个渐进式形式化的通用框架(语义阶梯),通过组织不同语义明确程度的表示层级,并支持层级间的转换,从而降低语义解析负担,并整合包括自然语言、结构化语义模型和向量嵌入在内的异构表示,为构建可扩展的AI就绪知识基础设施提供了系统化方法。

Abstract: Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.


[36] TiCo: Time-Controllable Training for Spoken Dialogue Models cs.CL | cs.AI | eess.ASPDF

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass

TL;DR: TiCo是一种简单的后训练方法,旨在使口语对话模型能够遵循时间约束指令并生成具有可控时长的响应。该方法通过引入口语时间标记(STM)来帮助模型在生成过程中估计已用时间,从而调整剩余内容以满足目标时长。TiCo仅需少量数据,无需额外问答对,并依赖自生成和强化学习,实验表明其在保持响应质量的同时显著提高了对时长约束的遵循能力。

Details

Motivation: 现有口语对话模型虽然能生成自然的口语响应,但缺乏时间意识,难以遵循与时长相关的指令(如“请生成一个大约15秒的响应”),这在语音助手和交互式代理等现实世界口语系统中限制了交互质量的提升。

Result: 实验结果显示,TiCo显著提高了模型对时长约束的遵循能力,同时保持了响应质量。通过对开源和商业口语对话模型的实证评估,表明现有模型经常无法满足时间控制要求,而TiCo有效解决了这一问题。

Insight: 创新点在于引入口语时间标记(STM)来增强模型的时间感知能力,并通过简单的后训练方法(仅需少量数据和自生成强化学习)实现时间可控性,无需复杂架构或大量标注数据,为口语系统的交互控制提供了高效解决方案。

Abstract: We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., “Please generate a response lasting about 15 seconds”). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.


cs.CV [Back]

[37] Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection cs.CV | cs.AIPDF

Saeed Khaki, Nima Safaei, Kamal Ginotra

TL;DR: 本文研究了基于Transformer的视觉语言模型(VLMs)中解码器层的结构化剪枝,通过领域感知激活相似性分析,揭示了数学与非数学输入下各层表征变换的差异,并提出了三种剪枝排名准则。研究发现剪枝过程存在三种机制:低剪枝预算时性能对移除层敏感;中等预算时方法趋同;高预算时结构连续性主导。领域感知排名在敏感机制中表现稳定,并在大预算下优于或匹配结构感知基线。

Details

Motivation: 解决视觉语言模型中解码器层深度冗余问题,特别是针对需要感知与多步推理紧密耦合的领域(如数学),理解移除特定层的影响。

Result: 在两个SOTA VLMs和广泛的数学及通用多模态基准测试中,领域感知排名在排名敏感机制中实现了最强的稳定性,并在大预算下匹配或超过结构感知基线。

Insight: 创新点在于通过领域感知激活相似性分析来指导剪枝,揭示了VLM深度对领域特定行为的贡献机制,并提供了一种可解释的、实用的模型深度缩减方法,不牺牲数学或通用视觉语言能力。

Abstract: Transformer-based vision-language models (VLMs) contain substantial depth redundancy, yet the effect of removing specific decoder layers remains poorly understood, especially for domains that require tight coupling between perception and multi-step reasoning. We study structured decoder layer pruning through the lens of domain-aware activation similarity, measuring how strongly each layer transforms representations for math versus non-math inputs. This yields simple math-aware, non-math-aware, and mixed ranking criteria that identify layers whose input-output activations change least within a target domain. Across two state-of-the-art VLMs and a broad suite of math and general multimodal benchmarks, we uncover a consistent three-regime structure: at low pruning budgets, performance is highly sensitive to which layers are removed; at moderate budgets, methods converge as structural damage accumulates; and at high budgets, structural continuity dominates, favoring spacing-aware strategies. Our domain-aware rankings achieve the strongest stability in the ranking-sensitive regime, while matching or exceeding structure-aware baselines at larger budgets. These results provide a clearer picture of how depth contributes to domain-specific behavior in VLMs and offer a practical, interpretable approach to reducing model depth without sacrificing essential mathematical or general vision-language capabilities.


[38] Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs cs.CV | cs.AR | cs.LGPDF

Danial Monachan, Samira Nazari, Mahdi Taheri, Ali Azarpeyvand, Milos Krstic

TL;DR: 本文提出了一种名为Mix-and-Match Pruning的全局引导、逐层稀疏化框架,用于深度神经网络(DNN)的压缩。该方法通过结合敏感性评分和简单的架构规则,为不同层生成多样化的高质量剪枝配置,以在保持精度的同时实现强压缩,适用于边缘设备部署。

Details

Motivation: 解决现有剪枝方法的局限性,即不同层和架构对剪枝的响应不同,使得单一策略方法效果不佳,需要一种能协调不同剪枝信号、生成最优配置的框架。

Result: 在CNN和Vision Transformer上的实验表明,该方法实现了帕累托最优结果,例如在Swin-Tiny上,相对于标准单准则剪枝,将精度下降减少了40%。

Insight: 创新点在于通过架构感知的稀疏范围(如保留归一化层、更积极剪枝分类器)和系统采样,协调现有剪枝信号(如幅度、梯度或其组合),无需重复运行即可生成部署就绪的精度-稀疏度权衡,比引入新准则更可靠高效。

Abstract: Deploying deep neural networks (DNNs) on edge devices requires strong compression with minimal accuracy loss. This paper introduces Mix-and-Match Pruning, a globally guided, layer-wise sparsification framework that leverages sensitivity scores and simple architectural rules to generate diverse, high-quality pruning configurations. The framework addresses a key limitation that different layers and architectures respond differently to pruning, making single-strategy approaches suboptimal. Mix-and-Match derives architecture-aware sparsity ranges, e.g., preserving normalization layers while pruning classifiers more aggressively, and systematically samples these ranges to produce ten strategies per sensitivity signal (magnitude, gradient, or their combination). This eliminates repeated pruning runs while offering deployment-ready accuracy-sparsity trade-offs. Experiments on CNNs and Vision Transformers demonstrate Pareto-optimal results, with Mix-and-Match reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning. These findings show that coordinating existing pruning signals enables more reliable and efficient compressed models than introducing new criteria.


[39] Remote Sensing Image Dehazing: A Systematic Review of Progress, Challenges, and Prospects cs.CVPDF

Heng Zhou, Xiaoxiong Liu, Zhenxi Zhang, Jieheng Yun, Chengyang Li

TL;DR: 本文首次对遥感图像去雾领域进行了系统性的综述,梳理了方法演进、基准评估和物理一致性分析。研究将现有方法归纳为三个阶段:基于手工物理先验的方法、数据驱动的深度恢复方法,以及混合物理智能生成方法,并总结了超过30种基于CNN、GAN、Transformer和扩散模型的代表性方法。通过大规模定量实验和物理一致性分析,评估了不同方法的性能与局限性,并总结了该领域面临的开放挑战与未来研究方向。

Details

Motivation: 遥感图像常因雾、霾和薄云等大气条件退化,掩盖了地表反射信息,阻碍了下游应用。目前缺乏对该领域方法演进、基准评估和物理一致性的系统性综述,因此本文旨在填补这一空白,为未来可信、可控、高效的去雾系统发展提供参考。

Result: 在五个公共数据集上使用12个指标(包括PSNR、SSIM、CIEDE、LPIPS、FID等)进行了大规模定量实验。跨领域比较表明,近期基于Transformer和扩散模型的平均SSIM提升了12%18%,感知误差降低了20%35%;而混合物理引导的设计实现了更高的辐射测量稳定性。专门的物理辐射一致性实验进一步表明,具有显式透射率或大气光约束的模型将颜色偏差降低了高达27%。

Insight: 创新点在于首次提供了遥感图像去雾领域的系统性、统一性综述,并整合了方法演进、基准评估和物理一致性分析。客观分析认为,其对方法发展三阶段的归纳、大规模多指标基准评估、以及强调物理约束对辐射一致性的重要性,为领域提供了清晰的演进脉络和可靠的实证参考。提出的可信、可控、高效(TCE)系统发展方向也具有前瞻性。

Abstract: Remote sensing images (RSIs) are frequently degraded by haze, fog, and thin clouds, which obscure surface reflectance and hinder downstream applications. This study presents the first systematic and unified survey of RSIs dehazing, integrating methodological evolution, benchmark assessment, and physical consistency analysis. We categorize existing approaches into a three-stage progression: from handcrafted physical priors, to data-driven deep restoration, and finally to hybrid physical-intelligent generation, and summarize more than 30 representative methods across CNNs, GANs, Transformers, and diffusion models. To provide a reliable empirical reference, we conduct large-scale quantitative experiments on five public datasets using 12 metrics, including PSNR, SSIM, CIEDE, LPIPS, FID, SAM, ERGAS, UIQI, QNR, NIQE, and HIST. Cross-domain comparison reveals that recent Transformer- and diffusion-based models improve SSIM by 12%18% and reduce perceptual errors by 20%35% on average, while hybrid physics-guided designs achieve higher radiometric stability. A dedicated physical radiometric consistency experiment further demonstrates that models with explicit transmission or airlight constraints reduce color bias by up to 27%. Based on these findings, we summarize open challenges: dynamic atmospheric modeling, multimodal fusion, lightweight deployment, data scarcity, and joint degradations, and outline promising research directions for future development of trustworthy, controllable, and efficient (TCE) dehazing systems. All reviewed resources, including source code, benchmark datasets, evaluation metrics, and reproduction configurations are publicly available at https://github.com/VisionVerse/RemoteSensing-Restoration-Survey.


[40] Transparent Fragments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly cs.CV | cs.RO | eess.IVPDF

Qihao Lin, Borui Chen, Yuping Zhou, Jianing Wu, Yulan Guo

TL;DR: 本文提出了一种基于视觉-触觉融合的透明碎片轮廓估计通用框架,用于自主重组任务。该框架包括构建透明碎片数据集TransFrag27K、视觉抓取位置检测网络TransFragNet、视觉-触觉融合材料分类器以及基于多维相似度度量的轮廓匹配与重组算法,并在真实场景验证中表现出色。

Details

Motivation: 解决透明碎片轮廓估计的挑战,这些挑战源于其严格的光学特性、不规则形状和边缘,在精密光学仪器修复、文物复原等领域具有重要意义。

Result: 实验结果表明所提框架的有效性,在真实世界验证中表现出强性能,并提供了可复现的基准用于评估视觉-触觉轮廓估计和碎片重组。

Insight: 创新点包括构建大规模合成透明碎片数据集、视觉-触觉融合感知框架模拟人类结合视觉与触觉估计轮廓的方式,以及引入多维相似度度量进行轮廓匹配与重组,为透明物体处理提供了新思路。

Abstract: The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical properties, irregular shapes and edges. To address this issue, a general transparent fragments contour estimation framework based on visual-tactile fusion is proposed in this paper. First, we construct the transparent fragment dataset named TransFrag27K, which includes a multiscene synthetic data of broken fragments from multiple types of transparent objects, and a scalable synthetic data generation pipeline. Secondly, we propose a visual grasping position detection network named TransFragNet to identify, locate and segment the sampling grasping position. And, we use a two-finger gripper with Gelsight Mini sensors to obtain reconstructed tactile information of the lateral edge of the fragments. By fusing this tactile information with visual cues, a visual-tactile fusion material classifier is proposed. Inspired by the way humans estimate a fragment’s contour combining vision and touch, we introduce a general transparent fragment contour estimation framework based on visual-tactile fusion, demonstrates strong performance in real-world validation. Finally, a multi-dimensional similarity metrics based contour matching and reassembly algorithm is proposed, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. The experimental results demonstrate the validity of the proposed framework. The dataset and codes are available at https://github.com/Keithllin/Transparent-Fragments-Contour-Estimation.


[41] EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control cs.CV | cs.AI | cs.MM | cs.SDPDF

Yuzhe Weng, Haotian Wang, Yuanhong Yu, Jun Du, Shan He

TL;DR: 本文提出EARTalking,一种新颖的端到端GPT风格自回归模型,用于交互式音频驱动说话头生成。该方法引入了逐帧、上下文内、音频驱动的流式生成范式,通过Sink Frame Window Attention机制支持可变长度视频生成并保持身份一致性,并通过流式Frame Condition In-Context方案高效注入多样化控制信号,实现每帧和任意时刻的交互控制。

Details

Motivation: 现有基于自回归的方法依赖中间面部表示,限制了表现力和真实感;而基于扩散的方法以片段为单位生成,缺乏细粒度控制并因窗口整体去噪导致固有延迟。本文旨在解决这些限制,实现灵活高效的交互式生成。

Result: 实验表明,EARTalking在性能上超越了现有自回归方法,并达到了与基于扩散方法相当的水平。

Insight: 创新点包括提出流式上下文自回归控制范式、SFA机制以支持可变长度生成和身份一致性,以及FCIC方案以统一方式处理多样化控制信号,为灵活高效的生成开辟了可扩展方向。

Abstract: Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.


[42] GraphiContact: Pose-aware Human-Scene Robust Contact Perception for Interactive Systems cs.CV | cs.GRPDF

Xiaojian Lin, Yaomin Shen, Junyuan Ma, Yujie Sun, Chengqing Bu

TL;DR: 本文提出了GraphiContact,一个用于单目图像中顶点级人-场景接触感知的框架。该框架通过利用两个预训练Transformer编码器传递互补的人体先验,并结合3D人体网格重建作为支撑,来预测重建网格上的逐顶点接触。为了提升在遮挡和感知噪声下的鲁棒性,论文还引入了单图像多推理不确定性(SIMU)训练策略。

Details

Motivation: 现有方法要么在接触预测中未充分利用显式的3D人体先验,要么侧重于姿态/网格重建而未在遮挡和噪声下直接优化鲁棒的顶点级接触推理。本文旨在填补这一空白,联合解决单图像3D人体网格重建和鲁棒的顶点级接触预测问题。

Result: 在五个基准数据集上的实验表明,GraphiContact在接触预测和3D人体重建任务上均取得了持续的性能提升。

Insight: 主要创新点在于提出了一个姿态感知的框架,将互补的人体先验从预训练编码器转移到接触推理任务,并设计了SIMU训练策略以模拟真实世界的遮挡和噪声,从而在保持高效单分支推理的同时提升了鲁棒性。

Abstract: Monocular vertex-level human-scene contact prediction is a fundamental capability for interactive systems such as assistive monitoring, embodied AI, and rehabilitation analysis. In this work, we study this task jointly with single-image 3D human mesh reconstruction, using reconstructed body geometry as a scaffold for contact reasoning. Existing approaches either focus on contact prediction without sufficiently exploiting explicit 3D human priors, or emphasize pose/mesh reconstruction without directly optimizing robust vertex-level contact inference under occlusion and perceptual noise. To address this gap, we propose GraphiContact, a pose-aware framework that transfers complementary human priors from two pretrained Transformer encoders and predicts per-vertex human-scene contact on the reconstructed mesh. To improve robustness in real-world scenarios, we further introduce a Single-Image Multi-Infer Uncertainty (SIMU) training strategy with token-level adaptive routing, which simulates occlusion and noisy observations during training while preserving efficient single-branch inference at test time. Experiments on five benchmark datasets show that GraphiContact achieves consistent gains on both contact prediction and 3D human reconstruction. Our code, based on the GraphiContact method, provides comprehensive 3D human reconstruction and interaction analysis, and will be publicly available at https://github.com/Aveiro-Lin/GraphiContact.


[43] VGS-Decoding: Visual Grounding Score Guided Decoding for Hallucination Mitigation in Medical VLMs cs.CV | cs.LGPDF

Govinda Kolli, Adinath Madhavrao Dukre, Behzad Bozorgtabar, Dwarikanath Mahapatra, Imran Razzak

TL;DR: 本文提出了一种名为VGS-Decoding的训练无关方法,用于缓解医学视觉语言模型在推理过程中的幻觉问题。该方法的核心是引入视觉基础评分,通过比较模型在原始图像和失真图像下的输出分布来衡量每个词元对视觉信息的依赖程度,并在解码时根据该评分动态调整词元概率,以增强基于视觉证据的生成并抑制幻觉。

Details

Motivation: 医学视觉语言模型在临床应用中存在严重风险,因为它们常常基于语言先验而非视觉证据生成回答,即产生幻觉。本文旨在解决这一关键问题,提出一种无需额外训练即可在推理阶段缓解幻觉的方法。

Result: 在MIMIC-Diff-VQA和VQA-RAD基准测试上,对LLaVA-Med、CheXagent和MedGemma等模型进行实验,VGS-Decoding方法取得了持续改进,最高实现了+9.12%的整体性能提升和+8.98%的开放式召回率提升,同时仅引入2倍的推理开销且无需额外训练。

Insight: 论文宣称的创新点在于揭示了幻觉词元与视觉基础词元在视觉信息退化时概率变化行为的差异,并据此设计了动态的、基于词元级别的自适应解码控制机制(VGS),这与使用固定权重的对比方法不同。从客观角度看,该方法提供了一种轻量、即插即用的后处理策略,对部署友好,是缓解VLM幻觉问题的一个新颖且实用的技术路径。

Abstract: Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token’s visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and $+8.98%$ in open-ended recall, while introducing only $2\times$ inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.


[44] NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation cs.CVPDF

Quang Dang Huynh, Xuefei Yin, Andrew Busch, Hugo G. Espinosa, Alan Wee-Chung Liew

TL;DR: 本文提出了一种名为NCSTR的新型节点中心解耦时空推理框架,用于视频人体姿态估计。该方法通过融合视觉、时间和结构信息,设计了基于视觉-时间速度的关节嵌入、注意力驱动的姿态查询编码器、双分支解耦时空注意力图以及节点空间专家融合模块,以解决运动模糊、遮挡和复杂时空动态等问题,并在多个基准测试中取得了最先进的性能。

Details

Motivation: 现有视频姿态估计方法依赖热图或隐式的时空特征聚合,限制了关节拓扑的表达能力并削弱了跨帧一致性。本文旨在通过显式的节点中心推理,整合视觉、时间和结构信息,以提升姿态估计的准确性和鲁棒性。

Result: 在三个广泛使用的视频姿态基准测试上进行了大量实验,结果表明该方法优于现有的最先进方法。

Insight: 创新点在于提出了显式的节点中心推理框架,通过解耦的时空注意力图分别建模时间传播和空间约束,并自适应融合局部与全局线索。这为视频姿态估计提供了新的视角,强调了显式建模关节间时空关系的重要性。

Abstract: Video-based human pose estimation remains challenged by motion blur, occlusion, and complex spatiotemporal dynamics. Existing methods often rely on heatmaps or implicit spatio-temporal feature aggregation, which limits joint topology expressiveness and weakens cross-frame consistency. To address these problems, we propose a novel node-centric framework that explicitly integrates visual, temporal, and structural reasoning for accurate pose estimation. First, we design a visuo-temporal velocity-based joint embedding that fuses sub-pixel joint cues and inter-frame motion to build appearance- and motion-aware representations. Then, we introduce an attention-driven pose-query encoder, which applies attention over joint-wise heatmaps and frame-wise features to map the joint representations into a pose-aware node space, generating image-conditioned joint-aware node embeddings. Building upon these node embeddings, we propose a dual-branch decoupled spatio-temporal attention graph that models temporal propagation and spatial constraint reasoning in specialized local and global branches. Finally, a node-space expert fusion module is proposed to adaptively fuse the complementary outputs from both branches, integrating local and global cues for final joint predictions. Extensive experiments on three widely used video pose benchmarks demonstrate that our method outperforms state-of-the-art methods. The results highlight the value of explicit node-centric reasoning, offering a new perspective for advancing video-based human pose estimation.


[45] DCG-Net: Dual Cross-Attention with Concept-Value Graph Reasoning for Interpretable Medical Diagnosis cs.CVPDF

Getamesay Dagnaw, Xuefei Yin, Muhammad Hassan Maqsood, Yanming Zhu, Alan Wee-Chung Liew

TL;DR: 本文提出了一种名为DCG-Net的端到端可解释医疗诊断框架,它通过整合多模态对齐与结构化概念推理,旨在解决深度学习模型在医学图像分析中决策过程不透明的问题。该框架引入了双交叉注意力模块和参数化概念图,以捕获临床概念间的上下文依赖关系,并在白细胞形态和皮肤病变诊断任务上实现了最先进的分类性能。

Details

Motivation: 现有概念瓶颈模型(CBMs)在医学图像分析中通常忽略了概念间的上下文依赖关系,导致其解释性受限。本文旨在通过结构化推理和空间局部证据归因,构建一个更符合临床领域知识的可解释诊断模型。

Result: 在白细胞形态和皮肤病变诊断的实验中,DCG-Net实现了最先进的分类性能,同时能生成临床可解释的诊断解释。

Insight: 创新点包括:1)用双向注意力机制替代余弦相似度匹配的双交叉注意力模块,实现空间局部证据归因;2)利用正点互信息先验初始化并通过稀疏控制消息传递优化的参数化概念图,以建模概念间依赖关系。这些设计增强了模型的可解释性并符合临床知识结构。

Abstract: Deep learning models have achieved strong performance in medical image analysis, but their internal decision processes remain difficult to interpret. Concept Bottleneck Models (CBMs) partially address this limitation by structuring predictions through human-interpretable clinical concepts. However, existing CBMs typically overlook the contextual dependencies among concepts. To address these issues, we propose an end-to-end interpretable framework \emph{DCG-Net} that integrates multimodal alignment with structured concept reasoning. DCG-Net introduces a Dual Cross-Attention module that replaces cosine similarity matching with bidirectional attention between visual tokens and canonicalized textual concept-value prototypes, enabling spatially localized evidence attribution. To capture the relational structure inherent to clinical concepts, we develop a Parametric Concept Graph initialized with Positive Pointwise Mutual Information priors and refined through sparsity-controlled message passing. This formulation models inter-concept dependencies in a manner consistent with clinical domain knowledge. Experiments on white blood cell morphology and skin lesion diagnosis demonstrate that DCG-Net achieves state-of-the-art classification performance while producing clinically interpretable diagnostic explanations.


[46] Scene Representation using 360° Saliency Graph and its Application in Vision-based Indoor Navigation cs.CV | cs.RO | eess.IV | eess.SPPDF

Preeti Meena, Himanshu Kumar, Sandeep Yadav

TL;DR: 本文提出了一种新颖的360°显著性图用于场景表示,该表示将场景的视觉、上下文、语义和几何信息显式编码为图中的节点、边、边权重和角度位置。该表示对场景视角变化具有鲁棒性,并解决了室内环境中的光照变化、遮挡和阴影等挑战。作者将该表示应用于基于视觉的室内导航任务,包括在拓扑地图中定位查询场景以及估计朝向目标的下一个移动方向。

Details

Motivation: 现有的场景表示(如RGB-D、LiDAR扫描、关键点等)通常隐式嵌入信息,对于场景索引和基于视觉的导航等应用可能效率不高。本文旨在提出一种更高效、鲁棒的显式场景表示方法,以克服传统方法在室内导航中面临的挑战。

Result: 实验结果表明,所提出的360°显著性图表示在增强场景定位和基于视觉的室内导航方面是有效的,并与使用360°场景的现有导航方法进行了比较。

Insight: 创新点在于将多模态信息(视觉、上下文、语义、几何)统一编码到一个紧凑的360°图结构中,该结构对视角变化和环境干扰具有鲁棒性,为场景理解和导航任务提供了更丰富、更直接的表示。

Abstract: A Scene, represented visually using different formats such as RGB-D, LiDAR scan, keypoints, rectangular, spherical, multi-views, etc., contains information implicitly embedded relevant to applications such as scene indexing, vision-based navigation. Thus, these representations may not be efficient for such applications. This paper proposes a novel 360° saliency graph representation of the scenes. This rich representation explicitly encodes the relevant visual, contextual, semantic, and geometric information of the scene as nodes, edges, edge weights, and angular position in the 360° graph. Also, this representation is robust against scene view change and addresses challenges of indoor environments such as varied illumination, occlusions, and shadows as in the case of existing traditional methods. We have utilized this rich and efficient representation for vision-based navigation and compared it with existing navigation methods using 360° scenes. However, these existing methods suffer from limitations of poor scene representation, lacking scene-specific information. This work utilizes the proposed representation first to localize the query scene in the given topological map, and then facilitate 2D navigation by estimating the next required movement directions towards the target destination in the topological map by using the embedded geometric information in the 360° saliency graph. Experimental results demonstrate the efficacy of the proposed 360° saliency graph representation in enhancing both scene localization and vision-based indoor navigation.


[47] Uni-Classifier: Leveraging Video Diffusion Priors for Universal Guidance Classifier cs.CVPDF

Yujie Zhou, Pengyang Ling, Jiazi Bu, Bingjie Gao, Li Niu

TL;DR: 本文提出Uni-Classifier(Uni-C),一个利用视频扩散先验的即插即用模块,旨在解决多模型串联工作流中因分布不匹配导致的生成质量下降问题。它通过引导上游模型的去噪过程来对齐下游模型输入,也可独立用于提升单个生成模型的质量。实验表明其在视频和3D生成任务中均能有效提升生成质量。

Details

Motivation: 解决实际AI工作流中,多个生成模型(如2D图像生成后接视频或3D生成)串联时,上游输出与下游输入之间的分布不匹配问题,该问题会降低整体生成质量。

Result: 在视频和3D生成任务上进行了广泛实验,结果表明Uni-C在基于工作流和独立使用的设置中都能持续提升生成质量,展现了其通用性和强大的泛化能力。

Insight: 创新点在于利用视频扩散模型的先验知识作为通用引导分类器,以即插即用的方式对齐模型间分布,提升工作流连贯性和单模型性能;其方法简单有效,具有较好的泛化性。

Abstract: In practical AI workflows, complex tasks often involve chaining multiple generative models, such as using a video or 3D generation model after a 2D image generator. However, distributional mismatches between the output of upstream models and the expected input of downstream models frequently degrade overall generation quality. To address this issue, we propose Uni-Classifier (Uni-C), a simple yet effective plug-and-play module that leverages video diffusion priors to guide the denoising process of preceding models, thereby aligning their outputs with downstream requirements. Uni-C can also be applied independently to enhance the output quality of individual generative models. Extensive experiments across video and 3D generation tasks demonstrate that Uni-C consistently improves generation quality in both workflow-based and standalone settings, highlighting its versatility and strong generalization capability.


[48] Jigsaw Regularization in Whole-Slide Image Classification cs.CVPDF

So Won Jeong, Veronika Ročková

TL;DR: 该论文提出了一种结合视觉基础模型嵌入和图神经网络的全切片图像分类方法,通过引入局部空间结构感知和创新的拼图正则化机制,显著提升了在乳腺癌、头颈癌和结肠癌基准数据集上的分类性能。

Details

Motivation: 现有基于多示例学习(MIL)的WSI分类方法通常将图像块视为可交换的,忽略了组织图像中丰富的空间和拓扑结构,因此需要更有效地整合空间信息以提高分类准确性。

Result: 在乳腺癌、头颈癌和结肠癌的基准数据集上,该方法明显优于当前最先进的基于注意力的MIL方法,实现了分类性能的显著提升。

Insight: 创新点在于利用视觉基础模型嵌入捕获每个图像块内的局部空间结构,并结合图神经网络与新颖的拼图正则化实现跨图像块的空间感知,为计算病理学中的空间建模提供了新思路。

Abstract: Computational pathology involves the digitization of stained tissues into whole-slide images (WSIs) that contain billions of pixels arranged as contiguous patches. Statistical analysis of WSIs largely focuses on classification via multiple instance learning (MIL), in which slide-level labels are inferred from unlabeled patches. Most MIL methods treat patches as exchangeable, overlooking the rich spatial and topological structure that underlies tissue images. This work builds on recent graph-based methods that aim to incorporate spatial awareness into MIL. Our approach is new in two regards: (1) we deploy vision \emph{foundation-model embeddings} to incorporate local spatial structure within each patch, and (2) achieve across-patch spatial awareness using graph neural networks together with a novel {\em jigsaw regularization}. We find that a combination of these two features markedly improves classification over state-of-the-art attention-based MIL approaches on benchmark datasets in breast, head-and-neck, and colon cancer.


[49] Monocular Models are Strong Learners for Multi-View Human Mesh Recovery cs.CVPDF

Haoyu Xie, Shengkai Xu, Cheng Guo, Muhammad Usama Saleem, Wenhan Wu

TL;DR: 本文提出了一种无需训练的多视角人体网格恢复框架,利用预训练的单视角模型作为强先验,通过构建一致的多视角初始化和基于多视角一致性及解剖约束的测试时优化,实现了无需相机标定且泛化性强的重建。

Details

Motivation: 解决现有几何方法依赖繁琐相机标定,以及学习方法因缺乏多视角训练数据而泛化性差的问题,旨在实现无需标定且适应任意相机配置的多视角人体网格恢复。

Result: 在标准基准测试中取得了最先进的性能,超越了使用显式多视角监督训练的多视角模型。

Insight: 创新点在于利用预训练单视角模型作为先验,避免多视角数据训练,并通过测试时优化结合多视角一致性和解剖约束,实现了校准无关的强泛化重建。

Abstract: Multi-view human mesh recovery (HMR) is broadly deployed in diverse domains where high accuracy and strong generalization are essential. Existing approaches can be broadly grouped into geometry-based and learning-based methods. However, geometry-based methods (e.g., triangulation) rely on cumbersome camera calibration, while learning-based approaches often generalize poorly to unseen camera configurations due to the lack of multi-view training data, limiting their performance in real-world scenarios. To enable calibration-free reconstruction that generalizes to arbitrary camera setups, we propose a training-free framework that leverages pretrained single-view HMR models as strong priors, eliminating the need for multi-view training data. Our method first constructs a robust and consistent multi-view initialization from single-view predictions, and then refines it via test-time optimization guided by multi-view consistency and anatomical constraints. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, surpassing multi-view models trained with explicit multi-view supervision.


[50] PEARL: Personalized Streaming Video Understanding Model cs.CV | cs.AI | cs.IRPDF

Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang

TL;DR: 该论文提出了个性化流式视频理解(PSVU)这一新任务,并构建了首个专门评估该任务的基准PEARL-Bench。同时,论文提出了一种即插即用、无需训练的PEARL方法作为强基线,在多个模型上实现了最先进的性能。

Details

Motivation: 当前多模态个性化方法主要局限于静态图像或离线视频,无法处理连续的视觉输入和实时反馈,这限制了未来AI助手提供实时、交互式个性化响应的能力。论文旨在弥合这一差距。

Result: 在提出的PEARL-Bench基准上,对8个离线和在线模型进行了广泛评估。PEARL方法实现了最先进的性能,并且在应用于3种不同架构时,均能带来一致的PSVU性能提升。

Insight: 创新点在于首次形式化定义了PSVU任务,并构建了包含帧级和视频级评估模式的综合基准。提出的PEARL策略是一种通用、无需训练的方法,能有效提升现有模型在流式个性化视频理解上的能力,为构建流式个性化AI助手提供了新思路。

Abstract: Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model’s ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.


[51] CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models cs.CVPDF

Kaizhen Tan

TL;DR: 本文提出了CREG(Compass Relational Evidence Graph),一种无需训练的可解释性框架,用于揭示视觉语言模型(VLMs)在空间推理中如何编码物体间的方向关系。该方法通过将多层对比梯度-激活归因投影到以参照物为中心的极坐标系中,生成罗盘扇区上的方向性证据分布。

Details

Motivation: 现有归因方法(如GradCAM和注意力展开)只能显示模型关注的位置,无法揭示物体间推断出的方向关系。本文旨在解决如何更忠实地解释VLMs在空间推理任务中对方向关系的编码机制。

Result: 在Qwen2-VL-7B模型上,使用VSR和COCO-Pairs数据集进行评估,CREG在方向对齐误差(DAE)和边准确率(EA)等指标上均优于标准归因基线。例如,在COCO-Pairs上,预测目标CREG的DAE为55.5度,EA为0.553,相比注意力展开方法分别提升了16.1度和0.120。因果遮挡实验(COS≥+0.42)进一步支持了方向解释的忠实性。

Insight: 创新点在于将多层对比归因与极坐标投影结合,生成可量化的方向性证据分布,并提出了专门评估方向解释的指标(DAE、EA、COS)。客观来看,该方法表明对比性、多层归因比基于显著性的标准解释能更忠实地暴露VLMs空间推理中的方向证据,且其效果受益于更大规模模型中更结构化的空间表征。

Abstract: Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.


[52] Lessons and Open Questions from a Unified Study of Camera-Trap Species Recognition Over Time cs.CVPDF

Sooyoung Jeon, Hongjie Tian, Lemeng Wang, Zheda Mai, Vidhi Bakshi

TL;DR: 本研究首次对相机陷阱物种识别随时间变化的问题进行了统一研究,提出了一个包含546个相机陷阱的流式评估基准,揭示了生物基础模型在特定站点适应性不足、在真实部署场景下模型更新可能损害性能等关键发现,并提出了改进策略和开放性问题。

Details

Motivation: 解决生态实践中相机陷阱物种识别在固定站点随时间推移的可靠性问题,现有研究多关注跨域泛化,而忽视了生态系统动态变化带来的时间偏移挑战。

Result: 在提出的流式基准测试中,生物基础模型(如BioCLIP 2)在多个站点初始区间表现不佳;在模拟真实部署的更新评估下,简单适应甚至可能低于零样本性能;通过结合模型更新和后处理技术可显著提升准确率,但仍与理论上限存在差距。

Insight: 强调了站点特异性适应和时间偏移(物种分布和背景变化)对模型性能的关键影响;提出了模型更新与后处理技术整合的有效性;为生态实践者提供了可操作的部署指南,并为计算机视觉和机器学习研究指明了新方向。

Abstract: Camera traps are vital for large-scale biodiversity monitoring, yet accurate automated analysis remains challenging due to diverse deployment environments. While the computer vision community has mostly framed this challenge as cross-domain generalization, this perspective overlooks a primary challenge faced by ecological practitioners: maintaining reliable recognition at the fixed site over time, where the dynamic nature of ecosystems introduces profound temporal shifts in both background and animal distributions. To bridge this gap, we present the first unified study of camera-trap species recognition over time. We introduce a realistic benchmark comprising 546 camera traps with a streaming protocol that evaluates models over chronologically ordered intervals. Our end-user-centric study yields four key findings. (1) Biological foundation models (e.g., BioCLIP 2) underperform at numerous sites even in initial intervals, underscoring the necessity of site-specific adaptation. (2) Adaptation is challenging under realistic evaluation: when models are updated using past data and evaluated on future intervals (mirrors real deployment lifecycles), naive adaptation can even degrade below zero-shot performance. (3) We identify two drivers of this difficulty: severe class imbalance and pronounced temporal shift in both species distribution and backgrounds between consecutive intervals. (4) We find that effective integration of model-update and post-processing techniques can largely improve accuracy, though a gap from the upper bounds remains. Finally, we highlight critical open questions, such as predicting when zero-shot models will succeed at a new site and determining whether/when model updates are necessary. Our benchmark and analysis provide actionable deployment guidelines for ecological practitioners while establishing new directions for future research in vision and machine learning.


[53] End-to-End Optimization of Polarimetric Measurement and Material Classifier cs.CVPDF

Ryota Maeda, Naoki Arikawa, Yutaka No, Shinsaku Hiura

TL;DR: 本文提出了一种端到端优化框架,用于联合学习材料分类器并确定控制入射光和反射光状态的偏振元件旋转角度的最优组合,以实现高效的材料分类。

Details

Motivation: 解决材料分类中偏振测量需要多次调制导致耗时的问题,探索在有限测量次数下实现高精度分类的最优测量配置。

Result: 在Mueller矩阵材料数据集上,该方法在有限测量次数下实现了高精度的材料分类。

Insight: 创新点在于将材料分类器与偏振测量角度优化进行端到端联合学习,从而在减少测量次数的同时保持分类性能,为高效偏振感知系统设计提供了新思路。

Abstract: Material classification is a fundamental problem in computer vision and plays a crucial role in scene understanding. Previous studies have explored various material recognition methods based on reflection properties such as color, texture, specularity, and scattering. Among these cues, polarization is particularly valuable because it provides rich material information and enables recognition even at distances where capturing high-resolution texture is impractical. However, measuring polarimetric reflectance properties typically requires multiple modulations of the polarization state of the incident light, making the process time-consuming and often unnecessary for certain recognition tasks. While material classification can be achieved using only a subset of polarimetric measurements, the optimal configuration of measurement angles remains unclear. In this study, we propose an end-to-end optimization framework that jointly learns a material classifier and determines the optimal combinations of rotation angles for polarization elements that control both the incident and reflected light states. Using our Mueller-matrix material dataset, we demonstrate that our method achieves high-accuracy material classification even with a limited number of measurements.


[54] When Negation Is a Geometry Problem in Vision-Language Models cs.CVPDF

Fawaz Sammani, Tzoulio Chamiti, Paul Gavrikov, Nikos Deligiannis

TL;DR: 本文针对CLIP等视觉语言联合嵌入模型在理解文本查询中的否定语义(如’无logo的纯蓝衬衫’)时存在的缺陷,提出了一种基于多模态大语言模型(MLLMs)作为评判者的新评估框架,以更可靠地评估模型对否定的理解能力。研究发现CLIP嵌入空间中存在一个与否定语义相关的方向,并可通过表示工程进行测试时干预来引导模型表现出否定感知行为,而无需微调。

Details

Motivation: 现有CLIP模型难以理解文本中的否定语义,而先前工作主要通过数据驱动方法(如在大规模合成否定数据集上微调)来解决,但其评估指标(基于检索的度量)无法可靠反映模型是否真正理解了否定。本文旨在建立更公平的评估框架并探索无需微调的干预方法。

Result: 研究证实了CLIP嵌入空间中存在与否定语义相关的方向,通过表示工程进行测试时干预可有效提升模型在否定理解上的表现。在分布偏移下的非常见图像-文本样本上进行了泛化能力测试。

Insight: 创新点在于提出了基于MLLMs-as-a-judge的评估框架来更准确地衡量否定理解,并发现了CLIP嵌入空间中固有的否定语义方向,通过表示工程实现无需微调的测试时干预,为模型的可控性提供了新思路。

Abstract: Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish “no” in the query: “a plain blue shirt with no logos”. Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts.


[55] RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction cs.CVPDF

Feiran Wang, Zezhou Shang, Gaowen Liu, Yan Yan

TL;DR: RayMap3R是一种无需训练、用于动态场景重建的流式框架。它基于RayMap预测存在静态场景偏置的观察,构建了一个双分支推理方案,通过对比RayMap与图像预测来识别动态区域,并在内存更新时抑制其干扰,同时引入了重置度量对齐和状态感知平滑来保持度量一致性和稳定轨迹预测。

Details

Motivation: 现有的流式前馈3D重建方法在实时估计场景几何和相机位姿时,由于缺乏显式的动态推理,容易受到移动物体的影响,导致重建伪影和漂移问题。

Result: 该方法在多个基准测试的动态场景重建任务中,在流式方法中达到了最先进的性能。

Insight: 核心创新点在于利用RayMap预测的静态偏置作为内部线索来识别动态区域,并设计了无需训练的双分支推理框架以及重置度量对齐与状态感知平滑机制,以在线处理动态干扰并保持重建的稳定性与一致性。

Abstract: Streaming feed-forward 3D reconstruction enables real-time joint estimation of scene geometry and camera poses from RGB images. However, without explicit dynamic reasoning, streaming models can be affected by moving objects, causing artifacts and drift. In this work, we propose RayMap3R, a training-free streaming framework for dynamic scene reconstruction. We observe that RayMap-based predictions exhibit a static-scene bias, providing an internal cue for dynamic identification. Based on this observation, we construct a dual-branch inference scheme that identifies dynamic regions by contrasting RayMap and image predictions, suppressing their interference during memory updates. We further introduce reset metric alignment and state-aware smoothing to preserve metric consistency and stabilize predicted trajectories. Our method achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.


[56] ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework cs.CVPDF

Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang

TL;DR: 本文提出了一个名为ScaleEditor的完全开源、分层多智能体框架,用于端到端构建大规模、高质量的图像编辑数据集ScaleEdit-12M。该框架通过三个关键组件(源图像扩展与知识注入、自适应多智能体编辑指令-图像合成、任务感知数据质量验证机制)来克服依赖闭源模型或固定合成流程的局限性。实验表明,使用该数据集微调统一多模态模型能显著提升其在通用和知识增强编辑基准上的性能。

Details

Motivation: 解决基于指令的图像编辑任务中,构建大规模、多样化、高质量数据集时面临的挑战,即避免依赖昂贵的闭源API或质量与泛化性受限的固定合成流程。

Result: 构建了迄今为止最大的开源图像编辑数据集ScaleEdit-12M(涵盖23个任务族)。在UniWorld-V1和Bagel模型上微调后,在通用编辑基准(ImgEdit和GEdit)上性能提升最高达10.4%和35.1%,在知识增强基准(RISE和KRIS-Bench)上提升最高达150.0%和26.5%。

Insight: 创新点在于提出了一个完全开源、基于多智能体的层次化数据生成框架,通过知识注入、自适应指令合成和质量验证机制,实现了接近商业级数据质量的大规模、低成本、可扩展的数据集构建。其多智能体协作和任务感知验证机制是可借鉴的系统设计思路。

Abstract: Instruction-based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large-scale, diverse, and high-quality editing datasets without costly proprietary APIs remains challenging. Previous image editing datasets either rely on closed-source models for annotation, which prevents cost-effective scaling, or employ fixed synthetic editing pipelines, which suffer from limited quality and generalizability. To address these challenges, we propose ScaleEditor, a fully open-source hierarchical multi-agent framework for end-to-end construction of large-scale, high-quality image editing datasets. Our pipeline consists of three key components: source image expansion with world-knowledge infusion, adaptive multi-agent editing instruction-image synthesis, and a task-aware data quality verification mechanism. Using ScaleEditor, we curate ScaleEdit-12M, the largest open-source image editing dataset to date, spanning 23 task families across diverse real and synthetic domains. Fine-tuning UniWorld-V1 and Bagel on ScaleEdit yields consistent gains, improving performance by up to 10.4% on ImgEdit and 35.1% on GEdit for general editing benchmarks and by up to 150.0% on RISE and 26.5% on KRIS-Bench for knowledge-infused benchmarks. These results demonstrate that open-source, agentic pipelines can approach commercial-grade data quality while retaining cost-effectiveness and scalability. Both the framework and dataset will be open-sourced.


[57] A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Average Distillation cs.CV | cs.AIPDF

Ling Xiao, Toshihiko Yamasaki

TL;DR: 本文提出了一种名为MCL-FIR的多头持续学习框架,用于解决细粒度时尚图像检索(FIR)中的类别增量学习问题。该框架结合了对比学习和指数移动平均蒸馏,通过多头设计适应新增类别、将三元组输入重构为使用InfoNCE损失的双元组以简化训练,并利用EMA蒸馏进行高效知识迁移。

Details

Motivation: 现有细粒度FIR方法多为静态设定,当出现新属性时需要完全重新训练,成本高昂且不适用于动态场景。预训练模型虽支持零样本推理,但在无监督下精度下降,且尚无工作探索细粒度FIR的类别增量学习。

Result: 在四个数据集上的实验表明,MCL-FIR在可扩展性之外,在效率与精度间取得了良好平衡。在相似训练成本下显著优于CIL基线方法;与静态方法相比,仅使用约30%的训练成本即可达到相当的性能。

Insight: 创新点包括:将持续学习引入细粒度FIR任务;采用多头架构和EMA蒸馏处理增量类别;将三元组损失重构为基于InfoNCE的双元组对比学习,简化了训练流程并提升了效果。这为动态环境下的细粒度检索提供了高效且可扩展的解决方案。

Abstract: Most fine-grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero-shot inference, their accuracy drops without supervision, and no prior work explores class-incremental learning (CIL) for fine-grained FIR. We propose a multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL-FIR). MCL-FIR adopts a multi-head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL-FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in https://github.com/Dr-LingXiao/MCL-FIR.


[58] Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models cs.CV | cs.AIPDF

Yifan Yang, Lei Zou, Wendy Jepson

TL;DR: 该论文提出了一种从卫星图像合成灾后街景视图的方法,旨在解决灾后地面视角数据缺失的问题。研究引入了两种生成策略:基于视觉语言模型(VLM)引导的方法和损伤敏感的混合专家(MoE)方法,并通过一个结构感知评估框架与通用基线模型(如Pix2Pix和ControlNet)进行对比。实验在300个灾害场景中进行,揭示了生成模型的真实感与保真度之间的权衡。

Details

Motivation: 自然灾害发生后,快速获取地面视角信息对评估具体结构损坏至关重要,但卫星图像缺乏地面细节,而街景数据在时效性事件中难以获取。本研究旨在通过卫星图像合成灾后街景,以弥合这一数据鸿沟。

Result: 在提出的结构感知评估框架下进行实验,该框架包括像素级质量评估、基于ResNet的语义一致性验证和新型的VLM-as-a-Judge感知对齐方法。定量结果显示,标准ControlNet在语义准确性上最高(0.71),而VLM增强和MoE模型在纹理合理性上表现更好,但在语义清晰度上存在不足。

Insight: 论文的创新点在于提出了两种针对灾后场景的生成策略(VLM引导和损伤敏感MoE)以及一个多层级评估框架,强调了视觉上真实的生成可能无法保留灾害评估所需的关键结构信息,这为可信的跨视图合成提供了基线参考。

Abstract: In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism–fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.


[59] Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs cs.CV | cs.CLPDF

Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu

TL;DR: 本文提出了一种名为CogAlign的新型临床认知对齐框架,旨在解决多模态大语言模型在胃肠道内镜诊断中的两大关键限制:通用模型推理与标准化临床认知路径之间的错位,以及视觉特征与诊断结果之间缺乏因果关联。该框架通过构建分层临床认知数据集进行监督微调,并引入反事实驱动的强化学习策略进行因果矫正,从而提升模型在复杂临床场景下的诊断准确性。

Details

Motivation: 动机在于解决多模态大语言模型在胃肠道内镜诊断应用中的两个关键问题:模型推理与临床认知路径的错位,以及视觉特征与诊断结果间缺乏因果关联,这阻碍了其在医疗图像分析中的有效应用。

Result: 在多个基准测试中,该方法达到了最先进的性能,显著提升了复杂临床场景下的诊断准确率。

Insight: 创新点包括:通过构建分层临床认知数据集和SFT将专家分层诊断逻辑内化到模型中;从理论上分析了标准监督微调会导致虚假背景关联,并提出了基于反事实样本生成和临床认知中心奖励的强化学习策略,强制模型将诊断严格基于因果病变特征,从而消除视觉偏差。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.


[60] Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation cs.CVPDF

Zihao Wang, Yuxiang Wei, Xinpeng Zhou, Tianyu Zhang, Tao Liang

TL;DR: 本文提出Premier框架,通过可学习的用户嵌入和偏好适配器,在文本到图像生成中实现个性化偏好调制,解决了现有方法难以准确捕捉用户细微偏好的问题。

Details

Motivation: 现有方法依赖多模态大语言模型推断用户偏好,但生成的提示或潜在代码难以忠实反映用户偏好,导致个性化效果不佳。

Result: 在相同历史长度下,Premier在偏好对齐、文本一致性、ViPer代理指标和专家评估方面优于先前方法,实现了更强的个性化性能。

Insight: 创新点包括将用户偏好表示为可学习嵌入,引入偏好适配器融合用户嵌入与文本提示,以及使用分散损失增强用户嵌入的区分度;通过现有偏好嵌入的线性组合实现对新用户的有效泛化,提升了细粒度偏好控制的准确性。

Abstract: Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user’s preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.


[61] Weakly supervised multimodal segmentation of acoustic borehole images with depth-aware cross-attention cs.CV | cs.AI | physics.geo-phPDF

Jose Luis Lima de Jesus Silva

TL;DR: 本文提出了一种用于声波钻孔图像弱监督多模态分割的框架,通过结合二维图像纹理和深度对齐的一维测井数据,利用阈值引导的伪标签和模型学习进行优化。该框架保持了传统阈值和聚类工作流的免标注特性,并引入了去噪、置信度感知伪监督和物理结构融合等扩展。

Details

Motivation: 解决声波钻孔图像大规模解释的困难,因为密集专家标注难以获取,且地下信息本质上是多模态的,需要开发结合二维图像和一维测井数据的弱监督方法。

Result: 在弱监督参考基准上,置信门控深度感知交叉注意力(CG-DCA)模型一致优于基于阈值、仅图像和早期多模态基线,性能广泛稳定,且通过针对性消融实验证明其优势依赖于置信度感知融合和结构化局部深度交互。

Insight: 创新点包括阈值引导的学习细化、深度感知交叉注意力、门控融合和置信度感知调制等多模态融合策略,表明当辅助测井数据被选择性地和深度感知地整合时,多模态改进最大化。

Abstract: Acoustic borehole images provide high-resolution borehole-wall structure, but large-scale interpretation remains difficult because dense expert annotations are rarely available and subsurface information is intrinsically multimodal. The challenge is developing weakly supervised methods combining two-dimensional image texture with depth-aligned one-dimensional well-logs. Here, we introduce a weakly supervised multimodal segmentation framework that refines threshold-guided pseudo-labels through learned models. This preserves the annotation-free character of classical thresholding and clustering workflows while extending them with denoising, confidence-aware pseudo-supervision, and physically structured fusion. We establish that threshold-guided learned refinement provides the most robust improvement over raw thresholding, denoised thresholding, and latent clustering baselines. Multimodal performance depends strongly on fusion strategy: direct concatenation provides limited gains, whereas depth-aware cross-attention, gated fusion, and confidence-aware modulation substantially improve agreement with the weak supervisory reference. The strongest model, confidence-gated depth-aware cross-attention (CG-DCA), consistently outperforms threshold-based, image-only, and earlier multimodal baselines. Targeted ablations show its advantage depends specifically on confidence-aware fusion and structured local depth interaction rather than model complexity alone. Cross-well analyses confirm this performance is broadly stable. These results establish a practical, scalable framework for annotation-free segmentation, showing multimodal improvement is maximized when auxiliary logs are incorporated selectively and depth-aware.


[62] VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation cs.CVPDF

Jun Du

TL;DR: 本文提出了一种名为VSD-MOT的端到端多目标跟踪框架,旨在解决低质量视频场景下因图像信息丢失导致的跟踪性能下降问题。该方法通过视觉语义蒸馏引导,利用CLIP图像编码器提取全局视觉语义信息,并通过知识蒸馏和动态权重调节模块,在保持效率的同时提升低质量视频的跟踪鲁棒性。

Details

Motivation: 现有多目标跟踪算法在现实世界的低质量视频中性能显著下降,主要原因是无法有效处理低质量图像造成的信息丢失问题。本文旨在解决这一挑战。

Result: 大量实验表明,该方法在现实世界的低质量视频场景中具有有效性和优越性,同时在常规场景下也能保持良好的性能。

Insight: 创新点包括:1)引入视觉语言模型(CLIP)的语义信息来补偿低质量图像的信息损失;2)提出双约束语义蒸馏方法,通过师生框架将语义提取能力迁移到高效的跟踪模型中;3)设计动态语义权重调节模块,根据实时帧质量评估自适应融合权重,以应对视频质量的动态变化。

Abstract: Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms’ inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.


[63] Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding cs.CVPDF

Jincen Jiang, Qianyu Zhou, Yuhang Li, Kui Su, Meili Wang

TL;DR: 本文提出了一种基于Mamba的结构感知领域泛化框架SADG,用于多任务点云理解。该方法通过结构感知序列化生成变换不变的序列,并利用分层领域感知建模稳定跨领域推理,同时引入轻量级谱图对齐在测试时进行结构保持的特征迁移。实验表明,该方法在重建、去噪和配准等多个任务上提升了结构保真度并超越了现有最佳方法。

Details

Motivation: 现有的Transformer和Mamba架构通常针对单任务或单领域设计,直接应用于多任务领域泛化时性能下降。Transformer建模全局依赖但计算成本高且缺乏显式结构顺序,Mamba虽具线性复杂度但依赖坐标驱动的序列化,对视角变化和缺失区域敏感,导致结构漂移和不稳定序列建模。

Result: 在包括重建、去噪和配准等多个任务上的综合实验表明,所提方法提高了结构保真度,并持续超越了最先进的方法。

Insight: 创新点包括:1) 结构感知序列化(SAS)利用基于质心的拓扑和测地曲率连续性生成变换不变的序列;2) 分层领域感知建模(HDM)通过整合领域内结构和融合领域间关系来稳定跨领域推理;3) 轻量级谱图对齐(SGA)在测试时无需更新模型参数即可在谱域中将目标特征向源原型对齐,实现结构保持的特征迁移;4) 引入了用于多任务领域泛化评估的真实扫描物体数据集MP3DObject。

Abstract: While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.


[64] CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration cs.CVPDF

Xiefan Guo, Xinzhu Ma, Haiyu Zhang, Di Huang

TL;DR: 本文提出了一种名为跨时间步自校准(CTCal)的新方法,旨在解决文本到图像扩散模型中文本提示与生成图像之间难以精确对齐的问题。该方法利用噪声较小时刻形成的可靠文本-图像对齐(即交叉注意力图)来校准噪声较大时刻的表征学习,从而在训练中提供显式监督。

Details

Motivation: 现有扩散模型在实现细粒度文本-图像对齐方面存在困难,主要源于传统扩散损失仅提供隐式监督,导致文本提示与生成图像的对齐不够精确。

Result: 在T2I-Compbench++和GenEval基准测试上的广泛实验证明了CTCal的有效性和泛化能力,该方法可无缝集成到现有文本到图像扩散模型(如SD 2.1和SD 3)中。

Insight: 核心创新在于观察到扩散模型中文本-图像对齐的难度随时间步增加而增大,并据此提出利用早期可靠对齐信息校准后期噪声表征的跨时间步自校准机制,以及时间步感知的自适应加权策略来平衡CTCal与扩散损失。

Abstract: Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at https://github.com/xiefan-guo/ctcal.


[65] Smart Operation Theatre: An AI-based System for Surgical Gauze Counting cs.CVPDF

Saraf Krish, Cai Yiyu, Huang Li Hui

TL;DR: 本文提出了一种基于人工智能的手术纱布计数系统,旨在通过实时视频监控和YOLOv5目标识别技术,自动追踪手术中使用的纱布,确保纱布数量准确无误,防止纱布遗留在患者体内。

Details

Motivation: 解决手术中纱布可能遗留在患者体内(称为Gossypiboma)的风险,该风险会导致严重并发症并引发医疗纠纷;现有手动计数方法耗时且分散护理资源,需要更高效的自动化预防方案。

Result: 系统在集成模型(可同时识别人和纱布)上实现了性能提升,训练集从2800张图像扩展到11000张,准确率提高,帧率从8 FPS提升至15 FPS,并通过医生反馈支持手动计数调整以增强可靠性。

Insight: 创新点包括使用单一集成模型替代多模型进行人和纱布检测,提升了效率和准确性;结合实时视频与目标识别技术于实际手术场景,并通过数据增强和临床反馈优化系统实用性。

Abstract: During surgeries, there is a risk of medical gauzes being left inside patients’ bodies, leading to “Gossypiboma” in patients and can cause serious complications in patients and also lead to legal problems for hospitals from malpractice lawsuits and regulatory penalties. Diagnosis depends on imaging methods such as X-rays or CT scans, and the usual treatment involves surgical excision. Prevention methods, such as manual counts and RFID-integrated gauzes, aim to minimize gossypiboma risks. However, manual tallying of 100s of gauzes by nurses is time-consuming and diverts resources from patient care. In partnership with Singapore General Hospital (SGH) we have developed a new prevention method, an AI-based system for gauze counting in surgical settings. Utilizing real-time video surveillance and object recognition technology powered by YOLOv5, a Deep Learning model was designed to monitor gauzes on two designated trays labelled “In” and “Out”. Gauzes are tracked from the “In” tray, prior to their use in the patient’s body & in the “Out” tray post-use, ensuring accurate counting and verifying that no gauze remains inside the patient at the end of the surgery. We have trained it using numerous images from Operation Theatres & augmented it to satisfy all possible scenarios. This study has also addressed the shortcomings of previous project iterations. Previously, the project employed two models: one for human detection and another for gauze detection, trained on a total of 2800 images. Now we have an integrated model capable of identifying both humans and gauzes, using a training set of 11,000 images. This has led to improvements in accuracy and increased the frame rate from 8 FPS to 15 FPS now. Incorporating doctor’s feedback, the system now also supports manual count adjustments, enhancing its reliability in actual surgeries.


[66] PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization cs.CVPDF

Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan

TL;DR: PiLoT是一个用于无人机(UAV)自我定位和目标地理定位的统一框架。它通过将实时视频流直接与地理参考的3D地图进行配准,摆脱了对GNSS和昂贵主动传感器的依赖。该框架通过双线程引擎、大规模合成数据集和联合神经引导随机梯度优化器实现了鲁棒、准确和实时的性能。

Details

Motivation: 解决传统无人机定位方法在GNSS拒止环境中容易失效,以及依赖激光测距仪等昂贵、复杂的主动传感器导致硬件成本高和系统复杂的问题。

Result: 在全面的公开和新收集的基准测试上,PiLoT的性能优于最先进的方法(SOTA),同时在NVIDIA Jetson Orin平台上能以超过25 FPS的速度实时运行。

Insight: 创新点包括:1) 将实时视频流直接与地理参考3D地图进行神经像素到3D配准的统一范式;2) 解耦地图渲染与核心定位线程的双线程引擎设计,兼顾低延迟和无漂移精度;3) 利用大规模合成数据集训练轻量级网络,实现从仿真到真实数据的零样本泛化;4) 联合神经引导随机梯度优化器,确保在剧烈运动下的鲁棒收敛。

Abstract: We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion. Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset is available at: https://github.com/Choyaa/PiLoT.


[67] ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking cs.CVPDF

Kanglong Fan, Tianhe Wu, Wen Wen, Jianzhao Liu, Le Yang

TL;DR: ME-IQA提出了一种即插即用的测试时内存增强重排序框架,用于解决推理诱导的视觉语言模型(VLM)在图像质量评估(IQA)中存在的分数离散崩溃问题。该方法通过构建记忆库、利用推理摘要检索语义和感知对齐的邻居,将VLM重构为概率比较器获取成对偏好概率,并在Thurstone Case V模型下融合序数证据与初始分数,最终通过门控反射和记忆巩固来生成更密集、对失真更敏感的预测。

Details

Motivation: 动机是解决推理诱导的视觉语言模型在图像质量评估中产生的标量分数缺乏敏感性、容易坍缩到少数离散值(即离散崩溃)的问题。

Result: 在多个IQA基准测试上的实验表明,该方法相比强大的推理诱导VLM基线、现有的非推理IQA方法以及测试时缩放替代方案,均取得了持续的性能提升。

Insight: 创新点在于将测试时重排序与记忆增强机制结合,通过检索对齐邻居、概率比较和证据融合来细化VLM的初始输出,从而缓解离散崩溃并提高预测的判别力;客观来看,其将传统心理测量模型(Thurstone模型)与基于记忆的检索相结合,为提升VLM在IQA任务中的鲁棒性提供了一种可借鉴的框架。

Abstract: Reasoning-induced vision-language models (VLMs) advance image quality assessment (IQA) with textual reasoning, yet their scalar scores often lack sensitivity and collapse to a few values, so-called discrete collapse. We introduce ME-IQA, a plug-and-play, test-time memory-enhanced re-ranking framework. It (i) builds a memory bank and retrieves semantically and perceptually aligned neighbors using reasoning summaries, (ii) reframes the VLM as a probabilistic comparator to obtain pairwise preference probabilities and fuse this ordinal evidence with the initial score under Thurstone’s Case V model, and (iii) performs gated reflection and consolidates memory to improve future decisions. This yields denser, distortion-sensitive predictions and mitigates discrete collapse. Experiments across multiple IQA benchmarks show consistent gains over strong reasoning-induced VLM baselines, existing non-reasoning IQA methods, and test-time scaling alternatives.


[68] Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation cs.CV | cs.ROPDF

Qunchao Jin, Yiliao Song, Qi Wu

TL;DR: 本文提出了Co-VLN框架,用于研究在视觉语言导航任务中,多个智能体通过共享彼此在共同经过位置的观测信息,能否提升导航性能。该框架模型无关,在R2R基准上基于DUET和MapGPT两种范式进行了验证。

Details

Motivation: 解决视觉语言导航中智能体因部分可观测性(仅能积累自身访问位置的知识)而受限的问题,探索在共享环境中多个同时导航的智能体能否从彼此的观测中受益。

Result: 在R2R基准测试中,基于学习范式(DUET)和零样本范式(MapGPT)的模型在启用视觉共享后均获得了显著的性能提升。

Insight: 创新点在于提出了一个简约、模型无关的协作框架,使智能体在识别出共同遍历位置时,能够交换结构化的感知记忆,从而在不增加额外探索成本的情况下扩展了每个智能体的感知范围,为未来具身协作导航研究奠定了基础。

Abstract: Vision-Language Navigation (VLN) systems are fundamentally constrained by partial observability, as an agent can only accumulate knowledge from locations it has personally visited. As multiple robots increasingly coexist in shared environments, a natural question arises: can agents navigating the same space benefit from each other’s observations? In this work, we introduce Co-VLN, a minimalist, model-agnostic framework for systematically investigating whether and how peer observations from concurrently navigating agents can benefit VLN. When independently navigating agents identify common traversed locations, they exchange structured perceptual memory, effectively expanding each agent’s receptive field at no additional exploration cost. We validate our framework on the R2R benchmark under two representative paradigms (the learning-based DUET and the zero-shot MapGPT), and conduct extensive analytical experiments to systematically reveal the underlying dynamics of peer observation sharing in VLN. Results demonstrate that vision-sharing enabled model yields substantial performance improvements across both paradigms, establishing a strong foundation for future research in collaborative embodied navigation.


[69] Less is More in Semantic Space: Intrinsic Decoupling via Clifford-M for Fundus Image Classification cs.CVPDF

Yifeng Zheng

TL;DR: 本文提出了一种名为Clifford-M的轻量级骨干网络,用于多标签眼底图像分类。该模型通过Clifford风格的滚动乘积实现稀疏几何交互,以线性复杂度联合捕捉对齐和结构变化,从而在紧凑的双分辨率架构中实现高效的跨尺度融合和自我精炼。

Details

Motivation: 多标签眼底诊断需要同时捕捉细粒度病变和大尺度视网膜结构的特征。现有多尺度医学视觉模型通常通过显式频率分解来应对这一挑战,但消融研究表明这些启发式方法在该场景下收益有限,甚至增加计算开销而未提升精度。

Result: 在ODIR-5K数据集上,Clifford-M无需预训练即达到平均AUC-ROC 0.8142和平均macro-F1 0.5481,参数量仅0.85M,显著优于同等训练协议下规模更大的中尺度CNN基线。在RFMiD数据集上的零样本评估显示其macro AUC为0.7425 ± 0.0198,micro AUC为0.7610 ± 0.0344,表明对跨数据集偏移具有合理的鲁棒性。

Insight: 论文的创新点在于用稀疏几何交互(Clifford滚动乘积)替代传统的显式频率分解模块和前馈扩展模块,直接捕捉多尺度结构,从而在减少参数和计算量的同时实现竞争性性能。这揭示了在语义空间中‘少即是多’的设计理念,即通过核心特征交互的精心设计而非复杂的频率工程,即可实现高效且准确的眼底诊断。

Abstract: Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse. Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift. These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly.


[70] Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models cs.CV | cs.LGPDF

Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan

TL;DR: 本文揭示了多模态大语言模型(MLLMs)在语言驱动训练过程中存在的视觉表征退化问题,即模型中间层的视觉表征在全局功能和局部结构上均出现退化。作者将此归因于单一文本生成目标导致的视觉牺牲,并提出了一种预测正则化方法(PRe),通过强制退化的中间特征预测初始视觉特征来维持模型内部表征的固有视觉属性。大量实验表明,缓解这种退化能有效提升视觉语言任务的性能。

Details

Motivation: MLLMs在视觉语言任务上表现出色,但其语言驱动训练对内部视觉基础能力的影响尚不明确。本文旨在诊断并解决MLLMs中普遍存在的视觉表征退化问题,以确保模型同时具备强大的跨模态推理能力和核心视觉能力。

Result: 广泛的实验证实,通过提出的预测正则化方法缓解视觉退化,能有效提升MLLMs在视觉语言任务上的性能,强调了在MLLMs内部培养鲁棒视觉表征对于全面多模态理解的重要性。

Insight: 论文的创新点在于首次系统诊断并揭示了MLLMs训练中的视觉表征退化现象,并提出了一种简单有效的预测正则化方法(PRe)来约束中间特征,以保持视觉保真度。从客观角度看,该方法为平衡MLLMs的文本生成目标与视觉基础能力提供了一个新颖的正则化视角,可能对提升模型的多模态理解鲁棒性具有借鉴意义。

Abstract: While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM’s internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.


[71] EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis cs.CVPDF

Xiefan Guo, Xinzhu Ma, Haoxiang Ma, Zihao Zhou, Di Huang

TL;DR: 本文提出EruDiff方法,旨在重构扩散模型中的知识结构,以解决现有文本到图像扩散模型在处理需要深层世界知识的隐式提示时生成反事实图像的问题。该方法通过扩散知识分布匹配(DK-DM)将隐式提示的知识分布与显式锚点对齐,并采用仅负强化学习(NO-RL)策略进行细粒度校正,显著提升了FLUX和Qwen-Image等主流模型在科学知识基准(Science-T2I)和世界知识基准(WISE)上的性能。

Details

Motivation: 现有文本到图像扩散模型在处理依赖深层世界知识(如自然科学、文化常识)的隐式提示时,由于底层知识结构错位,导致生成反事实图像,本文旨在解决这一局限性。

Result: 在科学知识基准(Science-T2I)和世界知识基准(WISE)上的严格实证评估表明,该方法显著提升了FLUX和Qwen-Image等领先扩散模型的性能,证明了其有效性和泛化性。

Insight: 创新点在于提出了扩散知识分布匹配(DK-DM)来对齐隐式与显式提示的知识分布,以及仅负强化学习(NO-RL)策略来校正显式提示渲染中的固有偏差,为改善扩散模型对复杂世界知识的理解和生成提供了新思路。

Abstract: Text-to-image diffusion models have achieved remarkable fidelity in synthesizing images from explicit text prompts, yet exhibit a critical deficiency in processing implicit prompts that require deep-level world knowledge, ranging from natural sciences to cultural commonsense, resulting in counter-factual synthesis. This paper traces the root of this limitation to a fundamental dislocation of the underlying knowledge structures, manifesting as a chaotic organization of implicit prompts compared to their explicit counterparts. In this paper, we propose EruDiff, which aims to refactor the knowledge within diffusion models. Specifically, we develop the Diffusion Knowledge Distribution Matching (DK-DM) to register the knowledge distribution of intractable implicit prompts with that of well-defined explicit anchors. Furthermore, to rectify the inherent biases in explicit prompt rendering, we employ the Negative-Only Reinforcement Learning (NO-RL) strategy for fine-grained correction. Rigorous empirical evaluations demonstrate that our method significantly enhances the performance of leading diffusion models, including FLUX and Qwen-Image, across both the scientific knowledge benchmark (i.e., Science-T2I) and the world knowledge benchmark (i.e., WISE), underscoring the effectiveness and generalizability. Our code is available at https://github.com/xiefan-guo/erudiff.


[72] MERIT: Multi-domain Efficient RAW Image Translation cs.CV | cs.AIPDF

Wenjun Huang, Shenghao Fu, Yian Jin, Yang Ni, Ziteng Cui

TL;DR: MERIT是首个用于多域RAW图像翻译的统一框架,通过单一模型实现任意相机域之间的图像转换。它提出了传感器感知的噪声建模损失和多尺度大核注意力模块,以解决不同相机传感器间的噪声差异和特征建模问题,并在新构建的MDRAW数据集上验证了其性能。

Details

Motivation: 不同相机传感器捕获的RAW图像因光谱响应、噪声特性和色调行为的差异而存在显著的域偏移,这阻碍了它们在下游计算机视觉任务中的直接使用。现有方法通常为每个源-目标对训练特定的RAW-to-RAW翻译器,但这种方法难以扩展到涉及多种商业相机的真实场景。

Result: 在MDRAW数据集(首个专为多域RAW图像翻译设计的数据集,包含来自五种不同相机传感器的配对和非配对RAW图像)上的大量实验表明,MERIT在质量上比先前模型提升了5.56 dB,在可扩展性上减少了80%的训练迭代次数。

Insight: 创新点包括:传感器感知的噪声建模损失,用于显式对齐生成图像与目标域的信号相关噪声统计;条件多尺度大核注意力模块,以增强上下文和传感器感知的特征建模;以及MDRAW数据集的引入,为多域RAW图像翻译提供了标准化评估基准。

Abstract: RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. We further enhance the generator with a conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (5.56 dB improvement) and scalability (80% reduction in training iterations).


[73] Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking cs.CV | cs.AI | cs.HC | cs.LGPDF

Yujin Park, Haejun Chung, Ikbeom Jang

TL;DR: Dodgersort是一种不确定性感知的视觉语言模型引导的人机交互成对排序方法,通过CLIP分层预排序、神经排序头与概率集成、认知-随机不确定性分解及信息论对选择,减少人工比较次数并提升排序可靠性。在医学影像、历史年代和美学等视觉排序任务中,实现11-16%的标注减少,同时提高评分者间一致性。

Details

Motivation: 解决成对比较标注中全比较所需二次方成本过高的问题,同时提升排序的可靠性。

Result: 在FG-NET数据集(具有真实年龄标注)上,每比较提取的排序信息比基线多5-20倍,实现了帕累托最优的准确率-效率权衡;在四个数据集上的跨域消融实验表明神经适应和集成不确定性是关键增益来源。

Insight: 创新点包括结合视觉语言模型(CLIP)进行分层预排序以减少比较次数,以及通过概率集成和不确定性分解进行信息论驱动的对选择,实现高效可靠的人机交互排序。

Abstract: Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic–aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11–16% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5–20$\times$ more ranking information per comparison than baselines, yielding Pareto-optimal accuracy–efficiency trade-offs.


[74] Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves cs.CV | cs.ROPDF

Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani

TL;DR: 本文提出了Glove2Hand框架,能够将多模态传感手套捕捉的手-物交互视频转换为逼真的裸手交互视频,同时忠实保留物理交互动态。该框架采用了一种新颖的3D高斯手部模型以确保时序渲染一致性,并利用基于扩散模型的手部修复器将渲染的手部无缝集成到场景中。基于此框架,作者构建了首个多模态手-物交互数据集HandSense,该数据集包含同步的触觉和IMU信号。实验表明,HandSense能显著提升下游裸手应用(如视频接触估计和严重遮挡下的手部跟踪)的性能。

Details

Motivation: 传统的手部交互视频缺乏关键的物理信息(如接触力、运动信号)且容易受到频繁遮挡,这限制了其在计算机视觉、机器人和AR/VR领域的应用。本文旨在解决这一问题。

Result: 通过Glove2Hand框架合成的HandSense数据集,显著提升了视频接触估计和严重遮挡下手部跟踪等下游任务的性能。

Insight: 主要创新点包括:1) 提出了一个将多模态手套数据转换为逼真裸手视频的框架;2) 引入了一种新颖的3D高斯手部模型以确保时序一致性;3) 利用扩散模型处理复杂交互和非刚性变形;4) 创建了首个包含同步触觉和IMU信号的多模态手-物交互数据集,为相关研究提供了宝贵资源。

Abstract: Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.


[75] Restoring Neural Network Plasticity for Faster Transfer Learning cs.CV | cs.AIPDF

Xander Coetzer, Arné Schreuder, Anna Sergeevna Bosman

TL;DR: 该论文提出了一种在迁移学习前通过有针对性的权重重新初始化策略来恢复神经网络可塑性的方法,旨在解决预训练模型在微调时因梯度消失而难以适应下游任务的问题。该方法适用于卷积神经网络和视觉Transformer,在多个图像分类基准测试中实现了更快的收敛速度和更高的测试精度,且计算开销可忽略不计。

Details

Motivation: 迁移学习中,基于ImageNet预训练的模型权重可能饱和,导致梯度不显著,难以有效适应下游任务,尤其是在下游数据集非典型时,这被称为神经可塑性丧失。该问题在持续学习中已被广泛研究,但在迁移学习背景下相对较少探索,因此论文旨在解决此问题。

Result: 在多个图像分类基准测试中,该方法使卷积神经网络和视觉Transformer均受益,实现了更高的测试精度和更快的收敛速度,与现有迁移学习流程兼容。

Insight: 论文的创新点在于提出了一种有针对性的权重重新初始化策略来恢复神经可塑性,这有助于模型更好地适应下游任务,且计算开销低,易于集成到现有流程中。从客观角度看,该方法为迁移学习中的可塑性问题提供了简单有效的解决方案,可能推动相关领域的研究。

Abstract: Transfer learning with models pretrained on ImageNet has become a standard practice in computer vision. Transfer learning refers to fine-tuning pretrained weights of a neural network on a downstream task, typically unrelated to ImageNet. However, pretrained weights can become saturated and may yield insignificant gradients, failing to adapt to the downstream task. This hinders the ability of the model to train effectively, and is commonly referred to as loss of neural plasticity. Loss of plasticity may prevent the model from fully adapting to the target domain, especially when the downstream dataset is atypical in nature. While this issue has been widely explored in continual learning, it remains relatively understudied in the context of transfer learning. In this work, we propose the use of a targeted weight re-initialization strategy to restore neural plasticity prior to fine-tuning. Our experiments show that both convolutional neural networks (CNNs) and vision transformers (ViTs) benefit from this approach, yielding higher test accuracy with faster convergence on several image classification benchmarks. Our method introduces negligible computational overhead and is compatible with common transfer learning pipelines.


[76] Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning cs.CVPDF

Xu Zhang, Jin Yuan, BinHong Yang, Xuan Liu, Qianjun Zhang

TL;DR: 本文提出了一种名为可控视频分割与字幕生成(SegCaptioning)的新任务,允许用户通过边界框等提示,同时生成与用户意图精确对齐的掩码和字幕。为此,作者设计了一个创新的场景图引导的细粒度SegCaptioning Transformer(SG-FSCFormer)框架,该框架通过提示引导的时间图Former捕捉用户意图,并利用细粒度掩码-语言解码器协同预测高质量的掩码-字幕对,实现掩码与字幕词之间的细粒度对齐。

Details

Motivation: 现有视频多模态理解方法主要集中于全局理解,用户交互有限。为了增强用户对视频内容的理解和控制,需要一种能够根据用户具体提示(如感兴趣对象的边界框)同时生成精确对齐的掩码和字幕的方法。

Result: 在两个基准数据集上的综合实验表明,SG-FSCFormer取得了显著性能,能有效捕捉用户意图并根据用户规格生成精确的多模态输出。

Insight: 创新点包括:1)提出了可控视频分割与字幕生成(SegCaptioning)新任务,强调用户意图驱动的多模态生成;2)设计了SG-FSCFormer框架,整合了提示引导的时间图Former和细粒度掩码-语言解码器,通过自适应提示适配器和多实体对比损失实现意图捕捉与细粒度对齐;3)引入了掩码与字幕词之间的细粒度对齐机制,提升了视频理解的精确性。

Abstract: Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users’ understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user’s requirements. Furthermore, our model introduces a Fine-grained Mask-linguistic Decoder to collaboratively predict high-quality caption-mask pairs using a Multi-entity Contrastive loss, as well as provide fine-grained alignment between each mask and its corresponding caption tokens, thereby enhancing users’ comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at https://github.com/XuZhang1211/SG-FSCFormer.


[77] GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies cs.CVPDF

Uzair Shah, Marco Agus, Mahmoud Gamal, Mahmood Alzubaidi, Corrado Cali

TL;DR: 本文提出GraPHFormer,一种多模态架构,通过CLIP风格对比学习统一拓扑和图形结构分析,用于神经科学形态学研究。该方法结合持久同调图像和树状LSTM编码器,在六个基准测试中取得五项SOTA性能。

Details

Motivation: 现有方法孤立分析神经元形态的拓扑或图形结构,无法充分利用互补信息,因此需要一种统一多视角的方法来更全面编码电路功能、发育和疾病信息。

Result: 在六个基准测试(BIL-6、ACT-4、JML-4、N7、M1-Cell、M1-REG)中,GraPHFormer在五项达到SOTA,显著优于仅拓扑、仅图形和形态计量学基线方法。

Insight: 创新点包括:多模态融合拓扑与图形特征、CLIP风格对比学习框架、三通道持久图像编码设计,以及保持拓扑语义的持久空间变换,为多模态表示学习提供了新思路。

Abstract: Neuronal morphology encodes critical information about circuit function, development, and disease, yet current methods analyze topology or graph structure in isolation. We introduce GraPHFormer, a multimodal architecture that unifies these complementary views through CLIP-style contrastive learning. Our vision branch processes a novel three-channel persistence image encoding unweighted, persistence-weighted, and radius-weighted topological densities via DINOv2-ViT-S. In parallel, a TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics. Evaluated on six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, GraPHFormer achieves state-of-the-art performance on five benchmarks, significantly outperforming topology-only, graph-only, and morphometrics baselines. We demonstrate practical utility by discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes. Code: https://github.com/Uzshah/GraPHFormer


[78] Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models cs.CVPDF

Binesh Sadanandan, Vahid Behzadan

TL;DR: 本文揭示了医疗视觉语言模型(VLMs)中一致性作为可靠性代理的缺陷,发现模型可通过依赖文本模式而非图像输入实现完美一致性,从而掩盖了潜在风险。作者提出了一种基于每样本的四象限安全分类法,通过联合评估一致性(对改写提示的稳定预测)和图像依赖性(移除图像时预测的变化),将样本划分为理想、脆弱、危险和最差四类。在MIMIC-CXR和PadChest两个胸部X光数据集上评估五种医疗VLM配置,发现LoRA微调虽大幅降低翻转率,却使多数样本落入危险象限,这些样本具有高准确率和低熵,难以通过标准置信度筛选检测。

Details

Motivation: 解决医疗视觉语言模型部署中,仅使用一致性(语义等效提示产生相同预测)作为可靠性代理的局限性,因为模型可能通过文本模式而非图像内容实现一致性,从而隐藏安全隐患。

Result: 在MIMIC-CXR和PadChest数据集上评估五种医疗VLM配置,发现LoRA微调后翻转率显著降低(如LLaVA-Rad Base在PadChest上为1.5%),但98.5%的样本被分类为危险象限;危险样本准确率高达99.6%且熵低,标准置信度筛选无法识别;翻转率与危险样本比例呈负相关(r=-0.89)。

Insight: 创新点在于提出四象限每样本安全分类法,联合评估一致性和图像依赖性,暴露了模型依赖文本模式而非图像内容的虚假可靠性陷阱;建议部署评估中始终将一致性检查与仅文本基线(单次额外前向传播)结合,以识别潜在风险。

Abstract: Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.


[79] SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis cs.CVPDF

Zhixiang Lu, Shijie Xu, Kaicheng Yan, Xuyue Cai, Chong Zhang

TL;DR: 本文提出了SkinCLIP-VL,一个用于多模态皮肤癌诊断的资源高效、可信赖的视觉-语言学习框架。该框架采用冻结感知、自适应推理范式,将冻结的CLIP编码器与轻量化的量化版Qwen2.5-VL模型通过低秩适配(LoRA)相结合。为了解决长尾分布下视觉区域与临床语义的严格对齐问题,论文提出了Consistency-aware Focal Alignment (CFA)损失函数。在ISIC和Derm7pt基准测试中,该方法以更少的参数超越了大型基线模型,并通过专家评估证明了其可解释性优势。

Details

Motivation: 解决在皮肤病学中部署视觉-语言模型(VLMs)面临的三重困境:高计算成本、极端数据稀缺以及深度学习的黑箱性质,旨在开发一个资源高效且可信赖的皮肤癌诊断框架。

Result: 在ISIC和Derm7pt基准测试上,SkinCLIP-VL在准确率上超越了参数量为130亿的基线模型4.3-6.2%,同时参数量减少了43%。盲法专家评估和分布外测试证实,其基于视觉的推理依据比传统的显著性图更能显著提升临床信任度。

Insight: 主要创新点包括:1)采用“冻结感知、自适应推理”的范式,结合冻结的CLIP与轻量化、量化的Qwen2.5-VL模型,实现了资源高效性;2)提出了Consistency-aware Focal Alignment (CFA)损失函数,通过融合焦点重加权、语义对齐和校准,解决了长尾分布下的严格语义对齐问题;3)框架不仅追求性能,更注重通过可解释的视觉依据来增强临床信任,这是医学AI应用的关键方向。

Abstract: The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.


[80] SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments cs.CV | cs.AIPDF

Wen Jiang, Kangyao Huang, Li Wang, Wang Xu, Wei Fan

TL;DR: 本文提出了SpatialFly,一种用于无人机视觉语言导航的几何引导空间表示框架。该框架通过几何先验注入和几何感知重参数化模块,将2D视觉感知与3D轨迹决策空间对齐,以解决复杂3D环境中结构表示不匹配的问题,从而提升空间推理能力。

Details

Motivation: 无人机在复杂3D环境中的视觉语言导航面临挑战,主要困难在于2D视觉感知与3D轨迹决策空间之间的结构表示不匹配,这限制了空间推理能力。

Result: 实验结果表明,SpatialFly在已见和未见环境中均优于最先进的无人机VLN基线方法,在未见Full split上,导航误差降低了4.03米,成功率提高了1.27%。轨迹级分析显示其生成的轨迹具有更好的路径对齐和更平滑稳定的运动。

Insight: 创新点在于提出了无需显式3D重建的几何引导2D表示对齐机制,通过注入全局结构线索和跨模态注意力对齐2D与3D表示,同时利用门控残差融合保持语义区分度,有效弥合了2D感知与3D决策的鸿沟。

Abstract: UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.


[81] When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound cs.CVPDF

Yasamin Medghalchi, Milad Yazdani, Amirhossein Dabiriaghdam, Moein Heidari, Mojan Izadkhah

TL;DR: 本文提出了一种利用大语言模型(LLM)生成临床合理对抗性提示变体的可扩展评估框架,以评估医学视觉语言模型(Med-VLM)在超声多选问答任务中的对抗鲁棒性,揭示了模型在安全临床转化方面存在的现实鲁棒性差距。

Details

Motivation: 医学视觉语言模型(Med-VLM)通过自然语言指令操作,其提示表述是一个现实且可利用的脆弱点,微小的语言变化可能导致输出显著偏移,因此需要评估其对抗鲁棒性以确保临床安全。

Result: 在超声多选问答基准测试中,系统评估了SOTA Med-VLMs对此类攻击的脆弱性,分析了攻击者LLM能力对攻击成功率的影响、攻击成功率与模型置信度的关系,并识别了跨模型的一致失败模式。

Insight: 创新点在于利用LLM生成“拟人化”重写和最小编辑来模拟常规临床沟通,从而创建临床合理的对抗性提示,为评估Med-VLM的对抗鲁棒性提供了一个现实且可扩展的框架。

Abstract: Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via “humanized” rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.


[82] A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors cs.CV | cs.AIPDF

Gia-Bao Doan, Nam-Khoa Huynh, Minh-Nhat-Huy Ho, Khanh-Thanh-Khoa Nguyen, Thanh-Hai Le

TL;DR: 本文提出了一种两阶段Transformer框架,用于从车内视频流中定位分心驾驶行为的时间区间。该框架结合了基于VideoMAE的特征提取与增强自掩码注意力检测器,并通过空间金字塔池化快速模块捕获多尺度时间特征,旨在平衡驾驶行为检测的准确性与计算效率。

Details

Motivation: 现有时间动作定位技术在准确性与计算效率之间难以取得平衡,本文旨在为驾驶员监控场景(如交通安检点或车队管理评估系统)定制一个高效准确的时间动作定位框架。

Result: 在特征提取阶段,ViT-Giant骨干网络取得了88.09%的Top-1测试准确率,而基于ViT的变体以显著更低的计算成本(101.85 GFLOPs/段 vs. 1584.06 GFLOPs/段)实现了82.55%的准确率。在下游定位任务中,SPPF模块的集成持续提升了所有配置的性能,其中ViT-Giant + SPPF模型达到了92.67%的峰值mAP,轻量级ViT配置也保持了稳健的结果。

Insight: 创新点在于为驾驶监控定制了两阶段Transformer框架,并引入了增强自掩码注意力检测器与SPPF模块以捕获多尺度时间特征。客观分析认为,其在模型容量与效率之间进行了明确的权衡分析,为实际部署提供了从高精度到轻量级的可选配置,具有实用价值。

Abstract: The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.


[83] SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM cs.CVPDF

Pengchong Hu, Zhizhong Han

TL;DR: 本文提出SGAD-SLAM方法,通过采用像素对齐的高斯模型并允许每个高斯沿其光线调整位置以优化渲染质量,同时利用高斯分布建模像素周围的深度分布来加速相机跟踪,从而在RGBD SLAM中实现更高质量的辐射场表示和更快的系统性能。

Details

Motivation: 解决现有3D高斯溅射(3DGS)方法在RGBD SLAM中因高斯模型过于灵活或运动受限导致的收敛慢或渲染质量有限的问题。

Result: 在广泛使用的基准测试中评估,在视图渲染、相机跟踪、运行时间和存储复杂度方面均优于最新方法。

Insight: 创新点在于结合了像素对齐高斯的位置调整机制和基于高斯分布的深度建模,在保证渲染质量的同时提升了系统的可扩展性和跟踪速度。

Abstract: 3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at https://machineperceptionlab.github.io/SGAD-SLAM-Project .


[84] Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving cs.CVPDF

Haixi Zhang, Aiyinsi Zuo, Zirui Li, Chunshu Wu, Tong Geng

TL;DR: 本文提出LRHPerception,一种用于自动驾驶的实时单目感知系统,通过单摄像头视频解析环境,将目标跟踪与预测、道路分割和深度估计集成到统一框架中,输出包含RGB、道路分割和像素级深度估计的五通道张量,并在单个GPU上实现29 FPS的实时处理。

Details

Motivation: 当前基于摄像头的自动驾驶技术往往优先考虑有效性而忽视计算效率,本文旨在解决这一问题,开发一个兼顾计算效率和丰富细节表示的实时单目感知系统。

Result: 实验结果表明,该系统在单个GPU上达到29 FPS的实时处理速度,相比最快的基于映射的方法实现了555%的加速,表现出强大的性能。

Insight: 创新点在于将端到端学习的计算效率与局部映射方法的丰富细节表示相结合,并统一了目标跟踪预测、道路分割和深度估计,实现了高效的实时单目感知。从客观角度看,这种多任务统一框架和显著的实时性提升是值得借鉴的工程优化方向。

Abstract: Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.


[85] Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting cs.CVPDF

Hwasik Jeong, Seungryong Lee, Gyeongjin Kang, Seungkwon Yang, Xiangyu Sun

TL;DR: 本文提出了2Xplat框架,一种基于双专家设计的无姿态前馈3D高斯泼溅方法,通过将几何估计与高斯生成显式分离,实现了从无标定多视角图像快速生成高质量3D高斯表示。

Details

Motivation: 现有主流方法采用统一架构联合估计相机姿态和合成3DGS表示,但将几何推理与外观建模耦合在共享表示中可能不利于高保真3DGS生成,因此需要解耦设计。

Result: 在少于5000次训练迭代下,该方法显著优于先前无姿态前馈3DGS方法,并与最先进的带姿态方法性能相当。

Insight: 创新点在于采用模块化双专家设计,显式分离几何估计与外观合成,挑战了现有统一范式,为复杂3D几何估计和外观合成任务提供了更优的解决方案。

Abstract: Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such “all-in-one” designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.


[86] NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection cs.CVPDF

Yupeng Zhang, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan

TL;DR: 本文提出了一种名为NoOVD的新型训练框架,旨在解决开放词汇目标检测(OVD)中训练与测试阶段存在的显著差距。该框架通过集成基于冻结视觉语言模型(VLM)知识的自蒸馏机制,设计K-FPN来引导模型发现新类别目标并促进知识蒸馏,同时引入R-RPN在推理时调整建议框的置信度分数,从而提升新类别目标的召回率。

Details

Motivation: 当前开放词汇目标检测在训练时,RPN和RoI头经常将未标注的新类别目标误分类为背景,导致这些建议框在训练中被过早过滤或在测试中被后处理移除,从而显著降低召回率并削弱新类别检测性能。

Result: 在OV-LVIS、OV-COCO和Objects365等跨数据集评估中,该方法在多个指标上均取得了优越的性能。

Insight: 创新点包括:1)利用冻结VLM的预训练知识引导新类别目标发现的自蒸馏机制,无需额外数据即可避免新目标与背景的强制对齐;2)在推理阶段通过R-RPN调整建议框置信度以提高新类别召回率。从客观角度看,该方法通过知识蒸馏和推理调整有效弥合了OVD的训练-测试差距,具有实用价值。

Abstract: Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.


[87] CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models cs.CVPDF

Nan Zhou, Huiqun Wang, Yaoyan Zheng, Di Huang

TL;DR: 本文提出了CoVFT(上下文感知视觉微调)框架,旨在解决多模态大语言模型(MLLMs)中视觉编码器微调不稳定的问题。研究发现现有视觉微调方法因视觉偏好冲突而表现不一致,CoVFT通过引入上下文向量提取和上下文混合专家模块,将多模态上下文融入视觉适应过程,实现了稳定且高效的微调。在12个多模态基准测试中,CoVFT取得了最先进的性能,并展现出显著的模型效率提升。

Details

Motivation: 解决多模态大语言模型中视觉编码器微调与冻结策略选择不一致的问题,现有方法在多样化多模态上下文下因视觉偏好冲突导致性能不稳定,无法持续超越冻结基线。

Result: 在12个多模态基准测试上进行广泛实验,CoVFT实现了最先进的性能,并表现出卓越的稳定性;使用CoVFT微调一个7B参数的MLLM,其平均性能超过了对应的13B参数模型。

Insight: 创新点在于明确将多模态上下文信息整合到视觉适应过程中,通过上下文向量提取和上下文混合专家模块分解冲突的优化信号,实现上下文敏感的视觉更新;这揭示了视觉编码器优化在多模态大语言模型中存在大量未开发的潜力,且通过上下文感知设计可以显著提升模型效率和稳定性。

Abstract: Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.


[88] Hierarchical Text-Guided Brain Tumor Segmentation via Sub-Region-Aware Prompts cs.CVPDF

Bahram Mohammadi, Ta Duc Huy, Afrouz Sheikholeslami, Qi Chen, Vu Minh Hieu Phan

TL;DR: 本文提出了一种名为TextCSP的分层文本引导脑肿瘤分割框架,旨在解决脑肿瘤子区域(全肿瘤、肿瘤核心、增强肿瘤)边界模糊的问题。该框架基于TextBraTS基线,通过文本调制软级联解码器、子区域感知提示调谐和文本语义通道调制器三个新组件,利用放射学描述文本生成针对每个子区域的专门化表示,以指导图像分割。

Details

Motivation: 现有方法通常将整个放射学报告压缩为单一的全局文本嵌入,并在所有肿瘤子区域间共享,忽略了不同子区域(如全肿瘤、肿瘤核心、增强肿瘤)之间独特的临床特征和视觉边界模糊的挑战。

Result: 在TextBraTS数据集上的实验表明,该方法在所有子区域的分割性能上均优于现有最先进方法,在主要指标Dice系数和HD95上分别提升了1.7%和6%。

Insight: 创新点在于提出了一个分层文本引导的软级联架构,通过子区域感知的提示调谐(结合LoRA适配的BioBERT编码器)为每个肿瘤子区域生成专门的文本表示,并利用文本语义通道调制器将这些表示转化为通道级的特征细化信号,从而实现了与临床描述模式对齐的、从粗到细的分割过程。

Abstract: Brain tumor segmentation remains challenging because the three standard sub-regions, i.e., whole tumor (WT), tumor core (TC), and enhancing tumor (ET), often exhibit ambiguous visual boundaries. Integrating radiological description texts with imaging has shown promise. However, most multimodal approaches typically compress a report into a single global text embedding shared across all sub-regions, overlooking their distinct clinical characteristics. We propose TextCSP (text-modulated soft cascade architecture), a hierarchical text-guided framework that builds on the TextBraTS baseline with three novel components: (1) a text-modulated soft cascade decoder that predicts WT->TC->ET in a coarse-to-fine manner consistent with their anatomical containment hierarchy. (2) sub-region-aware prompt tuning, which uses learnable soft prompts with a LoRA-adapted BioBERT encoder to generate specialized text representations tailored for each sub-region; (3) text-semantic channel modulators that convert the aforementioned representations into channel-wise refinement signals, enabling the decoder to emphasize features aligned with clinically described patterns. Experiments on the TextBraTS dataset demonstrate consistent improvements across all sub-regions against state-of-the-art methods by 1.7% and 6% on the main metrics Dice and HD95.


[89] Representation-Level Adversarial Regularization for Clinically Aligned Multitask Thyroid Ultrasound Assessment cs.CV | cs.AIPDF

Dina Salama, Mohamed Mahmoud, Nourhan Bayasi, David Liu, Ilker Hacihaliloglu

TL;DR: 本文提出了一种临床引导的多任务框架,联合预测甲状腺结节的分割掩膜和TI-RADS风险类别,并引入表示层对抗正则化器(RLAR)来缓解多任务学习中因标注者差异导致的梯度竞争问题,从而在保持分割质量的同时提升风险分层性能。

Details

Motivation: 解决甲状腺超声检查中,由于不同放射科医生在结节轮廓勾画和TI-RADS风险分级上的标注风格差异,导致监督信号不一致,进而影响标准学习流程性能的问题。

Result: 在公开的TI-RADS数据集上,与单任务训练和传统多任务基线相比,所提出的临床引导多任务模型结合RLAR,在保持分割质量的同时,持续改善了风险分层性能。

Insight: 创新点在于:1)通过引入紧凑的、与TI-RADS对齐的影像组学目标来引导分类嵌入,将风险预测建立在有临床意义的证据上;2)提出RLAR,在表示层使用各任务归一化的对抗方向作为任务敏感性的几何探针,并惩罚任务特定对抗方向之间的过度角度对齐,从而显式且可控地处理多任务梯度竞争,而非进行参数级的梯度手术。

Abstract: Thyroid ultrasound is the first-line exam for assessing thyroid nodules and determining whether biopsy is warranted. In routine reporting, radiologists produce two coupled outputs: a nodule contour for measurement and a TI-RADS risk category based on sonographic criteria. Yet both contouring style and risk grading vary across readers, creating inconsistent supervision that can degrade standard learning pipelines. In this paper, we address this workflow with a clinically guided multitask framework that jointly predicts the nodule mask and TI-RADS category within a single model. To ground risk prediction in clinically meaningful evidence, we guide the classification embedding using a compact TI-RADS aligned radiomics target during training, while preserving complementary deep features for discriminative performance. However, under annotator variability, naive multitask optimization often fails not because the tasks are unrelated, but because their gradients compete within the shared representation. To make this competition explicit and controllable, we introduce RLAR, a representation-level adversarial gradient regularizer. Rather than performing parameter-level gradient surgery, RLAR uses each task’s normalized adversarial direction in latent space as a geometric probe of task sensitivity and penalizes excessive angular alignment between task-specific adversarial directions. On a public TI-RADS dataset, our clinically guided multitask model with RLAR consistently improves risk stratification while maintaining segmentation quality compared to single-task training and conventional multitask baselines. Code and pretrained models will be released.


[90] Learning Progressive Adaptation for Multi-Modal Tracking cs.CV | cs.AIPDF

He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler

TL;DR: 本文提出了一种用于多模态跟踪的渐进式适应方法(PATrack),通过引入模态依赖、模态纠缠和任务级适配器,逐步将预训练的RGB模型适应到多模态数据中,以解决现有微调方法在模态特定信息、跨模态交互和预测头适应方面的不足。

Details

Motivation: 由于配对多模态数据有限,现有多模态跟踪器通常采用预训练RGB模型配合参数高效微调模块,但这些方法忽略了高级适应策略,未能有效调制单一模态、跨模态交互和预测头。

Result: 在RGB+热成像、RGB+深度和RGB+事件跟踪任务上的大量实验表明,该方法相比最先进方法表现出令人印象深刻的性能。

Insight: 创新点在于通过渐进式策略整合模态内、模态间和任务级适配器,增强模态特定信息(通过分解高低频成分)、引入跨模态交互(通过跨注意力机制)并适应预测头的强归纳偏置,从而统一适应RGB预训练网络到多模态数据。

Abstract: Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.


[91] CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs cs.CVPDF

Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick, Sathyanarayanan N. Aakur

TL;DR: 该论文提出了CVT-Bench,一个用于评估多模态大语言模型在反事实视角变换下空间表示稳定性的诊断性基准。研究发现,尽管MLLMs在单视角空间推理任务上表现良好,但其在假设的相机视角变换下会出现系统性的空间关系一致性退化。

Details

Motivation: 动机在于探究多模态大语言模型在反事实视角变化下是否能够保持稳定的空间状态表示,因为现有的强单视角性能可能高估了其空间表示的鲁棒性。

Result: 在100个合成场景和6000个关系查询上的实验表明,最先进的MLLMs在反事实视角变换下表现出系统性退化,频繁违反循环一致性,且关系稳定性迅速衰减。同时,增加表征结构(如场景图)能提升稳定性。

Insight: 创新点在于提出了一个无需重新渲染图像、通过控制相机轨道变换来评估关系一致性的诊断基准。核心洞察是单视角空间准确性会高估模型诱导出的空间表示的鲁棒性,而表征结构在反事实空间推理中起着关键作用。

Abstract: Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.


[92] One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation cs.CV | cs.AIPDF

Yu-Wen Tseng, Xingyi Zheng, Ya-Chen Wu, I-Bin Liao, Yung-Hui Li

TL;DR: 本文针对实际测试时适应(PTTA)场景中测试数据流非独立同分布且具有时序相关性的问题,指出现有方法普遍使用的单一无结构内存池存在根本性结构不匹配。为此,作者提出了多集群内存(MCM)框架,该框架利用轻量级像素级统计描述符将存储样本组织成多个集群,并引入了基于描述符的集群分配、相邻集群合并和均匀集群检索三种互补机制。MCM作为一个即插即用模块,在多个数据集和TTA方法上实现了性能的稳定提升。

Details

Motivation: 现有PTTA方法普遍使用单一无结构内存池来存储样本,但作者通过流可聚类性分析发现,实际测试数据流本质上是多模态的,其最优混合成分数量远大于一,因此单一集群设计与PTTA的实际需求存在根本性的结构不匹配。

Result: 在CIFAR-10-C、CIFAR-100-C、ImageNet-C和DomainNet数据集上,将MCM与三种当代TTA方法集成,在全部12种配置中均取得了一致的性能提升,其中在ImageNet-C上提升最高达5.00%,在DomainNet上提升最高达12.13%。基于GMM的内存诊断进一步证实,MCM能维持接近最优的分布平衡、熵和模态覆盖。

Insight: 核心创新点在于将内存组织确立为PTTA的一个关键设计维度,并提出了一个结构化的多集群内存框架来匹配测试流的多模态本质。其提出的三种机制(描述符聚类、相邻合并、均匀检索)共同解决了内存使用效率和跨模态平衡监督的问题,为TTA领域提供了新的、可借鉴的系统设计思路。

Abstract: Test-time adaptation (TTA) adapts pre-trained models to distribution shifts at inference using only unlabeled test data. Under the Practical TTA (PTTA) setting, where test streams are temporally correlated and non-i.i.d., memory has become an indispensable component for stable adaptation, yet existing methods universally store amples in a single unstructured pool. We show that this single-cluster design is fundamentally mismatched to PTTA: a stream clusterability analysis reveals that test streams are inherently multi-modal, with the optimal number of mixture components consistently far exceeding one. To close this structural gap, we propose Multi-Cluster Memory (MCM), a plug-and-play framework that organizes stored samples into multiple clusters using lightweight pixel-level statistical descriptors. MCM introduces three complementary mechanisms: descriptor-based cluster assignment to capture distinct distributional modes, Adjacent Cluster Consolidation (ACC) to bound memory usage by merging the most similar temporally adjacent clusters, and Uniform Cluster Retrieval (UCR) to ensure balanced supervision across all modes during adaptation. Integrated with three contemporary TTA methods on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, MCM achieves consistent improvements across all 12 configurations, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet. Notably, these gains scale with distributional complexity: larger label spaces with greater multi-modality benefit most from multi-cluster organization. GMM-based memory diagnostics further confirm that MCM maintains near-optimal distributional balance, entropy, and mode coverage, whereas single-cluster memory exhibits persistent imbalance and progressive mode loss. These results establish memory organization as a key design axis for practical test-time adaptation.


[93] Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues cs.CVPDF

Wenjin Hou, Xiaoxiao Sun, Hehe Fan

TL;DR: 本文提出了一种名为RLVC的基于结果奖励强化学习与视觉线索的生成式零样本学习框架,旨在解决现有生成式ZSL方法中合成特征与任务无关以及仅依赖语义原型导致视觉相似类别区分困难的问题。该框架通过强化学习使生成模型自我进化,利用基于结果的奖励鼓励合成任务相关特征,并引入类别级视觉线索来对齐合成特征与视觉原型并稳定训练。

Details

Motivation: 现有生成式零样本学习方法合成的视觉特征往往与具体任务无关,且仅从语义原型推断数据分布对于语义相似但视觉差异大的类别效果不佳,导致性能下降。本文旨在通过强化学习和视觉线索来解决这些问题,提升生成式ZSL的性能。

Result: 在三个主流ZSL基准测试上的综合实验表明,RLVC实现了最先进的性能,取得了4.7%的性能提升。

Insight: 创新点在于提出了一个结合结果奖励强化学习和类别级视觉线索的生成式ZSL框架,通过强化学习的自我进化机制和基于结果的奖励来合成任务相关特征,并利用视觉线索对齐特征和稳定训练。其提出的冷启动训练策略也是一个值得借鉴的实践。

Abstract: Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.


[94] GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing cs.CVPDF

Zifeng Zhu, Jiaming Han, Jiaxiang Zhao, Minnan Luo, Xiangyu Yue

TL;DR: 本文提出GIDE框架,用于解决扩散大语言模型在无需训练的图像编辑中面临的精确性挑战,通过离散噪声反演机制和多阶段编辑流程,支持多种编辑指令并保持背景不变,并在新构建的GIDE-Bench基准上显著超越现有方法。

Details

Motivation: 扩散大语言模型在多模态生成中表现出色,但其离散化特性阻碍了标准噪声反演技术的应用,导致编辑时结构退化,因此需要一种无需训练且能实现精确编辑的方法。

Result: 在GIDE-Bench(包含805个组合编辑场景)上,GIDE在语义正确性上提升51.83%,感知质量上提升50.39%,显著优于先前无需训练的方法;在ImgEdit-Bench上也表现出广泛适用性,与领先模型相当。

Insight: 创新点包括离散噪声反演机制以在离散令牌空间捕获噪声模式,以及将编辑流程分解为接地、反演和细化阶段,这为基于DLLM的精确编辑提供了可扩展框架。

Abstract: While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.


[95] Boundary-Aware Instance Segmentation in Microscopy Imaging cs.CVPDF

Thomas Mendelson, Joshua Francois, Galit Lahav, Tammy Riklin-Raviv

TL;DR: 本文提出了一种无需提示、边界感知的实例分割框架BAISeg,用于显微镜视频中密集细胞的精确分割。该方法通过预测有符号距离函数(SDF)来建模细胞轮廓,并结合改进的豪斯多夫距离损失进行训练,以解决相邻细胞实例难以分离的挑战。

Details

Motivation: 解决在密集显微镜场景中,现有基础分割模型(如SAM)即使有大量提示也难以准确分离相邻或重叠细胞实例的问题,以实现对细胞动力学的精确研究。

Result: 在公开和私有的高通量显微镜数据集上的评估表明,与最近的基于SAM的方法和其他基础模型方法相比,该方法在边界精度和实例级性能上均有提升。

Insight: 创新点在于采用预测有符号距离函数(SDF)而非二值掩码来建模轮廓,并结合学习的Sigmoid映射与统一的改进豪斯多夫距离损失,实现了无需提示的、几何一致的边界感知分割,提升了密集实例的分离鲁棒性。

Abstract: Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation-model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region- and boundary-based terms. Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches. Source code is available at: https://github.com/ThomasMendelson/BAISeg.git


[96] A Large-Scale Remote Sensing Dataset and VLM-based Algorithm for Fine-Grained Road Hierarchy Classification cs.CVPDF

Ting Han, Xiangyi Xie, Yiping Chen, Yumeng Du, Jin Ma

TL;DR: 本文提出了一个名为SYSU-HiRoads的大规模分层道路数据集,以及一个名为RoadReasoner的视觉-语言-几何框架,用于从遥感图像中自动进行多等级道路制图。RoadReasoner通过增强频率敏感线索和多尺度上下文来生成鲁棒的道路表面掩码、保持拓扑的道路网络和语义一致的层级分配,并利用几何描述符和几何感知的文本提示,通过视觉语言模型进行层级推理。

Details

Motivation: 解决从遥感图像中自动、精细地提取道路网络并对其层级(如主干道、次干道等)进行分类的问题,以支持交通基础设施测绘和管理的自动化。

Result: 在SYSU-HiRoads和CHN6-CUG数据集上的实验表明,RoadReasoner超越了最先进的道路提取基线方法,在总体准确率(OA)、F1分数和分割准确率(SegAcc)上分别达到72.6%、64.2%和60.6%,能够生成准确且语义一致的道路层级地图。

Insight: 主要创新点在于构建了包含密集掩码、矢量化中心线和三级层级标签的大规模分层道路数据集,并提出了一个结合视觉、语言和几何信息的统一框架,通过频率增强、多尺度上下文以及利用视觉语言模型进行基于几何描述的语义推理,来实现道路提取、拓扑重建和层级分类的联合优化。

Abstract: In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.


[97] Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species cs.CVPDF

Jinyu Xu, Tianqi Hu, Xiaonan Hu, Letian Zhou, Songliang Cao

TL;DR: 该论文提出了TPC-268数据集,这是首个融合植物分类学知识的植物计数基准,包含10,000张图像和678,050个点标注,涵盖268个可计数植物类别(242个物种),并跨越从冠层遥感图像到组织显微图像的多尺度观测。

Details

Motivation: 解决细粒度、具有分类学意识的植物计数问题,以填补视觉领域在非刚性形态、生长阶段和环境变化显著的植物对象计数方面的研究空白。

Result: 论文在TPC-268数据集上对最先进的基于回归和基于检测的类别无关计数方法进行了基准测试,但摘要中未提及具体的定量结果或是否达到SOTA水平。

Insight: 创新点在于构建了一个结合林奈分类标签(界->种)和器官类别、支持分层推理和物种感知评估的细粒度数据集,为推进细粒度类别无关计数提供了生物学基础测试平台。

Abstract: Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom -> species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at https://github.com/tiny-smart/TPC-268.


[98] QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression cs.CV | cs.AIPDF

Zhongyang Li, Yaqian Li, Faming Fang, Rinyoichi Takezoe, Zi-Hao Bo

TL;DR: 本文提出了QMoP(查询引导的混合投影器),一个用于高效视觉令牌压缩的自适应框架。它通过三个协作分支(池化、重采样和剪枝)来压缩视觉令牌,并引入查询引导路由器(QGR)根据视觉输入和文本查询动态协调这些分支。论文还构建了VTCBench基准来系统评估视觉令牌压缩的信息损失。实验表明QMoP在性能、内存、计算和推理时间上均优于现有方法。

Details

Motivation: 解决多模态大语言模型中视觉令牌数量远超文本令牌导致的严重计算和内存瓶颈问题,现有基于固定启发式的投影器方法缺乏跨场景的适应性。

Result: 在构建的VTCBench基准上进行广泛实验,QMoP在性能上超越了强基线模型,并在内存、计算和推理时间上实现了显著节省。

Insight: 创新点在于提出了一种由查询动态引导的自适应混合压缩框架(QMoP与QGR),结合了粗粒度全局语义、高层语义表示和细粒度细节保留,并引入了系统性的评估基准VTCBench。从客观角度看,其将混合专家思想与查询感知路由结合用于令牌压缩,是一种灵活且高效的架构创新。

Abstract: Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.


[99] Enhancing Brain Tumor Classification Using Vision Transformers with Colormap-Based Feature Representation on BRISC2025 Dataset cs.CVPDF

Faisal Ahmed

TL;DR: 本研究提出了一种基于视觉Transformer(ViT)并结合色彩映射特征表示的深度学习框架,用于提升脑肿瘤多分类性能。该方法利用Transformer架构捕获长距离依赖的能力,并通过色彩映射技术增强MRI扫描中的结构和强度变化特征。在BRISC2025数据集上的实验表明,该方法在准确率、AUC等指标上优于多个基线CNN模型。

Details

Motivation: 解决脑肿瘤MRI图像分类中准确率不足的问题,旨在通过结合Transformer和色彩映射技术来提升分类性能,以支持早期诊断和治疗规划。

Result: 在BRISC2025数据集(包含胶质瘤、脑膜瘤、垂体瘤和非肿瘤四类)上,模型达到98.90%的分类准确率和99.97%的AUC,优于ResNet50、ResNet101和EfficientNetB2等基线CNN模型,展现了SOTA性能。

Insight: 创新点在于将色彩映射特征表示与Vision Transformer结合,以增强MRI图像的结构和强度特征提取;客观分析认为,这种融合方法可能有效提升了模型对医学图像细微差异的感知能力,具有临床应用的潜力。

Abstract: Accurate classification of brain tumors from magnetic resonance imaging (MRI) plays a critical role in early diagnosis and effective treatment planning. In this study, we propose a deep learning framework based on Vision Transformers (ViT) enhanced with colormap-based feature representation to improve multi-class brain tumor classification performance. The proposed approach leverages the ability of transformer architectures to capture long-range dependencies while incorporating color mapping techniques to emphasize important structural and intensity variations within MRI scans. Experiments are conducted on the BRISC2025 dataset, which includes four classes: glioma, meningioma, pituitary tumor, and non-tumor cases. The model is trained and evaluated using standard performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed method achieves a classification accuracy of 98.90%, outperforming baseline convolutional neural network models including ResNet50, ResNet101, and EfficientNetB2. In addition, the model demonstrates strong generalization capability with an AUC of 99.97%, indicating high discriminative performance across all classes. These results highlight the effectiveness of combining Vision Transformers with colormap-based feature enhancement for accurate and robust brain tumor classification and suggest strong potential for clinical decision support applications.


[100] CornOrb: A Multimodal Dataset of Orbscan Corneal Topography and Clinical Annotations for Keratoconus Detection cs.CVPDF

Mohammed El Amine Lazouni, Leila Ryma Lazouni, Zineb Aziza Elaouaber, Mohammed Ammar, Sofiane Zehar

TL;DR: 本文介绍了CornOrb,一个公开可用的多模态数据集,包含来自阿尔及利亚患者的Orbscan角膜地形图图像和临床标注,用于圆锥角膜检测。数据集包含744名患者的1,454只眼睛(889只正常眼和565只圆锥角膜病例),每只眼提供四种角膜图(轴向曲率、前表面高度、后表面高度和角膜厚度图)以及结构化表格数据(如人口统计信息和关键临床参数)。数据经过匿名化、预处理为标准化PNG和CSV格式,旨在支持人工智能研究。

Details

Motivation: 解决缺乏来自非洲地区的大规模、多模态Orbscan角膜地形图数据集的问题,以促进基于人工智能的圆锥角膜稳健检测和分析。

Result: 数据集包含1,454只眼睛的标注数据,是首批来自非洲的大规模Orbscan资源之一,为AI模型训练和评估提供了基础。

Insight: 创新点在于提供了一个结合多模态图像(四种角膜图)和结构化临床数据的公开数据集,专门针对圆锥角膜检测,并填补了非洲地区数据资源的空白,有助于开发更稳健的AI诊断工具。

Abstract: In this paper, we present CornOrb, a publicly accessible multimodal dataset of Orbscan corneal topography images and clinical annotations collected from patients in Algeria. The dataset comprises 1,454 eyes from 744 patients, including 889 normal eyes and 565 keratoconus cases. For each eye, four corneal maps are provided (axial curvature, anterior elevation, posterior elevation, and pachymetry), together with structured tabular data including demographic information and key clinical parameters such as astigmatism, maximum keratometry (Kmax), central and thinnest pachymetry, and anterior/posterior asphericity. All data were retrospectively acquired, fully anonymized, and pre-processed into standardized PNG and CSV formats to ensure direct usability for artificial intelligence research. This dataset represents one of the first large-scale Orbscan-based resources from Africa, specifically built to enable robust AI-driven detection and analysis of keratoconus using multimodal data. The data are openly available at Zenodo.


[101] When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning cs.CV | cs.AIPDF

Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang

TL;DR: 本文提出了一种名为Group Relative Policy Optimization (GRPO)的无监督自进化训练框架,用于提升多模态大语言模型在推理任务上的性能。该方法无需人工标注答案或外部奖励模型,通过采样多个推理轨迹、利用模型自身的一致性信号作为先验,并引入有界的Judge调制来重加权不同质量的轨迹,最终将绝对分数转换为组内相对优势以实现更鲁棒的政策更新。

Details

Motivation: 当前多模态大语言模型在推理任务上的改进主要依赖于高质量标注数据或教师模型蒸馏,这两种方法成本高昂且难以扩展。本文旨在解决这一问题,探索一种无需人工监督的自进化训练方法。

Result: 在五个数学推理基准测试上,该方法持续提升了推理性能和泛化能力,实现了稳定的性能改进。

Insight: 创新点在于提出了一种完全无监督的自进化框架,利用模型自身的推理一致性(自洽性)作为训练信号,并通过组内相对优势比较来规避绝对分数的不稳定性,为多模态模型的可扩展自我进化提供了一条新路径。

Abstract: Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale.To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure.We use the Actor’s self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality.We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models.The code are available at https://dingwu1021.github.io/SelfJudge/.


[102] Text-Image Conditioned 3D Generation cs.CVPDF

Jiazhong Cen, Jiemin Fang, Sikuang Li, Guanjun Wu, Chen Yang

TL;DR: 本文提出了一种结合文本和图像条件进行3D生成的新方法,旨在解决现有单模态3D生成模型的局限性。通过引入TIGON模型,一个包含独立图像与文本条件分支并进行轻量级跨模态融合的双分支基线,实现了更灵活和保真的3D内容生成。

Details

Motivation: 现有3D生成模型通常只依赖单一模态(图像或文本)作为条件,图像条件模型能获得高视觉保真度但受限于输入视角偏差,文本条件模型提供广泛语义指导但缺乏低级视觉细节。这限制了用户表达意图的能力,因此研究如何结合两种模态以实现更优的3D生成。

Result: 诊断研究表明,即使是简单的文本和图像条件预测的后期融合也优于单模态模型。在广泛实验中,文本-图像条件化方法持续优于单模态方法,证明了跨模态互补性的有效性。

Insight: 论文的核心创新在于正式提出了文本-图像条件3D生成任务,并展示了视觉示例与文本规范联合推理的价值。TIGON模型作为一个极简的双分支基线,其轻量级跨模态融合架构揭示了互补的视觉-语言引导是未来3D生成研究的一个有前景的方向。

Abstract: High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page


[103] Identity-Consistent Video Generation under Large Facial-Angle Variations cs.CVPDF

Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang

TL;DR: 本文提出了一种名为Mv^2ID的多视角条件化视频生成框架,旨在解决单视角参考视频生成方法在大面部角度变化下难以保持身份一致性的问题。该方法通过区域掩码训练策略和参考解耦RoPE机制,有效缓解了多视角引入导致的视角依赖复制粘贴伪影,从而在提升身份一致性的同时保持了面部运动的自然性。

Details

Motivation: 单视角参考视频生成方法在大面部角度变化下难以保持身份一致性,而简单地引入多视角参考图像会加剧视角依赖的复制粘贴问题,导致面部运动不自然。尽管跨配对数据可以缓解此问题,但收集成本高昂,因此需要一种在无配对监督下平衡一致性与自然性的方法。

Result: 在构建的大规模多角度面部数据集上,该方法在身份一致性和运动自然性方面均显著优于现有方法,甚至超过了使用跨配对数据训练的基线模型,实现了SOTA性能。

Insight: 创新点包括:1)区域掩码训练策略,防止模型走捷径学习,促进跨视角身份特征的互补聚合;2)参考解耦RoPE机制,为视频和条件令牌分配不同的位置编码,以更好地建模其异构特性;3)构建了专门的数据集和评估指标,为相关研究提供了基准。

Abstract: Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.


[104] Privacy-Preserving Federated Action Recognition via Differentially Private Selective Tuning and Efficient Communication cs.CVPDF

Idris Zakariyya, Pai Chet Ng, Kaushik Bhargav Sivangi, S. Mohammad Sheikholeslami, Konstantinos N. Plataniotis

TL;DR: 本文提出了一种名为FedDP-STECAR的联邦学习方法,用于解决视频动作识别任务中的隐私泄露和通信开销两大挑战。该方法通过差分隐私选择性调优和高效通信,仅对任务相关层进行微调和扰动,在保护视频时序特征的同时,大幅减少了信息泄露风险和通信成本。

Details

Motivation: 动机是解决联邦视频动作识别中存在的两个关键问题:梯度交换导致的模型暴露(可能泄露私人运动模式)以及高维视频网络全模型同步带来的巨大通信开销。

Result: 在UCF-101数据集上使用MViT-B-16x4 transformer进行的实验表明,该方法在严格隐私约束(ε=0.65)的集中式设置下,准确率最高提升了70.2%;在联邦设置下,训练速度提升了48%,并达到了73.1%的准确率。同时,与全模型更新相比,通信流量减少了超过99%。

Insight: 宣称的创新点在于将差分隐私与选择性层调优相结合,仅对关键层进行扰动和传输,从而在保证隐私的同时维持了模型性能并极大降低了通信成本。从客观角度看,其核心创新在于针对视频时序特征的联邦学习场景,设计了一种高效的、隐私感知的参数更新与通信策略。

Abstract: Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2% higher accuracy} under strict privacy ($ε=0.65$) in centralized settings and \textbf{48% faster training} with \textbf{73.1% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at https://github.com/izakariyya/mvit-federated-videodp


[105] Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos cs.CVPDF

Masoumeh Sharafi, Muhammad Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli

TL;DR: 本文提出了一种基于缓存的测试时自适应方法TTA-CaP,用于视频中的面部表情识别。该方法通过三个协调的缓存(个性化源缓存、正目标缓存和负目标缓存)和一个三重门控机制,实现了对视觉语言模型的高效、无梯度的个性化,以应对不同主体间的分布偏移,并在保持低计算和内存开销的同时提升了性能。

Details

Motivation: 视频面部表情识别需要模型个性化以适应不同主体的巨大差异。现有基于测试时自适应的无监督参数优化方法计算开销大,不适用于实际部署;而基于缓存的TTA方法仅依赖动态内存存储测试样本,容易因噪声伪标签积累误差和漂移。

Result: 在BioVid、StressID和BAH三个具有挑战性的视频FER数据集上的实验表明,TTA-CaP在主体特定和环境偏移下能够超越最先进的TTA方法,同时保持了低计算和内存开销。

Insight: 创新点在于提出了一个包含三个协调缓存(个性化源缓存、正目标缓存、负目标缓存)的架构,以及基于时间稳定性、置信度和与个性化缓存一致性的三重门控机制来控制缓存更新与替换。这有效减少了噪声伪标签的影响,并通过嵌入融合细化预测,实现了高效且鲁棒的模型个性化。

Abstract: Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.


[106] EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization cs.CVPDF

Haolan Xu, Keli Cheng, Lei Wang, Ning Bi, Xiaoming Liu

TL;DR: 本文提出了EmoTaG,一个基于预训练-适应范式的少样本情感感知3D说话头合成框架。该框架在3D高斯泼溅(3DGS)上构建,通过将运动预测重新表述在结构化的FLAME参数空间中,并引入门控残差运动网络(GRMN)来捕捉音频中的情感韵律并补充头部姿态与上半面部线索,从而在保持几何稳定性的同时生成富有表现力且连贯的面部运动。

Details

Motivation: 现有的少样本音频驱动3D说话头合成方法(基于NeRF或3DGS)在表现丰富面部表情时,常面临几何不稳定和音频-情感不匹配的问题,因此需要更有效的情感感知运动建模。

Result: 大量实验表明,EmoTaG在情感表现力、唇部同步、视觉真实感和运动稳定性方面均达到了最先进的(SOTA)性能。

Insight: 主要创新点在于:1) 在FLAME参数空间而非直接变形3D高斯上进行运动预测,引入了显式几何先验以提升稳定性;2) 提出GRMN网络,能同时从音频中捕捉情感韵律并补充音频中缺失的头部姿态与上半面部运动线索,实现更富表现力的生成。

Abstract: Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.


[107] Respiratory Status Detection with Video Transformers cs.CVPDF

Thomas Savage, Evan Madill

TL;DR: 本研究评估了视频Transformer在识别呼吸窘迫迹象方面的能力,通过收集健康志愿者剧烈运动后恢复的视频,构建了一个标记数据集,并设计了一个时序排序挑战任务。研究发现,结合Lie相对编码和运动引导掩码的ViViT编码器,配合基于嵌入的比较策略,在该任务上取得了0.81的F1分数,表明现代视频Transformer能够识别呼吸力学的细微变化。

Details

Motivation: 解决通过视觉检查识别呼吸窘迫这一关键临床技能,利用AI系统早期检测呼吸恶化迹象,为干预创造时间窗口。

Result: 在自建的呼吸窘迫视频数据集上,提出的方法(ViViT编码器增强版)达到了0.81的F1分数,展示了识别呼吸状态变化的有效性。

Insight: 创新点包括使用Lie相对编码和运动引导掩码增强视频Transformer,以及设计时序排序挑战任务来评估呼吸窘迫检测;客观分析表明,该方法将视频理解技术应用于医疗监测,具有潜在临床价值。

Abstract: Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.


[108] Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation cs.CVPDF

Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng

TL;DR: 本文提出了一种名为Relax Forcing的结构化时序记忆机制,用于改进自回归视频扩散模型在生成长视频时的性能。该方法将历史上下文分解为三个功能角色(Sink、Tail和动态选择的History),并选择性纳入最相关的过去信息,以减轻外推过程中的误差累积并保持运动演化。

Details

Motivation: 现有自回归视频扩散模型在扩展到分钟级长视频生成时,面临逐步时序退化的问题。作者通过实证分析发现,限制因素并非内存不足,而是推理过程中时序内存的利用方式不当,因此需要一种更结构化的时序记忆机制。

Result: 在VBench-Long基准测试上的实验表明,Relax Forcing改善了运动动态和整体时序一致性,同时降低了注意力开销。

Insight: 核心创新在于将时序内存视为异质而非同质的缓冲区,并按其功能角色进行结构化分解和选择性利用。这为基于强制训练的策略提供了重要补充,表明结构化时序记忆对于可扩展的长视频生成至关重要。

Abstract: Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.


[109] HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis cs.CV | cs.LGPDF

Mohamed A Mabrok

TL;DR: HamVision 是一个用于医学图像分析的框架,它将阻尼谐振子作为结构化的归纳偏置,用于分割和分类任务。该框架通过相空间分解自动生成三个功能不同的表示:位置q(特征内容)、动量p(编码边界和纹理信息的空间梯度)和能量H(无参数的显著性图)。这些表示源自动力学而非监督学习,可以被不同的任务特定头部直接利用。在分割任务中,能量用于门控跳跃连接,动量则在每个解码器层级注入边界信息(HamSeg);在分类任务中,三个表示被全局池化并拼接成相空间特征向量(HamCls)。

Details

Motivation: 解决医学图像分析中如何将信号处理的基本构件(如阻尼谐振子)作为结构化归纳偏置,以自动生成对分割和分类任务有益的特征表示,从而减少对监督学习的依赖并提升模型性能。

Result: 在十个涵盖五种成像模态的医学影像基准测试中,HamSeg在分割任务上于ISIC 2018(89.38% Dice)、ISIC 2017(88.40% Dice)、TN3K(87.05% Dice)和ACDC(92.40% Dice)上达到了最先进的Dice分数,仅使用8.57M参数就超越了大多数基线模型;HamCls在分类任务上于BloodMNIST(98.85%准确率)和PathMNIST(96.65%准确率)上达到了最先进的准确率,并在其他MedMNIST数据集上与MedMamba和MedViT相比取得了有竞争力的结果。

Insight: 创新点在于将哈密顿动力学作为归纳偏置,通过相空间分解自动生成功能明确的特征表示(位置、动量、能量),这些表示无需监督即可用于不同任务头部,实现了任务自适应且参数高效的设计。从客观角度看,该方法将物理启发的动力学模型与深度学习结合,提供了一种可解释且高效的特征提取机制,特别是在医学图像分析中增强了边界感知和显著性检测能力。

Abstract: We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator’s phase-space decomposition yields three functionally distinct representations: position$q$ (feature content), momentum$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC,2018 (89.38%), ISIC,2017 (88.40%), TN3K (87.05%), and ACDC (92.40%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85%) and PathMNIST (96.65%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator’s momentum consistently encodes an interior$,{>},$boundary$,{>},$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds-R-Lab/hamvision.


[110] Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation cs.CVPDF

Nikolay Kormushev, Josip Šarić, Matej Kristan

TL;DR: 本文提出了OVRCOAT框架,通过CLIP条件化的物体性调整(COAT)和开放词汇掩码到文本细化(OVR)两个模块,共同解决开放词汇全景分割中的掩码选择偏差和区域-文本对齐不足问题,在多个基准数据集上实现了显著的性能提升。

Details

Motivation: 开放词汇全景分割面临两个耦合问题:一是掩码选择偏差,即基于封闭词汇训练的物体性头会抑制训练中未见类别的掩码;二是视觉语言模型(如CLIP)的区域理解能力有限,这些模型原本为全局图像分类优化,而非局部分割。

Result: OVRCOAT在ADE20K数据集上实现了+5.5% PQ的SOTA性能提升,在Mapillary Vistas和Cityscapes上分别获得+7.1%和+3% PQ的明显增益。

Insight: 创新点包括:1) COAT模块通过CLIP条件化调整前景/背景概率,保留未见类别的高质量掩码;2) OVR模块以较低内存成本增强CLIP的区域级对齐,改善已见和未见类别的分类。客观来看,该方法通过简单模块化设计有效解耦并联合优化物体性估计和掩码识别,具有较好的可扩展性和实用性。

Abstract: Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: https://github.com/nickormushev/OVRCOAT


[111] Knowledge Priors for Identity-Disentangled Open-Set Privacy-Preserving Video FER cs.CVPDF

Feng Xu, Xun Li, Lars Petersson, Yulei Sui, David Ahmedt Aristizabal

TL;DR: 本文提出了一种两阶段框架,用于在开放集视频场景下实现隐私保护的面部表情识别。该方法首先利用从无身份标签的真实视频中提取的领域知识先验训练身份抑制网络,以匿名化身份信息同时保留表情线索;随后通过去噪模块恢复表情相关信息以维持识别性能。此外,引入了一种基于伪造的验证方法,利用识别先验评估隐私鲁棒性,无需身份标注。

Details

Motivation: 解决在现实开放集视频场景中,现有隐私保护方法因身份未知且缺乏标签而失效的问题,旨在实现身份解耦的隐私保护表情识别。

Result: 在三个视频数据集上的实验表明,该方法能有效保护隐私,同时保持与使用身份监督的基线方法相当的表情识别准确率。

Insight: 创新点包括:利用无标签视频中的领域知识先验(如视频内和视频间信息)进行身份抑制,以及提出无需身份标注的伪造验证方法来评估隐私鲁棒性,为开放集隐私保护提供了新思路。

Abstract: Facial expression recognition relies on facial data that inherently expose identity and thus raise significant privacy concerns. Current privacy-preserving methods typically fail in realistic open-set video settings where identities are unknown, and identity labels are unavailable. We propose a two-stage framework for video-based privacy-preserving FER in challenging open-set settings that requires no identity labels at any stage. To decouple privacy and utility, we first train an identity-suppression network using intra- and inter-video knowledge priors derived from real-world videos without identity labels. This network anonymizes identity while preserving expressive cues. A subsequent denoising module restores expression-related information and helps recover FER performance. Furthermore, we introduce a falsification-based validation method that uses recognition priors to rigorously evaluate privacy robustness without requiring annotated identity labels. Experiments on three video datasets demonstrate that our method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.


[112] Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models cs.CVPDF

Jingchen Sun, Shaobo Han, Deep Patel, Wataru Kohno, Can Jin

TL;DR: 本文提出了一种名为Beta加权知识蒸馏(Beta-KD)的不确定性感知蒸馏框架,用于自适应地平衡学生模型从数据和教师指导中学习的程度。该方法通过贝叶斯视角将教师监督解释为学生激活的Gibbs先验,从而推导出闭式的不确定性感知加权机制。在多模态视觉问答基准上的实验表明,Beta-KD能够有效提升学生视觉语言模型的性能,并优于现有的知识蒸馏方法。

Details

Motivation: 知识蒸馏需要平衡从数据监督和教师指导中学习,但样本可能包含噪声或教师存在不确定性,因此需要自适应地调节这种平衡。

Result: 在多模态VQA基准测试上,使用大型教师VLM蒸馏学生视觉语言模型能持续提升性能,Beta-KD方法优于现有知识蒸馏方法。

Insight: 创新点在于从统一的贝叶斯视角形式化师生学习,将教师监督视为学生激活的Gibbs先验,从而推导出闭式的不确定性感知加权机制,支持任意蒸馏目标及其组合,实现了自适应平衡。

Abstract: Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher–student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.


[113] Image-Based Structural Analysis Using Computer Vision and LLMs: PhotoBeamSolver cs.CVPDF

Altamirano-Muñiz Emilio Fernando

TL;DR: 本文介绍了PhotoBeamSolver程序的开发,该程序能够基于手绘图纸解决理想化的梁模型问题。系统结合计算机视觉和统计学习技术来检测和视觉解读结构元素,并探讨了计算机视觉在土木工程结构分析、基础设施检测和工程决策支持系统中的应用现状与挑战。

Details

Motivation: 解决从手绘图纸自动识别和解析结构元素以进行理想化梁模型分析的问题,推动计算机视觉在土木工程结构分析领域的可靠应用。

Result: 论文未在摘要中明确提及具体的定量实验结果或基准测试,但介绍了PhotoBeamSolver程序的实现,并讨论了计算机视觉在土木工程中的当前应用状态。

Insight: 创新点在于将计算机视觉与统计学习结合,开发出能从手绘图自动解析结构模型的系统;客观分析认为,其将AI技术应用于传统工程图纸分析,为土木工程自动化提供了新思路。

Abstract: This paper presents the development of a documented program capable of solving idealized beam models, such as those commonly used in textbooks and academic exercises, from drawings made by a person. The system is based on computer vision and statistical learning techniques for the detection and visual interpretation of structural elements. Likewise, the main challenges and limitations associated with the integration of computer vision into structural analysis are analyzed, as well as the requirements necessary for its reliable application in the field of civil engineering. In this context, the implementation of the PhotoBeamSolver program is explored, and the current state of computer vision in civil engineering is discussed, particularly in relation to structural analysis, infrastructure inspection, and engineering decision-support systems.


[114] PAS3R: Pose-Adaptive Streaming 3D Reconstruction for Long Video Sequences cs.CVPDF

Lanbo Xu, Liang Guo, Caigui Jiang, Cheng Wang

TL;DR: PAS3R是一个用于长视频序列的位姿自适应流式3D重建框架,旨在解决在线单目3D重建中的稳定性-适应性困境。它通过动态调整状态更新来平衡新视角的快速整合与已有场景结构的保持,从而提升长序列重建的轨迹精度和几何一致性。

Details

Motivation: 现有流式方法采用均匀或基于注意力的更新机制,难以处理剧烈的视角转换,导致长序列中的轨迹漂移和几何不一致。论文旨在解决这一稳定性与适应性之间的根本矛盾。

Result: 在多个基准测试上的广泛实验表明,PAS3R在长视频序列中显著提升了轨迹精度、深度估计和点云重建质量,同时在短序列上保持了有竞争力的性能。

Insight: 核心创新在于提出根据相机运动和场景结构动态调整状态更新的位姿自适应机制,其关键洞见是:带来显著几何新颖性的帧应对重建状态施加更强影响,而视角变化小的帧应优先保留历史上下文。此外,通过轨迹一致性训练目标和轻量级在线稳定模块进一步提升了长时重建的稳定性。

Abstract: Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences.


[115] ALADIN:Attribute-Language Distillation Network for Person Re-Identification cs.CVPDF

Wang Zhou, Boran Duan, Haojun Ai, Ruiqi Lan, Ziyue Zhou

TL;DR: 本文提出ALADIN(属性-语言蒸馏网络),一种用于行人重识别(ReID)的方法。它通过从冻结的CLIP教师模型中蒸馏知识到一个轻量级ReID学生模型中,引入了细粒度的属性-局部对齐和场景感知的软提示生成,以增强对遮挡的鲁棒性和跨模态表征学习。

Details

Motivation: 现有基于CLIP的ReID方法主要依赖全局特征和固定提示,限制了其捕捉细粒度属性线索和适应多样化外观的能力。本文旨在解决这一问题,通过属性-语言蒸馏来提升细粒度对齐和自适应能力。

Result: 在Market-1501、DukeMTMC-reID和MSMT17基准测试上,ALADIN超越了基于CNN、Transformer和CLIP的方法,表现出更好的泛化性和可解释性,达到了先进水平(SOTA)。

Insight: 创新点包括:细粒度属性-局部对齐机制、场景感知软提示生成器、利用多模态大语言模型(MLLMs)生成结构化属性描述作为监督,以及跨模态对比和关系蒸馏来保持属性间结构关系。这为ReID提供了更鲁棒和可解释的解决方案。

Abstract: Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.


[116] Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models cs.CVPDF

Hyundong Jin, Dongyoon Han, Eunwoo Kim

TL;DR: 本文提出了一种针对大型视觉语言模型的持续遗忘框架,通过将遗忘目标分解为细粒度的视觉和文本概念,并利用概念调制器和拒绝专家混合机制,实现基于概念的精确拒绝响应生成,同时保持模型的通用能力。

Details

Motivation: 解决持续遗忘过程中,顺序更新导致共享表示扭曲,引发视觉语言对与拒绝行为之间的虚假关联,从而难以精确识别遗忘目标并产生不当拒绝的问题。

Result: 在视觉语言基准测试上的广泛实验表明,该框架优于现有方法,能生成基于概念的拒绝响应,并在遗忘序列中保持通用效用。

Insight: 创新点在于将遗忘目标分解为细粒度概念,并设计概念调制器与拒绝专家混合机制,结合多模态概念驱动路由方案,实现概念对齐的精确拒绝与模型能力保持。

Abstract: Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.


[117] Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation cs.CVPDF

Jingnan Luo, Mingqi Gao, Jun Liu, Bin-Bin Gao, Feng Zheng

TL;DR: 本文提出TrajSeg,一个基于多模态大语言模型(MLLMs)的简单统一框架,用于视频推理分割任务。该框架通过引入双向文本-轨迹对齐机制,结合帧级内容整合模块和统一的掩码解码器,旨在根据人类指令更准确地分割视频中的动态物体。

Details

Motivation: 现有视频推理分割方法依赖于单向且隐式的文本-轨迹对齐,在视频动态变化剧烈时难以有效感知物体轨迹。本文旨在解决此问题,提升模型对视频中物体轨迹的感知能力。

Result: 在多个指代和推理视频分割数据集上的广泛实验表明,TrajSeg在所有指标上均超越了现有的视频推理分割方法,取得了最优性能。

Insight: 核心创新点在于提出了双向文本-轨迹对齐机制(文本到轨迹和轨迹到文本),这增强了模态间的对应关系,并结合了帧级内容整合与统一掩码解码器,简化了框架并实现了端到端训练。

Abstract: The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.


[118] StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding cs.CV | cs.MMPDF

Guowei Tang, Tianwen Qian, Huanran Zheng, Yifei Wang, Xiaoling Wang

TL;DR: 本文提出了StreamingEval,一个用于在真实资源约束下评估视频大语言模型(Video-LLMs)流式视频理解能力的统一评估框架。该框架通过标准化协议,对主流离线模型和近期在线视频模型进行基准测试,明确刻画了效率、存储和准确性之间的权衡。

Details

Motivation: 现有流式视频理解研究通常只关注有限视觉上下文下的问答准确性或编码效率提升,而忽视了在真实资源约束下的实际可部署性。本文旨在填补这一空白,提供一个系统性的评估方案。

Result: 在多个数据集上的广泛实验表明,当前Video-LLMs与真实流式应用需求之间存在显著差距,为未来研究提供了系统性基准。

Insight: 创新点在于提出了一个统一的评估框架,通过固定容量的记忆库来规范化可访问的历史视觉上下文,并联合评估视觉编码效率、文本解码延迟和任务性能,以量化整体系统可部署性,从而更贴近实际应用场景。

Abstract: Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on isolated aspects such as question-answering accuracy under limited visual context or improvements in encoding efficiency, while largely overlooking practical deployability under realistic resource constraints. To bridge this gap, we introduce StreamingEval, a unified evaluation framework for assessing the streaming video understanding capabilities of Video-LLMs under realistic constraints. StreamingEval benchmarks both mainstream offline models and recent online video models under a standardized protocol, explicitly characterizing the trade-off between efficiency, storage and accuracy. Specifically, we adopt a fixed-capacity memory bank to normalize accessible historical visual context, and jointly evaluate visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability. Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and the requirements of realistic streaming applications, providing a systematic basis for future research in this direction. Codes will be released at https://github.com/wwgTang-111/StreamingEval1.


[119] Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification cs.CVPDF

Jayanie Bogahawatte, Sachith Seneviratne, Saman Halgamuge

TL;DR: 本文提出了一种用于少样本全切片图像(WSI)分类的参数高效提示调优方法和分层文本引导策略。该方法通过缩放和移动文本编码器特征来减少可训练参数和计算开销,并利用软分层文本引导策略避免硬实例过滤造成的信息丢失。在多个病理数据集上的实验表明,该方法在分类性能和弱监督肿瘤定位方面均优于现有方法。

Details

Motivation: 解决在少样本弱监督WSI分类中,现有基于视觉语言模型的方法存在可训练参数多、推理开销大,以及因丢弃低对齐实例而导致信息丢失的问题。

Result: 在涵盖乳腺癌、肺癌和卵巢癌的病理数据集上,分类准确率分别比现有最优方法提升高达10.9%、7.8%和13.8%;在乳腺癌和肺癌数据集上可训练参数减少18.1%,在卵巢癌数据集上减少5.8%,同时在弱监督肿瘤定位任务上表现出色。

Insight: 创新点在于提出了参数高效的提示调优方法(通过缩放和移动文本特征)和软分层文本引导策略,既有效利用了预训练视觉语言模型的知识,又结合了WSI固有的层次结构,避免了硬过滤带来的信息损失。

Abstract: Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.


[120] Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection cs.CVPDF

Kaiqiang Li, Gang Li, Mingle Zhou, Min Li, Delong Han

TL;DR: 本文提出了一种名为BTP(Back To Point)的新型零样本3D异常检测框架,旨在直接利用预训练的点云-语言模型(PLMs)进行三维点云的异常检测与定位。该方法通过将多粒度点云块特征与文本表示对齐,并结合几何描述符来增强对结构异常的敏感性,同时利用辅助点云数据进行联合表示学习以提升鲁棒性和丰富异常语义。

Details

Motivation: 现有的零样本3D异常检测方法通常将点云渲染为2D图像并利用预训练的视觉-语言模型(VLMs),但这会丢失几何细节且对局部异常不敏感。因此,本文旨在探索预训练的点云-语言模型(PLMs)的潜力,以直接处理点云数据,从而更好地保留几何信息并提升检测性能。

Result: 在Real3D-AD和Anomaly-ShapeNet基准测试上的大量实验表明,BTP在零样本3D异常检测中实现了优越的性能。

Insight: 创新点在于直接利用点云-语言模型进行3D异常检测,通过多粒度特征对齐和几何描述符增强来提升对局部和结构异常的敏感性,并引入辅助数据驱动的联合表示学习策略来丰富语义表示。这为基于点云的零样本异常检测提供了新的思路,避免了2D渲染的信息损失。

Abstract: Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.


[121] VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection cs.CVPDF

Xinghan Li, Junhao Xu, Jingjing Chen

TL;DR: 本文提出VIGIL,一种基于多模态大语言模型(MLLM)的、以面部部件为中心的、结构化的可泛化深度伪造检测框架。它采用‘规划-检查’流程:先规划需要检查的面部区域,再独立检查每个区域以获取法医证据,并通过阶段门控注入机制确保区域选择不受外部证据干扰。此外,提出了一种包含强化学习的三阶段渐进式训练范式,并构建了OmniFake基准进行严格评估。

Details

Motivation: 当前基于MLLM的深度伪造检测方法将证据生成和篡改定位合并为一个步骤,导致忠实观察与幻觉解释的边界模糊,结论不可靠。本文旨在模仿专家法医实践,构建一个结构化的、可解释的检测框架,以提升检测的可靠性和泛化能力。

Result: 在构建的OmniFake(一个包含5个难度等级的层次化基准)和跨数据集评估中,VIGIL在所有泛化性级别上均一致优于专家检测器和同期基于MLLM的方法,展现了卓越的泛化性能。

Insight: 核心创新在于将检测过程解耦为‘规划’和‘检查’两个独立阶段,并引入阶段门控注入机制来隔离证据对区域选择的影响,这增强了推理过程的忠实性和可解释性。此外,提出的渐进式训练范式(特别是使用部件感知奖励的强化学习阶段)和用于严格评估泛化性的层次化基准(OmniFake)也具有重要借鉴价值。

Abstract: Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model’s own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence–conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.


[122] PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation cs.CVPDF

Gensheng Pei, Xiruo Jiang, Xinhao Cai, Tao Chen, Yazhou Yao

TL;DR: PEARL是一种无需训练、即插即用的开放词汇语义分割方法,它通过Procrustes对齐和文本感知的拉普拉斯传播两个步骤,有效利用跨模态几何信息,无需后处理、辅助骨干网络或额外数据,在标准基准测试中实现了最先进的性能。

Details

Motivation: 现有的无需训练开放词汇语义分割方法要么依赖繁重的后处理,要么孤立处理文本和视觉信息,未能充分利用跨模态几何关系;而引入额外视觉骨干或多模型流水线的方法则增加了复杂性和延迟。本文旨在解决这些问题。

Result: PEARL在标准基准测试中,无论采用包含背景或不包含背景的评估协议,均实现了最先进的性能,且无需额外数据或辅助骨干网络。

Insight: 创新点在于提出了一个紧凑的两步推理框架(对齐-传播):首先通过稳定的极坐标迭代在自注意力块内进行正交投影对齐,将键向量旋转到查询子空间;然后通过一个置信度加权、文本引导的图求解在小网格上细化像素级逻辑值,其中文本同时提供数据信任信号和邻居门控,图像梯度则保留边界。该方法完全无需训练,仅使用固定常数,通过轻量级的逐头投影和少量共轭梯度步骤增加了最小延迟。

Abstract: Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \textbf{\underline{P}}rocrust\textbf{\underline{e}}s \textbf{\underline{a}}lignment with text-awa\textbf{\underline{r}}e \textbf{\underline{L}}aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.


[123] PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models cs.CVPDF

Yiwei Xie, Zheng Zhang, Ping Liu

TL;DR: 本文提出了PROBE诊断协议,用于量化文本到视频扩散模型中已擦除概念的‘再激活潜力’。该协议通过优化轻量级伪令牌嵌入,结合去噪重建目标和新型潜在对齐约束,评估了三种T2V架构、三种概念类别和三种擦除策略,发现现有方法仅实现输出级抑制而非表征移除,并识别了视频特有的时间性重现失败模式。

Details

Motivation: 现有文本到视频扩散模型的概念擦除技术仅评估生成帧中目标概念是否缺失,将输出级抑制视为表征移除的证据,缺乏对残留概念容量的诊断。

Result: 在三种T2V架构、三种概念类别和三种擦除策略上的系统实验表明,所有测试方法均留下可测量的残留容量,其鲁棒性与干预深度相关,并揭示了时间性重现这一视频特定失败模式。

Insight: 创新点包括多级评估框架(基于分类器的检测、语义相似性、时间再激活分析和人工验证)、量化再激活潜力的诊断协议,以及发现当前擦除方法仅实现输出级抑制而非表征移除的关键见解。

Abstract: Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.


[124] From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy cs.CVPDF

Bi’an Du, Daizong Liu, Pufan Li, Wei Hu

TL;DR: 本文提出了一种从单张图像生成3D模型的自适应部件-整体层次化生成世界模型。该模型通过从图像token推断软组合掩码来自主发现潜在结构槽,利用自适应槽门控机制动态确定激活概率并合并冗余槽,同时将提炼出的槽与可学习的类别无关原型库对齐以实现跨类别形状共享和去噪。

Details

Motivation: 解决现有单图像3D生成方法在稀疏监督下难以泛化到多样语义类别和高度可变结构复杂性的问题,特别是传统方法采用整体建模或固定部件数量导致的过拟合、结构碎片化或缺失以及组合泛化能力有限。

Result: 实验表明,该方法在跨类别迁移和部件数量外推方面取得持续提升,消融研究证实了原型库在形状先验共享和槽门控在结构自适应方面的互补优势。

Insight: 创新点在于将单图像3D生成重新定义为在灵活3D潜在空间中学习自适应部件-整体层次结构,通过自适应槽门控和类别无关原型库实现动态结构发现与跨类别几何共享,提升了模型的泛化能力和结构表达力。

Abstract: Single-image 3D generation lies at the core of vision-to-graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part-aware models such as PartCrafter, which still require a labor-intensive user-specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single-image 3D generation as learning an adaptive part-whole hierarchy in the flexible 3D latent space. We present a novel part-to-whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot-gating mechanism dynamically determines the slot-wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class-agnostic prototype bank, enabling powerful cross-category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross-category transfer and part-count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape-prior sharing as well as slot-gating for structural adaptation.


[125] Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning cs.CVPDF

Minseok Kang, Minhyeok Lee, Minjung Kim, Jungho Lee, Donghyeong Kim

TL;DR: 本文提出了一种通过可学习的配对亲和力来改进弱监督视频场景图生成的方法,旨在解决因使用现成目标检测器而引入的非交互对象噪声问题。通过配对亲和力学习与评分(PALS)和配对亲和力调制(PAM)机制,模型能够抑制非交互配对并专注于有意义的视觉关系。此外,还引入了关系感知匹配(RAM)来利用视觉-语言基础为亲和力学习提供更干净的监督。

Details

Motivation: 弱监督视频场景图生成(WS-VSGG)旨在降低标注成本,但现有方法依赖现成检测器生成目标提议,这些检测器会无差别地检测所有可见对象,导致关系模型被大量非交互的噪声配对淹没,与全监督流程存在根本差异。

Result: 在Action Genome数据集上的大量实验表明,该方法在不同基线和骨干网络上均取得了显著改进,实现了最先进的WS-VSGG性能。

Insight: 创新点在于引入可学习的配对亲和力来估计主客体对之间的交互可能性,并通过PALS和PAM将其整合到推理排序和上下文推理中,同时利用RAM通过视觉-语言基础解决伪标签生成中的类别级歧义,从而更有效地处理噪声配对。

Abstract: Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.


[126] Exploring Multimodal Prompts For Unsupervised Continuous Anomaly Detection cs.CVPDF

Mingle Zhou, Jiahui Liu, Jin Wan, Gang Li, Min Li

TL;DR: 本文提出了一种基于多模态提示的无监督连续异常检测框架,通过构建连续多模态提示记忆库来渐进式提取和保留视觉与文本领域的正常模式原型,并设计了缺陷语义引导的自适应融合机制来提升检测精度和对抗鲁棒性。

Details

Motivation: 现有仅依赖视觉信息的无监督连续异常检测方法难以在复杂场景中充分捕捉正常模式,限制了检测精度的进一步提升。

Result: 在MVTec AD和VisA数据集上的基准实验表明,该方法在图像级AUROC和像素级AUPR指标上达到了最先进的性能。

Insight: 创新点在于引入多模态(视觉与文本)信息来丰富正常模式表示,并通过记忆库和自适应融合机制实现连续学习下的高效异常检测。

Abstract: Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.


[127] CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation cs.CV | cs.AI | cs.DB | cs.LG | cs.ROPDF

Mohammad Eslami, Dhanvinkumar Ganeshkumar, Saber Kazeminasab, Michael G. Morley, Michael V. Boland

TL;DR: 本文提出了CataractSAM-2,这是一个基于Meta的Segment Anything Model 2进行领域适配的模型,旨在以高精度对白内障眼科手术视频进行实时语义分割。该模型位于计算机视觉与医疗机器人技术的交叉点,可为机器人辅助和计算机引导的手术系统提供精确的术中感知。此外,为了减轻人工标注的负担,作者引入了一个结合稀疏提示与基于视频的掩码传播的交互式标注框架,显著减少了标注时间,促进了高质量真实掩码的可扩展创建,加速了眼科前节手术数据集的开发。模型还展示了对青光眼小梁切除术的强零样本泛化能力,证实了其跨手术程序的实用性和更广泛的手术应用潜力。

Details

Motivation: 解决在白内障眼科手术视频中进行实时、高精度语义分割的挑战,以支持机器人辅助和计算机引导的手术系统;同时,为减轻创建高质量手术视频分割数据集所需的大量人工标注负担,开发一个高效的标注工具。

Result: 模型在眼科前节手术(特别是白内障手术)视频分割任务上实现了高精度实时分割;其标注工具显著减少了标注时间;模型在青光眼小梁切除术上展示了强大的零样本泛化能力,表明了其跨手术程序的实用性。

Insight: 主要创新点在于将通用的大规模分割模型(SAM 2)通过领域适配(domain adaptation)专门化到特定的医疗手术视频场景(白内障手术),实现了高精度实时分割;同时,提出了一个结合稀疏用户提示和视频时序掩码传播的交互式标注框架,这是一个高效且可扩展的解决方案,能显著加速医疗视频数据集的构建。从客观角度看,将强大的基础视觉模型与特定领域需求(医疗机器人、手术理解)结合,并配套开发降低数据标注门槛的工具,是推动AI在专业领域落地的有效范式。

Abstract: We present CataractSAM-2, a domain-adapted extension of Meta’s Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model’s strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.


[128] Rethinking Visual Privacy: A Compositional Privacy Risk Framework for Severity Assessment with VLMs cs.CVPDF

Efthymios Tsaprazlis, Tiantian Feng, Anil Ramakrishna, Sai Praneeth Karimireddy, Rahul Gupta

TL;DR: 本文提出了一种组合式隐私风险框架(CPRT),将视觉隐私视为组合属性而非二元分类,通过可识别性和组合伤害潜力对属性分级,并构建了一个包含6.7K图像的数据集。研究发现前沿视觉语言模型(VLMs)在有结构化指导时能较好评估组合隐私风险,但会系统性低估组合风险,而较小模型则难以处理分级隐私推理。为此,作者还开发了一个可部署的8B监督微调模型,在组合隐私评估上达到前沿模型性能。

Details

Motivation: 现有视觉隐私基准大多将隐私视为二元属性(私有/非私有),忽略了隐私的组合性本质,即单独无害的属性组合可能导致严重的隐私侵犯,因此需要更细粒度的评估框架。

Result: 在构建的包含6.7K图像的数据集上评估,前沿VLMs在有结构化指导时与组合隐私严重性对齐良好,但系统性低估组合驱动风险;较小模型在分级隐私推理上表现不佳;提出的8B SFT模型在组合隐私评估上接近前沿模型性能。

Insight: 创新点在于将隐私视为组合属性并提出了分级严重性框架CPRT,以及可解释的连续隐私评分函数;客观来看,该框架为细粒度隐私风险评估提供了新范式,且开发的轻量级SFT模型实现了高性能与可部署性的平衡。

Abstract: Existing visual privacy benchmarks largely treat privacy as a binary property, labeling images as private or non-private based on visible sensitive content. We argue that privacy is fundamentally compositional. Attributes that are benign in isolation may combine to produce severe privacy violations. We introduce the Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework that organizes visual attributes according to standalone identifiability and compositional harm potential. CPRT defines four graded severity levels and is paired with an interpretable scoring function that assigns continuous privacy severity scores. We further construct a taxonomy-aligned dataset of 6.7K images and derive ground-truth compositional risk scores. By evaluating frontier and open-weight VLMs we find that frontier models align well with compositional severity when provided structured guidance, but systematically underestimate composition-driven risks. Smaller models struggle to internalize graded privacy reasoning. To bridge this gap, we introduce a deployable 8B supervised fine-tuned (SFT) model that closely matches frontier-level performance on compositional privacy assessment.


[129] ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model cs.CV | cs.AI | cs.CL | cs.LG | cs.ROPDF

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen

TL;DR: 本文提出ThinkJEPA框架,将密集帧的潜在世界模型(如V-JEPA2)与稀疏采样的视觉语言模型(VLM)相结合,通过双时间路径(密集JEPA分支和稀疏VLM思考分支)来增强长时程语义预测能力,并在手部操作轨迹预测任务上验证了其有效性。

Details

Motivation: 现有潜在世界模型(如V-JEPA2)从短观察窗口进行密集预测,受限于时间上下文,容易偏向局部低层次外推,难以捕捉长时程语义;而视觉语言模型(VLM)虽能提供强语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈以及与小规模动作条件数据集的数据机制不匹配,不适合作为独立的密集预测器。

Result: 在手部操作轨迹预测实验中,该方法优于仅使用VLM的基线和仅使用JEPA预测器的基线,并产生了更稳健的长时程推演行为。

Insight: 创新点包括:1)提出双时间路径框架,结合密集帧动态建模与长时程语义指导;2)引入分层金字塔表示提取模块,将VLM的多层表示聚合为与潜在预测兼容的指导特征,有效传递VLM的渐进推理信号。

Abstract: Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision–language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM’s progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.


[130] SARe: Structure-Aware Large-Scale 3D Fragment Reassembly cs.CVPDF

Hanze Jia, Chunshi Wang, Yuxiao Yang, Zhonghua Jiang, Yawei Luo

TL;DR: 本文提出了一种名为SARe的结构感知大规模三维碎片重组框架,包含SARe-Gen用于生成欧几里得空间中的组装结构,以及SARe-Refine用于推理时细化。该方法通过联合预测断裂表面令牌概率和碎片间接触图来定位接触区域并推断候选邻接关系,并引入基于查询点的条件方案提取对齐的局部几何令牌。此外,通过几何一致性检查验证候选接触边,选择可靠子结构并重新采样不确定区域,从而在碎片数量增加时实现更稳定和一致的组装。

Details

Motivation: 解决大规模三维碎片重组中,由于目标形状未知且碎片语义线索弱,现有端到端方法因不可靠的接触推理(特别是碎片邻接不准确)而容易导致级联失败的问题。

Result: 在合成断裂、扫描真实物体的模拟断裂以及真实物理断裂扫描三种设置下进行评估,结果表明SARe达到了最先进的性能,在具有挑战性的大规模重组任务中,随着碎片数量增加,性能下降更平缓且成功率更高。

Insight: 创新点包括:1)联合预测断裂表面令牌概率和碎片间接触图以显式建模接触;2)采用基于查询点的条件方案,从冻结的几何编码器中提取对齐的局部几何令牌,无需额外的结构预训练;3)引入推理时细化阶段,通过几何一致性检查验证候选接触边,动态选择可靠子结构并重新采样不确定区域,提升大规模碎片组装的稳定性。

Abstract: 3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.


[131] WorldCache: Content-Aware Caching for Accelerated Video World Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan

TL;DR: 本文提出了WorldCache,一种用于加速视频世界模型推理的内容感知缓存框架。它通过引入运动自适应阈值、显著性加权漂移估计、混合与扭曲的最优近似以及扩散步骤间的相位感知阈值调度,改进了何时以及如何重用特征,从而在动态场景中减少伪影和运动不一致性,实现了显著的推理加速和质量保持。

Details

Motivation: 基于扩散Transformer的视频世界模型计算成本高昂,现有的免训练特征缓存方法主要依赖零阶保持假设,在动态场景中容易产生重影、模糊和运动不一致等问题,因此需要一种更智能的缓存机制来改善特征重用的时机和方式。

Result: 在PAI-Bench上评估的Cosmos-Predict2.5-2B模型中,WorldCache实现了2.3倍的推理加速,同时保持了基线模型99.4%的质量,显著优于先前的免训练缓存方法。

Insight: 论文的创新点在于将缓存问题形式化为一个感知约束的动态系统,通过运动自适应和内容感知的机制(如显著性加权和相位感知调度)来动态决定特征重用,避免了静态快照假设的局限性,为高效视频生成提供了可借鉴的优化思路。

Abstract: Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.


[132] 4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video cs.CVPDF

Jae Won Jang, Yeonjin Chang, Wonsik Shin, Juhwan Cho, Nojun Kwak

TL;DR: 本文提出了4DGS360框架,用于从单目视频中实现360度动态物体的4D重建。该方法通过创新的3D原生初始化策略解决了现有方法因过度依赖2D先验而导致的遮挡区域几何不一致问题,并引入了新的iPhone360基准数据集进行全方位评估。

Details

Motivation: 现有方法严重依赖2D先验,导致初始点过度拟合每个训练视角的可见表面,难以重建一致的360度动态物体几何形状。本文旨在解决这一几何模糊性问题。

Result: 在提出的iPhone360新基准数据集以及iPhone和DAVIS数据集上,4DGS360在定性和定量评估中均达到了最先进的性能水平。

Insight: 核心创新在于提出了AnchorTAP3D 3D跟踪器,它利用置信度高的2D跟踪点作为锚点来生成强化的3D点轨迹,抑制漂移并提供可靠的初始化,从而保留遮挡区域的几何结构。同时,创建了iPhone360数据集以支持现有数据集无法提供的360度评估。

Abstract: We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.


[133] PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation cs.CVPDF

Jiacheng Lu, Hui Ding, Shiyu Zhang, Guoping Huo

TL;DR: 本文提出了一种名为PGR-Net的脑肿瘤MRI分割网络,旨在解决肿瘤病灶体积小、空间稀疏导致现有方法在大量背景区域进行冗余计算的问题。该网络通过引入数据驱动的空间先验集来捕获肿瘤的分布和尺度特征,并采用分层Top-K ROI决策机制逐步选择最可信的病灶候选区域,结合WinGS-ROI模块生成中心增强的引导图来指导特征学习,最终使用窗口化RetNet骨干网络提升定位可靠性。

Details

Motivation: 现有脑肿瘤MRI分割网络往往忽视临床观察到的肿瘤空间先验,导致在广阔背景区域进行冗余特征计算,而肿瘤病灶仅占体积的一小部分,存在严重的空间稀疏性问题。

Result: 在BraTS-2019/2023和MSD Task01基准测试上,PGR-Net仅使用8.64M参数,在全肿瘤区域(Whole Tumor)的Dice分数分别达到89.02%、91.82%和89.67%,一致优于现有方法,达到了SOTA水平。

Insight: 创新点包括:1)显式ROI感知框架,通过数据驱动的空间先验集提供全局引导;2)分层Top-K ROI决策机制,逐步提升病灶定位精度;3)WinGS-ROI模块,利用多窗口高斯模板和空间衰减函数生成中心增强的引导图,有效指导网络特征学习;从客观角度看,该方法将临床先验知识与分层决策、引导图生成相结合,为处理医学图像中类不平衡和稀疏目标问题提供了新思路。

Abstract: Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS-2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at https://github.com/CNU-MedAI-Lab/PGR-Net.


[134] Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition cs.CVPDF

Wen Guo, Pengfei Zhao, Zongmeng Wang, Yufan Hu, Junyu Gao

TL;DR: 本文提出了一种名为TCEI(Test-time Calibration from Experience and Intuition)的双层自适应框架,用于解决多目标跟踪(MOT)任务中因训练与测试数据分布偏移导致的性能下降问题。该框架模拟人类决策过程,通过直觉系统快速预测,并利用经验系统进行校准,从而在测试时适应环境变化。

Details

Motivation: 现有测试时自适应(TTA)方法在多目标跟踪中效果不佳,主要因为它们仅关注帧级自适应,而忽略了跨帧和视频的时间一致性与身份关联。本文旨在解决这一局限性,以提升模型在分布偏移下的鲁棒性。

Result: 在多个基准数据集上的大量实验表明,TCEI框架始终取得优越性能,并显著增强了模型在分布偏移下的适应能力。

Insight: 创新点在于将人类决策的直觉与经验机制引入测试时自适应,通过双层系统(直觉系统与经验系统)协同工作,并利用测试中的置信与不确定对象作为历史先验和反思案例,以同时处理快速适应与长期校准,从而改善多目标跟踪的时空一致性。

Abstract: Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model’s adaptability under distribution shifts. The code will be released at https://github.com/1941Zpf/TCEI.


[135] FedCVU: Federated Learning for Cross-View Video Understanding cs.CV | cs.LGPDF

Shenghan Zhang, Run Ling, Ke Cao, Ao Ma, Zhanjie Zhang

TL;DR: FedCVU是一个用于跨视角视频理解的联邦学习框架,旨在解决跨视角场景中数据分布异构、表示未对齐和通信开销大的问题。该框架包含三个核心组件:VS-Norm用于处理视角特定的统计特征,CV-Align通过轻量级对比正则化模块改善跨视角表示对齐,SLA通过选择性层聚合策略降低通信成本。

Details

Motivation: 解决联邦学习在跨视角视频理解中面临的三大挑战:视角和背景异构导致非独立同分布数据分布和过拟合、局部分布偏差导致表示未对齐阻碍跨视角语义一致性,以及大型视频模型带来的高昂通信开销。

Result: 在跨视角协议下的动作理解和行人重识别任务上进行了广泛实验,FedCVU在保持强视角性能的同时持续提升未见视角的准确率,优于现有最先进的联邦学习基线方法,并展现出对领域异构和通信限制的鲁棒性。

Insight: 创新点包括:提出视角特定的归一化参数保留机制(VS-Norm)以处理非独立同分布数据;设计轻量级跨视角对比对齐模块(CV-Align)改善表示一致性;引入选择性层聚合策略(SLA)在降低通信开销的同时保持精度。这些方法为联邦学习在异构视觉任务中的应用提供了可借鉴的解决方案。

Abstract: Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.


[136] OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging cs.CVPDF

Meilin Liu, Jiaying Wang, Jing Shan

TL;DR: 本文提出OmniFM,一种面向异构医学影像的模态鲁棒且任务无关的联邦学习框架。该框架基于频域洞察,利用低频频谱分量具有跨模态一致性的特点,通过全局频谱知识检索、嵌入级交叉注意力融合和前缀-后缀频谱提示等技术,统一支持分类、分割、超分辨率、视觉问答和多模态融合等多种任务,无需重新设计优化流程。

Details

Motivation: 现有联邦学习框架通常与特定任务模型紧密耦合,且在异构成像模态下表现脆弱,这限制了其在模态分布差异大、需支持多种下游任务的真实医疗场景中的部署。

Result: 在真实数据集上的实验表明,OmniFM在模态内和跨模态异质性场景下,均持续超越最先进的联邦学习基线方法,在微调和从头训练设置下均取得了优异结果。

Insight: 创新点在于利用频域中低频分量编码模态不变解剖结构的洞察,构建了任务和模态无关的统一联邦学习框架;具体技术包括全局频谱知识检索、嵌入级交叉注意力融合、前缀-后缀频谱提示以及频谱-近端对齐目标函数,以实现稳定聚合和表征对齐。

Abstract: Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.


[137] Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis cs.CV | cs.AI | cs.GR | cs.LG | cs.MMPDF

Kangbo Zhao, Miaoxin Guan, Xiang Chen, Yukai Shi, Jinshan Pan

TL;DR: 本文提出了一种跨场景去雨自适应框架,通过超像素结构先验和多阶段伪雨合成,利用目标域的无雨背景图像生成伪配对数据,以解决合成训练数据与真实雨景之间的域差异问题,提升模型在未见分布场景下的泛化性能。

Details

Motivation: 现有深度去雨模型在合成数据上训练良好,但在真实复杂雨景中泛化性能严重下降,主要源于合成数据与真实雨物理动态间的域差异;本文旨在无需目标域配对雨图的情况下,仅利用无雨背景实现跨场景自适应。

Result: 在多个先进去雨模型上的实验表明,该方法在OOD域上实现了显著的PSNR提升(达32%至59%),并大幅加速了训练收敛速度。

Insight: 创新点包括:超像素生成模块提取稳定结构先验、分辨率自适应融合策略对齐源域结构与目标域背景、多阶段噪声生成的伪标签重合成机制;该框架可作为即插即用模块集成到任意去雨架构中,增强跨场景适应性。

Abstract: Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.


[138] HumanOmni-Speaker: Identifying Who said What and When cs.CVPDF

Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma

TL;DR: 本文针对现有全模态大语言模型在处理多人对话动态时存在的‘能力幻觉’问题,提出了HumanOmni-Speaker模型和VR-SDR范式。通过引入视觉增量编码器,以25fps采样原始视频并将帧间运动残差压缩为每帧仅6个token,模型能有效捕捉细粒度的视觉单元和说话人轨迹,实现无需侵入式裁剪的端到端唇读和高精度空间定位,在广泛的以说话人为中心的任务上表现出色。

Details

Motivation: 解决现有全模态大语言模型在理解复杂多人对话动态(即准确回答‘谁在何时说了什么’)时存在的根本性缺陷,包括依赖视觉偏见和低帧率采样导致的高频动态信息丢失。

Result: 在提出的HumanOmni-Speaker基准测试上,模型展示了强大的多模态协同能力,在广泛的以说话人为中心的任务上取得了优越性能。

Insight: 核心创新是提出了视觉注册说话人日志与识别范式及配套基准,以消除视觉捷径;并设计了视觉增量编码器,通过高效压缩帧间运动残差来捕获细粒度动态信息,避免了token爆炸,实现了真正的端到端时空身份绑定。

Abstract: While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer Who said what and when.'' Current models suffer from an illusion of competence’’ – they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.


[139] PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma cs.CVPDF

Zelin Liu, Xiangfu Yu, Jie Huang, Ge Wang, Yizhe Yuan

TL;DR: PPGL-Swarm是一个用于嗜铬细胞瘤和副神经节瘤(PPGLs)诊断的智能体驱动系统,它通过分解诊断任务为微任务并分配给专门智能体,自动生成包含GAPP评分、基因型风险警报和多模态证据的综合报告,旨在提高诊断效率和准确性。

Details

Motivation: 解决PPGL临床诊断中GAPP评分工作量大、主观性强、未纳入关键基因突变风险因素,以及现有智能诊断系统缺乏可追溯推理和领域知识整合的问题。

Result: 摘要中未提及具体的定量实验结果或基准测试比较,但宣称系统能生成包含自动化GAPP评分(含量化细胞密度和Ki-67)、基因型风险警报和集成证据的多模态报告。

Insight: 创新点在于采用智能体群(Swarm)架构将复杂诊断分解为可审计的微任务,结合知识增强(如基因和表格智能体)和强化学习优化工具选择与任务分配,实现了可解释、多模态且整合领域知识的自动化诊断流程。

Abstract: Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.


[140] Rethinking Token Reduction for Large Vision-Language Models cs.CV | cs.AIPDF

Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang

TL;DR: 本文提出了一种名为MetaCompress的基于学习的提示无关方法,旨在解决大型视觉语言模型在多轮视觉问答中因视觉令牌过多导致的高推理成本问题。该方法将令牌减少统一为可学习的压缩映射,并通过高效训练范式学习最优压缩策略,在保持跨对话轮次强泛化能力的同时,实现了优越的效率-准确性权衡。

Details

Motivation: 现有令牌减少方法主要针对单轮视觉问答,而多轮视觉问答场景中后续问题未知且可能涉及任意图像区域,导致现有基于提示依赖或启发式度量的方法效果不佳,因此需要一种能适应多轮对话的通用令牌压缩方法。

Result: 在多轮视觉问答基准测试和多种大型视觉语言模型架构上的广泛实验表明,MetaCompress在效率与准确性之间取得了优越的权衡,并保持了跨对话轮次的强泛化性能。

Insight: 创新点在于将令牌减少统一为可学习的压缩映射框架,避免了启发式设计的局限性,并引入了数据高效训练范式,可视为一种通用的提示无关压缩方法,适用于动态多轮交互场景。

Abstract: Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.


[141] Getting to the Point: Why Pointing Improves LVLMs cs.CVPDF

Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

TL;DR: 本文研究了在大型视觉语言模型中引入指向机制对零样本计数任务的影响,通过对比直接计数和先指向后计数两种微调方法,发现指向能提升模型的分布外泛化能力,并揭示了坐标预测中的空间偏差。

Details

Motivation: 尽管指向已被证明能提高LVLMs的准确性和可解释性,但其具体作用机制、在认知任务中的相关性以及中间指向的可靠性尚未明确,限制了其作为视觉解释的实用性。

Result: 在零样本计数任务中,Point-then-Count方法比Direct Counting具有更高的分布外泛化性能;坐标预测的F1分数超过89%,但存在空间偏差。

Insight: 指向通过将定位和推理建模为显式序列步骤,使模型学习技能而非过拟合特定任务;坐标编码的空间信息是计数性能提升的关键机制,但预测可靠性受图像区域影响,揭示了模型的空间偏差问题。

Abstract: Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs’ accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects’ coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.


[142] Let’s Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts cs.CV | cs.AIPDF

Xu Liu, Yongheng Zhang, Qiguang Chen, Yao Li, Sheng Wang

TL;DR: 本文提出了一种名为DaP-ICoT的新型交错模态思维链推理框架,旨在解决现有方法中视觉信息插入静态、低效以及视觉表征不连贯的问题。该框架通过动态视觉思维集成和精确视觉思维引导两个核心组件,实现了根据推理需求自适应引入视觉输入并确保视觉语义的连贯性。实验表明,该方法在多个基准测试上达到了最先进的性能,并大幅减少了图像插入数量和令牌消耗。

Details

Motivation: 当前交错模态思维链推理方法存在两大局限:一是静态视觉思维定位,即在固定步骤插入视觉信息,导致推理效率低下且不灵活;二是破碎的视觉思维表征,即视觉令牌不连续且语义不连贯。本文旨在解决这些问题,以提升多模态推理的效率和效果。

Result: 在多个基准测试和模型上的实验表明,DaP-ICoT实现了最先进的性能。此外,该方法显著减少了插入图像的数量,使令牌消耗降低了72.6%,从而实现了更高效的交错模态思维链推理。

Insight: 论文的创新点在于提出了动态视觉思维集成和精确视觉思维引导机制,前者实现了视觉信息的按需自适应引入,后者确保了视觉表征的语义连贯性和上下文对齐。从客观角度看,这种动态和精确的视觉思维处理方式为多模态推理提供了更灵活、高效的框架,可借鉴于其他需要结合视觉与语言信息的任务中。

Abstract: Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.


[143] Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends cs.CVPDF

Simone Nascivera, Leonard Bauersfeld, Jeff Delaune, Davide Scaramuzza

TL;DR: 本文提出了一种基于图像条件强化学习的视觉里程计前端参数在线自适应调优框架,通过轻量级CNN编码器将图像内容映射为特征检测与跟踪参数,实现根据场景纹理、光照等动态调整,在仿真训练下显著提升了特征跟踪长度并降低了计算成本。

Details

Motivation: 现有稀疏直接/半直接视觉里程计系统依赖手动调优的固定超参数,无法适应纹理密度、光照、运动模糊等动态场景变化,导致实际部署中性能脆弱;本文旨在通过强化学习嵌入专家知识,实现前端参数的自适应在线优化。

Result: 在TartanAirV2和TUM RGB-D数据集上的实验表明,该方法将特征跟踪长度提升3倍,计算成本降低3倍,且仅需仿真训练即可实现。

Insight: 创新点在于将前端参数配置建模为序列决策问题,并首次引入图像内容作为强化学习的观察输入,使系统能提前适应场景变化而非依赖事后VO统计;轻量级纹理感知CNN编码器与特权评论员训练机制提升了策略的泛化性与效率。

Abstract: Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.


[144] The Universal Normal Embedding cs.CV | eess.IVPDF

Chen Tasker, Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa

TL;DR: 本文提出通用正态嵌入(UNE)假设,认为生成模型和视觉编码器共享一个近似高斯的潜在空间,并通过构建NoiseZoo数据集验证了扩散模型的反向噪声与编码器表示在语义线性方向上的对齐性,实现了无需模型修改的可控图像编辑。

Details

Motivation: 解决生成模型与视觉编码器因目标不同而分离的问题,探索它们潜在空间的高斯共性,并假设存在一个共享的潜在源来统一两者。

Result: 在CelebA数据集上,线性探针在扩散噪声和编码器表示空间均实现了强对齐的属性预测,并通过正交化方法减少了虚假纠缠,支持了UNE假设。

Insight: 创新点在于提出UNE假设并实证验证生成与编码潜在空间的几何一致性,利用线性方向实现可控编辑,为统一生成与理解任务提供了新视角。

Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/


[145] Timing In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing cs.CVPDF

Yaelle Zribi, Florian Cafiero, Vincent Lépinay, Chahan Vidal-Gorène

TL;DR: 本文介绍了TIC-TALK,一个用于研究单口喜剧中喜剧时机(comedic timing)的多模态数据集和数据处理流程。该数据集包含90个专业录制的喜剧专场(2015-2024年),提供了5400多个在时间上对齐的主题片段,整合了语言、手势和观众反应(笑声)数据。处理流程结合了BERTopic进行主题分割、Whisper-AT进行笑声检测、YOLOv8进行镜头分类和姿态估计。作为一个应用案例,论文分析了笑声动态与24个主题的关系,发现动能与笑声率负相关,个人/身体主题比地缘政治主题引发更多笑声,特写镜头比例与笑声正相关。

Details

Motivation: 现有对单口喜剧和幽默的研究过于依赖文本内容,忽视了现场表演中表演者的身体表现和观众反馈等关键要素。本文旨在通过构建一个整合语言、姿态和观众反应的多模态数据集,为研究喜剧时机这一核心表演元素提供资源。

Result: 在构建的数据集上进行的案例分析显示:表演者的动能与观众笑声率呈显著负相关(r = -0.75),符合“抖包袱前的静止”模式;个人和身体主题比地缘政治主题引发更多笑声;特写镜头的比例与笑声率呈正相关(r = +0.28)。

Insight: 论文的创新点在于构建了一个大规模、多模态、时间对齐的单口喜剧数据集,并提供了一个可复现的数据处理流程。从客观角度看,其将连续的运动学信号(如手臂伸展、动能、躯干倾斜)作为表演动态的代理变量,并保留原始骨骼坐标而非预先聚类,为后续的细粒度分析提供了灵活性,这是对传统基于文本或离散动作分类研究方法的有效补充。

Abstract: Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.


[146] Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction cs.CVPDF

Dillan Imans, Phuoc-Nguyen Bui, Duc-Tai Le, Hyunseung Choo

TL;DR: 本文提出了一种名为临床图介导蒸馏(CGMD)的框架,用于解决在缺乏配对多模态数据的情况下,将MRI中提取的高血压知识迁移到眼底图像模型中的问题。该方法通过构建跨模态的临床相似性kNN图,利用共享的结构化生物标志物作为桥梁,训练MRI教师模型并传播其表示,从而为眼底图像患者生成脑部信息化的表示目标,最终通过联合目标训练眼底学生模型。

Details

Motivation: 动机在于眼底成像虽然成本低、可扩展性强,但高血压相关的视网膜线索微妙,预测方差高;而脑部MRI能提供更强的血管和小血管疾病标志物,但价格昂贵且很少与眼底图像同时采集,导致模态数据孤立。本文旨在研究这种非配对的MRI-眼底图像场景,将MRI的知识迁移到眼底模型以提升预测性能。

Result: 在新收集的非配对MRI-眼底-生物标志物数据集上的实验表明,CGMD在基于眼底图像的高血压预测上,相比标准蒸馏和非图插补基线方法,取得了持续性的改进,消融实验证实了基于临床的图连接性的重要性。

Insight: 创新点在于利用临床相似性图作为跨模态知识迁移的桥梁,通过图传播机制实现非配对数据的表示插补,并结合监督、目标蒸馏和关系蒸馏的联合训练目标。从客观角度看,该方法为处理模态孤立数据提供了一种新颖的图介导蒸馏策略,可借鉴于其他多模态医学影像分析任务。

Abstract: Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at https://github.com/DillanImans/CGMD-unpaired-distillation.


[147] Ctrl-A: Control-Driven Online Data Augmentation cs.CV | cs.AI | cs.LG | eess.SYPDF

Jesper B. Christensen, Ciaran Bench, Spencer A. Thomas, Hüsnü Aslan, David Balslev-Harder

TL;DR: Ctrl-A是一种用于图像视觉任务的自动化数据增强算法,它基于控制理论在线调整增强强度分布,无需手动初始化单个增强强度,通过控制循环架构和相对操作响应曲线动态适应训练过程。

Details

Motivation: 解决传统数据增强方法需要手动设计增强策略和初始化增强强度的问题,通过自动化调整增强强度分布来提升模型性能,避免对模型性能产生负面影响的增强风格。

Result: 在CIFAR-10、CIFAR-100和SVHN-core基准数据集上使用WideResNet-28-10架构进行实验,结果表明Ctrl-A与现有最先进的数据增强策略具有高度竞争力。

Insight: 创新点在于将控制理论应用于数据增强,通过在线调整增强强度分布和操作依赖的更新过程,自动化优化增强策略,可借鉴其动态适应机制来减少人工干预。

Abstract: We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.


[148] SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection cs.CV | cs.AIPDF

Shuxian Zhao, Jie Gui, Baosheng Yu, Lu Dong, Zhipeng Gui

TL;DR: 本文提出了一个名为SteelDefectX的视觉-语言数据集和基准测试,用于可泛化的钢材表面缺陷检测。该数据集包含7,778张图像,涵盖25个缺陷类别,并提供了从粗粒度(类别信息、视觉属性、工业成因)到细粒度(形状、大小、深度、位置、对比度等样本特定属性)的文本描述。作者建立了包含四个任务的基准测试,实验表明这种从粗到细的文本标注能显著提升模型的可解释性、泛化性和可迁移性。

Details

Motivation: 当前钢材表面缺陷检测方法通常依赖于仅使用标签训练的简单图像分类模型,这限制了模型的可解释性和泛化能力。

Result: 在建立的基准测试(包括纯视觉分类、视觉-语言分类、少样本/零样本识别和零样本迁移四个任务)上,多个基线模型的实验结果表明,从粗到细的文本标注能显著提升性能。

Insight: 核心创新在于构建了一个具有从粗粒度到细粒度多层级文本描述的视觉-语言数据集,这种结构化的语义信息标注为模型学习更丰富、更详细的缺陷表征提供了可能,有助于推动可解释、可泛化的工业缺陷检测研究。数据集和基准测试的公开也将促进该领域发展。

Abstract: Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.


[149] Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation cs.CVPDF

Xiaochan Yuan, Pai Zeng

TL;DR: 本文提出了一种名为MDSVM-UNet的新型两阶段冠状动脉分割框架,旨在从CTA图像中准确分割冠状动脉。该框架创新性地结合了多向蛇形卷积(MDSConv)和残差视觉Mamba(RVM),以解决血管结构细长、分支复杂以及前景与背景类别不平衡的挑战。

Details

Motivation: 冠状动脉分割对心血管疾病诊疗至关重要,但现有CNN方法难以捕捉血管长程依赖,而ViT方法计算开销过大,不适合资源受限的临床环境。因此,需要一种能高效建模长程依赖且计算复杂度低的方法。

Result: 论文提出的MDSVM-UNet框架在冠状动脉分割任务上取得了优异性能,通过两阶段策略(粗分割引导智能块提取,细分割恢复血管细节)有效提升了分割精度,并保持了线性计算复杂度。

Insight: 主要创新点包括:1)引入MDSConv模块,通过沿三个正交解剖平面学习自适应偏移,实现多视角特征融合以捕捉血管的细长几何形态;2)设计基于RVM的上采样解码器块,利用选择性状态空间机制建模切片间长程依赖,同时保持线性计算复杂度;3)采用渐进式两阶段分割策略,结合粗分割和细分割以优化结果。从客观角度看,将状态空间模型(SSM)与可变形卷积结合用于医学图像分割是一个有前景的方向。

Abstract: Accurate segmentation of coronary arteries from computed tomography angiography (CTA) images is of paramount clinical importance for the diagnosis and treatment planning of cardiovascular diseases. However, coronary artery segmentation remains challenging due to the inherent multi-branching and slender tubular morphology of the vasculature, compounded by severe class imbalance between foreground vessels and background tissue. Conventional convolutional neural network (CNN)-based approaches struggle to capture long-range dependencies among spatially distant vascular structures, while Vision Transformer (ViT)-based methods incur prohibitive computational overhead that hinders deployment in resource-constrained clinical settings. Motivated by the recent success of state space models (SSMs) in efficiently modeling long-range sequential dependencies with linear complexity, we propose MDSVM-UNet, a novel two-stage coronary artery segmentation framework that synergistically integrates multidirectional snake convolution (MDSConv) with residual visual Mamba (RVM). In the encoding stage, we introduce MDSConv, a deformable convolution module that learns adaptive offsets along three orthogonal anatomical planes – sagittal, coronal, and axial – thereby enabling comprehensive multi-view feature fusion that faithfully captures the elongated and tortuous geometry of coronary vessels. In the decoding stage, we design an RVM-based upsampling decoder block that leverages selective state space mechanisms to model inter-slice long-range dependencies while preserving linear computational complexity. Furthermore, we propose a progressive two-stage segmentation strategy: the first stage performs coarse whole-image segmentation to guide intelligent block extraction, while the second stage conducts fine-grained block-level segmentation to recover vascular details and suppress false positives..


[150] Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning cs.CVPDF

Sulian Thual, Feiyang Cai, Jingjing Wang, Feng Luo

TL;DR: 本文提出了一种基于视频扩散模型的气候提示方法,用于生成马登-朱利安振荡(MJO)序列。该模型在再分析数据上训练,通过低维条件(如关键指标)合成长期MJO序列,生成的序列能捕捉复合特征、功率谱和多尺度结构(如对流耦合波)。通过使用理想化的低维条件(如永恒MJO、季节或厄尔尼诺-南方涛动调制),模型可生成更易处理的MJO,从而解构底层过程并识别物理驱动因素。该方法为弥合低维MJO理论与高分辨率大气复杂性之间的差距提供了实用框架,有助于热带大气预测。

Details

Motivation: 生成式深度学习在模拟热带MJO方面具有潜力,但其与传统理论框架的关系尚不明确。本文旨在通过视频扩散模型和低维条件,生成MJO序列以连接低维理论与高维大气复杂性,解决这一理解差距。

Result: 生成的MJO序列在复合特征、功率谱和多尺度结构(包括对流耦合波)方面捕捉了关键特征,尽管存在一些偏差。通过理想化条件(如永恒MJO、季节或ENSO调制),模型生成了更易处理的序列,用于物理分析。

Insight: 创新点在于结合视频扩散模型与低维条件生成MJO,实现从理论到复杂模拟的桥梁;客观分析显示,该方法通过“气候提示”机制,允许可控生成以解构物理过程,为气候建模提供了可解释性和灵活性。

Abstract: Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.


[151] Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation cs.CV | cs.AIPDF

Yuyang You, Yongzhi Li, Jiahui Li, Yadong Mu, Quan Chen

TL;DR: 本文提出了一种专门针对视频扩散模型的自适应蒸馏框架,旨在解决现有图像蒸馏方法直接应用于视频生成时导致的过饱和、时间不一致和模式崩溃等问题。该框架通过自适应回归损失、时间正则化损失和推理时帧插值策略,实现了稳定的少步视频合成,显著提升了感知保真度和运动真实感。

Details

Motivation: 视频生成的计算成本高昂,模型蒸馏是高效部署的关键技术,但现有方法多直接迁移图像蒸馏技术,导致视频生成中出现过饱和、时间不一致和模式崩溃等伪影,因此需要专门针对视频扩散模型的蒸馏方法。

Result: 在VBench和VBench2基准测试上的大量实验表明,该方法实现了稳定的少步视频合成,在多个指标上一致优于现有蒸馏基线,显著提升了感知保真度和运动真实感。

Insight: 创新点包括自适应回归损失动态调整空间监督权重以防止分布偏移过大导致的伪影、时间正则化损失对抗时间崩溃以促进平滑且物理合理的采样轨迹,以及推理时帧插值策略在保持感知质量的同时减少采样开销;从客观角度看,该研究针对视频序列特性定制蒸馏损失,有效解决了视频生成中的时序一致性问题。

Abstract: Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.


[152] Manifold-Aware Exploration for Reinforcement Learning in Video Generation cs.CV | cs.AIPDF

Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma

TL;DR: 本文针对视频生成中强化学习后训练对齐不稳定的问题,提出了一种名为SAGE-GRPO的新方法。该方法将预训练模型视为定义了一个有效的视频数据流形,并通过在微观和宏观层面施加约束,将探索限制在该流形附近,从而稳定采样、提升生成质量并提高奖励估计的可靠性。

Details

Motivation: 现有用于视频生成的组相对策略优化(GRPO)方法(如FlowGRPO)的可靠性远低于语言模型和图像生成中的对应方法。这是因为视频生成的解空间复杂,且用于探索的ODE到SDE转换会引入过多噪声,导致生成质量下降、奖励估计不可靠,从而破坏后训练对齐的稳定性。

Result: 在HunyuanVideo1.5模型上使用VideoAlign作为奖励模型进行评估,SAGE-GRPO在视频质量(VQ)、运动质量(MQ)、时间对齐(TA)以及视觉指标(CLIPScore, PickScore)上均优于先前方法,在奖励最大化和整体视频质量方面均表现出优越性能。

Insight: 核心创新在于将强化学习探索过程约束在预训练模型定义的视频数据流形附近。具体包括:微观层面,推导了带有对数曲率校正的精确流形感知SDE,并引入梯度范数均衡器以稳定跨时间步的采样和更新;宏观层面,使用带有周期性移动锚点和逐步约束的双重信任区域,以跟踪更接近流形的检查点并限制长期漂移。

Abstract: Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.


[153] Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems cs.CVPDF

Chengyin Hu, Yikun Guo, Yuxian Dong, Qike Zhang, Kalibinuer Tiliwalidi

TL;DR: 本文提出了一种针对红外行人检测系统的通用物理对抗攻击方法UPPA,该方法通过几何约束的贝塞尔块建模扰动,并利用粒子群优化算法进行全局优化,最终将数字扰动实现为物理冷贴片,在无需在线计算开销的情况下实现了高攻击成功率,并展现出良好的跨域泛化能力和黑盒可迁移性。

Details

Motivation: 现有红外物理对抗攻击方法依赖实例特定的在线优化和刚性模式设计,导致部署成本高且物理鲁棒性不足,本文旨在解决这些局限性。

Result: 大量实验表明,UPPA在物理攻击成功率上表现出色,无需在线计算开销,同时展现出强大的跨域泛化能力和可靠的黑盒可迁移性。

Insight: 创新点在于首次在红外领域提出通用物理攻击方法,利用参数化贝塞尔块和粒子群优化实现拓扑稳定的扰动建模,并通过冷贴片实现与红外成像热辐射特性自然对齐的低温分布。

Abstract: Although infrared pedestrian detectors have been widely deployed in visual perception tasks, their vulnerability to physical adversarial attacks is becoming increasingly apparent. Existing physical attack methods predominantly rely on instance-specific online optimization and rigid pattern design, leading to high deployment costs and insufficient physical robustness. To address these limitations, this work proposes the Universal Physical Patch Attack (UPPA), the first universal physical attack method in the infrared domain. This method employs geometrically constrained parameterized Bezier blocks to model perturbations and utilizes the Particle Swarm Optimization (PSO) algorithm to perform unified optimization across the global data distribution, thus maintaining topological stability under dynamic deformations. In the physical deployment phase, we materialize the optimized digital perturbations into physical cold patches, achieving a continuous and smooth low-temperature distribution that naturally aligns with the thermal radiation characteristics of infrared imaging. Extensive experiments demonstrate that UPPA achieves an outstanding physical attack success rate without any online computational overhead, while also exhibiting strong cross-domain generalization and reliable black-box transferability.


[154] CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal cs.CVPDF

Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

TL;DR: 本文提出CLEAR框架,一种无需掩码的端到端视频字幕去除方法,通过上下文感知自适应学习实现高效推理。该方法采用两阶段设计:第一阶段通过自监督正交约束学习解耦的字幕表示,第二阶段利用基于LoRA的适应机制结合生成反馈进行动态上下文调整。

Details

Motivation: 现有基于扩散的方法在训练和推理阶段均需显式掩码序列,限制了实际部署;本文旨在开发一种无需掩码的端到端框架,以提升视频字幕去除的实用性和效率。

Result: 在中文字幕基准测试中,CLEAR相比依赖掩码的基线方法提升+6.77dB PSNR和降低-74.7% VFID;在六种语言(英语、韩语、法语、日语、俄语、德语)上展示了优异的零样本泛化能力。

Insight: 创新点包括:1) 无需掩码的端到端推理设计,降低部署复杂度;2) 两阶段解耦框架结合自监督学习和生成反馈机制;3) 仅需基础扩散模型0.77%的参数进行训练,实现高效自适应;4) 生成驱动的反馈机制确保跨语言鲁棒性。

Abstract: Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.


[155] FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection cs.CV | cs.MMPDF

Zhilin Tu, Kemou Li, Fengpeng Li, Jianwei Fei, Jiamin Zhang

TL;DR: 本文提出FeatDistill框架,通过特征蒸馏增强多专家集成,用于鲁棒的AI生成图像检测。该框架采用四个ViT骨干网络(CLIP和SigLIP变体)集成,并引入两阶段训练(分类损失优化和密集特征自蒸馏)以提升特征表示和泛化能力。

Details

Motivation: 解决深度伪造技术带来的信息安全挑战,针对现实取证中存在的退化干扰、特征表示不足和泛化能力有限三个瓶颈问题。

Result: 在NTIRE野外鲁棒AI生成图像检测挑战赛设置下进行了广泛评估,结果表明该框架在多样化的野外条件下实现了强大的鲁棒性和泛化能力。

Insight: 创新点包括多专家集成捕获互补取证线索、综合退化建模增强数据覆盖、以及两阶段训练中的特征级自蒸馏用于表示对齐,有效缓解过拟合并提升特征语义一致性。

Abstract: The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild’’ conditions, offering an effective and practical solution for real-world deepfake image detection.


[156] Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection cs.CVPDF

Youbin Kim, Jinho Park, Hogun Park, Eunbyung Park

TL;DR: 本文提出了Group3D,一个用于开放词汇3D目标检测的多视图框架。该方法通过多模态大语言模型(MLLM)构建场景自适应词汇表,并将其组织成语义兼容组,这些组编码了合理的跨视图类别等价关系。在实例构建过程中,将语义兼容性作为约束条件与几何一致性结合,从而在合并3D片段时避免仅依赖几何信息导致的过合并或碎片化错误。该方法支持已知相机姿态和自由姿态两种设置,仅需RGB观测。

Details

Motivation: 解决开放词汇3D目标检测中,现有方法将几何实例构建与语义标注解耦所导致的问题:当几何证据不完整时,仅基于几何一致性的合并会产生不可逆的错误(如过合并或碎片化)。

Result: 在ScanNet和ARKitScenes数据集上的实验表明,Group3D在多视图开放词汇3D检测任务中达到了最先进的性能,并在零样本场景下表现出强大的泛化能力。

Insight: 核心创新点是将语义约束直接整合到实例构建过程中,通过MLLM驱动的语义兼容组来引导3D片段的合并,从而在几何一致性之外引入了语义层面的指导,有效缓解了纯几何方法在视图依赖和不完整情况下的错误。这是一种将高层语义信息与底层几何重建进行更紧密耦合的思路。

Abstract: Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.


[157] Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention cs.CVPDF

Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, Guo Lu

TL;DR: 本文提出了一种用于视频大语言模型(Video-LLMs)的统一时空令牌压缩方法,旨在解决在极低保留率下现有两阶段压缩策略导致的视觉证据丢失和分配不平衡问题。该方法将压缩重新定义为全局令牌保留池中的时空分配任务,通过集成注意力权重和语义相似度的统一选择机制来全局选取高贡献、低冗余的令牌,并对未选令牌进行聚类合并与回填以保持信息完整性。在LLM内部,进一步引入文本感知合并进行基于查询相关性的二次压缩。该方法无需重新训练,可作为即插即用模块与现有Video-LLMs兼容。

Details

Motivation: 视频大语言模型因大量视觉令牌而面临高昂计算成本,现有令牌压缩方法通常采用两阶段时空压缩策略,依赖于阶段特定指标和时空可分离的隐含假设,在极低保留率下常导致分配不平衡和问答所需关键视觉证据的丢失。

Result: 实验表明,在多个基准测试中,仅保留约2%的视觉令牌即可保持基线模型90.1%的性能,同时将FLOPs降低至约2.6%。该优势在不同骨干网络上均能泛化,降低了端到端推理延迟和内存消耗,在超低令牌保留的视频理解任务中达到了最先进的水平。

Insight: 创新点在于将令牌压缩重新定义为全局时空分配任务,提出了统一的选择机制(结合注意力与语义相似度)和文本感知的二次压缩,实现了无需重新训练的即插即用高效压缩,在极低保留率下仍能有效保持模型性能。

Abstract: Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.


[158] Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model cs.CVPDF

SII-GAIR, Sand. ai, :, Ethan Chern, Hansi Teng

TL;DR: 本文提出了daVinci-MagiHuman,一个开源的、以人为中心的音视频生成基础模型。它采用单流Transformer架构,仅通过自注意力机制统一处理文本、视频和音频token序列,从而联合生成同步的视频和音频。该模型在人类中心化场景中表现出色,支持多语言语音生成,并通过模型蒸馏、潜在空间超分辨率和Turbo VAE解码器实现了高效推理。

Details

Motivation: 解决现有音视频生成模型架构复杂(如多流或交叉注意力设计)且难以优化的问题,旨在通过一个简洁的单流架构实现高质量、高效率、同步的人类中心化音视频生成。

Result: 在自动评估中,该模型在领先的开源模型中取得了最高的视觉质量和文本对齐度,以及最低的语音可懂度词错误率(14.60%)。在成对人工评估中,它在2000次比较中对Ovi 1.1和LTX 2.3的胜率分别为80.0%和60.9%。在单张H100 GPU上,可在2秒内生成一段5秒的256p视频。

Insight: 核心创新在于采用仅依赖自注意力的单流Transformer统一处理多模态输入,简化了架构并降低了优化难度。同时,结合模型蒸馏、潜在空间超分辨率和高效解码器,在保持高质量输出的同时显著提升了推理速度,为构建高效的多模态生成模型提供了可借鉴的简洁设计思路。

Abstract: We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.


[159] Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models cs.CVPDF

Purui Bai, Junxian Duan, Pin Wang, Jinhua Hao, Ming Sun

TL;DR: 本文提出了ResFlow-Tuner,一个基于先进流匹配模型FLUX.1-dev的图像修复框架。它通过统一多模态融合(UMMF)和测试时缩放(TTS)技术,在推理时动态引导去噪方向,从而在多个标准基准测试中实现了最先进的修复性能。

Details

Motivation: 尽管基于扩散模型的真实世界图像修复取得了显著进展,但如何高效利用超大规模预训练文生图模型并充分挖掘其潜力仍是重大挑战。本文旨在解决这一问题。

Result: 大量实验表明,该方法在多个标准基准测试中实现了最先进的性能。

Insight: 主要创新点包括:1)将多模态条件编码为统一序列以指导高质量图像合成的UMMF机制;2)为图像修复量身定制的、无需训练的测试时缩放范式,通过奖励模型的反馈在推理时动态引导去噪,以可控的计算开销获得显著性能提升。这项工作不仅验证了流匹配模型在底层视觉任务中的强大能力,更重要的是提出了一种适用于大型预训练模型的新型高效推理时缩放范式。

Abstract: Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.


[160] Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models cs.CV | cs.AIPDF

Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun

TL;DR: 本文提出了一种不确定性引导的组合双曲对齐方法(UNCHA),用于增强双曲视觉语言模型在捕捉部分-整体语义代表性方面的能力。该方法通过双曲不确定性建模部分对整体的语义代表性差异,并将其融入对比学习目标,从而学习到更准确的部分-整体层次结构表示。

Details

Motivation: 现有双曲视觉语言模型虽能更好地建模层次关系,但未能考虑每个部分对整体场景具有不同的语义代表性,限制了其在多对象组合场景中的理解能力。

Result: UNCHA在零样本分类、检索和多标签分类基准测试中取得了最先进的性能(SOTA)。

Insight: 创新点在于引入双曲不确定性来量化部分对整体的语义代表性,并通过不确定性加权和基于熵的正则化损失进行校准,从而更精细地建模图像中的组合结构。

Abstract: While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.


[161] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning cs.CVPDF

Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin

TL;DR: 本文提出SpatialBoost框架,通过语言引导的推理增强视觉表示。该框架利用大语言模型将2D图像中的密集3D空间信息转化为语言描述,并通过多轮思维链推理逐步注入到预训练视觉编码器中,以提升其空间感知能力。

Details

Motivation: 现有大规模预训练视觉编码器主要基于2D图像数据训练,缺乏对真实世界中物体与背景之间3D空间关系的捕捉,限制了其在下游任务中的有效性。

Result: 在ADE20K等需要3D感知和通用视觉能力的基准测试上,SpatialBoost将DINOv3的性能从55.9 mIoU提升至59.7 mIoU,实现了3.8%的性能增益,达到了最先进水平。

Insight: 创新点在于将3D空间知识通过语言描述形式注入视觉编码器,并采用多轮思维链推理构建层次化空间理解,为增强预训练模型的3D感知能力提供了可扩展的新思路。

Abstract: Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.


[162] Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning cs.CVPDF

Xingyu Zhu, Liang Yi, Shuo Wang, Wenbo Zhu, Yonglinag Wu

TL;DR: 本文提出BayesMM,一种基于多模态贝叶斯分布学习的测试时自适应框架,用于解决点云分析模型在领域偏移下的性能下降问题。该方法将文本先验和流式视觉特征建模为高斯分布,通过贝叶斯模型平均融合多模态信息,实现无需训练的持续自适应。

Details

Motivation: 现有基于缓存的测试时自适应方法存在历史信息有限和启发式融合导致的不稳定问题,需要一种更鲁棒的自适应机制来应对点云分析中的领域偏移。

Result: 在多个点云基准测试上的实验表明,BayesMM在分布偏移下保持鲁棒性,平均性能提升超过4%。

Insight: 创新点在于将多模态信息建模为概率分布并通过贝叶斯模型平均进行动态融合,避免了启发式融合的不稳定性,同时通过持续更新视觉分布参数实现了对测试数据流的自适应。

Abstract: Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.


[163] P-Flow: Prompting Visual Effects Generation cs.CVPDF

Rui Zhao, Mike Zheng Shou

TL;DR: P-Flow是一个无需训练的视频生成框架,用于定制动态视觉特效(如物体破碎或爆炸),它通过利用视觉语言模型进行测试时提示优化,迭代精炼文本提示以匹配参考视频中的特效,从而在文本到视频和图像到视频任务中实现高保真和多样化的效果定制。

Details

Motivation: 现有视频生成模型在遵循文本提示方面虽有进步,但针对动态视觉特效(涉及高级语义和时序演化)的定制仍未被充分探索,且人工设计精确提示耗时费力,需要解决如何高效定制这些特效而不修改底层模型的问题。

Result: 实验表明,P-Flow在文本到视频和图像到视频生成任务中,实现了高保真和多样化的视觉特效定制,并优于其他模型。

Insight: 创新点在于提出了一种无需训练的测试时提示优化框架,利用视觉语言模型的语义和时序推理能力,通过迭代比较参考视频与生成输出的特效差异来精炼提示,从而实现对动态视觉特效的灵活、高效定制,避免了模型微调的开销。

Abstract: Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at https://github.com/showlab/P-Flow.


[164] Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models cs.CVPDF

Xingyu Zhu, Beier Zhu, Shuo Wang, Junfeng Fang, Kesen Zhao

TL;DR: 本文提出了一种名为NullSteer的防御框架,通过零空间投影激活防御来平衡视觉语言模型的安全性与实用性,在抵御视觉越狱攻击的同时保持模型在良性输入上的性能。

Details

Motivation: 视觉语言模型在开放世界场景中易受视觉越狱攻击诱导生成有害内容,现有激活引导方法虽能增强拒绝能力但可能导致过度拒绝,且缺乏理论可解释性,因此需要一种能更好平衡安全与效用的防御方法。

Result: 在多种越狱攻击下,NullSteer显著减少了有害输出(在MiniGPT-4上平均攻击成功率降低超过15%),同时在通用基准测试上保持了与原模型相当的性能。

Insight: 创新点在于通过线性变换在模型激活中构建拒绝方向,利用零空间投影理论确保良性子空间内扰动为零,从而动态引导拒绝潜在有害方向,实现了安全增强而不损害模型通用能力,具有理论可解释性。

Abstract: As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model’s general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.


[165] FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario cs.CV | cs.GR | cs.ROPDF

Hang Dai, Hongwei Fan, Han Zhang, Duojin Wu, Jiyao Zhang

TL;DR: FreeArtGS是一种在自由移动场景下重建关节物体的新方法,仅需单目RGB-D视频作为输入。该方法结合了自由移动部件分割、关节估计和端到端优化,通过利用现有点跟踪和特征模型的先验,从无约束捕获中识别刚性部件,校准统一的对象到相机姿态,并稳健地恢复关节类型和轴。最后,基于3D高斯泼溅(3DGS)的端到端优化共同重建关节物体的视觉纹理、几何和关节角度。

Details

Motivation: 增强现实和机器人领域对高可扩展性关节物体重建的需求日益增长,但现有方法(如从离散关节状态或单目视频重建)需要复杂的轴对齐或覆盖不足,限制了其适用性。因此,论文提出FreeArtGS,旨在通过简单设置和高可扩展性,解决自由移动场景下关节物体重建的挑战。

Result: 在两个基准测试和真实世界自由移动关节物体上进行的实验表明,FreeArtGS在重建自由移动关节物体方面表现优异,并在先前重建设置中保持高度竞争力,证明了其作为现实资产生成解决方案的实用性和有效性。

Insight: 创新点包括:引入自由移动场景作为新设置,结合自由移动部件分割与关节估计,利用现有点跟踪和特征模型先验进行优化,以及基于3DGS的端到端优化联合重建纹理、几何和关节角度。从客观角度看,该方法通过简单输入(单目RGB-D视频)实现高可扩展性重建,减少了传统方法对轴对齐或覆盖的依赖,具有实际应用潜力。

Abstract: The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/


[166] StreamingClaw Technical Report cs.CVPDF

Jiawei Chen, Zhe Chen, Chaoqun Du, Maokui He, Wei He

TL;DR: StreamingClaw是一个用于流式视频理解和具身智能的统一代理框架,旨在解决现有代理在实时感知-决策-行动闭环中的能力碎片化问题。它集成了实时流式推理、未来事件预测与主动交互、多模态长期记忆、感知-决策-行动闭环以及与OpenClaw框架的兼容性等核心能力。

Details

Motivation: 现有代理在流式视频理解中存在能力碎片化问题,如仅支持离线视频理解、缺乏长期多模态记忆机制、难以实现实时推理和主动交互,这阻碍了它们在真实环境中持续感知、实时决策和执行行动的能力。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但宣称StreamingClaw通过集成多项核心能力,支持实时流式推理、主动交互和物理世界控制,实现了具身交互的实际部署。

Insight: 创新点在于将实时流式推理、多模态长期记忆和主动交互整合到统一框架中,并引入流式工具和以动作为中心的技能,直接控制物理环境;同时兼容OpenClaw框架,利用开源社区资源,增强了可扩展性和实用性。

Abstract: Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.


[167] Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding cs.CV | cs.AIPDF

Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu

TL;DR: 本文提出了一种名为Mamba-VMR的两阶段框架,用于提升文本驱动视频片段检索(VMR)的精确时间定位能力。该方法首先利用大语言模型(LLM)匹配视频字幕,并结合查询文本生成辅助短视频作为时间先验;然后通过一个多模态可控的Mamba网络,高效融合生成先验与长视频序列,并过滤噪声。

Details

Motivation: 解决现有文本驱动视频片段检索方法因难以捕捉未修剪视频中隐藏的时间动态,导致在长序列中定位不精确的问题。传统方法依赖自然语言查询或静态图像增强,忽略了运动序列,且基于Transformer的架构计算成本高。

Result: 在TVR基准测试上的实验评估表明,该方法相比最先进(SOTA)方法有显著提升,包括降低了计算开销,并在长序列定位中实现了更高的召回率。

Insight: 创新点在于提出了一种结合LLM引导的字幕匹配与文本生成视频(T2V)来生成时间先验的两阶段框架,并引入了多模态可控的Mamba网络进行高效融合与噪声过滤,该框架与基础检索模型无关,具有广泛的适用性。

Abstract: Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.


[168] Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation cs.CV | cs.AIPDF

Kejia Liu, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song

TL;DR: 本文提出了一种名为Bearing-UAV的纯视觉跨视图导航方法,用于解决无人机在GNSS拒止环境下的导航问题。该方法通过联合预测无人机的绝对位置和航向,利用全局和局部结构特征并显式编码相对空间关系,以克服现有跨视图地理定位方法在精度与存储开销之间的权衡以及对航向信息考虑不足的局限。

Details

Motivation: 现有跨视图地理定位方法主要专注于将无人机视图与机载地图瓦片进行匹配,这导致了精度与存储开销之间的固有权衡,并且忽视了无人机航向在导航中的重要性。此外,跨视图场景中存在的显著差异和不同重叠度未得到充分考虑,限制了其在真实场景中的泛化能力。

Result: 在提出的多城市基准测试Bearing-UAV-90k上进行的大量实验表明,Bearing-UAV方法在不同地形下取得了比先前的匹配/检索范式更低的定位误差,显示出有希望的结果。

Insight: 论文的创新点在于从匹配范式转向联合预测范式,通过融合全局与局部特征并显式建模相对空间关系,使模型对跨视图变化、未对齐和特征稀疏条件具有鲁棒性。从客观角度看,其提出的联合预测框架和新的基准数据集是推动纯视觉无人机导航实用化的重要贡献。

Abstract: Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.


[169] ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints cs.CVPDF

Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang

TL;DR: 本文针对直接偏好优化(DPO)在大型视觉语言模型(LVLM)对齐中存在的似然位移问题,提出了非对称约束偏好优化(ACPO)方法。该方法通过动态、目标导向的缩放,非对称地抑制被拒绝响应的梯度流,从而缓解视觉锚点崩溃,减少幻觉。实验表明,ACPO在多个基准测试中优于基线方法。

Details

Motivation: DPO在视觉语言模型对齐中会导致似然位移,即选择和拒绝响应的概率均崩溃,引发视觉锚点崩溃,使模型依赖语言先验而忽视视觉证据,导致严重幻觉。

Result: 在InternVL模型上的实验显示,ACPO有效逆转了标准DPO的选择奖励退化,在幻觉基准(HallusionBench、MM-IFEval)和通用排行榜(MMBench、MMStar、OCRBenchV2)上普遍优于基线,同时提升了通用能力。

Insight: 创新点在于引入非对称梯度约束,通过复杂度感知的缩放系数仅作用于被拒绝奖励,打破梯度对称性,保护视觉令牌不被语言先验抑制,这是一种模态无关的对齐机制,可推广到多模态任务中。

Abstract: While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods – a failure we term Visual Anchor Collapse – causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.


[170] Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement cs.CV | cs.AIPDF

Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie

TL;DR: 本文提出了一种基于视觉反馈的迭代文本布局优化框架VFLM,通过结合OCR准确性的视觉奖励模型,利用强化学习实现自适应反思生成,从而在多个基准测试中超越了现有MLLM和布局模型。

Details

Motivation: 现有基于代码生成的布局方法无法感知渲染后的视觉结果,难以保证可读性和美观性,因此需要引入视觉反馈来改进布局生成质量。

Result: 在多个基准测试中,VFLM一致优于先进的MLLM、现有布局模型和纯代码基线,证明了视觉反馈对面向设计的MLLM至关重要。

Insight: 创新点在于将视觉反馈作为迭代优化的核心,通过强化学习结合OCR准确性的奖励机制,激发模型的反思和迭代生成能力,提升了布局生成的可读性和美观性。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model’s iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.


[171] PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation cs.CVPDF

Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding

TL;DR: 本文提出了PAM(姿态-外观-运动引擎),一个用于可控手物交互(HOI)视频生成的统一框架。该引擎整合了姿态、外观和运动信息,能够从深度图、分割掩码和关键点等输入条件生成高分辨率视频,并在DexYCB和OAKINK2数据集上验证了其性能,同时展示了合成数据在下游手部姿态估计任务中的增强效果。

Details

Motivation: 现有HOI生成研究存在碎片化问题:姿态合成不生成像素、单图像生成缺乏动态、视频生成需要完整姿态序列和真实首帧。本文旨在构建一个统一的框架,将姿态、外观和运动整合,实现真正的模拟到真实(sim-to-real)可控HOI视频生成。

Result: 在DexYCB数据集上,PAM取得了FVD 29.13(优于InterDyn的38.83)和MPJPE 19.37毫米(优于CosHand的30.05毫米),并生成更高分辨率(480x720)的视频。在OAKINK2数据集上,完整多条件模型将FVD从68.76提升至46.31。消融实验表明结合深度、分割和关键点输入效果最佳。下游任务中,使用3400个合成视频(20.7万帧)进行数据增强,仅用50%真实数据加合成数据训练的模型即可匹配100%真实数据基线。

Insight: 论文的主要创新在于提出了一个统一的姿态-外观-运动引擎,解决了HOI生成领域的碎片化问题,实现了从多模态条件(如深度、分割、关键点)直接生成高分辨率动态视频。其框架设计支持真正的sim-to-real部署,且生成的合成数据能有效增强下游任务的性能,展示了统一的生成模型在具身AI和AR/VR中的潜力。

Abstract: Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.


[172] Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models cs.CVPDF

Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang

TL;DR: 本文提出了Omni-WorldBench,一个专门用于评估4D世界模型交互响应能力的综合性基准。该基准包含Omni-WorldSuite(涵盖不同交互级别和场景类型的系统提示套件)和Omni-Metrics(基于智能体的评估框架,通过测量交互动作对最终结果和中间状态演化轨迹的因果影响来量化世界建模能力)。作者对18个代表性世界模型进行了广泛评估,揭示了当前模型在交互响应方面的关键局限性。

Details

Motivation: 现有的世界模型评估基准要么狭隘地关注生成模型的视觉保真度和文本-视频对齐,要么依赖静态的3D重建指标,从根本上忽略了时间动态。作者认为世界建模的未来在于4D生成(联合建模空间结构和时间演化),其核心能力是交互响应,但目前没有基准系统评估这一关键维度。

Result: 对18个代表性世界模型进行了广泛评估,分析揭示了当前世界模型在交互响应方面的关键局限性。

Insight: 创新点在于提出了首个专门评估4D世界模型交互响应能力的综合性基准,其核心是引入了基于智能体的评估框架(Omni-Metrics),通过量化交互动作的因果影响来评估模型,这超越了传统的视觉质量或静态重建指标,直接针对世界模型的核心功能(模拟状态演化)进行评估。

Abstract: Video–based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text–video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni–WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni–WorldBench comprises two key components: Omni–WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni–Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.


[173] SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation cs.CV | cs.AIPDF

Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu

TL;DR: 本文提出SpatialReward,一种用于评估文本到图像生成中空间布局的可验证奖励模型,通过多阶段流程(提示分解、专家检测器和视觉语言模型推理)来提升细粒度空间一致性,并引入了SpatRelBench基准进行综合评估。

Details

Motivation: 现有奖励模型在评估文本到图像生成时,对细粒度空间关系关注有限,导致生成的图像整体看似合理但对象定位不准确,需要专门解决空间一致性问题。

Result: 在Stable Diffusion和FLUX模型上的实验表明,将SpatialReward融入强化学习训练能持续提升空间一致性和整体生成质量,结果更符合人类判断。

Insight: 创新点在于设计了一个可验证的、基于多阶段流程的奖励模型来评估复杂空间关系,并引入了涵盖多种空间方面的基准SpatRelBench,为文本到图像生成提供了更精确的优化方向。

Abstract: Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.


[174] EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild cs.CVPDF

Jeffri Murrugarra-Llerena, Pranav Chitale, Zicheng Liu, Kai Ao, Yujin Ham

TL;DR: 本文介绍了EgoGroups数据集,这是一个用于检测真实场景中社交群体的基准测试。该数据集以第一人称视角捕捉全球多个城市的社会动态,涵盖65个国家、不同人群密度和天气/时间条件,并提供了详细的人员和社交群体标注以及地理和场景元数据。作者利用该数据集对最先进的视觉语言模型/大语言模型和监督模型进行了群体检测能力的广泛评估,发现零样本设置下VLMs和LLMs可以超越监督基线,且人群密度和文化区域显著影响模型性能。

Details

Motivation: 现有社交群体检测基准存在场景多样性低、依赖第三人称摄像头(如监控录像)的局限性,缺乏对多样化文化背景和无约束环境中群体形成与演变的真实世界评估,因此需要一个新的基准来弥补这一不足。

Result: 在EgoGroups数据集上对SOTA VLM/LLM和监督模型进行评估,结果显示在零样本设置下,VLMs和LLMs能够超越监督基线模型,同时人群密度和文化区域对模型性能有明显影响。

Insight: 论文的创新点在于引入了首个大规模、多国家、多条件的第一人称视角社交群体检测数据集EgoGroups,填补了现有基准的空白;客观分析表明,该数据集促进了跨文化、真实场景下的社交智能研究,并揭示了VLMs/LLMs在零样本社交理解任务上的潜力及其性能受环境因素影响的洞察。

Abstract: Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.


[175] GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning cs.CVPDF

Yixuan Luo, Feng Qiao, Zhexiao Xiong, Yanjing Li, Nathan Jacobs

TL;DR: 本文提出了一种名为GenOpticalFlow的无监督光流学习生成方法,通过合成大规模、完美对齐的帧-光流数据对来训练光流模型,无需人工标注。该方法利用预训练的深度估计网络生成伪光流,并作为条件输入训练下一帧生成模型,以产生高保真、像素对齐的后续帧,从而创建高质量合成数据。此外,还引入了不一致像素过滤策略来移除生成帧中的不可靠像素,提升在真实数据集上的微调性能。

Details

Motivation: 光流估计是计算机视觉中的基础问题,但依赖昂贵真实标注限制了监督方法的可扩展性。现有无监督和半监督方法常基于亮度恒定和平滑性假设,在复杂真实场景中产生不可靠的监督信号,导致运动估计不准确。

Result: 在KITTI2012、KITTI2015和Sintel数据集上的大量实验表明,GenOpticalFlow相比现有无监督和半监督方法取得了竞争性或更优的结果。

Insight: 创新点在于提出了一种生成式框架,通过合成完美对齐的帧-光流数据对实现无监督光流学习,避免了传统无监督方法中不可靠的监督信号问题;同时,不一致像素过滤策略有效提升了生成数据的质量和对真实数据的适应性,为光流学习提供了一种可扩展且无需标注的解决方案。

Abstract: Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame–flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.


[176] DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution cs.CVPDF

Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong

TL;DR: DUO-VSR是一个用于一步式视频超分辨率的三阶段框架,通过双流蒸馏策略统一了分布匹配和对抗监督,以解决基于扩散的VSR模型采样成本高的问题。

Details

Motivation: 基于扩散的视频超分辨率方法虽然取得了显著的保真度,但存在采样成本过高的问题;直接应用分布匹配蒸馏(DMD)到VSR会导致训练不稳定和监督不足。

Result: 大量实验表明,DUO-VSR在视觉质量和效率上优于先前的一步式VSR方法。

Insight: 创新点包括:1)渐进式引导蒸馏初始化以稳定训练;2)双流蒸馏联合优化DMD和Real-Fake Score Feature GAN(RFS-GAN)流,后者利用真实和伪造分数模型的判别特征提供互补的对抗监督;3)偏好引导细化阶段进一步对齐学生的感知质量偏好。

Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.


[177] The Dual Mechanisms of Spatial Reasoning in Vision-Language Models cs.CV | cs.LGPDF

Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba

TL;DR: 该论文揭示了视觉语言模型(VLMs)进行空间推理的两种并行机制:语言模型主干通过中间层在视觉标记上表示内容无关的空间关系,但仅起次要作用;而主导的空间信息源自视觉编码器,其表征编码了物体布局并被语言模型直接利用。研究发现增强所有图像标记的视觉空间表征能提升自然图像上的空间推理性能。

Details

Motivation: 旨在探究视觉语言模型在图像描述、视觉问答等多模态任务中,如何计算物体与其属性及空间关系的关联,明确这些关联在模型内部的计算位置与方式。

Result: 研究结果表明,通过全局增强视觉编码器产生的空间表征,可以提高模型在自然图像上的空间推理性能。

Insight: 创新点在于识别并验证了VLMs中空间推理的双重机制,并强调了视觉编码器在空间推理中的核心作用,其空间信号全局分布于视觉标记中,甚至超出物体区域至背景区域,这为改进VLM的空间能力提供了新方向。

Abstract: Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.


[178] 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing cs.CV | cs.AIPDF

Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu

TL;DR: 本文提出了一种名为3D-Layout-R1的结构化推理框架,用于解决大语言模型和视觉语言模型在细粒度视觉编辑任务中空间理解和布局一致性不足的问题。该框架通过场景图推理,根据自然语言指令对输入场景图进行编辑,生成满足文本条件且保持空间一致性的更新场景图。

Details

Motivation: 大语言模型和视觉语言模型在推理方面表现出色,但在执行细粒度视觉编辑时,难以保证空间理解和布局一致性,因此需要一种能够显式处理空间关系并提高可解释性和控制力的方法。

Result: 在一个新的文本引导布局编辑基准测试(涵盖排序、空间对齐和房间编辑任务)上,该方法相比思维链微调和普通GRPO基线,平均IoU提高了15%,中心距离误差减少了25%;与SOTA零样本大语言模型相比,最佳模型实现了高达20%的mIoU提升,显示出显著改善的空间精度。

Insight: 创新点在于引入结构化推理框架,利用场景图作为中间表示来显式引导推理过程,从而增强空间关系的可解释性和控制力;客观来看,该方法通过结合场景图推理,有效弥补了现有模型在空间编辑任务中的不足,为语言指令驱动的空间编辑提供了更可靠和精确的解决方案。

Abstract: Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.


[179] DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models cs.CV | cs.ROPDF

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong

TL;DR: 本文提出了DualCoT-VLA模型,一种用于视觉-语言-动作(VLA)模型的并行推理方法。它通过整合视觉链(用于低级空间感知)和语言链(用于高级任务规划)来实现全面的多模态推理,并采用并行机制将自回归推理转变为单步前向推理,从而解决了现有CoT-VLA模型在复杂任务中推理能力不足和推理延迟高的问题。

Details

Motivation: 标准VLA模型在处理需要逻辑规划的复杂多步骤任务和需要精细空间感知的精确操作时存在困难。现有的基于CoT的VLA模型存在两个关键局限:一是依赖孤立的单模态CoT,无法同时捕捉低级视觉细节和高级逻辑规划;二是逐步自回归解码导致推理延迟高且错误累积。

Result: 在LIBERO和RoboCasa GR1基准测试以及真实世界平台上进行的广泛实验表明,DualCoT-VLA取得了最先进的(SOTA)性能。

Insight: 创新点在于提出了一个并行的视觉-语言链式思维(DualCoT)框架,将视觉和语言推理路径分离并整合,以实现更全面的多模态理解。同时,通过引入可学习的查询令牌实现并行推理,将推理过程从顺序自回归转变为单步前向,显著降低了延迟并减少了错误传播。

Abstract: Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting’’ capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.


[180] UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation cs.CV | cs.AIPDF

Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu

TL;DR: UniMotion是一个统一的框架,首次在单一架构内同时实现人体运动、自然语言和RGB图像的理解与生成。它通过将运动视为与RGB平等的连续模态,克服了现有模型仅处理受限模态子集(如运动-文本或静态姿态-图像)和依赖离散标记化导致量化误差与时间连续性破坏的局限。

Details

Motivation: 现有统一模型仅能处理受限的模态子集,且主要依赖离散标记化,这引入了量化误差并破坏了时间连续性。UniMotion旨在克服这些限制,实现运动、文本和视觉三种模态的统一理解与生成。

Result: UniMotion在涵盖三种模态间任意到任意理解、生成和编辑的七项任务上实现了最先进的性能,尤其在跨模态组合任务上表现出显著优势。

Insight: 核心创新点包括:1. 将运动视为与RGB平等的连续模态;2. 新颖的跨模态对齐运动VAE(CMA-VAE)和对称双路径嵌入器,在共享LLM骨干中构建并行连续通路;3. 双后验KL对齐(DPA),在不需推理时图像的情况下将视觉语义先验注入运动表示;4. 潜在重建对齐(LRA),一种自监督预训练策略,解决了仅文本监督过于稀疏的冷启动问题,为所有下游任务建立了稳定的运动感知基础。

Abstract: We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder’s richer posterior into the motion-only encoder. To address the cold-start problem – where text supervision alone is too sparse to calibrate the newly introduced motion pathway – we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.


[181] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding cs.CVPDF

Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu

TL;DR: 本文提出VideoDetective框架,通过结合查询与视频片段的相关性以及片段间的内在关联性,在长视频问答任务中实现高效线索定位。该方法将视频分割为多个片段并构建视觉-时间亲和图,通过假设-验证-精炼循环估计片段与查询的全局相关性分布,从而指导关键片段的定位。

Details

Motivation: 现有方法在长视频理解中主要依赖查询本身进行线索定位,忽略了视频的内在结构和片段间变化的相关性,导致多模态大语言模型因上下文窗口有限而难以有效处理稀疏查询相关片段。

Result: 在VideoMME-long等代表性基准测试中,该方法在多种主流多模态大语言模型上均取得显著性能提升,准确率最高提升7.5%。

Insight: 创新点在于同时利用查询与片段的外在相关性和片段间的内在亲和性(通过视觉相似性和时间邻近性构建图结构),并采用假设-验证-精炼循环进行全局相关性传播,实现稀疏观测下的高效线索定位。

Abstract: Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video’s intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/


cs.NI [Back]

[182] OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields cs.NI | cs.CV | cs.MM | cs.RO | eess.IVPDF

Aizierjiang Aiersilan, Zhangfei Yang

TL;DR: OrbitStream是一种免训练的360度视频自适应流媒体框架,结合语义场景理解和鲁棒控制理论,用于解决远程操作中视口预测和比特率自适应问题。

Details

Motivation: 解决360度视频流媒体在远程操作中面临的视口预测不确定性和无线信道波动性挑战,同时避免数据驱动和深度强化学习方法在安全关键系统中的黑盒性和训练数据依赖问题。

Result: 在物体丰富的远程操作轨迹上,OrbitStream实现了94.7%的零样本视口预测准确率,接近轨迹外推基线(约98.5%)。在3600次蒙特卡洛模拟中,平均QoE为2.71,在12种算法中排名第二,接近最佳BOLA-E(2.80),优于FastMPC(1.84),决策延迟平均1.01毫秒,重缓冲事件极少。

Insight: 创新点包括将视口预测建模为引力视口预测问题,利用语义对象生成吸引用户注视的势场,以及采用基于饱和度的比例-微分控制器进行缓冲区调节,实现了可解释性、零训练开销和竞争力的QoE。

Abstract: Adaptive 360° video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their “black-box” nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360° streaming in teleoperation.


cs.RO [Back]

[183] Your Robot Will Feel You Now: Empathy in Robots and Embodied Agents cs.RO | cs.AI | cs.CVPDF

Angelica Lim, Ö. Nilay Yalçin

TL;DR: 这篇论文回顾了人机交互(HRI)和具身对话代理(ECAs)领域在机器人中实现共情的研究,探讨了通过模仿人类和动物行为来赋予机器多模态社交与情感智能的方法,并旨在将这些经验应用于当今基于语言的智能体(如ChatGPT)。

Details

Motivation: 研究动机是总结HRI和ECAs领域在机器中实现共情行为的现有知识,以解决如何将共情能力整合到与人类交互的人工智能代理中,特别是为当前基于语言的智能体提供借鉴。

Result: 论文是一篇综述性章节,未提及具体的定量实验结果或基准测试,主要聚焦于回顾和总结现有研究。

Insight: 创新点在于将传统HRI和ECAs中的共情模型与行为研究,系统性地迁移到现代语言驱动的智能体(如大语言模型)中,强调了从多模态交互到语言交互的共情能力扩展潜力。

Abstract: The fields of human-robot interaction (HRI) and embodied conversational agents (ECAs) have long studied how empathy could be implemented in machines. One of the major drivers has been the goal of giving multimodal social and emotional intelligence to these artificially intelligent agents, which interact with people through facial expressions, body, gesture, and speech. What empathic behaviors and models have these fields implemented by mimicking human and animal behavior? In what ways have they explored creating machine-specific analogies? This chapter aims to review the knowledge from these studies, towards applying the lessons learned to today’s ubiquitous, language-based agents such as ChatGPT.


[184] Rheos: Modelling Continuous Motion Dynamics in Hierarchical 3D Scene Graphs cs.RO | cs.CVPDF

Iacopo Catalano, Francesco Verdoja, Javier Civera, Jorge Peña-Queralta, Julio A. Placed

TL;DR: 本文提出了Rheos框架,将连续方向运动模型嵌入到分层3D场景图的动态层中,以增强图的导航属性。该方法使用半包裹高斯混合模型捕捉多模态方向流,并采用储层采样、并行更新和贝叶斯信息准则优化在线操作。在模拟行人环境中,Rheos在连续和离散指标上均优于离散基线方法。

Details

Motivation: 现有3D场景图主要跟踪单个智能体动态,而动态地图依赖均匀网格离散化,缺乏语义基础且扩展性差。Rheos旨在将连续运动动力学建模融入分层3D场景图,解决离散方法在语义和尺度上的局限性。

Result: 在模拟行人环境的四种空间分辨率下评估,Rheos在连续和不利的离散指标上均持续优于离散基线方法,实现了性能提升。

Insight: 创新点包括:将连续方向运动模型作为概率分布嵌入3D场景图动态层;使用半包裹高斯混合模型替代离散直方图;通过储层采样、并行更新和BIC准则优化在线操作,将更新初始化成本从二次降为线性。

Abstract: 3D Scene Graphs (3DSGs) provide hierarchical, multi-resolution abstractions that encode the geometric and semantic structure of an environment, yet their treatment of dynamics remains limited to tracking individual agents. Maps of Dynamics (MoDs) complement this by modeling aggregate motion patterns, but rely on uniform grid discretizations that lack semantic grounding and scale poorly. We present Rheos, a framework that explicitly embeds continuous directional motion models into an additional dynamics layer of a hierarchical 3DSG that enhances the navigational properties of the graph. Each dynamics node maintains a semi-wrapped Gaussian mixture model that captures multimodal directional flow as a principled probability distribution with explicit uncertainty, replacing the discrete histograms used in prior work. To enable online operation, Rheos employs reservoir sampling for bounded-memory observation buffers, parallel per-cell model updates and a principled Bayesian Information Criterion (BIC) sweep that selects the optimal number of mixture components, reducing per-update initialization cost from quadratic to linear in the number of samples. Evaluated across four spatial resolutions in a simulated pedestrian environment, Rheos consistently outperforms the discrete baseline under continuous as well as unfavorable discrete metrics. We release our implementation as open source.


[185] Memory Over Maps: 3D Object Localization Without Reconstruction cs.RO | cs.CVPDF

Rui Zhou, Xander Yap, Jianwen Cao, Allison Lau, Boyang Sun

TL;DR: 本文提出了一种无需显式三维重建的目标定位方法,通过仅存储带位姿的RGB-D关键帧作为轻量级视觉记忆,在查询时利用视觉语言模型检索候选视图并进行重排序,再通过深度反投影和多视图融合构建目标稀疏三维估计,显著降低了预处理成本和存储开销。

Details

Motivation: 传统目标定位方法依赖构建显式三维场景表示(如点云、体素网格),导致映射时间长、存储开销大且可扩展性受限;本文旨在探索是否必须进行完整三维重建才能实现目标定位,从而提出一种免地图的轻量化方案。

Result: 该方法在多个基准测试中表现出色,预处理速度比基于重建的流程快两个数量级,存储需求大幅降低,并在下游物体目标导航任务中验证了定位效果,无需任务特定训练即达到强劲性能。

Insight: 创新点在于利用视觉语言模型直接在二维观测上进行语义推理,以稀疏按需的三维估计替代密集全局重建,证明了基于图像场景记忆的推理可有效替代三维重建以支持以物体为中心的机器人导航。

Abstract: Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory–without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language model, and constructs a sparse, on-demand 3D estimate of the queried target through depth backprojection and multi-view fusion. Compared to reconstruction-based pipelines, this design drastically reduces preprocessing cost, enabling scene indexing that is over two orders of magnitude faster to build while using substantially less storage. We further validate the localized targets on downstream object-goal navigation tasks. Despite requiring no task-specific training, our approach achieves strong performance across multiple benchmarks, demonstrating that direct reasoning over image-based scene memory can effectively replace dense 3D reconstruction for object-centric robot navigation. Project page: https://ruizhou-cn.github.io/memory-over-maps/


[186] GHOST: Ground-projected Hypotheses from Observed Structure-from-Motion Trajectories cs.RO | cs.CVPDF

Tomasz Frelek, Rohan Patil, Akshar Tumu, Henrik I. Christensen

TL;DR: 本文提出了一种名为GHOST的自监督学习方法,用于从单目图像中分割自动驾驶车辆在复杂城市环境中的可行轨迹。该方法利用大规模行车记录仪视频,通过单目运动恢复结构技术获取相机轨迹,并将其投影到地面平面生成空间掩码,作为训练标签。训练后的深度分割网络能够从单张RGB图像中预测运动条件化的路径建议,无需显式建模道路或车道线。模型在NuScenes数据集上评估,展示了可靠的轨迹预测能力,并通过轻量微调迁移到电动滑板车平台。

Details

Motivation: 解决在复杂城市环境中,自动驾驶系统从单目图像中分割可行轨迹的问题,避免依赖手动标注或显式道路建模,通过自监督方式利用大规模未标注视频数据。

Result: 在NuScenes数据集上评估,模型表现出可靠的轨迹预测性能,能够生成结构化且可泛化的路径建议,并通过轻量微调成功迁移到电动滑板车平台,展示了方法的通用性。

Insight: 创新点在于利用自监督学习从大规模行车视频中提取相机轨迹作为隐式监督,通过地面投影生成空间掩码标签,避免了手动标注;模型能够隐式捕捉场景布局、车道拓扑和交叉口结构,实现跨相机配置的泛化,将轨迹假设估计转化为图像分割任务。

Abstract: We present a scalable self-supervised approach for segmenting feasible vehicle trajectories from monocular images for autonomous driving in complex urban environments. Leveraging large-scale dashcam videos, we treat recorded ego-vehicle motion as implicit supervision and recover camera trajectories via monocular structure-from-motion, projecting them onto the ground plane to generate spatial masks of traversed regions without manual annotation. These automatically generated labels are used to train a deep segmentation network that predicts motion-conditioned path proposals from a single RGB image at run time, without explicit modeling of road or lane markings. Trained on diverse, unconstrained internet data, the model implicitly captures scene layout, lane topology, and intersection structure, and generalizes across varying camera configurations. We evaluate our approach on NuScenes, demonstrating reliable trajectory prediction, and further show transfer to an electric scooter platform through light fine-tuning. Our results indicate that large-scale ego-motion distillation yields structured and generalizable path proposals beyond the demonstrated trajectory, enabling trajectory hypothesis estimation via image segmentation.


[187] ToFormer: Towards Large-scale Scenario Depth Completion for Lightweight ToF Camera cs.RO | cs.CVPDF

Juncheng Chen, Tiancheng Lai, Xingpeng Wang, Bingxin Liao, Baozhe Zhang

TL;DR: 本文提出了ToFormer框架,旨在解决短程飞行时间(ToF)相机在大规模场景中深度补全的问题。该框架包括构建首个大规模场景ToF深度补全数据集LASER-ToF,以及设计一个传感器感知的深度补全网络,该网络结合了3D-2D联合传播池化模块和多模态互协方差注意力机制,以有效建模长距离关系并处理非均匀ToF深度稀疏性。此外,网络还能利用视觉SLAM的稀疏点云作为补充,提高预测精度。实验表明,该方法在保持轻量级设计的同时,平均绝对误差比次优方法降低8.6%,并成功部署在四旋翼无人机上,实现10Hz实时运行,支持可靠的大规模建图和长距离规划。

Details

Motivation: ToF相机因紧凑设计和高测量精度广泛应用于机器人任务,但其有限感知范围限制了在大规模场景中的部署。现有深度补全研究缺乏专用数据集,且难以泛化到ToF测量,因此需要开发针对大规模场景的ToF深度补全解决方案。

Result: 在LASER-ToF数据集上的实验显示,该方法平均绝对误差比次优方法降低8.6%,达到最先进水平(SOTA),同时保持轻量级设计,支持板载部署。在真实机器人上部署时,能在四旋翼无人机上以10Hz运行,实现可靠的大规模建图和长距离规划。

Insight: 创新点包括构建首个大规模场景ToF深度补全数据集LASER-ToF,以及提出传感器感知网络,其中3D-2D联合传播池化模块和多模态互协方差注意力机制有效处理长距离关系和非均匀稀疏性,并可融合视觉SLAM点云提升精度。从客观角度看,该研究通过全栈框架解决了ToF相机在大规模场景中的实际部署挑战,兼具数据、算法和系统集成创新。

Abstract: Time-of-Flight (ToF) cameras possess compact design and high measurement precision to be applied to various robot tasks. However, their limited sensing range restricts deployment in large-scale scenarios. Depth completion has emerged as a potential solution to expand the sensing range of ToF cameras, but existing research lacks dedicated datasets and struggles to generalize to ToF measurements. In this paper, we propose a full-stack framework that enables depth completion in large-scale scenarios for short-range ToF cameras. First, we construct a multi-sensor platform with a reconstruction-based pipeline to collect real-world ToF samples with dense large-scale ground truth, yielding the first LArge-ScalE scenaRio ToF depth completion dataset (LASER-ToF). Second, we propose a sensor-aware depth completion network that incorporates a novel 3D branch with a 3D-2D Joint Propagation Pooling (JPP) module and Multimodal Cross-Covariance Attention (MXCA), enabling effective modeling of long-range relationships and efficient 3D-2D fusion under non-uniform ToF depth sparsity. Moreover, our network can utilize the sparse point cloud from visual SLAM as a supplement to ToF depth to further improve prediction accuracy. Experiments show that our method achieves an 8.6% lower mean absolute error than the second-best method, while maintaining lightweight design to support onboard deployment. Finally, to verify the system’s applicability on real robots, we deploy proposed method on a quadrotor at a 10Hz runtime, enabling reliable large-scale mapping and long-range planning in challenging environments for short-range ToF cameras.


[188] CounterScene: Counterfactual Causal Reasoning in Generative World Models for Safety-Critical Closed-Loop Evaluation cs.RO | cs.CVPDF

Bowen Jing, Ruiyang Hao, Weitao Zhou, Haibao Yu

TL;DR: CounterScene是一个用于安全关键闭环评估的生成世界模型框架,通过结构化反事实因果推理生成安全关键的驾驶场景。该方法首先识别因果关键代理并分类冲突类型,然后构建冲突感知的交互世界模型,利用因果交互图显式建模代理间的动态依赖关系,最后通过阶段自适应反事实指导对关键代理进行最小干预,使其在保持轨迹真实性的同时提高碰撞率。

Details

Motivation: 现有方法依赖启发式对抗代理选择和非结构化扰动,缺乏对交互依赖的显式建模,导致真实性与对抗性之间存在权衡。论文旨在通过反事实因果推理理解危险交互产生的原因,而不仅仅是强制碰撞。

Result: 在nuScenes数据集上的广泛实验表明,CounterScene在保持所有时间范围内轨迹真实性的同时,实现了最强的对抗效果,将长时域碰撞率从最强基线的12.3%提升至22.7%,且具有更好的真实性(ADE 1.88 vs. 2.09)。该优势在更长的推演中进一步扩大,并且CounterScene在nuPlan上实现了零样本泛化,达到了最先进的真实性水平。

Insight: 创新点在于将结构化反事实因果推理引入生成世界模型,通过因果交互图显式建模动态代理依赖关系,并采用阶段自适应反事实指导进行最小干预。这为安全关键场景生成提供了一种兼顾真实性与对抗性的新范式,可借鉴于其他需要因果推理的交互系统评估中。

Abstract: Generating safety-critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism–adversarial trade-off. We present CounterScene, a framework that endows closed-loop generative BEV world models with structured counterfactual reasoning for safety-critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict-aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter-agent dependencies. Building on this structure, stage-adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long-horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero-shot to nuPlan with state-of-the-art realism.


[189] Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition cs.RO | cs.CVPDF

Zhiyan Cao, Zhengxi Wu, Yiwei Wang, Pei-Hsuan Lin, Li Zhang

TL;DR: 本研究提出了一种解剖学先验驱动的框架,用于机器人自主获取心脏超声标准切面。该框架集成了基于YOLO并增强空间关系图模块的多类分割模型,以将解剖先验嵌入特征金字塔,并提取标准切面的量化解剖特征。这些特征的先验被拟合为高斯分布以构建概率解剖先验。机器人超声扫描的探头调整过程被形式化为一个强化学习问题,其状态基于实时解剖特征,奖励则反映与解剖先验的匹配度。

Details

Motivation: 解决心脏超声诊断中标准切面获取高度依赖操作者的问题,以及现有医学分割模型在特征类别间纹理差异小的图像中产生解剖不一致结果、现有自主探头调整方法依赖简单启发式规则或黑盒学习的局限性。

Result: 在Special Case数据集上,SRG-YOLOv11s模型将mAP50提高了11.3%,mIoU提高了6.8%。强化学习智能体在仿真实验中成功率达到92.5%,在体模实验中达到86.7%。

Insight: 主要创新点在于将解剖学先验(通过空间关系图模块和概率分布建模)系统地集成到分割和自主控制框架中,将探头调整形式化为基于解剖特征匹配的强化学习问题,实现了分割准确性和机器人操作自主性的协同提升。

Abstract: Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.


[190] PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing cs.RO | cs.CVPDF

Yuheng Ji, Yuyang Liu, Huajie Tan, Xuchuan Huang, Fanding Huang

TL;DR: 本文提出了PRM-as-a-Judge,一种密集评估范式,利用过程奖励模型(PRMs)从轨迹视频中直接审计策略执行,通过观察序列估计任务进度,以解决当前机器人评估过度依赖二元成功率、忽略执行过程细节的问题。

Details

Motivation: 当前机器人评估主要依赖二元成功率,这种评估方式将丰富的执行过程压缩为单一结果,掩盖了进度、效率和稳定性等关键质量指标,因此需要一种更密集、更细致的评估方法来揭示执行过程中的行为特征和失败模式。

Result: 在专门设计的诊断基准RoboPulse上,多个基于轨迹训练的PRM评估器在微观尺度进度判别方面优于基于判别相似性的方法和通用基础模型评估器,验证了其微观分辨率特性。

Insight: 创新点在于提出了基于OPD(结果-过程-诊断)度量系统的密集评估范式,该范式通过任务对齐的进度势能形式化执行质量,并定义了宏观一致性和微观分辨率两个公理性质;PRM评估器自然地实例化了这一评估范式,能够对长时域任务中的主流策略范式进行结构化审计,揭示仅靠结果指标无法看到的行为特征。

Abstract: Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.


cs.AI [Back]

[191] Knowledge Boundary Discovery for Large Language Models cs.AI | cs.CL | cs.LGPDF

Ziquan Wang, Zhongqi Lu

TL;DR: 本文提出了一种基于强化学习的知识边界发现(KBD)框架,用于探索大型语言模型(LLMs)的知识边界。该框架通过自动生成模型能自信回答的问题(知识边界内)和无法回答的问题(知识边界外)来定义知识边界,并利用强化学习代理与LLM交互来迭代探索这些边界。实验表明,KBD能够自动生成非平凡的可回答和不可回答问题集,其效果与人工构建的基准数据集相当。

Details

Motivation: 解决大型语言模型因幻觉现象而难以准确界定其知识边界的问题,旨在通过自动化方法探索和评估LLMs的知识能力范围。

Result: 在人工构建的LLM基准数据集上进行验证,实验结果显示KBD生成的问题集与人类生成的数据集具有可比性,表明该方法能有效检测LLMs的知识边界。

Insight: 创新点在于将知识边界发现建模为部分可观测环境下的强化学习问题,通过熵减奖励机制引导代理生成渐进式问题,从而自动化地识别LLMs的知识局限;这为评估LLMs提供了一种新途径,可减少对人工标注的依赖。

Abstract: We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i) those the LLM can confidently answer (within-knowledge boundary) and (ii) those it cannot (beyond-knowledge boundary). Iteratively exploring and exploiting the LLM’s responses to find its knowledge boundaries is challenging because of the hallucination phenomenon. To find the knowledge boundaries of an LLM, the agent interacts with the LLM under the modeling of exploring a partially observable environment. The agent generates a progressive question as the action, adopts an entropy reduction as the reward, receives the LLM’s response as the observation and updates its belief states. We demonstrate that the KBD detects knowledge boundaries of LLMs by automatically finding a set of non-trivial answerable and unanswerable questions. We validate the KBD by comparing its generated knowledge boundaries with manually crafted LLM benchmark datasets. Experiments show that our KBD-generated question set is comparable to the human-generated datasets. Our approach paves a new way to evaluate LLMs.


[192] LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning cs.AI | cs.CLPDF

Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li

TL;DR: 本文提出了LongCat-Flash-Prover,一个5600亿参数的专家混合开源模型,通过智能体工具集成推理技术推进Lean4中的原生形式推理。它将任务分解为自动形式化、草拟和证明三个能力,并提出了混合专家迭代框架来扩展高质量任务轨迹。训练中采用了分层重要性采样策略优化算法以稳定长视野任务训练,并整合了定理一致性与合法性检测机制。

Details

Motivation: 旨在推进原生形式推理能力,解决在Lean4等定理证明器中,将非形式化问题转化为形式化陈述并完成证明这一复杂、长视野任务的挑战。

Result: 在多个基准测试上达到开源权重模型的新SOTA:在MiniF2F-Test上以每个问题仅72次推理预算实现97.1%通过率;在更具挑战性的ProverBench和PutnamBench上分别解决70.8%和41.5%的问题(每个问题最多尝试220次),显著超越现有开源基线。

Insight: 创新点包括:1) 将原生形式推理任务解耦为三个独立的形式化能力;2) 提出混合专家迭代框架生成高质量任务轨迹;3) 设计分层重要性采样策略优化算法,通过梯度掩码策略处理策略陈旧性和训练-推理引擎差异,以稳定MoE模型在长视野任务上的训练;4) 引入定理一致性与合法性检测机制防止奖励黑客问题。

Abstract: We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.


[193] The Library Theorem: How External Organization Governs Agentic Reasoning Capacity cs.AI | cs.CL | cs.DS | cs.LGPDF

Zachary F. Mainen

TL;DR: 本文提出了’图书馆定理’,形式化地将Transformer的上下文窗口视为I/O页面,证明了配备索引化外部记忆的工具增强智能体在检索成本上相比仅能顺序扫描的智能体具有指数级优势。论文通过受控查找基准测试验证了理论预测,并揭示了语言模型在语义理解和遵循导航协议之间的竞争关系,主张将索引构建(利用模型语义能力)与索引遍历(使用确定性算法)分离。

Details

Motivation: 解决基于Transformer的智能体在外部化推理中,结构化检索(即对自身推理状态进行索引)尚未被充分探索的问题,旨在量化分析索引化外部记忆对提升智能体推理效率的根本性优势。

Result: 在包含随机哈希、有序整数和百科全书条目的受控查找基准测试中,索引化智能体在抽象内容上实现了中位数1次页面读取(与存储大小无关),符合O(1)预测;而无索引的排序页面方法无法弥合差距,即使更强模型实现近最优的log₂N搜索,仍比索引方法慢5倍。在熟悉内容(百科全书)上,模型会绕过检索协议直接从参数记忆生成答案,导致灾难性的令牌消耗。

Insight: 核心创新点在于形式化证明了索引化外部记忆带来的指数级检索效率提升(图书馆定理),并揭示了语言模型中’理解内容’与’遵循导航协议’两种认知操作的竞争与解耦。客观分析认为,其主张的’关注点分离’架构——即用语言模型进行索引构建(发挥其语义理解优势),而用确定性算法进行索引遍历(避免其理解导致的协议短路)——是提升智能体推理系统可靠性与效率的关键设计原则。

Abstract: Externalized reasoning is already exploited by transformer-based agents through chain-of-thought, but structured retrieval – indexing over one’s own reasoning state – remains underexplored. We formalize the transformer context window as an I/O page and prove that tool-augmented agents with indexed external memory achieve exponentially lower retrieval cost than agents restricted to sequential scanning: $O(\log_b N)$ versus $Ω(N)$ page reads per query, and $O(T \log_b T)$ versus $Θ(T^2)$ cumulative cost over $T$ reasoning steps – a gap that widens as deliberation deepens. We test these predictions on a controlled lookup benchmark across three content types – random hashes, ordered integers, and encyclopedia entries – varying store size from 50 to 5,000 items, and replicate key conditions across two model generations (GPT-4o-mini and GPT-5.4). On abstract content, the indexed agent achieves median 1 page read regardless of store size, confirming the $O(1)$ prediction. Sorted pages without an index fail to close the gap: the weaker model cannot sustain binary search at scale, and the stronger model achieves near-optimal $\log_2 N$ search but still loses to the index by $5\times$. On familiar content (encyclopedia entries), a competing failure mode emerges: the model recognizes the domain, bypasses the retrieval protocol, and generates answers from parametric memory, producing catastrophic token expenditure even when the index is sound. This parametric memory competition dissociates the two cognitive operations that indexing combines: understanding content (where language models excel) and following navigational protocols (where they fail when understanding tempts them to shortcut). The result argues for a separation of concerns: use language models for index construction, where semantic understanding helps, and deterministic algorithms for index traversal, where it hurts.


[194] EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning cs.AI | cs.CLPDF

Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng

TL;DR: 本文提出EvoIdeator框架,通过结合清单式反馈的强化学习来促进科学想法的迭代进化。该框架利用结构化评判模型生成词典序奖励和细粒度语言反馈,使策略模型在优化和推理中系统利用精确反馈,从而将初始概念转化为高质量研究提案。

Details

Motivation: 解决大语言模型在科学想法生成中难以将初始概念迭代进化为高质量研究提案的挑战,现有强化学习方法依赖粗粒度的标量奖励,而基于语言的优化方法通常仅限于推理时提示,缺乏对反馈内容的内化优化。

Result: 基于Qwen3-4B构建的EvoIdeator在关键科学指标上显著优于更大的前沿模型,并且学习到的策略展现出对多样化外部反馈源的强泛化能力,无需进一步微调。

Insight: 创新点在于将强化学习目标与清单式反馈对齐,通过词典序奖励实现多维度优化,并结合细粒度的、针对具体片段(如基础性、可行性、方法严谨性)的语言反馈,为自主、可扩展的自我精炼构思提供了新路径。

Abstract: Scientific idea generation is a cornerstone of autonomous knowledge discovery, yet the iterative evolution required to transform initial concepts into high-quality research proposals remains a formidable challenge for Large Language Models (LLMs). Existing Reinforcement Learning (RL) paradigms often rely on rubric-based scalar rewards that provide global quality scores but lack actionable granularity. Conversely, language-based refinement methods are typically confined to inference-time prompting, targeting models that are not explicitly optimized to internalize such critiques. To bridge this gap, we propose \textbf{EvoIdeator}, a framework that facilitates the evolution of scientific ideas by aligning the RL training objective with \textbf{checklist-grounded feedback}. EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) \emph{lexicographic rewards} for multi-dimensional optimization, and (2) \emph{fine-grained language feedback} that offers span-level critiques regarding grounding, feasibility, and methodological rigor. By integrating these signals into the RL loop, we condition the policy to systematically utilize precise feedback during both optimization and inference. Extensive experiments demonstrate that EvoIdeator, built on Qwen3-4B, significantly outperforms much larger frontier models across key scientific metrics. Crucially, the learned policy exhibits strong generalization to diverse external feedback sources without further fine-tuning, offering a scalable and rigorous path toward self-refining autonomous ideation.


[195] The Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures cs.AI | cs.CLPDF

Yiling Wu

TL;DR: 本文提出一个框架,分析不同推理类型(如归纳、类比、因果推理、演绎和形式逻辑)对表征系统的结构需求,识别出四个关键结构属性:可操作性、一致性、结构保持性和组合性。该框架揭示了推理类型之间存在一个主要的结构边界,边界以下的推理类型可以在关联性、概率性表征上运行,而边界以上的推理类型则需要完全满足所有四个属性。

Details

Motivation: 动机在于目前心理学、人工智能和心灵哲学领域缺乏对不同推理类型所需表征系统结构需求的系统性解释,作者旨在提供一个统一框架来填补这一空白,并阐明为何单纯扩展统计学习无法实现某些高级推理。

Result: 该框架得到了来自人工智能评估、发展心理学和认知神经科学不同直接程度证据的支持,并推导出三个可测试的预测:复合退化、对针对性结构破坏的选择性脆弱性以及在扩展下的不可约性。

Insight: 创新点在于提出了一个与表征格式无关的必要条件框架,强调了推理的结构边界,指出演绎推理所需的结构保证无法通过概率手段近似实现,从而为现有辩论提供了重组视角而非终结性结论。

Abstract: Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.


[196] Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning cs.AI | cs.CVPDF

Xueqi Ma, Shuo Yang, Yanbei Jiang, Shu Liu, Zhenzhen Liu

TL;DR: 本文通过机制可解释性方法研究视觉语言模型(VLMs)中注意力头在空间推理中的作用,提出了CogVSR数据集来分解复杂空间推理问题,并开发了一个探测框架来识别专门处理不同认知功能(如空间感知、关系推理)的注意力头。研究发现这些功能头普遍稀疏,且空间专用头数量较少,通过激活潜在空间头可以提升模型的空间理解能力。

Details

Motivation: 尽管大型视觉语言模型取得了显著进展,但空间推理仍然是一个持续的挑战。本文旨在探究VLMs中的注意力头如何通过机制可解释性视角贡献于空间推理,以理解其功能角色。

Result: 在多个VLM家族上的分析表明,功能头是普遍稀疏的,且数量和分布因功能而异;空间专用头比其他认知功能的头更少。干预实验显示,移除功能头会导致性能下降,而强调它们则能提高准确性,从而验证了这些头在空间推理中的关键作用。

Insight: 创新点包括引入CogVSR数据集来模拟人类逐步推理,以及开发探测框架来识别功能头;从客观角度看,研究揭示了空间专用注意力头的稀缺性,并提出了激活潜在空间头的方法,为增强多模态模型中的复杂空间推理提供了可解释性驱动的见解。

Abstract: Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially specialized heads are fewer than those for other cognitive functions, highlighting their scarcity. We propose methods to activate latent spatial heads, improving spatial understanding. Intervention experiments further demonstrate their critical role in spatial reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. This study provides new interpretability driven insights into how VLMs attend to space and paves the way for enhancing complex spatial reasoning in multimodal models.


[197] A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment cs.AI | cs.CVPDF

Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei

TL;DR: 本文提出了一个名为Cerebra的多学科AI系统,用于痴呆症的多模态表征和风险评估。该系统是一个交互式多智能体团队,协调处理电子健康记录、临床笔记和医学影像的专门智能体,并将结果整合到一个面向临床医生的仪表板中,结合可视化分析和对话界面,支持临床决策。

Details

Motivation: 现代临床实践越来越依赖于对异构、动态且不完整的患者数据进行推理,而现有的多模态基础模型大多是静态、不透明且与真实临床工作流程脱节的。因此,需要开发一个能整合多种数据、支持交互且符合临床工作流的AI系统。

Result: 在包含来自四个独立医疗系统300万患者的大规模多机构数据集上评估,Cerebra在痴呆风险预测(AUROC达0.80)、痴呆诊断(AUROC 0.86)和生存预测(C-index 0.81)上均优于最先进的单模态模型和大型多模态语言模型基线。在一项有经验医师参与的阅读者研究中,Cerebra将专家在痴呆风险前瞻性估计中的准确率提高了17.5个百分点。

Insight: 论文的创新点在于提出了一个交互式多智能体AI团队架构,将多种模态的专业分析智能体协调集成,并通过结合可视化仪表板和对话界面的方式提供可解释的决策支持。从客观角度看,其强调在隐私保护(通过处理结构化表示)和模态不完整情况下的鲁棒性部署,以及对真实临床工作流程的贴合,是具有借鉴价值的系统设计思路。

Abstract: Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra’s potential for interpretable, robust decision support in clinical care.


[198] Compensating Visual Insufficiency with Stratified Language Guidance for Long-Tail Class Incremental Learning cs.AI | cs.CVPDF

Xi Wang, Xu Yang, Donghao Sun, Cheng Deng

TL;DR: 本文提出了一种利用分层语言指导来补偿视觉信息不足的长尾类增量学习方法。该方法通过分析数据分布,引导大语言模型生成从粗到细粒度的分层语义树,并在此基础上引入分层自适应语言指导和分层对齐语言指导,以动态调整尾部类的监督、缓解数据不平衡,并利用语言树的结构稳定性约束优化、增强语义视觉对齐以减轻灾难性遗忘。

Details

Motivation: 解决长尾类增量学习中,尾部类样本稀缺不仅阻碍其学习,还在持续演变且不平衡的数据分布下加剧灾难性遗忘的问题。

Result: 在多个基准测试上的广泛实验表明,该方法取得了最先进的性能。

Insight: 创新点在于利用语言知识的丰富性和可扩展性,通过分层语言树结构动态调整监督和约束优化,以补偿视觉信息的不足并缓解长尾和增量学习中的核心挑战。

Abstract: Long-tail class incremental learning (LT CIL) remains highly challenging because the scarcity of samples in tail classes not only hampers their learning but also exacerbates catastrophic forgetting under continuously evolving and imbalanced data distributions. To tackle these issues, we exploit the informativeness and scalability of language knowledge. Specifically, we analyze the LT CIL data distribution to guide large language models (LLMs) in generating a stratified language tree that hierarchically organizes semantic information from coarse to fine grained granularity. Building upon this structure, we introduce stratified adaptive language guidance, which leverages learnable weights to merge multi-scale semantic representations, thereby enabling dynamic supervisory adjustment for tail classes and alleviating the impact of data imbalance. Furthermore, we introduce stratified alignment language guidance, which exploits the structural stability of the language tree to constrain optimization and reinforce semantic visual alignment, thereby alleviating catastrophic forgetting. Extensive experiments on multiple benchmarks demonstrate that our method achieves state of the art performance.


cs.CR [Back]

[199] SecureBreak – A dataset towards safe and secure models cs.CR | cs.AI | cs.CL | cs.LGPDF

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera

TL;DR: 本文介绍了SecureBreak数据集,这是一个面向安全的数据集,旨在支持开发AI驱动的解决方案,用于检测因安全对齐残留弱点而产生的有害LLM输出。该数据集通过精细的人工标注确保高可靠性,并在多个风险类别中有效检测不安全内容。在预训练LLM上的测试表明,使用SecureBreak微调后结果有所提升。

Details

Motivation: 随着大语言模型在现实应用中的普及,安全对齐成为其安全部署的关键要求。现有研究主要关注模型架构和对齐方法,但无法完全消除有害生成,且攻击(如越狱和提示注入)可能绕过现有安全机制,因此需要额外安全策略来评估训练阶段对齐的鲁棒性并创建最终防御层。

Result: SecureBreak数据集在检测多个风险类别的不安全内容方面表现良好,使用预训练LLM进行测试显示,在SecureBreak上微调后结果得到改善。

Insight: 创新点在于构建了一个高可靠性、保守标注的安全数据集,可用于后生成安全过滤以及指导进一步模型对齐和安全改进,弥补了现有方法在消除有害生成方面的不足。

Abstract: Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate’’ defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.


[200] Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning cs.CR | cs.CV | cs.LGPDF

Yunbei Zhang, Yingqiang Ge, Weijie Xu, Yuhui Xu, Jihun Hamm

TL;DR: 本文提出了一种新型的多模态红队攻击方法——视觉排他性攻击,该方法利用图像内容(如技术示意图)作为推理基础来诱导模型生成有害内容,而非传统的对抗噪声或文字叠加。为了系统性地利用这种威胁,作者提出了多模态多轮代理规划框架,通过训练攻击规划器来合成全局的多轮攻击策略,并在新构建的VE-Safety数据集上进行了评估。

Details

Motivation: 现有基于排版或对抗噪声的多模态红队攻击结构脆弱,一旦有效载荷被识别,标准防御即可轻易化解。本文旨在探索一种更具韧性的威胁形式,即伤害仅通过对视觉内容(如技术图表)的推理产生,从而绕过当前的安全对齐机制。

Result: 在VE-Safety数据集上的实验表明,所提出的MM-Plan方法对Claude 4.5 Sonnet的攻击成功率达到46.3%,对GPT-5达到13.8%,性能超出基线方法2到5倍,而现有方法在此类依赖推理的威胁上基本失效。

Insight: 核心创新点在于将视觉排他性确立为一种新的威胁模型,并将越狱攻击从逐轮反应重构为全局规划合成问题。其提出的GRPO优化方法使攻击策略能够自我发现,无需人工监督,这为自动化、自适应的多模态红队测试提供了新思路,揭示了前沿模型在代理式多模态攻击面前仍存在重大安全漏洞。

Abstract: Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise. These attacks are structurally brittle, as standard defenses neutralize them once the payload is exposed. We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), a framework that reframes jailbreaking from turn-by-turn reaction to global plan synthesis. MM-Plan trains an attacker planner to synthesize comprehensive, multi-turn strategies, optimized via Group Relative Policy Optimization (GRPO), enabling self-discovery of effective strategies without human supervision. To rigorously benchmark this reasoning-dependent threat, we introduce VE-Safety, a human-curated dataset filling a critical gap in evaluating high-risk technical visual understanding. MM-Plan achieves 46.3% attack success rate against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2–5x where existing methods largely fail. These findings reveal that frontier models remain vulnerable to agentic multimodal attacks, exposing a critical gap in current safety alignment. Warning: This paper contains potentially harmful content.


cs.IR [Back]

[201] OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis cs.IR | cs.AI | cs.CLPDF

Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie

TL;DR: 本文提出了OpenResearcher,一个完全开源的、可复现的流程,用于生成长视野的深度研究轨迹。该流程将一次性语料库引导与多轮轨迹合成解耦,并使用搜索、打开、查找三个浏览器原语在包含1500万文档的离线语料库上执行搜索-浏览循环。利用GPT-OSS-120B作为教师模型,合成了超过9.7万条轨迹,其中包括大量包含100多个工具调用的长视野轨迹。在一个30B-A3B骨干模型上对这些轨迹进行监督微调,在BrowseComp-Plus基准上取得了54.8%的准确率,比基础模型提升了34.0个百分点,同时在BrowseComp、GAIA和xbench-DeepSearch基准上保持竞争力。

Details

Motivation: 训练深度研究智能体需要交织着搜索、证据聚合和多步推理的长视野轨迹。然而,现有的数据收集流程通常依赖专有的网络API,使得大规模轨迹合成成本高昂、不稳定且难以复现。

Result: 在30B-A3B骨干模型上进行监督微调后,在BrowseComp-Plus基准上达到54.8%的准确率,比基础模型提升了34.0个百分点。在BrowseComp、GAIA和xbench-DeepSearch基准上保持竞争力。

Insight: 主要创新点在于提出了一种完全离线、可复现的轨迹合成流程,将语料库引导与轨迹合成解耦,并使用明确的浏览器原语。这解决了依赖专有API导致的成本、稳定性和可复现性问题。此外,离线环境支持受控分析,为深度研究流程设计(如数据过滤策略、智能体配置选择)提供了实用见解。

Abstract: Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER-AI-Lab/OpenResearcher.


[202] ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval cs.IR | cs.CVPDF

Zhuocheng Zhang, Xingwu Zhang, Kangheng Liang, Guanxuan Li, Richard Mccreadie

TL;DR: 本文提出了一种名为ADaFuSE的轻量级自适应融合模型,用于改进交互式文到图检索中扩散模型生成图像与文本反馈的融合方式。该方法通过动态门控和语义感知专家混合分支,校准多模态视图,可即插即用,在四个标准基准上实现了SOTA性能。

Details

Motivation: 现有交互式文到图检索框架使用简单的嵌入加法来融合扩散模型生成的多模态反馈视图,这种静态、无差别的融合会不加区分地引入扩散模型产生的生成噪声,导致大量样本性能下降。

Result: 在四个标准I-TIR基准上的评估表明,ADaFuSE实现了最先进的性能,在Hits@10指标上比DAR方法最高提升3.49%,而参数量仅增加5.29%,并且对噪声和长交互查询表现出更强的鲁棒性。

Insight: 创新点在于提出了一个双分支融合机制:一个自适应门控分支动态平衡模态可靠性,一个语义感知的专家混合分支捕捉细粒度跨模态细微差别。核心洞察是,生成式增强与有原则的融合相结合,为交互式检索提供了一种简单、可泛化的替代微调的方法。

Abstract: Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.


cs.MA [Back]

[203] Measuring Reasoning Trace Legibility: Can Those Who Understand Teach? cs.MA | cs.AI | cs.CLPDF

Dani Roytburg, Shreya Sridhar, Daphne Ippolito

TL;DR: 本文提出评估推理语言模型(RLMs)推理过程可读性的重要性,并引入转移效用(transfer utility)作为衡量标准,即RLMs的推理轨迹对指导较弱非推理模型获得正确答案的有用程度。研究发现高性能模型的推理可读性较低,且可读性效率指标(如轨迹长度)与转移效用之间存在权衡,形成可读性帕累托前沿。

Details

Motivation: 当前语言模型常通过输出大量推理过程来提升答案正确性,但缺乏对推理过程可读性的评估,本文旨在填补这一空白,强调推理轨迹应具备指导其他模型的能力。

Result: 评估了12个RLMs的9万条推理轨迹,发现最高性能模型的推理可读性排名最低,并揭示了可读性效率指标与转移效用之间的紧张关系,建立了可读性帕累托前沿。

Insight: 创新点在于提出转移效用作为可读性评估指标,并指出可读性是任务和受众依赖的目标;客观分析表明,当前用于训练RLMs的奖励模型未内在奖励可读性,这为多智能体协作中的推理轨迹优化提供了方向。

Abstract: Language models are increasingly being trained to “reason” before answering users’ queries, outputting hundreds or even thousands of tokens worth of deliberation before their final answer. While the main intention of reasoning is to improve models’ ability to arrive at a correct answer, we argue that these models should be assessed for the legibility of their reasoning traces in addition to the correctness of their final answers. In this paper, we evaluate 90k traces from 12 Reasoning Language Models (RLMs) for the quality of their reasoning traces. We introduce the concept of transfer utility, which assesses how useful an RLM’s reasoning traces are for guiding a weaker, non-reasoning model toward arriving at the correct answer. We find that the reasoning traces of the highest-performing models rank among the lowest for legibility. Furthermore, we uncover tensions between efficiency-based measurements of legibility (such as trace length) and transfer utility. These tensions establish a legibility Pareto frontier, and we demonstrate that an RLM’s ability to output highly legible traces can be a task- and audience-dependent goal. Crucially, we find that reward models used to train RLMs do not intrinsically reward legibility. Together, these metrics and the findings they surface chart a path towards scaffolding reasoning traces for a multi-agent future.


q-bio.MN [Back]

[204] GIP-RAG: An Evidence-Grounded Retrieval-Augmented Framework for Interpretable Gene Interaction and Pathway Impact Analysis q-bio.MN | cs.AI | cs.CLPDF

Fujian Jia, Jiwen Gu, Cheng Lu, Dezhi Zhao, Mengjiang Huang

TL;DR: GIP-RAG是一个结合生物医学知识图谱与大语言模型(LLMs)的计算框架,用于推断和解释基因相互作用及其对生物通路的影响。它通过整合多个公共数据库构建统一的基因相互作用知识图谱,并利用检索增强生成(RAG)技术进行可解释的多步推理,以识别直接和间接的调控关系,并评估基因扰动对通路状态的潜在影响。

Details

Motivation: 尽管公共数据库中存在大量分子相互作用和通路数据,但整合异构知识源并在生物网络中实现可解释的多步推理仍然具有挑战性。该论文旨在解决这一问题,以促进疾病机制的理解和精准医学的发展。

Result: 论文在多种生物场景下进行评估,结果表明该框架能够生成一致、可解释且有生物学证据支持的基因调控机制见解,但摘要中未提及具体的定量结果(如基准测试或SOTA比较)。

Insight: 创新点包括:1)整合多个权威数据库构建统一的基因相互作用知识图谱;2)结合检索增强生成(RAG)与LLMs进行可解释的逐步推理;3)扩展至通路水平的功能影响分析,模拟基因扰动在网络中的传播。从客观角度看,该方法将知识图谱与LLMs结合,为复杂分子系统的机制推理提供了一个通用且可解释的框架。

Abstract: Understanding mechanistic relationships among genes and their impacts on biological pathways is essential for elucidating disease mechanisms and advancing precision medicine. Despite the availability of extensive molecular interaction and pathway data in public databases, integrating heterogeneous knowledge sources and enabling interpretable multi-step reasoning across biological networks remain challenging. We present GIP-RAG (Gene Interaction Prediction through Retrieval-Augmented Generation), a computational framework that combines biomedical knowledge graphs with large language models (LLMs) to infer and interpret gene interactions. The framework constructs a unified gene interaction knowledge graph by integrating curated data from KEGG, WikiPathways, SIGNOR, Pathway Commons, and PubChem. Given user-specified genes, a query-driven module retrieves relevant subgraphs, which are incorporated into structured prompts to guide LLM-based stepwise reasoning. This enables identification of direct and indirect regulatory relationships and generation of mechanistic explanations supported by biological evidence. Beyond pairwise interactions, GIP-RAG includes a pathway-level functional impact module that simulates propagation of gene perturbations through signaling networks and evaluates potential pathway state changes. Evaluation across diverse biological scenarios demonstrates that the framework generates consistent, interpretable, and evidence-supported insights into gene regulatory mechanisms. Overall, GIP-RAG provides a general and interpretable approach for integrating knowledge graphs with retrieval-augmented LLMs to support mechanistic reasoning in complex molecular systems.


cs.LG [Back]

[205] AE-LLM: Adaptive Efficiency Optimization for Large Language Models cs.LG | cs.CLPDF

Kaito Tanaka, Masato Ito, Yuji Nishimura, Keisuke Matsuda, Aya Nakayama

TL;DR: 本文提出了AE-LLM,一个用于大型语言模型的自适应效率优化统一框架。该框架通过一个多目标优化过程,根据具体任务、资源和硬件约束,自动选择和组合最优的效率技术(如高效注意力、专家混合、参数高效微调、量化等),以在精度、延迟、内存占用和能耗之间取得平衡。

Details

Motivation: 大型语言模型部署面临巨大计算成本、内存需求和能耗挑战,且现有单一效率技术(如高效注意力、MoE、量化等)的效果因任务、资源和模型规模而异,缺乏普适最优方案。

Result: 在涵盖15个模型(0.5B-70B参数)和10个多样化任务的广泛实验中,AE-LLM相比静态效率配置,在保持竞争力精度(与基线相差1.2%以内)的同时,平均实现了效率指标2.8倍的提升。该框架在视觉语言模型上也取得了类似的效率增益。

Insight: 主要创新点在于提出了一个统一的自适应框架,通过多目标优化和高效搜索算法,在架构、微调和推理阶段自动探索和组合多种效率技术,以找到针对特定场景的帕累托最优配置,为实践者提供了自动化工具来应对LLM效率优化的复杂权衡问题。

Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse applications, yet their deployment remains challenging due to substantial computational costs, memory requirements, and energy consumption. Recent empirical studies have demonstrated that no single efficiency technique is universally optimal; instead, the effectiveness of methods such as efficient attention mechanisms, mixture-of-experts (MoE), parameter-efficient fine-tuning, and quantization varies significantly depending on task characteristics, resource constraints, and model scales. Building upon these insights, we propose AE-LLM, a unified framework that automatically selects and combines optimal efficiency techniques tailored to specific deployment scenarios. Our approach introduces a multi-objective optimization framework that jointly considers accuracy, latency, memory footprint, and energy consumption, while accounting for hardware constraints and task requirements. We develop an efficient search algorithm that explores the combinatorial space of efficiency techniques across architecture, fine-tuning, and inference stages, identifying Pareto-optimal configurations. Extensive experiments across 15 models (0.5B-70B parameters) and 10 diverse tasks demonstrate that AE-LLM achieves an average of $2.8\times$ improvement in efficiency metrics while maintaining competitive accuracy (within 1.2% of baseline), compared to static efficiency configurations. Furthermore, our framework generalizes effectively to vision-language models, achieving similar efficiency gains. Our contributions provide practitioners with an automated tool for navigating the complex trade-off landscape of LLM efficiency optimization.


[206] Understanding Contextual Recall in Transformers: How Finetuning Enables In-Context Reasoning over Pretraining Knowledge cs.LG | cs.CLPDF

Bhavya Vasudeva, Puneesh Deora, Alberto Bietti, Vatsal Sharan, Christos Thrampoulidis

TL;DR: 本文研究了Transformer模型在上下文学习(ICL)中的一种特定形式——上下文回忆(contextual recall),即模型利用成对示例在新提示格式中回忆特定事实的能力。通过一个受控的合成框架,论文发现仅预训练不足以实现上下文回忆,但通过微调可以触发该能力的涌现,并伴随低维潜在编码的形成。

Details

Motivation: 动机是探究上下文回忆能力是否仅源于预训练,需要何种微调,以及驱动必要表示的机制是什么,以理解Transformer如何通过微调实现基于预训练知识的上下文推理。

Result: 在合成框架中,预训练模型能成功获取事实知识,但在ICL提示中去除语法统计信息时无法隐式推断属性类型;微调后,模型在所有主题上实现了上下文回忆,并通过注意力机制验证了从事实回忆到上下文回忆的转变。

Insight: 创新点在于揭示了上下文回忆能力依赖于微调而非仅预训练,并识别了低维潜在编码作为关键机制;从客观角度看,这为理解Transformer的ICL机制提供了可解释的合成实验框架和理论构建。

Abstract: Transformer-based language models excel at in-context learning (ICL), where they can adapt to new tasks based on contextual examples, without parameter updates. In a specific form of ICL, which we refer to as \textit{contextual recall}, models pretrained on open-ended text leverage pairwise examples to recall specific facts in novel prompt formats. We investigate whether contextual recall emerges from pretraining alone, what finetuning is required, and what mechanisms drive the necessary representations. For this, we introduce a controlled synthetic framework where pretraining sequences consist of subject-grammar-attribute tuples, with attribute types tied to grammar statistics. We demonstrate that while such pretraining successfully yields factual knowledge, it is insufficient for contextual recall: models fail to implicitly infer attribute types when the grammar statistics are removed in ICL prompts. However, we show that finetuning on tasks requiring implicit inference, distinct from the ICL evaluation, using a subset of subjects, triggers the emergence of contextual recall across all subjects. This transition is accompanied by the formation of low-dimensional latent encodings of the shared attribute type. For mechanistic insight, we derive a construction for an attention-only transformer that replicates the transition from factual to contextual recall, corroborated by empirical validation.


[207] PLR: Plackett-Luce for Reordering In-Context Learning Examples cs.LG | cs.CLPDF

Pawel Batorski, Paul Swoboda

TL;DR: 本文提出了一种名为PLR的概率方法,用于优化大语言模型上下文学习(ICL)中示例的顺序。该方法使用Plackett-Luce模型学习排序的概率分布,通过迭代更新参数将概率质量集中在高性能的排序上,从而避免了在n!种可能排序中进行穷举搜索。

Details

Motivation: 上下文学习(ICL)的性能通常对示例的顺序高度敏感,但穷举搜索所有可能的排序是不可行的。现有方法要么使用模型置信度度量,要么直接寻找最佳排序,效率或适用性有限。本文旨在提出一种更高效、通用的概率方法来优化ICL示例排序。

Result: 在多个分类基准测试中,PLR在k∈{4, 8, 16, 32}个示例的少样本设置下持续提高了准确率。此外,在基于标签的排序方法不适用的数学推理任务上也取得了性能提升。

Insight: 主要创新点在于将离散的排序搜索问题转化为学习排序的概率分布问题,使用Plackett-Luce模型和Gumbel扰动排序过程进行高效采样。这种方法提供了一种通用、可学习的框架来优化ICL示例顺序,尤其适用于没有明确标签的任务。

Abstract: In-context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the $n!$ possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label-probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in-context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett-Luce model. PLR models orderings using a Plackett-Luce distribution and iteratively updates its parameters to concentrate probability mass on high-performing orderings under a task-level metric. Candidate orderings are sampled efficiently via a Gumbel perturb-and-sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few-shot accuracy for $k \in {4, 8, 16, 32}$ examples, and we further demonstrate gains on mathematical reasoning tasks where label-based ordering methods are not applicable. Our code is available at https://github.com/Batorskq/PLR.


[208] Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization cs.LG | cs.AI | cs.CLPDF

Hung-Hsuan Chen

TL;DR: 本文提出了一种深度循环Transformer,通过共享权重的Transformer块在隐空间迭代计算,将计算深度与参数量解耦,使模型能在推理时通过增加循环步数进行更深层次的推理。该架构通过静默思考目标、LayerScale初始化和身份偏置循环三种机制稳定深度循环(20+步),并在图可达性、嵌套布尔逻辑和非结构化关系文本三个组合推理任务上验证了其性能。

Details

Motivation: 标准Transformer的计算深度固定,限制了其在需要变深度推理任务(如多跳图遍历或嵌套逻辑)上的泛化能力。本文旨在解决这一问题,使模型能够根据任务复杂度动态调整推理深度。

Result: 在三个组合推理任务(图可达性、嵌套布尔逻辑、非结构化关系文本)上,模型性能随着思考步数增加到与任务复杂度匹配时,会从随机水平跃迁到接近完美,展现出清晰的“计算边界”。模型在不同任务上表现出不同的泛化行为:精确但脆弱(图)、近似但鲁棒(逻辑)、以及无需结构提示的自主隐式路由(文本)。

Insight: 主要创新点在于提出了一个任务不变的循环推理核心与任务特定感知接口相结合的架构,为垂直思维链提供了机制性视角,补充了主流的水平token生成范式。其稳定深度循环的机制(如静默思考目标)对设计需要多步推理的模型具有借鉴意义。

Abstract: Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space – enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} – a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.


[209] Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe cs.LG | cs.CLPDF

Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu

TL;DR: 本文通过TravelPlanner这一复杂多轮工具使用环境,对强化学习在长视野智能体训练中的设计空间进行了系统性实证研究,并提炼出一套实用配方。

Details

Motivation: 解决如何将强化学习有效应用于复杂、多轮环境中以训练长视野规划能力的自主智能体这一实际问题,目前缺乏可扩展的实用方法。

Result: 基于提炼出的配方进行RL训练的模型在TravelPlanner测试平台上达到了最先进的性能,显著优于领先的大型语言模型。

Insight: 创新点在于将智能体RL设计空间分解为奖励塑造、模型缩放、数据构成、算法选择和环境稳定性五个维度,并通过控制实验得出七项关键发现,例如奖励与算法选择具有规模依赖性,以及约1K个难度平衡的训练样本是性能最佳点。

Abstract: Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.


[210] ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention cs.LG | cs.AI | cs.CLPDF

Xinyan Wang, Xiaogeng Liu, Chaowei Xiao

TL;DR: 本文提出ROM方法,通过流式检测和干预来实时缓解大型推理模型(LRMs)的过度思考问题。ROM在冻结的大语言模型骨干网络的深层隐藏状态上附加轻量级检测头,实时监控token并在检测到过度思考时提前触发向最终答案的转换,从而减少推理步骤、降低延迟和计算成本。

Details

Motivation: 大型推理模型在生成长思维链时存在过度思考问题,即在得出正确答案后仍继续生成冗余推理步骤,这增加了延迟和计算成本,并可能导致答案漂移。现有缓解方法要么需要大量训练来修改骨干网络,要么依赖手工启发式规则,未能真正捕捉过度思考模式。

Result: 在七个基准测试上,ROM实现了最高的准确率(93.51%)、最短的响应长度(1,159个token)和最佳的响应效率。与原始基线相比,ROM将响应长度减少了47.2%,并将效率提高了121%。

Insight: 论文的创新点在于首次将过度思考缓解问题形式化为流式预测与控制问题,并提出了基于解决方案正确性边界的token级监督方法和减少蒸馏数据偏差的数据增强策略。从客观角度看,ROM通过轻量级检测头和实时干预机制,在不修改骨干模型的情况下有效实现了实时过度思考检测与缓解,为提升大型推理模型的效率提供了新思路。

Abstract: Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.


[211] Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations cs.LG | cs.AI | cs.CVPDF

Liu hung ming

TL;DR: 本文提出了一种名为AI Mother Tongue (AIM)的被动量化探测框架,用于将V-JEPA 2视频世界模型学习到的连续潜在向量转换为离散符号序列,从而揭示其潜在表示中蕴含的物理结构,而无需修改或监督编码器。

Details

Motivation: 解决基于联合嵌入预测架构(JEPA)训练的视频世界模型(如V-JEPA 2)存在的结构可解释性鸿沟问题,即其编码器学习到的丰富时空物理结构无法通过生成式模型的视觉验证路径或现有探测方法(如连续空间探测或附加生成组件)进行有效归因和解释。

Result: 在Kinetics-mini数据集上针对抓取角度、物体几何和运动时间结构三个物理维度进行类别对比实验,结果显示AIM符号分布在所有实验中都存在显著差异(卡方检验p值小于10^{-4},互信息在0.036至0.117比特之间,归一化互信息达到3比特最大值的1.2%至3.9%,Jensen-Shannon散度最高达0.342,码本激活率为62.5%),表明V-JEPA 2的潜在空间高度紧凑,语义差异通过渐变的分布变化而非类别边界编码。

Insight: 创新点在于提出了一个轻量级、无词汇表的被动量化探测框架(AIM),它能够在不修改冻结编码器或引入任务特定监督的情况下,从JEPA潜在空间中自动涌现出离散符号和结构化流形,这为构建动作条件符号世界模型奠定了基础,并证明结构化符号流形是冻结JEPA潜在空间的可发现属性。

Abstract: Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations – not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^{-4}; MI 0.036–0.117 bits, NMI 1.2–3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.


[212] SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models cs.LG | cs.CVPDF

Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif

TL;DR: 本文提出了一种名为SSAM(奇异子空间对齐与合并)的无训练模型合并框架,旨在将独立训练的多模态大语言模型(如视觉-语言、音频-语言模型)合并成一个能处理任意输入模态组合的单一模型。该方法通过识别并对齐语言相关参数的低秩共享子空间,在避免参数干扰的同时保留互补知识,无需使用多模态训练数据即可实现高效合并。

Details

Motivation: 构建或扩展多模态大语言模型通常需要大量配对数据和计算资源,而现有公开的预训练MLLM(如视觉-语言或音频-语言模型)各自专注于不同模态。本文旨在解决如何将这些独立训练的专家模型合并为一个能处理多模态的单一模型,以克服表示差异和参数空间干扰的挑战。

Result: 在四个数据集上的实验表明,SSAM在无需多模态训练数据的情况下,取得了最先进的性能,超越了先前的无训练合并方法,甚至优于联合训练的多模态模型。

Insight: 创新点在于提出了一种基于参数空间对齐的无训练合并框架,通过识别语言相关参数的共享低秩子空间并进行对齐,有效减少了模态间参数干扰,为多模态模型整合提供了一种可扩展且资源高效的替代方案。

Abstract: Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.


[213] dynActivation: A Trainable Activation Family for Adaptive Nonlinearity cs.LG | cs.CVPDF

Alois Bachmann

TL;DR: 本文提出了一种名为dynActivation的可训练激活函数族,其形式为$f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$,其中$α_i$和$β_i$是可学习的标量参数,用于在基础非线性激活函数(如ReLU类函数)和线性路径之间进行插值。该激活函数在多个视觉任务、语言建模任务和消融实验中进行了评估,结果表明其能够线性化深层网络并提升训练效率。

Details

Motivation: 动机在于设计一种自适应的激活函数,通过引入轻量级的可学习参数,使网络能够根据层深和任务动态调整非线性程度,以解决传统固定激活函数(如ReLU)在深层网络中可能出现的性能下降或训练效率低下的问题。

Result: 在CIFAR-10上,dynActivation(Mish)在AttentionCNN上相比静态Mish提升了高达+14.02%,平均提升+6.00%,且收敛AUC减少了24%。在MNIST深度扩展实验中(1到75层),dynActivation始终保持95%以上的测试准确率(95.3%–99.3%),而ReLU在25层时准确率崩溃至80%以下。在FGSM攻击下($ε=0.08$),dynActivation(Mish)的准确率下降比ReLU少7.40%。在语言建模中,提出的dynActGLU变体在5620步时相比SwiGLU相对困惑度降低了10.3%(4.047 vs. 4.514),尽管在34300步时差距消失。

Insight: 创新点在于提出了一种通用、轻量级的可训练激活函数框架,通过插值参数动态平衡非线性和线性,从而增强深层网络的稳定性和效率。从客观角度看,该方法在多个基准测试中展示了鲁棒的性能提升和训练加速潜力,尤其在对抗性攻击和深度扩展场景下表现突出,为自适应激活函数设计提供了新思路。

Abstract: This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$, where $α_i$ and $β_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02%$ on AttentionCNN with an average improvment by $+6.00%$, with a $24%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95%$ test accuracy ($95.3$–$99.3%$), while ReLU collapses below $80%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39%$ accuracy drop versus $62.79%$ for ReLU ($7.40%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.