Table of Contents

cs.CL [Back]

[1] Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models cs.CL | cs.AI | cs.LGPDF

Hongji Li, Junchi yao, Manjiang Yu, Priyanka Singh, Xue Li

TL;DR: 本文提出了首个针对推理多模态大语言模型(RMLLMs)的遗忘学习基准RMLLMU-Bench,用于评估方法在抑制推理链信息泄露与保持通用推理能力之间的平衡。针对现有方法在此基准上的不足,作者提出了无需训练、在推理时进行干预的框架R-MUSE,通过子空间引导和自适应调控来同时遗忘答案和推理痕迹,同时显式保护通用推理能力。

Details

Motivation: 针对推理多模态大语言模型(RMLLMs)的机器遗忘任务存在独特挑战:即使最终答案被遗忘,中间思维链步骤仍可能泄露敏感信息,而过于激进的干预又容易损害模型的通用推理能力。目前缺乏一个能联合评估遗忘方法在抑制推理级泄露与保持推理能力方面表现的基准。

Result: 在提出的RMLLMU-Bench基准上进行系统评估表明,现有的MLLM或大型推理模型的遗忘方法要么在推理过程中留下大量信息泄露,要么严重降低推理性能。而提出的R-MUSE框架在该基准上实现了有效遗忘与推理保持之间显著更好的平衡。

Insight: 主要创新点在于:1)构建了首个专门评估RMLLM遗忘的基准RMLLMU-Bench,扩展了标准遗忘指标,引入了专门的推理泄露和推理保持度量;2)提出了无需训练的推理时干预框架R-MUSE,通过调控内部表征来同时遗忘答案和推理痕迹,并显式保护通用推理,为解决推理保留遗忘问题提供了新思路。

Abstract: Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention.


[2] Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning cs.CL | cs.AIPDF

Lihui Liu

TL;DR: 本文提出Graph-O1框架,通过结合蒙特卡洛树搜索(MCTS)与端到端强化学习,使大语言模型(LLM)能够在文本属性图上进行逐步、交互式推理,以解决图问答任务中现有方法因忽略图结构或受上下文长度限制而导致的推理碎片化和准确性下降问题。

Details

Motivation: 解决文本属性图问答任务中,现有检索增强生成方法要么将文本段落视为孤立单元而忽略图结构,要么因将大子图序列化为长文本而超出LLM上下文长度限制,导致推理不完整和准确性降低的问题。

Result: 在多个LLM骨干网络上进行的广泛实验表明,Graph-O1一致超越了最先进的基线方法,产生了更准确、可靠和可解释的答案。

Insight: 核心创新在于将图推理建模为智能体与图环境的多轮交互过程,并集成MCTS与强化学习,使模型能够选择性地探索和检索信息最丰富的子图组件,从而实现了更高效和结构化的图推理。

Abstract: ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.


[3] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression cs.CL | cs.MAPDF

Boris Kriuk, Logic Ng

TL;DR: 本文提出Q-KVComm协议,通过自适应KV缓存压缩实现多智能体大语言模型系统间的高效通信。该方法结合层间自适应量化、混合信息提取和异构模型校准,直接传输压缩后的键值缓存表示,避免了冗余文本传输和重复计算。实验表明,在多个问答数据集上,Q-KVComm实现了5-6倍的压缩比,同时保持语义保真度。

Details

Motivation: 解决多智能体LLM系统中因冗余传输上下文信息而导致的带宽和计算资源消耗过高的问题,传统方法丢弃内部语义表示并传输原始文本,迫使接收方智能体重头计算相似表示。

Result: 在三个不同的问答数据集上进行广泛实验,Q-KVComm实现了5-6倍的压缩比,同时保持语义保真度,所有场景下的连贯性质量分数均高于0.77。该协议在多种模型规模(1.1B-1.5B参数)上表现稳健,并适用于对话QA和多跳推理等实际应用。

Insight: 创新点包括:1) 基于敏感性分析的自适应层间量化分配可变比特宽度;2) 跨内容域保留关键事实的混合信息提取;3) 建立跨架构通信的异构模型校准。从文本交换转向基于表示的信息交换,为LLM智能体通信建立了新范式。

Abstract: Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.


[4] Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression cs.CL | cs.AIPDF

Rahul Baxi

TL;DR: 该论文提出了压缩衰减理解测试(CDCT)这一新基准,用于在提示压缩场景下独立评估大型语言模型的约束遵循能力和语义准确性。研究发现,模型在中等压缩程度下约束违反最严重,且RLHF训练的‘乐于助人’行为是导致此问题的主要原因。

Details

Motivation: 探究大型语言模型在提示压缩下性能下降的机制,理解约束遵循与语义准确性这两个维度的独立表现,以改进实际部署系统的指令遵循能力。

Result: 在CDCT基准上评估了9个前沿LLM,发现约束遵循呈现普遍的U型曲线模式(97.2%的案例),违反峰值出现在中等压缩水平(c=0.5)。通过RLHF消融实验验证,移除‘乐于助人’信号可使约束遵循平均提升598%。推理模型比高效模型性能高27.5%。

Insight: 创新点在于将指令遵循分解为约束遵循和语义准确性两个正交维度进行独立测量,揭示了RLHF对齐目标与严格指令遵循之间存在根本性张力,为系统改进提供了明确方向(如调整对齐信号)。

Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss’ \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing “helpfulness” signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen’s d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.


Shubham Kumar Nigam, Tanuj Tyagi, Siddharth Shukla, Aditya Kumar Guru, Balaramamahanthi Deepak Patnaik

TL;DR: 本文提出了ReGal框架,首次在印度法律AI领域探索基于PPO的强化学习方法,用于判决预测和法律文档摘要任务。该框架结合了多任务指令微调和基于AI反馈的强化学习,虽然性能指标上不及监督学习和专有模型,但为法律文本的RL应用提供了重要见解。

Details

Motivation: 解决将强化学习应用于法律AI领域的挑战,特别是在印度法律背景下,探索如何将RL方法用于高风险、长文档的法律任务,如判决预测和文档摘要。

Result: 在法院判决预测与解释(CJPE)和法律文档摘要两个任务上,ReGal框架在标准评估指标上表现不及监督学习和专有模型,但通过实证和定性分析展示了RL在法律长文档任务中的潜力。

Insight: 创新点在于首次将PPO-based RLAIF框架应用于印度法律AI,揭示了奖励模型对齐、法律语言复杂性和领域特定适应等关键挑战,为构建可解释和自适应的法律AI系统奠定了基础。

Abstract: This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.


[6] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation cs.CL | cs.CYPDF

Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman, Katie Ekström, Pablo Rosillo-Rodes

TL;DR: 本研究通过分析视频聊天中陌生人对话和电影中虚构角色对话的词汇增长模式,发现词汇量随对话长度增长遵循赫普定律,且不同词性的词汇增长速率存在差异。

Details

Motivation: 旨在探究自然对话和虚构对话中词汇增长的统计规律,特别是赫普定律在对话场景下的适用性,以及语言特征如何影响词汇增长模式。

Result: 研究发现词汇增长速率因词性而异,在两种对话媒介中均观察到这一现象,但未提及具体基准测试或与SOTA的比较。

Insight: 创新点在于将赫普定律应用于对话分析,并揭示词性对词汇增长的影响,为理解对话动态提供了统计语言学的新视角。

Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps’ law, which holds that vocabulary size scales with document length. Little work on Heaps’s law has looked at conversation and considered how language features impact scaling. We measure Heaps’ law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.


[7] Training LLMs with LogicReward for Faithful and Rigorous Reasoning cs.CLPDF

Jundong Xu, Hao Fei, Huichi Zhou, Xin Quan, Qijun Huang

TL;DR: 本文提出LogicReward,一种基于定理证明器强制执行步骤级逻辑正确性的新型奖励系统,用于训练大型语言模型进行忠实且严谨的推理。该方法还引入了软统一的自动形式化技术,以减少自然语言的歧义并提升形式化质量。实验表明,使用LogicReward构建的数据训练的8B模型,在自然语言推理和逻辑推理任务上超越了GPT-4o和o4-mini,并提升了推理的忠实性、泛化能力,且能在无真实标签的情况下提供可靠的奖励信号。

Details

Motivation: 现有LLM训练方法主要依赖基于结果的反馈,可能导致答案正确但推理过程存在缺陷。先前工作引入了对中间步骤的监督,但仍缺乏逻辑严谨性的保证,这在逻辑一致性至关重要的高风险场景中尤为关键。

Result: 在自然语言推理和逻辑推理任务上,使用LogicReward构建的数据训练的8B模型,以简单的训练流程,分别超越了GPT-4o和o4-mini达11.6%和2%。进一步分析表明,该方法提升了推理忠实性,并泛化到未见过的数学和常识推理任务。

Insight: 核心创新点在于将定理证明器集成到奖励系统中,为模型训练提供步骤级的逻辑正确性监督。同时,提出的软统一自动形式化技术,通过减少自然语言歧义,有效提升了形式化质量,从而更充分地利用了定理证明器的能力。这为提升LLM推理的严谨性和可靠性提供了一条可借鉴的技术路径。

Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.


[8] LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation cs.CLPDF

Guo Chen, Junjie Huang, Huaijin Xie, Fei Sun, Tao Jia

TL;DR: 本文提出了一种名为LiR^3AG的轻量级重排推理策略框架,旨在解决检索增强生成(RAG)中多跳问答任务中推理模型带来的高计算开销问题。该框架通过将检索到的证据重组为连贯的推理链,使非推理模型能够转移推理策略,从而在显著降低输出令牌开销和推理时间的同时,提升模型性能。

Details

Motivation: 动机在于解决RAG多跳问答任务中推理模型引入的高计算成本(如令牌消耗和推理延迟)与性能提升之间的权衡问题,旨在找到一种更高效的替代方案。

Result: 在RAG多跳问答任务中,LiR^3AG显著减少了平均98%的输出令牌开销和58.6%的推理时间,同时将8B非推理模型的F1性能提升了6.2%到22.5%,超越了32B推理模型的性能,实现了高效且实用的改进。

Insight: 创新点在于识别了推理模型的两种主要推理模式(上下文基础推理和知识调和推理),并设计了一个轻量级框架来重构检索证据为推理链,使非推理模型能继承推理能力,从而在保持高性能的同时大幅降低计算成本,为RAG系统提供了可扩展的解决方案。

Abstract: Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model’s F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.


[9] Towards Efficient Agents: A Co-Design of Inference Architecture and System cs.CLPDF

Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu

TL;DR: 本文提出了AgentInfer框架,通过协同设计推理架构与系统来加速基于大语言模型的智能体。该框架包含四个协同组件:用于分层双模型推理的AgentCollab、用于缓存感知混合调度的AgentSched、基于后缀自动机的推测解码方法AgentSAM,以及异步语义压缩机制AgentCompress,共同构成一个自进化引擎以提升长程推理任务的效率。

Details

Motivation: 当前基于LLM的智能体在现实世界部署中面临严重低效问题,其瓶颈并非孤立模型推理,而是源于推理循环、上下文增长和异构工具交互中累积的系统性延迟。

Result: 在BrowseComp-zh和DeepDiver基准测试中,AgentInfer通过方法协同,将无效token消耗降低了50%以上,实现了1.8-2.5倍的整体加速,同时保持了准确性。

Insight: 创新点在于将智能体优化目标从单token吞吐量转向任务完成效率,通过架构与系统协同设计(如动态角色分配、缓存感知调度、语义记忆重用和异步内存压缩)实现系统性加速,并构建了能维持认知稳定的自进化引擎。

Abstract: The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative decoding method that reuses multi-session semantic memory to achieve low-overhead inference acceleration; and AgentCompress, a semantic compression mechanism that asynchronously distills and reorganizes agent memory without disrupting ongoing reasoning. Together, these modules form a Self-Evolution Engine capable of sustaining efficiency and cognitive stability throughout long-horizon reasoning tasks. Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy. These results underscore that optimizing for agentic task completion-rather than merely per-token throughput-is the key to building scalable, efficient, and self-improving intelligent systems.


[10] An Agentic AI Framework for Training General Practitioner Student Skills cs.CL | cs.AIPDF

Victor De Marez, Jens Van Nooten, Luna De Bruyne, Walter Daelemans

TL;DR: 本文提出了一种用于培训全科医学生技能的智能体框架,通过整合可配置的循证病例生成、可控的角色驱动患者对话以及基于标准的评估反馈,旨在解决当前虚拟模拟患者在医学准确性、角色扮演一致性和教育反馈方面的不足。

Details

Motivation: 当前虚拟模拟患者存在医学准确性不足、角色扮演不一致、病例生成困难以及教育反馈结构不佳等问题,需要一种更可靠且具有教学价值的训练工具。

Result: 在包含14名医学生的交互式口语咨询环境中评估,参与者报告了真实且符合病例的对话、适当的难度校准、稳定的个性信号以及高度有用的示例丰富反馈,整体可用性优秀。

Insight: 创新点在于将场景控制、交互控制和基于标准的评估分离为智能体框架,这种模式可构建可靠且具有教学价值的虚拟模拟患者训练工具,强调了可配置性、证据基础和标准对齐的重要性。

Abstract: Advancements in large language models offer strong potential for enhancing virtual simulated patients (VSPs) in medical education by providing scalable alternatives to resource-intensive traditional methods. However, current VSPs often struggle with medical accuracy, consistent roleplaying, scenario generation for VSP use, and educationally structured feedback. We introduce an agentic framework for training general practitioner student skills that unifies (i) configurable, evidence-based vignette generation, (ii) controlled persona-driven patient dialogue with optional retrieval grounding, and (iii) standards-based assessment and feedback for both communication and clinical reasoning. We instantiate the framework in an interactive spoken consultation setting and evaluate it with medical students ($\mathbf{N{=}14}$). Participants reported realistic and vignette-faithful dialogue, appropriate difficulty calibration, a stable personality signal, and highly useful example-rich feedback, alongside excellent overall usability. These results support agentic separation of scenario control, interaction control, and standards-based assessment as a practical pattern for building dependable and pedagogically valuable VSP training tools.


[11] Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling cs.CL | cs.AI | cs.LGPDF

Christopher Román Jaimes

TL;DR: 本文提出了一种自动化、可扩展的流程,以减轻自然语言推理(NLI)模型对虚假相关性的依赖。该流程包括:引入Log-Frequency LMI(LF-LMI)来准确检测语义伪影;通过基于LLM的合成流程与多评委验证生成高质量的合成对比集;以及提出动态平衡采样训练策略以防止灾难性遗忘。

Details

Motivation: NLI模型经常依赖虚假相关性而非语义推理,现有缓解策略通常标注成本高昂或在微调过程中引发灾难性遗忘。本文旨在解决这些限制。

Result: 在具有挑战性的基准测试上,该方法将一致性从63.5%提升至81.0%,同时保持了88.4%的域内准确率,显著优于朴素微调方法。

Insight: 创新点在于结合了自动化的虚假相关性检测(LF-LMI)、利用LLM合成高质量对比数据,以及动态平衡采样策略来有效平衡新知识与原有知识,从而在提升模型鲁棒性的同时避免遗忘。这为缓解NLI中的虚假相关性提供了一种高效且可扩展的解决方案。

Abstract: Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.


[12] Teaching and Critiquing Conceptualization and Operationalization in NLP cs.CLPDF

Vagrant Gautam

TL;DR: 这篇论文介绍了一个为NLP学生设计的研讨班,旨在探讨该领域中常用抽象概念(如’可解释性’、’偏见’、’推理’和’刻板印象’)的定义与测量问题。研讨班通过跨学科阅读材料和强调讨论与批判,引导学生思考这些概念的含义、应有的定义以及如何操作化测量。

Details

Motivation: NLP领域的研究者经常使用抽象概念而不明确定义,各子领域对这些概念有共享的理解和操作化决策,但缺乏对这些概念本质和测量方式的批判性反思。论文旨在通过教育方式解决这一问题。

Result: 论文未提及具体的定量实验结果或基准测试,而是描述了一个教育研讨班的设计和实施。

Insight: 创新点在于将社会科学中的概念化与操作化框架引入NLP教育,通过跨学科方法和批判性讨论,帮助学生深入理解领域核心概念,提升研究的严谨性和反思能力。

Abstract: NLP researchers regularly invoke abstract concepts like “interpretability,” “bias,” “reasoning,” and “stereotypes,” without defining them. Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made: Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what should they mean, and how should we measure them? I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.


[13] LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction cs.CL | cs.AIPDF

Jensen Zhang, Ningyuan Liu, Yijia Fan, Zihao Huang, Qinglin Zeng

TL;DR: 本文提出了LLM-CAS框架,将实时幻觉纠正问题建模为分层强化学习任务。该框架训练一个智能体,学习在推理过程中根据当前上下文动态选择临时神经元扰动的策略,从而实现无需永久修改模型参数的自适应、细粒度纠正。

Details

Motivation: 大型语言模型(LLMs)经常产生缺乏事实或上下文依据的幻觉内容,这限制了其在关键应用中的可靠性。现有的监督微调和基于人类反馈的强化学习方法数据密集且计算成本高,而静态参数编辑方法难以处理上下文相关的错误并存在灾难性遗忘问题。

Result: 在多个语言模型上的实验表明,LLM-CAS持续提升了事实准确性:在StoryCloze上提升了10.98个百分点,在TriviaQA上提升了2.71个百分点,在TruthfulQA的MC1分数上提升了2.06个百分点。这些结果优于静态编辑方法(如ITI和CAA)以及动态框架SADI。

Insight: 核心创新点在于将实时幻觉纠正形式化为一个策略驱动的分层强化学习问题,通过动态、临时的神经元扰动进行上下文感知的纠正,避免了永久性参数修改及其带来的副作用。这为提升LLM可靠性提供了一种高效且可扩展的新思路,并具备向多模态扩展的潜力。

Abstract: Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting. We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification. Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions.


Pierre Colombo, Malik Boudiaf, Allyn Sweet, Michael Desa, Hongxi Wang

TL;DR: 本文提出将风险投资中的资本化表核对(capitalization tie-out)作为法律AI的现实基准任务,分析了现有智能体系统的性能局限,并设计了一种世界模型架构以实现自动化核对,为应用法律智能奠定基础。

Details

Motivation: 解决风险投资融资中律师需核对资本化表这一复杂法律工作流程的自动化问题,该任务需要多文档推理、严格证据追溯和确定性输出,现有LLM和智能体系统难以可靠完成。

Result: 论文分析了现有智能体系统在该任务上的性能表现,但未在摘要中提及具体定量结果或基准比较数据。

Insight: 将专业法律工作流程(如资本化表核对)定义为现实AI基准任务,并提出专门的世界模型架构以支持多文档推理和证据追溯,为法律领域应用智能体提供了新的技术方向。

Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.


[15] From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure cs.CL | physics.acc-phPDF

Thorsten Hellert, Nikolay Agladze, Alex Giovannone, Jan Jug, Frank Mayet

TL;DR: 本文提出了一种用于复杂实验基础设施中语义通道查找的概念框架,将自然语言意图映射到具体控制系统信号,解决了在大型实验平台中定位控制与诊断通道的难题。该框架包含四种范式:基于词典的上下文查找、结构化层次导航、交互式代理探索和基于本体的语义搜索,并在四个实际设施中进行了概念验证,实现了90-97%的查询准确率。

Details

Motivation: 解决现代实验平台(如粒子加速器、聚变装置等)中因控制通道数量庞大、命名不规范和文档碎片化导致的信号定位瓶颈,以提升可靠性、可扩展性和语言模型驱动界面的效率。

Result: 在四个实际设施(从紧凑型自由电子激光器到大型同步辐射光源)的概念验证中,针对专家策划的操作查询实现了90-97%的准确率,展示了框架在不同规模和架构控制系统中的有效性。

Insight: 创新点在于将语义通道查找形式化为通用问题,并提出涵盖从简单查找到高级语义搜索的四范式框架,其本体驱动的语义搜索方法能解耦通道含义与设施特定命名约定,可借鉴于其他复杂工业系统的自然语言接口设计。

Abstract: Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.


[16] From Word to World: Can Large Language Models be Implicit Text-based World Models? cs.CLPDF

Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang

TL;DR: 本文研究了大型语言模型能否在基于文本的环境中作为隐式世界模型,以提升智能体的学习效率。作者提出了一个三层次评估框架,并在五个代表性环境中验证了充分训练的世界模型能够保持一致的潜在状态、随数据和模型规模可预测地扩展,并通过动作验证、合成轨迹生成和预热强化学习等方式提升智能体性能。

Details

Motivation: 解决现实世界环境难以扩展和覆盖有限的问题,探索大型语言模型是否能够可靠地作为世界模型来提升智能体的学习效率。

Result: 在五个基于文本的环境中,充分训练的世界模型能够保持一致的潜在状态,随数据和模型规模可预测地扩展,并通过动作验证、合成轨迹生成和预热强化学习等方式提升智能体性能,但这些增益严重依赖于行为覆盖和环境复杂性。

Insight: 提出了一个三层次评估框架来系统评估基于LLM的世界模型,揭示了世界模型对智能体学习的有效支持取决于行为覆盖和环境复杂性,为LLM作为隐式世界模型的应用提供了边界条件。

Abstract: Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.


[17] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models cs.CLPDF

Tung Duong Ta, Tim Oates

TL;DR: 本文提出MDToC(元认知动态概念树)方法,通过构建概念树、为每个概念生成准确性验证的计算步骤,并采用多数投票评估竞争解决方案,以提升大语言模型在数学问题解决中的计算验证能力。

Details

Motivation: 尽管大语言模型在数学推理能力上有所进展,但在使用现有提示技术时仍难以有效验证计算过程,因此需要一种新方法来增强其计算验证的准确性。

Result: 在CHAMP、MATH和Game-of-24基准测试中,GPT-4-Turbo使用MDToC分别达到58.1%、86.6%和85%的准确率,相比GoT方法分别提升5%、5.4%和4%,且在所有骨干模型上均优于现有提示方法,最高比ToT提升7.6%,比GoT提升6.2%。

Insight: 创新点在于引入元认知机制,通过动态构建概念树和多数投票来验证计算步骤,这为增强数学推理提供了一种无需人工设计提示的新方向,可借鉴其结构化验证和集成决策策略。

Abstract: Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC’s effectiveness, with GPT-4-Turbo achieving 58.1% on CHAMP, 86.6% on MATH, and 85% on Game-of-24 - outperforming GoT by 5%, 5.4%, and 4% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6% over ToT and 6.2% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.


[18] Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction cs.CL | cs.AI | cs.CYPDF

Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao

TL;DR: 论文通过大规模实证分析,研究了大型语言模型(LLMs)是否能估计学生困难,发现模型在难度预测上与人类存在系统性不对齐,模型规模扩大不可靠,模型倾向于达成机器共识而非模拟人类认知困难。

Details

Motivation: 动机是解决教育评估中项目难度估计的冷启动问题,并探索LLMs是否能够感知人类学习者的认知困难。

Result: 结果包括:模型与人类在难度评估上不对齐,模型规模扩大不总是有帮助;模型难以模拟学生能力限制,缺乏内省能力,无法预测自身局限性,在医学知识和数学推理等多个领域验证。

Insight: 创新点在于揭示了LLMs在难度估计上的局限性,表明通用问题解决能力不意味着理解人类认知困难,突出了使用当前模型进行自动难度预测的挑战。

Abstract: Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.


[19] Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations cs.CLPDF

Shaomu Tan, Ryosuke Mitani, Ritvik Choudhary, Qiyu Wu, Toshiyuki Sekiya

TL;DR: 本文提出了Remedy-R,一种基于推理的生成式机器翻译评估方法,它通过强化学习从成对翻译偏好中训练,无需错误标注或从闭源大语言模型蒸馏。该方法能生成关于准确性、流畅性和完整性的逐步分析及最终评分,从而提供更可解释的评估。仅使用两个语言对的6万训练对,Remedy-R在WMT22-24元评估中与顶级标量指标和GPT-4评委保持竞争力,能泛化到其他语言,并在分布外压力测试中表现出强鲁棒性。此外,其生成的自我反思反馈可用于翻译改进,基于此构建的Remedy-R代理能通过评估-修订流程持续提升多种模型的翻译质量。

Details

Motivation: 现有自动机器翻译指标虽在基准测试中表现强劲,但仍是黑盒,决策过程不透明,且在真实世界分布外输入下容易失效。

Result: 在WMT22-24元评估中,Remedy-R与顶级标量指标和GPT-4评委保持竞争力;在分布外压力测试中表现出强鲁棒性;构建的Remedy-R代理能持续提升包括Qwen2.5、ALMA-R、GPT-4o-mini和Gemini-2.0-Flash在内的多种模型的翻译质量。

Insight: 创新点在于通过强化学习从成对偏好训练生成式推理模型,无需错误标注,实现了可解释的逐步分析;其自我反思反馈可直接用于构建翻译改进代理,表明推理过程捕获了翻译相关信息,具有实际应用价值。

Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R’s evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R’s reasoning captures translation-relevant information and is practically useful.


[20] Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models cs.CL | cs.AI | cs.LGPDF

Tongyuan Miao, Gary Huang, Kai Jun Han, Annie Jiang

TL;DR: 本文提出了一种训练无关的上下文感知初始化方法,用于减少扩散语言模型(DLLMs)的生成路径长度。该方法通过从轻量级辅助模型注入提示条件先验来缩短去噪轨迹,并引入了离散令牌注入和表示级嵌入插值两种机制,以及一个基于置信度的重掩码机制来处理先验不完美的问题。在GSM8K上的初步实验表明,该方法能显著减少约35%的去噪迭代次数,但也揭示了朴素预热启动可能降低最终准确性的挑战。

Details

Motivation: 扩散大语言模型(DLLMs)虽然支持完全并行的令牌解码,但在推理时由于需要多次去噪迭代来将信息全无的完全掩码初始化细化为连贯文本,导致效率低下。现有加速方法多集中于通过改进求解器或采样策略来更高效地遍历生成轨迹,而本文则从缩短轨迹本身出发,通过上下文感知初始化使起点更接近目标分布。

Result: 在GSM8K基准测试上的初步证据显示,上下文感知初始化能显著减少去噪迭代次数(在实验设置中约减少35%的函数评估),但同时也暴露了关键挑战:相对于强大的扩散基线,朴素的预热启动可能会降低最终准确性。

Insight: 创新点在于提出了一种训练无关的接口,通过轻量级辅助模型注入提示条件先验来初始化扩散过程,并引入了离散令牌注入和表示级嵌入插值两种具体机制,以及一个基于置信度的重掩码机制作为先验怀疑论形式。从客观角度看,该方法为扩散解码的加速提供了新视角,即通过优化初始化而非仅改进采样来缩短生成路径,并强调了校准、修订机制和表示对齐等研究方向的重要性。

Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.


[21] A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs cs.CLPDF

Ziyan Zhang, Chao Wang, Zhuo Chen, Lei Chen, Chiyi Li

TL;DR: 本文提出了一种基于大语言模型的知识图谱复杂逻辑推理方法ROG,该方法通过结合查询感知的知识图谱邻域检索与大语言模型的思维链推理,将复杂一阶逻辑查询分解为更简单的子查询序列,并利用检索到的相关子图作为上下文证据进行逐步逻辑推断。

Details

Motivation: 解决现有基于嵌入的方法在处理涉及多运算符、深层推理链或异构知识图谱模式的复杂逻辑查询时泛化能力不足的问题。

Result: 在标准知识图谱推理基准测试中,ROG在平均倒数排名(MRR)上持续优于基于嵌入的强基线方法,尤其在高度复杂的查询类型上取得了显著提升。

Insight: 创新点在于将结构化知识图谱检索与LLM驱动的逻辑推理相结合,避免了特定任务的嵌入优化,为复杂知识图谱推理任务提供了鲁棒且有效的替代方案。

Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.


[22] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards cs.CLPDF

Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao

TL;DR: 本文提出了一种名为优势加权策略优化(AWPO)的强化学习框架,旨在通过显式整合推理奖励来增强大型语言模型(LLM)的工具使用能力。该框架采用方差感知门控和难度感知加权来自适应地调节推理信号的优势,并结合定制化的裁剪机制以实现稳定优化。实验表明,AWPO在标准工具使用基准测试中达到了最先进的性能,特别是在多轮场景中显著超越了强基线模型和领先的闭源模型。

Details

Motivation: 现有基于强化学习训练工具使用LLM的方法主要依赖可验证的结果奖励,而忽视了显式推理奖励在增强推理和工具利用方面的潜力;同时,直接结合推理和结果奖励可能导致次优性能或与主要优化目标冲突。

Result: AWPO在标准工具使用基准测试中实现了最先进的性能,显著超越了强基线模型和领先的闭源模型;在具有挑战性的多轮场景中,其4B参数模型以卓越的参数效率,在多轮准确率上超越了Grok-4模型16.0%,同时在分布外MMLU-Pro基准测试上保持了泛化能力。

Insight: 创新点在于提出了一个原则性的强化学习框架AWPO,通过方差感知门控和难度感知加权自适应地整合显式推理奖励,并采用定制化裁剪机制确保优化稳定性;从客观角度看,该方法有效解决了推理与结果奖励的整合难题,提升了工具使用性能,同时保持了模型效率和泛化能力。

Abstract: While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) – a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.


[23] From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs cs.CLPDF

Alessandro Lucca, Francesco Pierri

TL;DR: 本文通过评估四种最先进的自动语音识别模型在意大利电视节目字幕生成任务上的表现,探讨了ASR系统在真实世界非英语长视频内容中的实际应用潜力。研究发现,尽管现有模型无法完全替代专业人工字幕员,但可作为高效辅助工具提升生产力,并据此设计了一套支持人机协作的云端字幕生产系统。

Details

Motivation: 解决ASR模型在非英语(特别是意大利语)长视频内容字幕生成这一真实生产环境中的性能评估缺失问题,为媒体行业构建专业字幕系统提供实证依据。

Result: 在50小时意大利电视节目数据集上评估了Whisper Large v2等四个SOTA模型,以专业人工字幕为基准,发现当前模型虽无法满足全自动高精度需求,但能显著提升人工效率。

Insight: 创新点在于针对意大利语长视频字幕场景的系统性实证研究,并提出了结合人机回环的云端生产架构,为多语言媒体内容处理提供了可落地的工程范式。

Abstract: Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional human subtitlers. The findings indicate that, while current models cannot meet the media industry’s accuracy needs for full autonomy, they can serve as highly effective tools for enhancing human productivity. We conclude that a human-in-the-loop (HITL) approach is crucial and present the production-grade, cloud-based infrastructure we designed to support this workflow.


[24] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation cs.CLPDF

Bingyang Kelvin Liu, Ziyu Patrick Chen

TL;DR: 本文提出JEPA-Reasoner,一种增强生成能力的JEPA模型,通过在潜在空间进行推理,并利用独立的Talker模型生成可读句子,实现了潜在推理与标记生成的解耦。

Details

Motivation: 解决JEPA架构缺乏生成能力,以及现有Transformer模型(如COCONUT)在潜在空间推理中仍依赖逐标记生成、易累积复合误差的问题。

Result: JEPA-Reasoner在自回归生成中表现出对复合误差更强的鲁棒性,其解耦方法可能为多线程推理奠定基础。

Insight: 核心创新在于将潜在空间推理与标记生成过程解耦,这既保留了JEPA的表示学习优势,又通过独立的生成模块(Talker)实现了可控、鲁棒的文本生成,可能开启多线程推理的新方向。

Abstract: While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, latent space reasoning attempts for Transformer models like COCONUT do improve performance, but they ultimately rely on token-by-token generation, which still accumulates compounding error and relies on context information to gain reasoning insights. To address these limitations, we propose JEPA-Reasoner, a novel JEPA model enhanced with generative ability that reasons in latent space. We augment it with a separate action-taker model, Talker, to produce human-readable sentences. Our approach demonstrates that decoupling latent space reasoning and token generation enables JEPA-Reasoner to produce mixed latent vectors that might lay the foundation for multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.


[25] ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models cs.CL | cs.AIPDF

Mingxu Zhang, Dazhong Shen, Qi Zhang, Ying Sun

TL;DR: ChemATP是一个无需训练的化学推理框架,通过构建首个原子级文本知识库,使冻结的大型语言模型能够动态检索和推理化学知识,从而在保持模型通用推理能力的同时提升化学任务性能。

Details

Motivation: 解决大型语言模型在分子科学中因缺乏显式化学先验知识而表现不佳的问题,同时避免基于训练的方法导致的静态耦合和通用能力下降,以及现有无训练方法仅依赖表层提示无法提供精细原子级先验的局限。

Result: 实验表明ChemATP显著优于无训练基线方法,并与最先进的基于训练的模型性能相当,证明了显式先验注入是隐式参数更新的有效竞争方案。

Insight: 创新点在于将化学知识与推理引擎解耦,通过原子级知识库实现动态检索和推理,提高了可解释性和适应性,同时保留了大型语言模型的通用智能;这为领域特定推理提供了一种可扩展且无需训练的新范式。

Abstract: Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model’s general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM’s intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.


[26] Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics cs.CL | cs.AIPDF

Do Minh Duc, Quan Xuan Truong, Nguyen Tat Dat, Nguyen Van Vinh

TL;DR: 本文提出了一种用于物流文本中框架检测的新型提示优化流水线,结合了检索增强生成、少样本提示、思维链推理和自动思维链合成,以生成高效的任务特定提示。核心是一个基于LLM的提示优化代理,它利用检索到的示例、性能反馈和内部自评估迭代优化提示。在真实世界的物流文本标注任务上评估,优化后的提示(特别是通过Auto-CoT和RAG增强的)相比基线零样本或静态提示,将推理准确率提高了高达15%。该系统在多个LLM上展示了一致的改进,验证了其泛化性和实用价值。

Details

Motivation: 解决在物流文本的框架检测任务中,如何通过提示工程而非大量微调来适配大语言模型,以提高推理准确性和标注效率的问题。

Result: 在真实世界物流文本标注任务上,优化提示使推理准确率相比基线零样本或静态提示提升高达15%,在GPT-4o、Qwen 2.5和LLaMA 3.1等多个LLM上均表现出一致改进。

Insight: 创新点在于将RAG、少样本提示、CoT和Auto-CoT集成到一个结构化提示优化流水线中,并引入基于LLM的代理进行迭代优化;客观来看,该方法提供了一种可扩展的、免于全量微调的领域特定NLP应用部署方案。

Abstract: Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.


[27] CodeSimpleQA: Scaling Factuality in Code Large Language Models cs.CLPDF

Jian Yang, Wei Zhang, Yizhi Li, Shawn Guo, Haowen Wang

TL;DR: 本文提出了CodeSimpleQA,一个用于评估代码大语言模型在回答编程相关问题时事实准确性的双语基准,并构建了包含6600万样本的指令数据集CodeSimpleQA-Instruct,通过监督微调和强化学习的后训练框架显著提升了模型在代码事实性上的表现。

Details

Motivation: 现有代码相关基准主要关注代码执行正确性,而忽视了编程知识的事实准确性,因此需要专门评估和提升代码大语言模型在事实性方面的能力。

Result: 在CodeSimpleQA基准上的评估显示,即使是前沿的大语言模型在代码事实性方面也存在困难;提出的后训练框架相比基础模型有显著改进。

Insight: 创新点在于构建了首个专注于代码事实准确性的双语基准和大规模指令数据集,并设计了结合监督微调和强化学习的后训练框架来提升模型的事实性对齐能力。

Abstract: Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.


[28] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation cs.CLPDF

Thittipat Pairatsuppawat, Abhibhu Tachaapornchai, Paweekorn Kusolsomboon, Chutikan Chaiwong, Thodsaporn Chay-intr

TL;DR: 本文提出了SiamGPT-32B,一个基于Qwen3-32B、专为泰语文本生成优化的开源大语言模型。其核心是采用了一种’质量优先’的微调策略,通过整合翻译的高复杂度英文指令数据和适配泰语的AutoIF框架来提升模型在复杂指令下的生成稳定性,而无需持续预训练或扩展语料库。

Details

Motivation: 解决开源大语言模型在泰语任务上,尤其是在复杂指令下生成不稳定、表现不佳的问题,尽管这些模型在英语上表现强劲。

Result: 在SEA-HELM基准测试中,SiamGPT-32B在同等规模的开源泰语模型中取得了最强的综合性能,在指令遵循、多轮对话和自然语言理解方面均有稳定提升。

Insight: 创新点在于’质量优先’的微调策略,强调数据质量而非规模,并通过结合翻译的高质量指令数据与语言适配的约束框架来直接提升特定语言的生成稳定性和指令遵循能力,为资源优化和语言特定优化提供了思路。

Abstract: Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.


[29] A Large-Language-Model Framework for Automated Humanitarian Situation Reporting cs.CLPDF

Ivan Decostanzi, Yelena Mejova, Kyriaki Kalimeri

TL;DR: 本文提出了一种基于大语言模型(LLM)的自动化框架,用于将异构的人道主义文档转化为结构化、有证据支持的情势报告。该系统集成了语义文本聚类、自动问题生成、检索增强的答案提取与引用、多级摘要和行政摘要生成,并辅以模拟专家推理的内部评估指标。

Details

Motivation: 当前人道主义情势报告的生成工作流程主要依赖人工,存在资源密集、不一致且效率低下的问题,需要一种自动化方法来提供及时、准确和可操作的情报。

Result: 在涵盖自然灾害和冲突等13个人道主义事件、使用超过1100份来自ReliefWeb等可靠来源的文档进行评估时,生成的问题在相关性、重要性和紧急性上分别达到84.7%、84.0%和76.4%;提取的答案相关性为86.3%,引用精确率和召回率均超过76%;基于LLM的评估与人工评估的一致性F1分数超过0.80。与现有基线相比,该框架生成的报告更具结构化、可解释性和可操作性。

Insight: 创新点在于将LLM的推理能力与透明的引用链接和多级评估相结合,实现了自动化生成准确、可验证且具有操作价值的人道主义报告。其核心是构建了一个端到端的、证据驱动的处理流程,通过内部评估指标模拟专家判断,提升了生成内容的可靠性和实用性。

Abstract: Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.


[30] Event Extraction in Large Language Model cs.CLPDF

Bobo Li, Xudong Han, Jiang Liu, Yuzhe Ding, Liqiang Jing

TL;DR: 这篇论文综述了大型语言模型(LLM)和多模态LLM在事件抽取(EE)领域的应用与挑战,主张将EE视为一个为LLM解决方案提供认知支架的系统组件。它涵盖了文本和多模态设置下的EE任务、方法演变(从基于规则到生成式框架)、以及相关数据集和评估,并探讨了跨语言、低资源和特定领域设置,最后指出了未来将EE发展为可靠、面向智能体的感知与记忆层的方向。

Details

Motivation: 尽管基于LLM的流水线在零样本或少样本设置下能生成结构化输出,但其部署仍面临幻觉、长上下文中的时序与因果链接脆弱以及有限的长时程知识管理等挑战,因此需要将EE重新定位为一个提供认知支架的系统组件。

Result: 论文是一篇综述,未报告具体的定量实验结果或基准测试,但系统总结了EE领域的任务分类、方法演进、数据集和评估体系。

Insight: 创新点在于提出将事件模式、槽位约束、事件中心结构和事件存储等EE组件作为LLM的认知支架,用于实现基于图的检索增强生成(RAG)和超越上下文窗口的可更新记忆,从而推动EE从静态抽取向可靠、面向智能体的系统层演进。

Abstract: Large language models (LLMs) and multimodal LLMs are changing event extraction (EE): prompting and generation can often produce structured outputs in zero shot or few shot settings. Yet LLM based pipelines face deployment gaps, including hallucinations under weak constraints, fragile temporal and causal linking over long contexts and across documents, and limited long horizon knowledge management within a bounded context window. We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions. Event schemas and slot constraints create interfaces for grounding and verification; event centric structures act as controlled intermediate representations for stepwise reasoning; event links support relation aware retrieval with graph based RAG; and event stores offer updatable episodic and agent memory beyond the context window. This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks, and summarizing formulations, decoding strategies, architectures, representations, datasets, and evaluation. We also review cross lingual, low resource, and domain specific settings, and highlight open challenges and future directions for reliable event centric systems. Finally, we outline open challenges and future directions that are central to the LLM era, aiming to evolve EE from static extraction into a structurally reliable, agent ready perception and memory layer for open world systems.


[31] Algerian Dialect cs.CLPDF

Zakaria Benmounah, Abdennour Boulesnane

TL;DR: 本文介绍了Algerian Dialect数据集,这是一个包含45,000条阿尔及利亚阿拉伯语方言YouTube评论的大规模情感标注数据集,每条评论被手动标注为非常负面、负面、中性、正面或非常正面五个情感类别之一,并包含时间戳、点赞数等元数据,旨在解决阿尔及利亚方言公开资源稀缺的问题,支持情感分析、方言阿拉伯语NLP和社交媒体分析研究。

Details

Motivation: 动机是解决阿尔及利亚阿拉伯语方言公开数据资源的稀缺性,以支持情感分析、方言阿拉伯语自然语言处理及社交媒体分析等领域的研究。

Result: 论文未提及具体模型实验结果或基准测试,主要贡献是创建并公开了该数据集,未涉及SOTA比较。

Insight: 创新点在于构建了首个大规模、情感标注的阿尔及利亚阿拉伯语方言数据集,通过YouTube评论收集和手动标注,提供了丰富的元数据,填补了方言NLP资源的空白,可促进低资源语言处理研究。

Abstract: We present Algerian Dialect, a large-scale sentiment-annotated dataset consisting of 45,000 YouTube comments written in Algerian Arabic dialect. The comments were collected from more than 30 Algerian press and media channels using the YouTube Data API. Each comment is manually annotated into one of five sentiment categories: very negative, negative, neutral, positive, and very positive. In addition to sentiment labels, the dataset includes rich metadata such as collection timestamps, like counts, video URLs, and annotation dates. This dataset addresses the scarcity of publicly available resources for Algerian dialect and aims to support research in sentiment analysis, dialectal Arabic NLP, and social media analytics. The dataset is publicly available on Mendeley Data under a CC BY 4.0 license at https://doi.org/10.17632/zzwg3nnhsz.2.


[32] Increasing the Thinking Budget is Not All You Need cs.CLPDF

Ignacio Iacobacci, Zhaozhi Qian, Faroq AL-Tam, Muhammad AL-Qurishi, Riad Souissi

TL;DR: 本文系统研究了大型语言模型推理过程中的计算资源分配问题,探讨了增加思维预算与采用自洽性、反思等配置策略对模型性能的影响,发现单纯增加思维预算并非最优方案。

Details

Motivation: 针对当前具备思维能力的LLM在推理任务中计算资源使用效率的问题,旨在探索思维预算与不同配置策略的交互作用,以优化性能与计算成本的平衡。

Result: 研究发现,通过自洽性和自我反思等替代配置可以获得更准确的响应,而非单纯增加思维预算,在多个推理基准测试中验证了该结论。

Insight: 创新点在于提出了系统化的思维预算分析框架,并揭示了配置策略优化比单纯增加计算量更能有效提升模型性能,为LLM推理效率优化提供了新方向。

Abstract: Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.


[33] MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery cs.CL | eess.ASPDF

Angelo Ortiz Tandazo, Manel Khentout, Youssef Benchekroun, Thomas Hueber, Emmanuel Dupoux

TL;DR: 本文提出了MauBERT,它是HuBERT的多语言扩展,通过利用发音特征进行鲁棒的跨语言语音表示学习。该方法在55种语言上基于语音到发音特征的映射进行监督预训练,学习预测发音特征或音素,从而获得捕捉多语言语音特性的语言无关表示。

Details

Motivation: 动机是解决多语言语音表示学习中缺乏语言无关的鲁棒表示问题,旨在通过引入发音特征作为归纳偏置,提升自监督语音模型在跨语言和少样本场景下的性能。

Result: 通过全面的ABX可区分性测试,MauBERT模型比最先进的多语言自监督学习模型产生更上下文不变的表示;此外,模型能有效适应未见语言和随意语音,仅需少量自监督微调(10小时语音)。

Insight: 创新点在于将发音特征作为监督信号融入HuBERT预训练,以语言无关的方式学习语音表示,从而增强模型的跨语言泛化能力和少样本适应性,为自监督语音模型注入语言归纳偏置提供了有效途径。

Abstract: This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.


[34] Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting cs.CLPDF

Filippos Ventirozos, Peter Appleby, Matthew Shardlow

TL;DR: 本文提出了一种利用统一意义表示(UMR)构建思维链(CoT)提示的新方法,用于零样本方面类别情感分析(ACSA)。该方法旨在解决新领域标注数据稀缺且成本高昂的问题,通过结构化推理过程来提升大型语言模型(LLMs)在零样本设置下的性能。

Details

Motivation: 动机是解决ACSA任务中监督学习方法面临的新领域标注数据稀缺和高成本问题,探索在数据标注资源有限的情况下,利用LLMs进行零样本学习的实用替代方案。

Result: 在三个模型(Qwen3-4B、Qwen3-8B和Gemini-2.5-Pro)和四个不同数据集上的评估表明,基于UMR的方法与标准CoT基线相比,其有效性可能依赖于模型;初步结果显示,对于中等规模模型如Qwen3-8B,性能相当,但需要进一步研究以确定其在更小模型架构上的普适性。

Insight: 创新点在于将UMR作为中间表示整合到CoT提示中,以结构化推理过程,这为LLMs在零样本ACSA任务中提供了可解释的、基于语义的推理框架;客观分析认为,该方法强调了提示工程中结构化语义表示的重要性,但其模型依赖性提示了未来研究需关注不同规模模型的适配问题。

Abstract: Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.


[35] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators cs.CLPDF

Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang

TL;DR: GenEnv是一个通过难度对齐的协同进化游戏来训练LLM智能体的框架,它让智能体与可扩展的生成式环境模拟器共同进化,以解决真实世界交互数据成本高且静态的问题。

Details

Motivation: 训练大型语言模型智能体的主要瓶颈在于真实世界交互数据成本高昂且静态不变,因此需要一种动态、数据高效的方法来提升智能体能力。

Result: 在API-Bank、ALFWorld、BFCL、Bamboogle和TravelPlanner五个基准测试中,GenEnv将7B基线模型的性能提升了高达40.3%,匹配或超越了更大模型的平均性能,并且比基于Gemini 2.5 Pro的离线数据增强方法使用数据量少3.3倍的同时性能更优。

Insight: 创新点在于提出了一个协同进化框架,其中模拟器作为动态课程策略,根据智能体的‘最近发展区’持续生成定制化任务,并通过α-课程奖励机制实现难度与能力的对齐,从而从静态监督转向自适应模拟,为扩展智能体能力提供了数据高效的路径。

Abstract: Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent’s ``zone of proximal development’’. This process is guided by a simple but effective $α$-Curriculum Reward, which aligns task difficulty with the agent’s current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.


cs.CV [Back]

[36] NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction cs.CV | cs.AIPDF

Karthik Prabhakar

TL;DR: 本文提出NystagmusNet,一种用于预测眼球震颤患者光敏风险的可解释深度学习系统。该系统通过双分支卷积神经网络,基于环境亮度和眼动方差估计风险评分,并整合SHAP和GradCAM等可解释性技术来突出环境风险区域,同时包含基于规则的推荐引擎以提供自适应滤镜建议。

Details

Motivation: 解决眼球震颤患者因环境亮度加剧不自主眼动而面临日常挑战的问题,当前辅助方案缺乏预测性个性化治疗。

Result: 在合成数据上达到75%的验证准确率,系统通过可解释性技术提升临床信任和模型可解释性。

Insight: 创新点包括结合合成与增强数据集的双分支CNN架构、整合可解释性AI技术以可视化风险区域,以及基于规则的实时推荐引擎;可借鉴其将可解释性与临床决策支持相结合的方法。

Abstract: Nystagmus patients with photosensitivity face significant daily challenges due to involuntary eye movements exacerbated by environmental brightness conditions. Current assistive solutions are limited to symptomatic treatments without predictive personalization. This paper proposes NystagmusNet, an AI-driven system that predicts high-risk visual environments and recommends real-time visual adaptations. Using a dual-branch convolutional neural network trained on synthetic and augmented datasets, the system estimates a photosensitivity risk score based on environmental brightness and eye movement variance. The model achieves 75% validation accuracy on synthetic data. Explainability techniques including SHAP and GradCAM are integrated to highlight environmental risk zones, improving clinical trust and model interpretability. The system includes a rule-based recommendation engine for adaptive filter suggestions. Future directions include deployment via smart glasses and reinforcement learning for personalized recommendations.


[37] SuperFlow: Training Flow Matching Models with RL on the Fly cs.CVPDF

Kaijie Chen, Zhiyang Xu, Ying Shen, Zihao Lin, Yuguang Yao

TL;DR: 本文提出SuperFlow,一种用于基于流的生成模型的强化学习训练框架,通过方差感知采样动态调整每提示组大小,并计算与连续时间流动力学一致的步级优势,以解决现有方法中采样效率低和信用分配偏差的问题。

Details

Motivation: 当前基于流的生成模型在强化学习训练中存在两个主要问题:固定每提示组大小忽略采样重要性差异导致效率低下,以及轨迹级优势重用为步级估计导致信用分配偏差。

Result: 在标准文本到图像任务中,SuperFlow仅需原始训练步骤的5.4%至56.3%,减少训练时间5.2%至16.7%,性能优于SD3.5-M模型4.6%至47.2%,优于Flow-GRPO模型1.7%至16.0%。

Insight: 创新点包括方差感知采样动态调整组大小以提高效率,以及基于连续时间流动力学的步级优势计算以改进信用分配,无需架构修改即可实现训练加速和性能提升。

Abstract: Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.


[38] Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition cs.CV | cs.AIPDF

Ellie Zhou, Jihoon Chung, Olga Russakovsky

TL;DR: 论文系统分析了动作识别模型中的背景偏见,发现分类模型、对比文本图像预训练模型和视频大语言模型均存在此问题。针对分类模型,通过引入分割人类输入减少偏见3.78%;针对视频大语言模型,通过提示调优提高人类聚焦推理9.85%。

Details

Motivation: 论文的动机是解决动作识别模型中过度依赖背景线索而非人类动作的偏见问题,旨在分析和减轻这种背景偏见。

Result: 实验结果显示,针对分类模型的策略减少背景偏见3.78%,针对视频大语言模型的提示调优提高人类聚焦推理9.85%。

Insight: 创新点包括系统分析动作识别模型中的背景偏见,并提出针对分类模型和视频大语言模型的减轻策略。客观来看,论文在跨模型偏见分析和提示调优应用方面具有创新性。

Abstract: Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.


[39] SCS-SupCon: Sigmoid-based Common and Style Supervised Contrastive Learning with Adaptive Decision Boundaries cs.CV | cs.LGPDF

Bin Wang, Fadi Dornaika

TL;DR: 本文提出了SCS-SupCon,一种基于Sigmoid的通用与风格监督对比学习框架,旨在解决图像分类中类间差异细微、类内变化大的问题。该方法通过引入具有可学习温度和偏置参数的Sigmoid成对对比损失,实现自适应决策边界,并加入显式的风格距离约束来解耦风格与内容表示,从而提升细粒度识别任务的判别能力。

Details

Motivation: 现有基于InfoNCE损失的监督对比学习方法存在负样本稀释和缺乏自适应决策边界的问题,导致在细粒度识别任务中判别力不足。本文旨在解决这些局限性,提升模型对细微类间差异和显著类内变化的鲁棒性。

Result: 在六个基准数据集(包括CIFAR-100、CUB200-2011和Stanford Dogs等)上的综合实验表明,SCS-SupCon在CNN和Transformer骨干网络上均达到了最先进的性能。具体而言,在CIFAR-100上使用ResNet-50时,SCS-SupCon在五折交叉验证下比SupCon提升了约3.9个百分点,比CS-SupCon提升了约1.7个百分点;在细粒度数据集上,其性能超过CS-SupCon 0.4到3.0个百分点。

Insight: 论文的创新点在于:1) 提出了一种基于Sigmoid的成对对比损失,通过可学习的温度和偏置参数实现自适应决策边界,强调困难负样本并缓解负样本稀释;2) 引入了显式的风格距离约束,促进风格与内容表示的解耦,从而学习更鲁棒的特征。从客观角度看,该方法将自适应边界机制与表示解耦思想结合,为改进监督对比学习提供了新的有效途径。

Abstract: Image classification is hindered by subtle inter-class differences and substantial intra-class variations, which limit the effectiveness of existing contrastive learning methods. Supervised contrastive approaches based on the InfoNCE loss suffer from negative-sample dilution and lack adaptive decision boundaries, thereby reducing discriminative power in fine-grained recognition tasks. To address these limitations, we propose Sigmoid-based Common and Style Supervised Contrastive Learning (SCS-SupCon). Our framework introduces a sigmoid-based pairwise contrastive loss with learnable temperature and bias parameters to enable adaptive decision boundaries. This formulation emphasizes hard negatives, mitigates negative-sample dilution, and more effectively exploits supervision. In addition, an explicit style-distance constraint further disentangles style and content representations, leading to more robust feature learning. Comprehensive experiments on six benchmark datasets, including CUB200-2011 and Stanford Dogs, demonstrate that SCS-SupCon achieves state-of-the-art performance across both CNN and Transformer backbones. On CIFAR-100 with ResNet-50, SCS-SupCon improves top-1 accuracy over SupCon by approximately 3.9 percentage points and over CS-SupCon by approximately 1.7 points under five-fold cross-validation. On fine-grained datasets, it outperforms CS-SupCon by 0.4–3.0 points. Extensive ablation studies and statistical analyses further confirm the robustness and generalization of the proposed framework, with Friedman tests and Nemenyi post-hoc evaluations validating the stability of the observed improvements.


[40] Name That Part: 3D Part Segmentation and Naming cs.CVPDF

Soumava Paul, Prakhar Kaushik, Ankit Vaidya, Anand Bhattad, Alan Yuille

TL;DR: 该论文提出了ALIGN-Parts方法,用于解决语义三维部件分割与命名问题,即将三维物体分解为具有有意义名称的部件。该方法通过将部件命名建模为直接集合对齐任务,结合几何、外观和语义知识,实现高效的一次性分割与命名,并构建了统一的本体论对齐多个数据集。

Details

Motivation: 现有三维部件标注数据集的定义不一致,限制了鲁棒训练;先前方法只能产生未标记的分解或检索单个部件,缺乏完整的形状标注。

Result: 该方法支持对任意描述的零样本匹配,并通过人工验证创建了一个包含1794个独特三维部件的统一本体论,对齐了PartNet、3DCoMPaT++和Find3D数据集。

Insight: 创新点在于将部件命名形式化为集合对齐任务,并融合三维部件场、多视角视觉特征和语言模型生成的affordance描述;提出了适用于命名三维部件分割任务的新评估指标。

Abstract: We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We also show examples from our newly created Tex-Parts dataset. We also introduce 2 novel metrics appropriate for the named 3D part segmentation task.


Shubham Kumar Nigam, Parjanya Aditya Shukla, Noel Shallum, Arnab Bhattacharya

TL;DR: 本文研究了手写马拉地语法律文档的翻译问题,比较了传统OCR-机器翻译两阶段流程与端到端视觉大语言模型(VLMs)的性能。研究动机源于印度法院系统对高效、准确翻译手写法律记录(如FIR、指控书、证词)的迫切需求,以提升法律信息获取效率。

Details

Motivation: 解决低资源语言(如马拉地语)手写文本识别与翻译的挑战,特别是针对缺乏大规模数字化语料库且手写风格多变的法律文档,以满足印度地区及高等法院对可扩展、准确翻译系统的迫切需求,实现法律记录的数字化处理。

Result: 在精心构建的手写马拉地语法律文档数据集上评估了传统OCR-MT流程与端到端VLMs,但摘要未提及具体定量结果(如准确率、BLEU分数)或基准测试比较,仅指出研究结果为构建鲁棒、可边缘部署的解决方案提供了可行见解。

Insight: 创新点在于探索并比较了传统两阶段流程与端到端VLMs在手写文档翻译任务中的性能,强调VLMs可能统一OCR和翻译步骤以简化处理;从客观角度看,该研究为低资源环境下的法律文档处理提供了方法学对比,有助于推动边缘可部署解决方案的发展。

Abstract: Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India’s district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.


[42] FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis cs.CVPDF

Ekta Balkrishna Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon

TL;DR: 本文提出了FPBench,这是首个针对多模态大语言模型在指纹分析领域的综合性基准测试,评估了20个开源和专有MLLM在7个真实与合成数据集上、8个生物识别与法医任务中的表现,并讨论了性能、可解释性及挑战。

Details

Motivation: 尽管MLLM已在虹膜和面部图像分析中应用,但其在指纹理解方面的能力尚未被探索,因此需要建立一个全面的基准来填补这一空白。

Result: 研究通过零样本和思维链提示策略评估了20个MLLM,在多个数据集和任务上进行了性能分析,为指纹领域的基础模型发展奠定了基础。

Insight: 创新点在于首次构建了针对指纹理解的综合性MLLM基准FPBench,涵盖了多样化的数据集和任务,有助于揭示MLLM在生物识别领域的潜力和局限性。

Abstract: Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.


[43] SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping cs.CVPDF

Thomas Boudras, Martin Schwartz, Rasmus Fensholt, Martin Brandt, Ibrahim Fayad

TL;DR: SERA-H是一个端到端深度学习模型,结合超分辨率模块(EDSR)和时间注意力编码(UTAE),利用免费的Sentinel-1和Sentinel-2(10米分辨率)时间序列数据,在机载激光雷达(ALS)高密度数据监督下,生成2.5米高分辨率的冠层高度图。

Details

Motivation: 解决现有方法在冠层高度高分辨率制图中面临的数据可访问性与空间分辨率之间的权衡问题,旨在利用免费、高重访频率的卫星数据突破输入传感器原生分辨率的限制。

Result: 在法国开源基准数据集上评估,SERA-H的MAE为2.6米,决定系数为0.82,不仅优于标准的Sentinel-1/2基线方法,而且性能与依赖商业超高分辨率影像(如SPOT-6/7、PlanetScope、Maxar)的方法相当或更好。

Insight: 创新点在于将高分辨率监督与时间序列中嵌入的时空信息相结合,实现了超越输入传感器原生分辨率的细节重建;该方法证明了利用免费卫星数据时间序列进行高精度、高频次森林制图的可行性,为资源受限的应用提供了低成本替代方案。

Abstract: High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring. Although recent studies have led to the advent of deep learning methods using satellite imagery to predict height maps, these approaches often face a trade-off between data accessibility and spatial resolution. To overcome these limitations, we present SERA-H, an end-to-end model combining a super-resolution module (EDSR) and temporal attention encoding (UTAE). Trained under the supervision of high-density LiDAR data (ALS), our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 (10 m) time series data. Evaluated on an open-source benchmark dataset in France, SERA-H, with a MAE of 2.6 m and a coefficient of determination of 0.82, not only outperforms standard Sentinel-1/2 baselines but also achieves performance comparable to or better than methods relying on commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar). These results demonstrate that combining high-resolution supervision with the spatiotemporal information embedded in time series enables the reconstruction of details beyond the input sensors’ native resolution. SERA-H opens the possibility of freely mapping forests with high revisit frequency, achieving accuracy comparable to that of costly commercial imagery. The source code is available at https://github.com/ThomasBoudras/SERA-H#


[44] EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams cs.CVPDF

Hao Li, Daiwei Lu, Jiacheng Wang, Robert J. Webster, Ipek Oguz

TL;DR: 本文提出了EndoStreamDepth,一个用于内窥镜视频流的单目深度估计框架。该框架能够为每一帧生成具有清晰解剖边界的精确深度图,确保跨帧的时间一致性,并实现实时处理。它通过逐帧处理结合时序模块来传播帧间信息,包含单帧深度网络、多级Mamba时序模块和具有多尺度监督的分层设计。

Details

Motivation: 解决现有内窥镜视频深度估计方法通常使用批量输入、缺乏时间一致性、边界模糊且无法实时处理的问题,旨在为机器人手术等下游任务提供支持。

Result: 在两个公开的结肠镜深度估计数据集上进行评估,相比最先进的单目深度估计方法,性能有显著提升,并能生成具有清晰解剖对齐边界的深度图。

Insight: 创新点在于将Mamba架构引入时序建模以提升精度和稳定性,采用逐帧处理结合时序信息传播的框架设计,以及通过分层设计和多尺度监督联合优化局部边界清晰度和全局几何一致性。

Abstract: This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth


[45] Atlas is Your Perfect Context: One-Shot Customization for Generalizable Foundational Medical Image Segmentation cs.CVPDF

Ziyu Zhang, Yi Yu, Simeng Zhu, Ahmed Aly, Yunhe Gao

TL;DR: 本文提出AtlasSegFM框架,通过单次标注示例将基础医学图像分割模型定制化到特定临床场景。该框架利用图谱配准提供上下文感知提示,并融合图谱配准与基础模型的预测结果,显著提升分割精度,尤其对小而精细的结构效果更佳。

Details

Motivation: 现有交互式基础模型虽通过大规模多模态预训练提升泛化能力,但仍依赖精确提示且在训练数据中代表性不足的临床场景下表现不佳,因此需要一种轻量级、可部署的单次定制化解决方案。

Result: 在涵盖多模态和多器官的公共及内部数据集上的广泛实验表明,AtlasSegFM能持续改进分割性能,特别是在小而精细的结构上表现突出,为真实临床工作流提供了有效的定制化方法。

Insight: 创新点包括:通过图谱与查询图像配准生成上下文感知提示的流程,以及测试时适配器融合图谱配准与基础模型预测;这为医学图像分割基础模型的快速临床定制提供了可借鉴的轻量级框架。

Abstract: Accurate medical image segmentation is essential for clinical diagnosis and treatment planning. While recent interactive foundation models (e.g., nnInteractive) enhance generalization through large-scale multimodal pretraining, they still depend on precise prompts and often perform below expectations in contexts that are underrepresented in their training data. We present AtlasSegFM, an atlas-guided framework that customizes available foundation models to clinical contexts with a single annotated example. The core innovations are: 1) a pipeline that provides context-aware prompts for foundation models via registration between a context atlas and query images, and 2) a test-time adapter to fuse predictions from both atlas registration and the foundation model. Extensive experiments across public and in-house datasets spanning multiple modalities and organs demonstrate that AtlasSegFM consistently improves segmentation, particularly for small, delicate structures. AtlasSegFM provides a lightweight, deployable solution one-shot customization of foundation models in real-world clinical workflows. The code will be made publicly available.


[46] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation cs.CVPDF

Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang

TL;DR: MACE-Dance是一个音乐驱动的舞蹈视频生成框架,采用级联专家混合(MoE)架构。它包含一个负责从音乐生成3D舞蹈动作的‘运动专家’和一个负责基于动作和参考图像合成视频的‘外观专家’。该方法在3D舞蹈生成和姿态驱动图像动画任务上均达到了最先进的性能,并提出了一个新的评估协议和数据集来更好地衡量该任务。

Details

Motivation: 现有方法无法直接适配音乐驱动舞蹈视频生成任务,且该领域现有研究难以同时实现高质量视觉外观和逼真人运动。本文旨在解决这一联合优化问题。

Result: 在3D舞蹈生成任务上,运动专家达到了最先进的(SOTA)性能;在姿态驱动图像动画任务上,外观专家达到了最先进的(SOTA)性能。基于作者提出的新评估协议,MACE-Dance整体也达到了最先进的性能。

Insight: 创新点在于将任务解耦为运动生成和外观合成两个级联专家模块,并分别进行优化。运动专家采用了BiMamba-Transformer混合架构的扩散模型和免引导训练(GFT)策略;外观专家采用了解耦的运动-美学微调策略。此外,构建了大规模数据集并设计了专门的评估协议,为该任务建立了更好的基准。

Abstract: With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: https://macedance.github.io/


[47] Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching cs.CVPDF

Junho Lee, Kwanseok Kim, Joonseok Lee

TL;DR: 论文探讨了流匹配中源分布的选择,提出一个2D模拟来分析高维几何性质的学习动态,基于分析设计了一个结合范数对齐训练和方向剪枝采样的框架,以提高生成质量和采样效率。

Details

Motivation: 解决高斯分布作为源分布在高维数据生成中可能不是最优的问题,探索更好的替代方案以改进流匹配的性能。

Result: 实证评估显示在生成质量和采样效率上均有持续改进。

Insight: 创新点包括对学习动态的深入分析(如密度近似、方向对齐和范数对齐的影响),以及提出的方向剪枝采样策略,可直接应用于现有高斯源流匹配模型,无需重新训练即可提升性能。

Abstract: Flow matching has emerged as a powerful generative modeling approach with flexible choices of source distribution. While Gaussian distributions are commonly used, the potential for better alternatives in high-dimensional data generation remains largely unexplored. In this paper, we propose a novel 2D simulation that captures high-dimensional geometric properties in an interpretable 2D setting, enabling us to analyze the learning dynamics of flow matching during training. Based on this analysis, we derive several key insights about flow matching behavior: (1) density approximation can paradoxically degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian’s omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. Building on these insights, we propose a practical framework that combines norm-aligned training with directionally-pruned sampling. This approach maintains the robust omnidirectional supervision essential for stable flow learning, while eliminating initializations in data-sparse regions during inference. Importantly, our pruning strategy can be applied to any flow matching model trained with a Gaussian source, providing immediate performance gains without the need for retraining. Empirical evaluations demonstrate consistent improvements in both generation quality and sampling efficiency. Our findings provide practical insights and guidelines for source distribution design and introduce a readily applicable technique for improving existing flow matching models. Our code is available at https://github.com/kwanseokk/SourceFM.


[48] Multi-Part Object Representations via Graph Structures and Co-Part Discovery cs.CVPDF

Alex Foo, Wynne Hsu, Mong Li Lee

TL;DR: 本文提出了一种利用显式图表示和共部件发现算法的新方法,用于从图像中发现多部件对象的中心化表示,并在遮挡和分布外场景下评估其鲁棒性。实验表明,该方法在模拟、真实和现实世界图像中均优于现有方法,并能更准确地预测下游任务中的关键对象属性。

Details

Motivation: 现有基于隐式对象表示的方法在遮挡或分布外场景下难以识别多部件对象,因为其假设部件-整体关系通过间接训练目标隐式编码。本文旨在通过显式图表示和共部件发现算法解决这一局限性。

Result: 在模拟、真实和现实世界图像上的实验结果显示,该方法在发现对象质量上显著优于最先进方法,并能准确识别遮挡和分布外场景下的多部件对象。

Insight: 创新点在于引入显式图表示来建模部件关系,并开发共部件发现算法,从而提升对象中心化表示的鲁棒性和可解释性,为下游任务提供更准确的预测能力。

Abstract: Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.


[49] Investigating Spatial Attention Bias in Vision-Language Models cs.CV | cs.CLPDF

Aryan Chaudhary, Sanchit Goyal, Pratik Narang, Dhruv Kumar

TL;DR: 该论文发现并系统研究了视觉语言模型(VLMs)中存在的一种空间注意力偏差:在处理水平拼接图像时,模型会持续优先描述左侧内容而非右侧内容。通过在不同架构的开源和闭源模型上进行对照实验,证实了该偏差的普遍存在,并排除了语言阅读方向是主要原因的可能性。

Details

Motivation: 尽管视觉语言模型在理解视觉内容方面表现出色,但其空间处理中的系统性偏差尚未被充分探索。本研究旨在识别和表征这种空间注意力偏差,以揭示当前VLMs处理空间信息的基本局限性。

Result: 在中性提示条件下,模型在约97%的情况下会优先描述左侧内容。即使在经过阿拉伯语(从右向左阅读)微调的模型上测试,该偏差依然存在,这表明语言阅读方向并非主要原因。

Insight: 论文的创新点在于首次系统识别并量化了VLMs中普遍存在的空间注意力偏差。从客观角度看,其研究揭示了模型架构本身(而非明确的训练数据指令)可能是导致这种系统性偏差的根本原因,这对理解VLMs的内部工作机制和未来改进方向具有重要启示。

Abstract: Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.


[50] Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction cs.CV | cs.ROPDF

Shahram Najam Syed, Yitian Hu, Yuchao Yao

TL;DR: 本文提出了一种联合学习框架,用于从单目视频中进行大规模三维重建。该框架同时优化深度、相机位姿和局部辐射场,解决了传统方法在单独处理这些因素时出现的尺度模糊、位姿漂移和场景表示能力不足的问题。系统包括一个具有度量尺度监督的ViT深度网络、一个在特征空间中进行多尺度特征束调整的位姿优化层,以及一个增量分配的局部哈希网格NeRF层次结构,从而实现了城市街区尺度的高质量重建和新视角合成。

Details

Motivation: 解决从单目视频进行大规模三维重建时,因深度、位姿和辐射场单独优化而导致的尺度模糊、长距离位姿漂移以及单一全局NeRF无法建模大范围内容的问题。

Result: 在Tanks and Temples基准测试的八个室内外序列上,绝对轨迹误差降低至0.001-0.021米,比BARF方法降低了高达18倍,比NoPe-NeRF降低了2倍,同时保持了亚像素级的相对位姿误差。

Insight: 创新点在于将深度、位姿和辐射场进行端到端联合优化,并引入了度量尺度监督的ViT深度网络、基于学习特征的多尺度束调整位姿优化,以及增量分配的局部哈希网格NeRF层次结构,从而实现了大规模、无漂移、高保真的单目三维重建。

Abstract: Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space–leveraging learned pyramidal descriptors instead of brittle keypoints–to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences–up to 18x lower than BARF and 2x lower than NoPe-NeRF–while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.


[51] SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality cs.CVPDF

Pan Ben Wong, Chengli Wu, Hanyue Lu

TL;DR: SG-RIFE是一种结合语义引导的实时视频帧插值方法,通过在预训练的RIFE骨干网络上高效微调,引入DINOv3 Vision Transformer的语义先验,以提升复杂场景下的感知质量。

Details

Motivation: 解决现有基于光流的实时视频插值方法(如RIFE)在复杂大运动和遮挡场景下质量不足,而基于扩散模型的方法(如Consec. BB)虽质量高但延迟过大、无法实时应用的问题。

Result: 在SNU-FILM基准测试中,SG-RIFE在FID/LPIPS指标上超越了基于扩散的LDMVFI,并在复杂基准上达到了与Consec. BB相当的质量,同时运行速度显著更快,实现了接近实时的扩散级感知质量。

Insight: 创新点包括:参数高效的微调策略,利用冻结的DINOv3提取语义先验;Split-Fidelity Aware Projection Module(Split-FAPM)用于压缩和细化高维特征;Deformable Semantic Fusion(DSF)模块对齐语义先验与像素级运动场,证明语义一致性能使基于光流的方法在实时性下竞争扩散模型的感知质量。

Abstract: Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.


[52] Loom: Diffusion-Transformer for Interleaved Generation cs.CVPDF

Mingcheng Ye, Jiaming Liu, Yiren Song

TL;DR: 论文提出了Loom,一个基于扩散-Transformer的统一框架,用于交错文本-图像生成。该方法通过全参数微调和交错架构扩展了Bagel模型,利用语言规划策略将用户指令分解为逐步提示和帧嵌入,以实现时序一致和可控的长序列生成。

Details

Motivation: 解决交错文本-图像生成任务中,如何联合生成连贯的视觉帧和对齐的文本描述,以支持风格迁移、组合合成和程序教程等应用,并提升时序一致性和文本-图像对齐能力。

Result: 在风格迁移、组合生成和教程类任务中,Loom在时序和语义指标上显著优于开源基线Anole,平均提升2.6分(5分制),并在自建的50K交错教程数据集上优于统一和扩散编辑基线。

Insight: 创新点包括:1)交错架构交替处理文本和视觉嵌入以实现多条件推理和序列规划;2)语言规划策略分解指令为逐步提示和帧嵌入;3)仅条件于少量先验帧而非全部历史,实现高效可控的长序列生成。

Abstract: Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.


[53] Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks cs.CV | cs.AI | cs.CRPDF

Yucheng Fan, Jiawei Chen, Yu Tian, Zhaoxia Yin

TL;DR: 本文提出了一种针对基于视觉语言模型(VLM)的属性推断攻击的对抗性屏蔽保护方法,该方法在视觉一致性约束下联合优化隐私抑制与功能保留。同时,为了解决评估基准缺失的问题,作者构建并开源了VPI-COCO基准数据集,用于细粒度地评估保护方法在隐私保护与用户体验方面的性能。

Details

Motivation: 随着VLM的广泛应用,基于VLM的属性推断攻击成为严重的隐私威胁,而现有保护方法往往以牺牲图像视觉质量或干扰社交媒体视觉功能为代价,无法在隐私保护与用户体验间取得良好平衡。

Result: 在多个VLM上的实验表明,该方法能将隐私属性识别率(PAR)降至25%以下,将非隐私属性识别率(NPAR)保持在88%以上,同时保持高视觉一致性,并能很好地泛化到未见及改述的隐私问题上,展现了强大的实际应用潜力。

Insight: 创新点在于提出了一种在视觉一致性约束下联合优化隐私与效用的对抗性屏蔽框架,并构建了首个公开的、包含层次化隐私问题及非隐私对应项的基准数据集VPI-COCO,为公平、细粒度的评估奠定了基础。

Abstract: As vision-language models (VLMs) become widely adopted, VLM-based attribute inference attacks have emerged as a serious privacy concern, enabling adversaries to infer private attributes from images shared on social media. This escalating threat calls for dedicated protection methods to safeguard user privacy. However, existing methods often degrade the visual quality of images or interfere with vision-based functions on social media, thereby failing to achieve a desirable balance between privacy protection and user experience. To address this challenge, we propose a novel protection method that jointly optimizes privacy suppression and utility preservation under a visual consistency constraint. While our method is conceptually effective, fair comparisons between methods remain challenging due to the lack of publicly available evaluation datasets. To fill this gap, we introduce VPI-COCO, a publicly available benchmark comprising 522 images with hierarchically structured privacy questions and corresponding non-private counterparts, enabling fine-grained and joint evaluation of protection methods in terms of privacy preservation and user experience. Building upon this benchmark, experiments on multiple VLMs demonstrate that our method effectively reduces PAR below 25%, keeps NPAR above 88%, maintains high visual consistency, and generalizes well to unseen and paraphrased privacy questions, demonstrating its strong practical applicability for real-world VLM deployments.


[54] UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations cs.CVPDF

Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Yiming Ma

TL;DR: 本文提出UniMPR,一个用于多模态地点识别的统一框架,能够仅用一个训练模型适应任意传感器组合(如相机、激光雷达、雷达),通过极坐标鸟瞰图特征空间统一异构数据,并利用多分支网络提取判别性特征,在七个数据集上实现了最先进的性能。

Details

Motivation: 解决现有多模态地点识别方法无法动态适应任意模态输入、缺乏对模态缺失或退化的鲁棒性,以及难以泛化到不同传感器配置的三个关键挑战。

Result: 在七个数据集上的实验表明,UniMPR在不同传感器配置、模态组合和环境条件下均达到了最先进的性能水平。

Insight: 创新点包括将异构输入统一到极坐标BEV特征空间以处理数据异质性,以及采用自适应标签分配策略进行大规模预训练以增强泛化能力和鲁棒性;客观来看,其统一框架设计对实际自动驾驶系统中传感器配置的灵活性和容错性具有重要价值。

Abstract: Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network’s generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at https://github.com/QiZS-BIT/UniMPR.


[55] Pyramidal Adaptive Cross-Gating for Multimodal Detection cs.CVPDF

Zidong Gu, Shoufu Tian

TL;DR: 本文提出了一种用于多模态目标检测的Pyramidal Adaptive Cross-Gating Network (PACGNet)架构,旨在解决现有方法在特征融合时易受跨模态噪声干扰和破坏特征金字塔层次结构的问题。该网络通过两个核心模块——对称交叉门控(SCG)和金字塔特征感知多模态门控(PFMG)——在骨干网络内部进行深度融合,以选择性地吸收互补信息、抑制噪声并保持特征层次,从而提升对小目标的细粒度检测能力。

Details

Motivation: 现有方法在航空图像多模态目标检测中,通常采用简单的特征融合策略,这容易引入跨模态噪声并破坏特征金字塔的层次结构,从而损害对小目标的细粒度检测性能。

Result: 在DroneVehicle和VEDAI数据集上的评估表明,PACGNet达到了新的最先进水平,mAP50分数分别达到81.7%和82.1%。

Insight: 创新点包括:1. 对称交叉门控(SCG)模块,通过双向对称的“水平”门控机制选择性地融合互补信息并抑制噪声;2. 金字塔特征感知多模态门控(PFMG)模块,通过渐进式层次门控机制重构特征层次,利用高分辨率层的细节特征指导低分辨率层的融合,有效保留细粒度细节。从客观角度看,这种在骨干网络内部进行深度、层次化融合的设计,针对性地解决了多模态检测中的噪声和结构破坏问题,具有借鉴意义。

Abstract: Object detection in aerial imagery is a critical task in applications such as UAV reconnaissance. Although existing methods have extensively explored feature interaction between different modalities, they commonly rely on simple fusion strategies for feature aggregation. This introduces two critical flaws: it is prone to cross-modal noise and disrupts the hierarchical structure of the feature pyramid, thereby impairing the fine-grained detection of small objects. To address this challenge, we propose the Pyramidal Adaptive Cross-Gating Network (PACGNet), an architecture designed to perform deep fusion within the backbone. To this end, we design two core components: the Symmetrical Cross-Gating (SCG) module and the Pyramidal Feature-aware Multimodal Gating (PFMG) module. The SCG module employs a bidirectional, symmetrical “horizontal” gating mechanism to selectively absorb complementary information, suppress noise, and preserve the semantic integrity of each modality. The PFMG module reconstructs the feature hierarchy via a progressive hierarchical gating mechanism. This leverages the detailed features from a preceding, higher-resolution level to guide the fusion at the current, lower-resolution level, effectively preserving fine-grained details as features propagate. Through evaluations conducted on the DroneVehicle and VEDAI datasets, our PACGNet sets a new state-of-the-art benchmark, with mAP50 scores reaching 81.7% and 82.1% respectively.


[56] MatSpray: Fusing 2D Material World Knowledge on 3D Geometry cs.CV | cs.GRPDF

Philipp Langsteiner, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

TL;DR: 本文提出MatSpray框架,通过结合基于学习和投影的方法,将2D扩散模型预测的材质参数(如反照率、粗糙度、金属度)融合到基于高斯溅射重建的3D几何中,并引入轻量级神经细化步骤提升细节精度和多视角一致性,从而生成可重光照、高真实感的渲染结果。

Details

Motivation: 解决现有3D重建方法在重光照场景中因缺乏精确空间变化材质参数而表现不足,以及2D材质图难以有效迁移到3D几何上的挑战。

Result: 在定量指标和视觉真实感上均优于现有技术,提升了重建场景的渲染准确性和真实感。

Insight: 创新点包括结合学习与投影的2D材质到3D几何融合框架、基于高斯射线追踪的直接投影优化,以及轻量级神经细化器(Neural Merger)增强细节与一致性;可借鉴其多模态数据融合与神经后处理优化思路。

Abstract: Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.


[57] MCVI-SANet: A lightweight semi-supervised model for LAI and SPAD estimation of winter wheat under vegetation index saturation cs.CV | cs.AIPDF

Zhiheng Zhang, Jiajun Yang, Hong Sun, Dong Wang, Honghua Jiang

TL;DR: 该论文提出了一种名为MCVI-SANet的轻量级半监督视觉模型,用于解决冬小麦在植被指数饱和阶段叶面积指数和叶绿素含量的精确估计问题。模型通过新设计的植被指数饱和度感知模块和基于VICReg的半监督策略来提升特征表达和泛化能力,在保持高推理速度的同时实现了最先进的预测精度。

Details

Motivation: 解决冬小麦在冠层茂密阶段植被指数饱和以及地面真实标注数据有限的问题,克服现有基于植被指数和纹理的机器学习方法特征表达能力不足,以及深度学习基线模型存在的领域差距和高数据需求导致的泛化能力受限的挑战。

Result: 在重复10次的实验中,MCVI-SANet在LAI和SPAD估计上均达到了最先进的精度。具体而言,LAI估计的平均R²为0.8123,RMSE为0.4796;SPAD估计的平均R²为0.6846,RMSE为2.4222。其性能超越了最佳基线模型,LAI平均R²提升了8.95%,SPAD平均R²提升了8.17%。模型参数量仅为0.10M,保持了高推理速度。

Insight: 主要创新点包括:1)新设计的植被指数饱和度感知模块,用于自适应通道-空间特征增强;2)集成了基于VICReg的半监督策略以提升泛化能力;3)采用了基于植被高度的数据集划分策略,确保不同生长阶段的代表性。从客观角度看,将半监督学习与农学先验知识(如植被指数饱和问题)相结合,为基于遥感的精准农业提供了一种有前景的轻量化解决方案。

Abstract: Vegetation index (VI) saturation during the dense canopy stage and limited ground-truth annotations of winter wheat constrain accurate estimation of LAI and SPAD. Existing VI-based and texture-driven machine learning methods exhibit limited feature expressiveness. In addition, deep learning baselines suffer from domain gaps and high data demands, which restrict their generalization. Therefore, this study proposes the Multi-Channel Vegetation Indices Saturation Aware Net (MCVI-SANet), a lightweight semi-supervised vision model. The model incorporates a newly designed Vegetation Index Saturation-Aware Block (VI-SABlock) for adaptive channel-spatial feature enhancement. It also integrates a VICReg-based semi-supervised strategy to further improve generalization. Datasets were partitioned using a vegetation height-informed strategy to maintain representativeness across growth stages. Experiments over 10 repeated runs demonstrate that MCVI-SANet achieves state-of-the-art accuracy. The model attains an average R2 of 0.8123 and RMSE of 0.4796 for LAI, and an average R2 of 0.6846 and RMSE of 2.4222 for SPAD. This performance surpasses the best-performing baselines, with improvements of 8.95% in average LAI R2 and 8.17% in average SPAD R2. Moreover, MCVI-SANet maintains high inference speed with only 0.10M parameters. Overall, the integration of semi-supervised learning with agronomic priors provides a promising approach for enhancing remote sensing-based precision agriculture.


[58] Enhancing 3D Semantic Scene Completion with a Refinement Module cs.CVPDF

Dunxing Zhang, Jiachen Lu, Han Yang, Lei Bao, Bo Song

TL;DR: 本文提出了一种名为ESSC-RM的即插即用增强框架,用于3D语义场景补全(SSC)。该框架包含一个细化模块,可以无缝集成到现有的SSC模型中。其工作流程分为两个阶段:首先由基线SSC网络生成粗略的体素预测,然后通过一个基于3D U-Net的预测噪声感知模块(PNAM)和体素级局部几何模块(VLGM)在多尺度监督下进行细化。

Details

Motivation: 动机是提升现有3D语义场景补全模型的性能,解决其预测结果可能存在的噪声和几何细节不足的问题,提供一个通用的、可增强多种SSC模型的细化框架。

Result: 在SemanticKITTI数据集上的实验表明,ESSC-RM能持续提升语义预测性能。当集成到CGFormer和MonoScene模型中时,平均交并比(mIoU)分别从16.87%提升到17.27%,以及从11.08%提升到11.51%。

Insight: 宣称的创新点在于提出了一个通用的、即插即用的细化框架ESSC-RM,其核心是结合了预测噪声感知(PNAM)和局部几何建模(VLGM)的双模块细化策略。客观来看,其将后处理式的细化过程模块化并融入端到端训练,以及采用多尺度监督来优化细节,是值得借鉴的思路。

Abstract: We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.


[59] RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion cs.CVPDF

Wenhao Hu, Haonan Zhou, Zesheng Li, Liu Liu, Jiacheng Dong

TL;DR: 本文提出RecurGS,一种基于离散状态循环高斯融合的交互式场景建模框架,旨在解决现有3D场景表示方法难以适应离散场景变化和构建交互式3D环境的问题。该框架通过检测连续状态间的物体级变化、对齐几何运动并进行循环更新,将多个离散高斯场景状态逐步融合为单一可交互的演化表示。

Details

Motivation: 现有3D场景表示方法虽能实现高质量新视角合成,但无法适应离散场景变化(如物体移动)或支持交互式环境构建;现有方法要么仅更新单一场景而不支持新状态合成,要么依赖基于扩散的物体-背景解耦,无法跨多个观测融合信息。

Result: 在合成和真实世界数据集上的大量实验表明,该框架能实现高质量重建,并显著提升更新效率,为持续交互的高斯世界提供了可扩展的解决方案。

Insight: 创新点包括:提出循环融合框架以增量整合离散场景状态;利用语义对应和李代数SE(3)细化进行几何运动对齐;通过回放监督保留历史结构;采用体素化、可见性感知的融合模块选择性整合新观测区域,缓解灾难性遗忘,支持高效长时程更新,并能合成新场景状态而无需额外扫描。

Abstract: Recent advances in 3D scene representations have enabled high-fidelity novel view synthesis, yet adapting to discrete scene changes and constructing interactive 3D environments remain open challenges in vision and robotics. Existing approaches focus solely on updating a single scene without supporting novel-state synthesis. Others rely on diffusion-based object-background decoupling that works on one state at a time and cannot fuse information across multiple observations. To address these limitations, we introduce RecurGS, a recurrent fusion framework that incrementally integrates discrete Gaussian scene states into a single evolving representation capable of interaction. RecurGS detects object-level changes across consecutive states, aligns their geometric motion using semantic correspondence and Lie-algebra based SE(3) refinement, and performs recurrent updates that preserve historical structures through replay supervision. A voxelized, visibility-aware fusion module selectively incorporates newly observed regions while keeping stable areas fixed, mitigating catastrophic forgetting and enabling efficient long-horizon updates. RecurGS supports object-level manipulation, synthesizes novel scene states without requiring additional scans, and maintains photorealistic fidelity across evolving environments. Extensive experiments across synthetic and real-world datasets demonstrate that our framework delivers high-quality reconstructions with substantially improved update efficiency, providing a scalable step toward continuously interactive Gaussian worlds.


[60] Automated Mosaic Tesserae Segmentation via Deep Learning Techniques cs.CV | cs.LGPDF

Charilaos Kapelonis, Marios Antonakakis, Konstantinos Politof, Aristomenis Antoniadis, Michalis Zervakis

TL;DR: 本文提出了一种利用Segment Anything Model 2 (SAM 2)基础模型来自动分割马赛克镶嵌画中镶嵌块(tesserae)的方法。由于该领域公开数据集有限,作者还创建了一个带标注的马赛克图像数据集用于微调和评估模型。定量评估表明,微调后的SAM 2模型在多个指标上相比基线SAM 2和先前方法均有显著提升。

Details

Motivation: 马赛克镶嵌画是文化遗产的重要组成部分,但因其年代久远和脆弱性易受损,亟需数字化保存。本文旨在解决马赛克数字化中的关键问题——自动分割镶嵌块以将其与背景分离,这属于计算机视觉图像分割的范畴。

Result: 在作者自建的测试数据集上,微调后的SAM 2模型相比基线SAM 2,交并比从89.00%提升至91.02%,召回率从92.12%提升至95.89%。在先前方法提出的基准测试上,模型的F-measure比先前方法高出3%,预测与实际镶嵌块数量的绝对误差从0.20降至0.02。

Insight: 主要创新点在于将强大的通用分割基础模型SAM 2适配并微调于文化遗产数字化这一特定领域(马赛克镶嵌块分割),并通过创建专门的标注数据集解决了该领域数据稀缺的问题,为实时分割马赛克图像提供了可行路径。

Abstract: Art is widely recognized as a reflection of civilization and mosaics represent an important part of cultural heritage. Mosaics are an ancient art form created by arranging small pieces, called tesserae, on a surface using adhesive. Due to their age and fragility, they are prone to damage, highlighting the need for digital preservation. This paper addresses the problem of digitizing mosaics by segmenting the tesserae to separate them from the background within the broader field of Image Segmentation in Computer Vision. We propose a method leveraging Segment Anything Model 2 (SAM 2) by Meta AI, a foundation model that outperforms most conventional segmentation models, to automatically segment mosaics. Due to the limited open datasets in the field, we also create an annotated dataset of mosaic images to fine-tune and evaluate the model. Quantitative evaluation on our testing dataset shows notable improvements compared to the baseline SAM 2 model, with Intersection over Union increasing from 89.00% to 91.02% and Recall from 92.12% to 95.89%. Additionally, on a benchmark proposed by a prior approach, our model achieves an F-measure 3% higher than previous methods and reduces the error in the absolute difference between predicted and actual tesserae from 0.20 to just 0.02. The notable performance of the fine-tuned SAM 2 model together with the newly annotated dataset can pave the way for real-time segmentation of mosaic images.


[61] Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval cs.CVPDF

Dimitrios Georgoulopoulos, Nikolaos Chaidos, Angeliki Dimitriou, Giorgos Stamou

TL;DR: 论文提出PRISm框架,通过重要性预测模块和边缘感知图神经网络,在图像检索中结合关系推理与视觉表征,实现更符合人类感知的语义检索。

Details

Motivation: 传统图像检索方法难以捕捉场景中的关系和上下文细微差别,PRISm旨在通过建模对象及其交互的语义重要性来解决这一挑战。

Result: 在基准和真实数据集上的广泛实验显示,PRISm在顶级排名性能上持续优于现有方法,定性分析表明其能准确捕捉关键对象和交互。

Insight: 创新点包括重要性预测模块用于修剪无关元素,以及边缘感知图神经网络显式编码关系结构,结合全局视觉特征生成语义感知的图像嵌入。

Abstract: Accurately retrieving images that are semantically similar remains a fundamental challenge in computer vision, as traditional methods often fail to capture the relational and contextual nuances of a scene. We introduce PRISm (Pruning-based Image Retrieval via Importance Prediction on Semantic Graphs), a multimodal framework that advances image-to-image retrieval through two novel components. First, the Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image while pruning irrelevant elements. Second, the Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings. PRISm achieves image retrieval that closely aligns with human perception by explicitly modeling the semantic importance of objects and their interactions, capabilities largely absent in prior approaches. Its architecture effectively combines relational reasoning with visual representation, enabling semantically grounded retrieval. Extensive experiments on benchmark and real-world datasets demonstrate consistently superior top-ranked performance, while qualitative analyses show that PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.


[62] AmPLe: Supporting Vision-Language Models via Adaptive-Debiased Ensemble Multi-Prompt Learning cs.CV | cs.AIPDF

Fei Song, Yi Li, Jiangmeng Li, Rui Wang, Changwen Zheng

TL;DR: 本文提出了一种名为AmPLe的自适应去偏集成多提示学习方法,旨在解决视觉-语言模型在多提示学习中存在的模型-提示匹配偏差和样本-提示匹配偏差问题。通过集成学习聚合不同预测的优势,并基于信息论分析提取提示相关语义以自适应计算去偏集成权重,从而提升模型在下游任务中的性能。

Details

Motivation: 现有多提示学习方法主要关注在单一基础视觉-语言模型中使用精心设计的提示,但忽视了模型-提示匹配偏差(同一提示在不同模型中语义不一致)和样本-提示匹配偏差(输入样本中包含与提示无关的语义),这些偏差限制了多提示学习的发展,因此需要一种方法来同时缓解这两种偏差。

Result: 在三个代表性任务(新类别泛化、新目标数据集泛化和未见域偏移泛化)上的大量实验表明,AmPLe能够广泛超越现有方法,达到先进水平(SOTA)。

Insight: 创新点包括:1)首次系统性地识别并缓解多提示学习中的模型-提示匹配偏差和样本-提示匹配偏差;2)提出基于信息论分析的自适应去偏集成权重计算方法,以提取提示相关语义;3)从因果角度提供了理论验证,增强了方法的有效性。从客观角度看,该方法通过集成和去偏策略,为多提示学习提供了更鲁棒和通用的解决方案。

Abstract: Multi-prompt learning methods have emerged as an effective approach for facilitating the rapid adaptation of vision-language models to downstream tasks with limited resources. Existing multi-prompt learning methods primarily focus on utilizing various meticulously designed prompts within a single foundation vision-language model to achieve superior performance. However, the overlooked model-prompt matching bias hinders the development of multi-prompt learning, i.e., the same prompt can convey different semantics across distinct vision-language models, such as CLIP-ViT-B/16 and CLIP-ViT-B/32, resulting in inconsistent predictions of identical prompt. To mitigate the impact of this bias on downstream tasks, we explore an ensemble learning approach to sufficiently aggregate the benefits of diverse predictions. Additionally, we further disclose the presence of sample-prompt matching bias, which originates from the prompt-irrelevant semantics encapsulated in the input samples. Thus, directly utilizing all information from the input samples for generating weights of ensemble learning can lead to suboptimal performance. In response, we extract prompt-relevant semantics from input samples by leveraging the guidance of the information theory-based analysis, adaptively calculating debiased ensemble weights. Overall, we propose Adaptive-Debiased Ensemble MultiPrompt Learning, abbreviated as AmPLe, to mitigate the two types of bias simultaneously. Extensive experiments on three representative tasks, i.e., generalization to novel classes, new target datasets, and unseen domain shifts, show that AmPLe can widely outperform existing methods. Theoretical validation from a causal perspective further supports the effectiveness of AmPLe.


[63] E-RGB-D: Real-Time Event-Based Perception with Structured Light cs.CV | eess.IVPDF

Seyed Ehsan Marjani Bajestani, Giovanni Beltrame

TL;DR: 本文提出了一种名为E-RGB-D的新型感知系统,通过将事件相机与数字光处理投影仪结合,形成主动结构光系统,实现了实时、高帧率的彩色与深度像素级感知。

Details

Motivation: 传统单色事件相机在检测静态或缓慢移动物体方面存在局限,且缺乏对许多应用至关重要的颜色信息。为了解决这些问题,本研究旨在开发一种能够同时获取高动态范围、高时间分辨率彩色和深度信息的方法。

Result: 该方法实现了相当于1400 fps的彩色检测速度和4 kHz的像素深度检测速度,在实时RGB-D感知方面取得了显著性能突破。

Insight: 核心创新点在于将事件相机与主动结构光投影相结合,通过动态投影调整优化带宽,实现了选择性彩色数据采集,在不牺牲空间分辨率的情况下生成彩色点云,为机器人、3D重建等领域提供了新的高帧率、高动态范围感知解决方案。

Abstract: Event-based cameras (ECs) have emerged as bio-inspired sensors that report pixel brightness changes asynchronously, offering unmatched speed and efficiency in vision sensing. Despite their high dynamic range, temporal resolution, low power consumption, and computational simplicity, traditional monochrome ECs face limitations in detecting static or slowly moving objects and lack color information essential for certain applications. To address these challenges, we present a novel approach that integrates a Digital Light Processing (DLP) projector, forming Active Structured Light (ASL) for RGB-D sensing. By combining the benefits of ECs and projection-based techniques, our method enables the detection of color and the depth of each pixel separately. Dynamic projection adjustments optimize bandwidth, ensuring selective color data acquisition and yielding colorful point clouds without sacrificing spatial resolution. This integration, facilitated by a commercial TI LightCrafter 4500 projector and a monocular monochrome EC, not only enables frameless RGB-D sensing applications but also achieves remarkable performance milestones. With our approach, we achieved a color detection speed equivalent to 1400 fps and 4 kHz of pixel depth detection, significantly advancing the realm of computer vision across diverse fields from robotics to 3D reconstruction methods. Our code is publicly available: https://github.com/MISTLab/event_based_rgbd_ros


[64] Object-Centric Framework for Video Moment Retrieval cs.CVPDF

Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli

TL;DR: 本文提出了一种以对象为中心的框架用于视频时刻检索,通过提取查询相关对象并构建场景图来生成对象级特征序列,利用关系轨迹变换器建模对象间的时空相关性,从而更准确地定位与面向对象查询对齐的时刻。

Details

Motivation: 现有视频时刻检索方法主要依赖帧或片段级特征序列,这些特征编码全局视觉和语义信息,但往往无法捕捉细粒度的对象语义和外观,而对象级的时间动态也被忽视,限制了在需要详细对象级推理场景中的效果。

Result: 在Charades-STA、QVHighlights和TACoS三个基准测试上,该方法均优于现有的最先进方法。

Insight: 创新点在于引入对象中心化表示,通过场景图解析提取查询相关对象并构建对象级特征序列,结合关系轨迹变换器捕获对象状态的时空变化,提升了面向对象查询的定位精度。

Abstract: Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.


[65] Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models cs.CVPDF

Xiaoyang Guo, Keze Wang

TL;DR: 本文提出了Adaptive-VoCo框架,通过引入一个轻量级预测器来动态选择最优的视觉令牌压缩率,以解决现有固定压缩率方法无法适应图像视觉复杂度变化的问题,从而在降低计算开销的同时保持强大的跨模态对齐能力。

Details

Motivation: 现有视觉语言模型(如VoCo-LLaMA)通过压缩视觉补丁令牌来降低计算和内存成本,但采用固定压缩率限制了其适应不同视觉复杂度的能力,因此需要一种自适应压缩方法。

Result: 实验结果表明,该方法在多个多模态任务上持续优于固定压缩率基线,展示了自适应视觉压缩在创建更高效、更鲁棒的视觉语言模型方面的潜力。

Insight: 创新点在于利用视觉编码器的统计线索(如补丁令牌熵和注意力图方差)量化图像视觉复杂度,并引入结合速率正则化与复杂度对齐的联合损失函数,以动态平衡推理效率与表征能力。

Abstract: In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational and memory costs. VoCo-LLaMA alleviates this issue by compressing visual patch tokens into a few VoCo tokens, reducing computational overhead while preserving strong cross-modal alignment. Nevertheless, such approaches typically adopt a fixed compression rate, limiting their ability to adapt to varying levels of visual complexity. To address this limitation, we propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression. This predictor dynamically selects an optimal compression rate by quantifying an image’s visual complexity using statistical cues from the vision encoder, such as patch token entropy and attention map variance. Furthermore, we introduce a joint loss function that integrates rate regularization with complexity alignment. This enables the model to balance inference efficiency with representational capacity, particularly in challenging scenarios. Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks, highlighting the potential of adaptive visual compression for creating more efficient and robust VLMs.


[66] GTMA: Dynamic Representation Optimization for OOD Vision-Language Models cs.CV | cs.AIPDF

Jensen Zhang, Ningyuan Liu, Keze Wang

TL;DR: 本文提出了一种名为GTMA的动态表示优化框架,旨在解决视觉-语言模型在开放世界应用中因分布外概念导致的跨模态对齐崩溃问题。该方法通过构建连续伪词嵌入来绕过文本编码器的词汇限制,并采用自适应梯度优化算法进行表示优化,从而提升模型在零样本和少样本OOD任务上的性能。

Details

Motivation: 视觉-语言模型在开放世界应用中面临分布外概念引发的跨模态对齐崩溃问题,根源在于模态不对称性:视觉编码器能提取未见图像的判别特征,而文本编码器受限于固定离散词汇表,无法合成新的语义锚点。现有方法如CoOp或LoRA仅提供部分解决方案,仍局限于预训练语义空间。

Result: 在ImageNet-R和VISTA-Beyond基准测试中,GTMA将基础视觉-语言模型的零样本和少样本OOD准确率提升了15-20%,同时保持了对分布内概念的性能。消融研究进一步证实了伪词优化的必要性。

Insight: 创新点在于提出动态表示优化框架GTMA,通过构建连续伪词嵌入绕过词汇限制,并采用自适应梯度表示策略优化算法结合语义正则化,以保持表示的合理性和与先验知识的兼容性。这为处理OOD问题提供了一种绕过文本编码器词汇瓶颈的新思路。

Abstract: Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal asymmetry: while the visual encoder can extract discriminative features from unseen images, the text encoder is constrained by a fixed discrete vocabulary and cannot synthesize new semantic anchors. Existing approaches such as CoOp or LoRA provide only partial remedies, as they remain confined to the pre-trained semantic space. To overcome this bottleneck, we propose dynamic representation optimization, realized through the Guided Target-Matching Adaptation (GTMA) framework. At inference time, GTMA constructs a continuous pseudo-word embedding that best aligns with an OOD image’s visual anchor, effectively bypassing vocabulary limitations. The optimization is driven by an adaptive gradient-based representation policy optimization algorithm, which incorporates semantic regularization to preserve plausibility and compatibility with the model’s prior knowledge. Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM while maintaining performance on in-distribution concepts. Ablation studies further confirm the necessity of pseudo-word optimization.


[67] WoundNet-Ensemble: A Novel IoMT System Integrating Self-Supervised Deep Learning and Multi-Model Fusion for Automated, High-Accuracy Wound Classification and Healing Progression Monitoring cs.CVPDF

Moses Kiprono

TL;DR: 本文提出了WoundNet-Ensemble,一个基于物联网医疗的集成系统,它融合了ResNet-50、自监督视觉Transformer DINOv2和Swin Transformer三种深度学习架构,用于自动分类六种临床伤口类型,并在包含5,175张图像的全面数据集上实现了99.90%的集成准确率。此外,系统还实现了纵向伤口愈合追踪功能,用于计算愈合率、严重程度评分并生成临床警报。

Details

Motivation: 慢性伤口(如糖尿病足溃疡)带来了巨大的临床和经济负担,而当前的伤口评估主要依赖主观判断,导致分类不一致和干预延迟。本文旨在通过人工智能开发一个自动化、高精度的伤口分类和愈合进展监测系统,以解决远程医疗和患者远程监测中的关键需求。

Result: 在包含糖尿病足溃疡、压力性溃疡、静脉性溃疡、热烧伤、藏毛窦伤口和恶性真菌性肿瘤的5,175张伤口图像数据集上,集成系统达到了99.90%的准确率。加权融合策略相比之前的最先进方法提升了3.7%。

Insight: 主要创新点在于将自监督学习(DINOv2)与监督学习模型(ResNet-50、Swin Transformer)通过加权融合策略进行集成,构建了一个互补的多模型系统,显著提升了分类精度。同时,系统集成了纵向愈合追踪功能,提供了从分类到监测的完整临床工具链,并承诺公开代码和模型以支持可复现性。

Abstract: Chronic wounds, including diabetic foot ulcers which affect up to one-third of people with diabetes, impose a substantial clinical and economic burden, with U.S. healthcare costs exceeding 25 billion dollars annually. Current wound assessment remains predominantly subjective, leading to inconsistent classification and delayed interventions. We present WoundNet-Ensemble, an Internet of Medical Things system leveraging a novel ensemble of three complementary deep learning architectures: ResNet-50, the self-supervised Vision Transformer DINOv2, and Swin Transformer, for automated classification of six clinically distinct wound types. Our system achieves 99.90 percent ensemble accuracy on a comprehensive dataset of 5,175 wound images spanning diabetic foot ulcers, pressure ulcers, venous ulcers, thermal burns, pilonidal sinus wounds, and fungating malignant tumors. The weighted fusion strategy demonstrates a 3.7 percent improvement over previous state-of-the-art methods. Furthermore, we implement a longitudinal wound healing tracker that computes healing rates, severity scores, and generates clinical alerts. This work demonstrates a robust, accurate, and clinically deployable tool for modernizing wound care through artificial intelligence, addressing critical needs in telemedicine and remote patient monitoring. The implementation and trained models will be made publicly available to support reproducibility.


[68] Hierarchical Bayesian Framework for Multisource Domain Adaptation cs.CVPDF

Alexander M. Glandon, Khan M. Iftekharuddin

TL;DR: 本文提出了一种用于多源域自适应(MDA)的层次贝叶斯框架,该框架利用不同源域数据分布之间的相似性来优化预训练过程,从而提升目标域上的识别准确率。

Details

Motivation: 现有MDA方法在源模型预训练上较为随意,要么基于权重共享,要么使用独立训练的模型,缺乏统一的理论框架。本文旨在通过考虑不同源域分布通常相似这一特性,建立一个贝叶斯框架来系统性地指导MDA的预训练。

Result: 在大型基准数据集上的实验表明,该框架提高了识别任务的准确率。特别是在具有挑战性的多域基准数据集Daily-DA RGB视频上进行人体动作识别时,与现有最先进(SOTA)的MDA方法相比,所提出的贝叶斯框架实现了17.29%的准确率提升。

Insight: 创新点在于将不同源域分布的相似性建模为一个层次贝叶斯先验,从而在预训练阶段实现知识共享与正则化。这为多源域自适应提供了一个原则性的概率框架,而非依赖于特定的网络结构或启发式方法。

Abstract: Multisource domain adaptation (MDA) aims to use multiple source datasets with available labels to infer labels on a target dataset without available labels for target supervision. Prior works on MDA in the literature is ad-hoc as the pretraining of source models is either based on weight sharing or uses independently trained models. This work proposes a Bayesian framework for pretraining in MDA by considering that the distributions of different source domains are typically similar. The Hierarchical Bayesian Framework uses similarity between the different source data distributions to optimize the pretraining for MDA. Experiments using the proposed Bayesian framework for MDA show that our framework improves accuracy on recognition tasks for a large benchmark dataset. Performance comparison with state-of-the-art MDA methods on the challenging problem of human action recognition in multi-domain benchmark Daily-DA RGB video shows the proposed Bayesian Framework offers a 17.29% improvement in accuracy when compared to the state-of-the-art methods in the literature.


[69] Enhancing Medical Large Vision-Language Models via Alignment Distillation cs.CV | cs.AIPDF

Aofei Chang, Ting Wang, Fenglong Ma

TL;DR: 本文提出了一种名为MEDALIGN的轻量级对齐蒸馏框架,旨在解决医学大型视觉语言模型(Med-LVLMs)中因视觉理解未对齐而产生的幻觉输出问题。该框架通过从领域特定的CLIP模型中蒸馏视觉对齐知识,引入空间感知视觉对齐损失和注意力感知蒸馏损失,以增强模型的视觉表示学习和注意力对齐。

Details

Motivation: 医学大型视觉语言模型在临床应用中常因视觉理解未对齐而产生幻觉输出,主要源于视觉表示学习不足和视觉注意力对齐不佳,因此需要一种方法来提升其视觉对齐能力。

Result: 在医学报告生成和医学视觉问答(VQA)基准测试上的广泛实验表明,MEDALIGN能持续提升模型性能和可解释性,产生更基于视觉的输出。

Insight: 创新点在于提出了一种简单轻量的对齐蒸馏框架,通过空间感知和注意力感知的蒸馏损失,从领域特定CLIP模型中转移视觉对齐知识,从而增强Med-LVLMs的视觉表示和注意力机制,减少幻觉输出。

Abstract: Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.


[70] OpenView: Empowering MLLMs with Out-of-view VQA cs.CVPDF

Qixiang Chen, Cheng Zhang, Chi-Wing Fu, Jingwen Ye, Jianfei Cai

TL;DR: 本文提出了OpenView,一个用于增强多模态大语言模型(MLLMs)在视野外(OOV)视觉问答(VQA)能力的框架。通过利用全景图像生成高质量的多选题VQA数据集(OpenView-Dataset),并构建评估基准(OpenView-Bench),该研究首次系统探索了MLLMs在推理图像可见帧外内容的能力。实验表明,经过OpenView增强的MLLMs在OOV VQA任务上的性能从平均48.6%提升至64.1%。

Details

Motivation: 当前MLLMs主要擅长推理图像可见帧内的内容,但缺乏对视野外(OOV)物体、活动和场景的推理能力。本文旨在解决这一局限,首次系统研究MLLMs的OOV理解问题。

Result: 在构建的OpenView-Bench基准上,经过OpenView增强的多个MLLMs在OOV VQA答案选择任务上的平均准确率从48.6%提升至64.1%,但仍与人类性能存在较大差距。

Insight: 创新点包括:1) 提出了一个四阶段流水线(OpenView),利用全景图像生成上下文丰富且空间基础的多选题VQA数据;2) 构建了高质量合成数据集(OpenView-Dataset)用于监督微调;3) 建立了联合评估选择和理由准确性的可解释基准(OpenView-Bench)。这为扩展MLLMs的空间推理能力提供了系统化的数据生成和评估方法。

Abstract: Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.


[71] Placenta Accreta Spectrum Detection Using an MRI-based Hybrid CNN-Transformer Model cs.CV | cs.AI | cs.LGPDF

Sumaiya Ali, Areej Alhothali, Ohoud Alzamzami, Sameera Albasri, Ahmed Abduljabbar

TL;DR: 本研究提出了一种基于MRI的混合3D深度学习模型,用于自动检测胎盘植入谱系(PAS)。该模型结合了3D DenseNet121以捕获局部特征和3D Vision Transformer(ViT)以建模全局空间上下文,旨在解决因放射科医生解读差异导致的PAS诊断挑战。

Details

Motivation: 胎盘植入谱系(PAS)是一种严重的产科疾病,由于放射科医生对MRI图像的解读存在差异,其诊断具有挑战性,因此需要开发自动化工具来提高诊断的一致性和准确性。

Result: 在包含1,133个MRI体积的回顾性数据集上,DenseNet121-ViT模型在独立测试集上取得了最佳性能,五次运行平均准确率达到84.3%,优于其他比较的3D深度学习架构。

Insight: 论文的创新点在于将3D CNN(DenseNet121)与3D Transformer(ViT)结合,以同时利用局部特征和全局上下文,这为医学图像分析中的计算机辅助诊断提供了新的混合模型思路,有望提升诊断的鲁棒性和一致性。

Abstract: Placenta Accreta Spectrum (PAS) is a serious obstetric condition that can be challenging to diagnose with Magnetic Resonance Imaging (MRI) due to variability in radiologists’ interpretations. To overcome this challenge, a hybrid 3D deep learning model for automated PAS detection from volumetric MRI scans is proposed in this study. The model integrates a 3D DenseNet121 to capture local features and a 3D Vision Transformer (ViT) to model global spatial context. It was developed and evaluated on a retrospective dataset of 1,133 MRI volumes. Multiple 3D deep learning architectures were also evaluated for comparison. On an independent test set, the DenseNet121-ViT model achieved the highest performance with a five-run average accuracy of 84.3%. These results highlight the strength of hybrid CNN-Transformer models as a computer-aided diagnosis tool. The model’s performance demonstrates a clear potential to assist radiologists by providing a robust decision support to improve diagnostic consistency across interpretations, and ultimately enhance the accuracy and timeliness of PAS diagnosis.


[72] Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach cs.CV | cs.GRPDF

Zhe Li, Kun Cheng, Hanyue Mo, Jintao Lu, Ziwen Kuang

TL;DR: 本文提出了一种基于视觉的轨迹分析方法,用于解决商用车自动紧急制动系统在低速运行时因CAN信号不准确导致的’零速制动’问题。该方法在NVIDIA Jetson AGX Xavier平台上处理盲点摄像头的连续视频帧,采用自适应CLAHE增强的SIFT特征提取和KNN-RANSAC匹配,以精确分类车辆的运动状态。

Details

Motivation: 解决商用车AEB系统在低速工况下因CAN信号不准确而频繁误触发’零速制动’的问题,提升系统可靠性。

Result: 在真实数据集上评估,静态检测F1分数达99.96%,运动状态识别F1分数达97.78%,处理延迟为14.2毫秒;现场部署显示误制动事件减少89%,紧急制动成功率100%,故障率低于5%。

Insight: 创新点包括:1) 多帧轨迹位移统计;2) 双阈值状态决策矩阵;3) OBD-II驱动的动态ROI配置。该方法通过视觉轨迹分析有效抑制环境干扰和动态物体误检,直接针对商用车安全系统的低速误触发挑战。

Abstract: A vision-based trajectory analysis solution is proposed to address the “zero-speed braking” issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE)-enhanced Scale-Invariant Feature Transform (SIFT) feature extraction and K-Nearest Neighbors (KNN)-Random Sample Consensus (RANSAC) matching. This allows for precise classification of the vehicle’s motion state (static, vibration, moving). Key innovations include 1) multiframe trajectory displacement statistics (5-frame sliding window), 2) a dual-threshold state decision matrix, and 3) OBD-II driven dynamic Region of Interest (ROI) configuration. The system effectively suppresses environmental interference and false detection of dynamic objects, directly addressing the challenge of low-speed false activation in commercial vehicle safety systems. Evaluation in a real-world dataset (32,454 video segments from 1,852 vehicles) demonstrates an F1-score of 99.96% for static detection, 97.78% for moving state recognition, and a processing delay of 14.2 milliseconds (resolution 704x576). The deployment on-site shows an 89% reduction in false braking events, a 100% success rate in emergency braking, and a fault rate below 5%.


[73] SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback cs.CVPDF

Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez

TL;DR: 本文提出了一种名为SimpleCall的轻量级图像修复代理,旨在解决复杂图像修复任务中现有方法效率低下且依赖标注数据的问题。该框架通过策略优化学习一个代理,在无标签环境中基于多模态大语言模型的感知反馈来决定工具调用序列,从而在保持高质量修复的同时显著加速推理。

Details

Motivation: 现有基于视觉语言模型和大语言模型的修复代理存在效率瓶颈(如反思、回滚和迭代工具搜索),且其性能严重依赖需要大量标注训练的退化识别模型,限制了在无标签环境下的应用。

Result: 在多种退化场景下的实验表明,该方法尽管不使用监督,在全参考指标上达到了SOTA性能,并在无参考指标上超越了现有方法。

Insight: 创新点在于提出了一种基于策略优化的轻量级代理框架,并引入由多模态大语言模型驱动的奖励机制作为人类对齐的评估器,以在无标签环境中提供感知反馈进行策略改进,从而实现高效且高质量的确定性修复计划。

Abstract: Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.


[74] Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments cs.CV | cs.AIPDF

Saeideh Yousefzadeh, Hamidreza Pourreza

TL;DR: 本文提出Text2Graph VPR,一种可解释的视觉地点识别系统,通过将图像序列转换为文本描述,再解析为场景图,并基于图结构进行地点匹配,以应对长期部署中光照、天气和季节变化带来的挑战。

Details

Motivation: 解决长期视觉地点识别中需要超越像素相似性、实现透明可解释决策,并在外观剧烈变化下保持鲁棒性的问题。

Result: 在Oxford RobotCar和MSLS(安曼/旧金山)基准测试上验证了系统在严重外观变化下的鲁棒检索能力,并展示了使用人类文本查询的零样本操作性能。

Insight: 创新点在于将图像转换为文本再生成场景图进行结构化推理,结合图注意力网络嵌入和最短路径核的双重相似性机制,实现了可学习的语义匹配与拓扑感知比较,并提供了人类可读的中间表示以增强决策透明度。

Abstract: Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and – critically – produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.


[75] PTTA: A Pure Text-to-Animation Framework for High-Quality Creation cs.CV | cs.AIPDF

Ruiqi Chen, Kaitong Cai, Yijia Fan, Keze Wang

TL;DR: 本文提出了PTTA,一个纯文本到动画的框架,用于高质量动画创作。该方法基于预训练的文本到视频模型HunyuanVideo,通过在一个小规模但高质量配对的动画视频-文本描述数据集上进行微调,使其适应动画风格生成。广泛的视觉评估表明,该方法在动画视频合成方面持续优于可比基线。

Details

Motivation: 传统动画制作流程复杂且人工成本高。虽然Sora、Kling、CogVideoX等近期视频生成模型在自然视频合成上取得了令人印象深刻的结果,但在应用于动画生成时表现出明显局限。类似AniSora的工作通过在图像到视频模型上微调以适配动画风格展现了潜力,但在纯文本到视频设定下的类似探索仍然有限。

Result: 广泛的视觉评估表明,所提出的方法在动画视频合成方面持续优于可比基线。

Insight: 创新点在于构建了一个小规模但高质量的配对动画视频-文本数据集,并基于预训练文本到视频模型进行微调,实现了纯文本驱动的动画风格生成,为高质量动画创作提供了一个高效的框架。

Abstract: Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.


[76] Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers cs.CVPDF

Xiyue Bai, Ronghao Yu, Jia Xiu, Pengfei Zhou, Jie Xia

TL;DR: 本文提出了Uni-Neur2Img,一个基于扩散变换器的统一框架,用于直接从神经信号(如EEG)生成、编辑和风格化图像。该框架通过参数高效的LoRA模块注入神经信号,并采用因果注意力机制处理长序列条件,在多个任务上实现了高效且可扩展的性能。

Details

Motivation: 现有研究主要依赖文本模态作为神经信号生成图像的中介,缺乏对视觉模态作为直接条件信号的探索。本文旨在弥合神经信号与视觉内容生成之间的鸿沟,实现更直接、灵活的神经信号驱动图像处理。

Result: 在公开基准CVPR40(EEG图像生成)、Loongx(神经信号引导图像编辑)和自收集的EEG-Style数据集(EEG风格迁移)上进行了全面评估。实验结果表明,该方法在生成保真度、编辑一致性和风格迁移质量上均有显著提升,同时保持了低计算开销和对额外模态的强可扩展性。

Insight: 创新点包括:1)引入参数高效的LoRA神经信号注入模块,作为可插拔组件独立处理每个条件信号,支持灵活的多模态条件而不改变基础模型参数;2)采用因果注意力机制适应条件生成任务的长序列建模需求;3)构建了EEG-Style数据集以填补视觉模态作为直接条件信号的研究空白。从客观角度看,该框架的统一性和可扩展性为神经信号与视觉生成任务的结合提供了实用解决方案。

Abstract: Generating or editing images directly from Neural signals has immense potential at the intersection of neuroscience, vision, and Brain-computer interaction. In this paper, We present Uni-Neur2Img, a unified framework for neural signal-driven image generation and editing. The framework introduces a parameter-efficient LoRA-based neural signal injection module that independently processes each conditioning signal as a pluggable component, facilitating flexible multi-modal conditioning without altering base model parameters. Additionally, we employ a causal attention mechanism accommodate the long-sequence modeling demands of conditional generation tasks. Existing neural-driven generation research predominantly focuses on textual modalities as conditions or intermediate representations, resulting in limited exploration of visual modalities as direct conditioning signals. To bridge this research gap, we introduce the EEG-Style dataset. We conduct comprehensive evaluations across public benchmarks and self-collected neural signal datasets: (1) EEG-driven image generation on the public CVPR40 dataset; (2) neural signal-guided image editing on the public Loongx dataset for semantic-aware local modifications; and (3) EEG-driven style transfer on our self-collected EEG-Style dataset. Extensive experimental results demonstrate significant improvements in generation fidelity, editing consistency, and style transfer quality while maintaining low computational overhead and strong scalability to additional modalities. Thus, Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation.


[77] SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse cs.CVPDF

Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, Min Yang

TL;DR: 本文提出了SmartSight,一种无需训练的方法,用于缓解视频大语言模型中的幻觉问题。该方法通过生成多个候选回答,并利用模型自身的‘时间注意力崩溃’分数来评估每个回答的幻觉程度,同时引入‘视觉注意力消失点’来提高评估效率并提前终止幻觉回答。实验表明,SmartSight显著降低了幻觉率,并同时提升了视频理解和推理能力。

Details

Motivation: 现有的视频大语言模型存在严重的感知幻觉问题,这限制了其实际应用。虽然已有一些缓解幻觉的方法,但它们往往以牺牲模型的视频理解和推理能力为代价。本文旨在不损害视频理解能力的前提下,以无需训练的方式缓解幻觉。

Result: 在VRIPT-HAL基准上,SmartSight将Qwen2.5-VL-7B模型的幻觉率降低了10.59%。同时,在VideoMMMU基准上,视频理解和推理性能提升了高达8.86%。

Insight: 核心创新点在于利用模型自身的‘内省能力’进行无训练幻觉缓解。具体包括:1)提出‘时间注意力崩溃’分数作为幻觉评估指标,衡量模型是否过度关注视频的琐碎时间区域;2)引入‘视觉注意力消失点’以实现更准确的幻觉估计和早期终止,从而降低解码成本。这是一种新颖的、基于模型内部注意力机制的分析和干预方法。

Abstract: Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model’s capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model’s own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight’s effectiveness in improving the reliability of open-source Video-LLMs.


[78] brat: Aligned Multi-View Embeddings for Brain MRI Analysis cs.CV | cs.CLPDF

Maxime Kayser, Maksim Gridnev, Wanting Wang, Max Bain, Aneesh Rangnekar

TL;DR: 本文提出了brat(brain report alignment transformer),一个用于脑磁共振成像(MRI)的多视角表示学习框架,该框架利用MRI与临床报告配对数据进行训练。针对脑MRI中存在的众多、高度多样且通常细微的异常(常局限于3D体积中的少数切片)所带来的挑战,作者引入了一个比现有数据集大10倍的脑MRI数据集(包含约80,000个3D扫描及相应的放射学报告),并受文档检索进展启发,提出了一种多视角预训练方法。该方法开发了一种隐式查询-特征匹配机制,并采用质量-多样性概念来获得与报告句子给出的临床特征对齐的MRI多视角嵌入。作者在多个视觉-语言和视觉任务上评估了该方法,展示了显著的性能提升。brat基础模型已公开发布。

Details

Motivation: 解决脑MRI分析中因异常数量多、变化大、通常细微且局限于少数切片而带来的独特挑战,并利用大规模配对数据(MRI与临床报告)来学习更好的表示。

Result: 在多个视觉-语言和视觉任务上评估,展示了显著的性能提升(具体基准未在摘要中提及,但暗示优于现有方法)。

Insight: 创新点包括:1) 构建了大规模脑MRI-报告配对数据集;2) 受文档检索启发,提出了多视角预训练框架;3) 引入了隐式查询-特征匹配机制和质量-多样性概念,以生成与临床特征对齐的多视角嵌入。从客观角度看,将文档检索和质量-多样性思想应用于医学影像-报告对齐是一个有前景的方向。

Abstract: We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.


[79] A Study of Finetuning Video Transformers for Multi-view Geometry Tasks cs.CVPDF

Huimin Wu, Kwang-Ting Cheng, Stephen Lin, Zhirong Wu

TL;DR: 本文研究了通过微调视频基础模型,将视觉Transformer应用于多视图几何任务(如光流估计)。研究发现,在视频上预训练的通用模型只需最小化适配即可迁移到多视图问题,其核心在于补丁间的通用注意力机制能学习用于几何推理的时空信息。通过在线性解码器后添加迭代细化,该方法在多个数据集上达到了最先进的性能。

Details

Motivation: 解决多视图几何任务(如光流估计)通常需要定制架构设计和任务特定预训练,本文旨在探索通用视频预训练模型是否能够通过简单微调有效迁移到这些任务,以简化流程并提升泛化能力。

Result: 在光流估计任务中,该方法在Sintel clean、Sintel final和KITTI数据集上的端点误差(EPE)分别为0.69、1.78和3.15,达到了最先进水平;在线测试基准上EPE为0.79和1.88,F1值为3.79,创下新纪录。在3D深度估计和立体匹配任务中也表现出色。

Insight: 创新点在于揭示了通用视频预训练Transformer的注意力机制能有效捕获几何推理所需的时空信息,通过简单的线性解码器和迭代细化即可实现高性能,避免了复杂的定制设计,展示了视频基础模型在几何视觉任务中的强大泛化能力和多功能性。

Abstract: This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.


[80] $M^3-Verse$: A “Spot the Difference” Challenge for Large Multimodal Models cs.CV | cs.AIPDF

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu

TL;DR: 论文提出了M^3-Verse基准,用于评估大型多模态模型在理解两个视频之间对象状态变化的能力。该基准包含270个场景和2,932个问题,分为50多个子任务测试4个核心能力,并提出了一个简单有效的基线方法以提升性能。

Details

Motivation: 动机是现有大型多模态模型在静态图像理解方面表现出色,但在动态变化理解方面尚未充分探索,这对空间智能的发展至关重要。

Result: 评估了16个最先进的大型多模态模型,发现它们在跟踪状态转换方面存在局限性;提出的基线方法在M^3-Verse基准上显著提高了多状态感知性能。

Insight: 创新点在于引入了多模态、多状态、多维度的M^3-Verse基准,以及一个简单而有效的基线方法,旨在催化下一代模型对动态视觉世界的更全面理解。

Abstract: Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.


[81] Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation cs.CVPDF

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang

TL;DR: 本文提出Memorize-and-Generate (MAG)框架,以解决帧级自回归模型在生成长视频时,因使用窗口注意力机制而导致的灾难性遗忘和场景不一致问题。该框架将历史信息压缩与帧生成解耦,通过一个记忆模型将历史信息压缩为紧凑的KV缓存,再由独立的生成器模型利用此压缩表示合成后续帧,从而在保证实时性的同时提升长期一致性。

Details

Motivation: 当前帧级自回归模型在生成长视频时,面临一个权衡:使用窗口注意力会丢弃窗口外的历史上下文,导致灾难性遗忘和场景不一致;而保留完整历史则带来过高的内存成本。本文旨在解决这一权衡问题。

Result: 大量实验表明,MAG在标准视频生成基准上保持了有竞争力的性能,同时在严格评估历史记忆保留的新基准MAG-Bench上,实现了更优的历史场景一致性。

Insight: 核心创新点在于将历史信息压缩(记忆)与帧生成(生成)解耦为两个独立任务,并引入专门的记忆模型进行高效KV缓存压缩。这为实时视频生成系统提供了一种平衡内存开销与长期一致性的新架构思路。

Abstract: Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.


[82] InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search cs.CV | cs.CL | cs.LGPDF

Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen

TL;DR: 本文提出了InSight-o3框架,旨在增强多模态基础模型的视觉推理能力。该框架包含视觉推理代理(vReasoner)和视觉搜索代理(vSearcher),后者专门执行广义视觉搜索任务,即根据自由形式语言描述定位图像中的关系性、模糊性或概念性区域。通过强化学习训练vSearcher,并将其作为即插即用模块赋能前沿多模态模型,显著提升了它们在多个基准测试上的性能。

Details

Motivation: 当前开放多模态智能体在需要结合视觉细节进行多步推理的现实任务(如分析包含密集图表/示意图的文档或导航地图)中,其推理能力仍然不足。为弥补这一差距,研究旨在提升AI代理的视觉推理能力。

Result: 在提出的新基准O3-Bench上,即使如OpenAI o3这样的前沿系统准确率也仅为40.8%。而InSight-o3框架通过集成vSearcher,显著提升了多种前沿多模态模型在广泛基准测试上的性能,向强大的开放系统迈进了一步。

Insight: 主要创新点在于定义了广义视觉搜索这一新任务,并提出了一个专门为此任务通过强化学习训练的多模态LLM(vSearcher)。该搜索代理能够理解复杂、模糊的语言描述来定位视觉区域,并可作为即插即用模块灵活增强现有视觉推理模型,这种解耦和专业化代理的设计思路具有借鉴意义。

Abstract: The ability for AI agents to “think with images” requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search – locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .


[83] IPCV: Information-Preserving Compression for MLLM Visual Encoders cs.CV | cs.AIPDF

Yuan Chen, Zichen Wen, Yuzhou Wu, Xuyang Liu, Shuang Chen

TL;DR: 本文提出了一种名为IPCV的无训练、信息保留压缩框架,用于减少多模态大语言模型(MLLM)中视觉编码器的计算开销。该方法通过邻居引导重建(NGR)在ViT内部进行激进的令牌剪枝,并临时重建被剪枝的令牌以参与注意力计算,最后在传递给LLM前完全恢复它们;同时引入注意力稳定化(AS)来进一步缓解剪枝的负面影响。

Details

Motivation: 现有令牌剪枝策略存在不足:LLM阶段的令牌剪枝忽略了ViT的开销,而传统的ViT令牌剪枝缺乏语言指导,可能丢弃对文本至关重要的视觉线索,并因ViT的双向注意力机制而放大特征失真。IPCV旨在解决这些问题,在保留关键信息的同时高效压缩视觉编码器。

Result: 在多种图像和视频基准测试上的广泛实验表明,IPCV显著减少了端到端计算量,并在性能上超越了最先进的无训练令牌压缩方法。

Insight: 创新点在于提出了在ViT内部进行令牌剪枝并临时重建以保留信息的方法(NGR),以及通过近似被剪枝令牌的K/V来稳定注意力的技术(AS)。后者可被直接应用于先前的LLM侧令牌剪枝方法以提升其性能,提供了一种模块化的性能增强思路。

Abstract: Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT’s overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT’s bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at https://github.com/Perkzi/IPCV.


[84] Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos cs.CVPDF

Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Qingsong Fei

TL;DR: 本文提出了一种用于视频动作识别的上下文感知网络(CAN),该网络通过多尺度时空注意力机制来全面捕捉动作的多粒度时空线索。CAN包含两个核心模块:多尺度时间线索模块(MTCM)用于提取不同时间尺度的运动细节和整体动作流,以及分组空间线索模块(GSCM)用于通过分组特征图提取多尺度空间线索。

Details

Motivation: 现有动作识别方法往往忽视了动作的多粒度特性,未能充分捕捉不同尺度的时空线索,因此本文旨在解决这一问题,以提升动作识别的鲁棒性。

Result: 在五个基准数据集(Something-Something V1和V2、Diving48、Kinetics-400和UCF101)上的实验表明,CAN取得了具有竞争力的性能,优于大多数主流方法,准确率分别为50.4%、63.9%、88.4%、74.9%和86.9%。

Insight: 创新点在于明确设计了针对多尺度时空线索提取的专用模块(MTCM和GSCM),强调了动作的多粒度特性,并通过分组策略和注意力机制有效整合了不同尺度的信息,为动作识别提供了更全面的上下文建模。

Abstract: Action recognition is a critical task in video understanding, requiring the comprehensive capture of spatio-temporal cues across various scales. However, existing methods often overlook the multi-granularity nature of actions. To address this limitation, we introduce the Context-Aware Network (CAN). CAN consists of two core modules: the Multi-scale Temporal Cue Module (MTCM) and the Group Spatial Cue Module (GSCM). MTCM effectively extracts temporal cues at multiple scales, capturing both fast-changing motion details and overall action flow. GSCM, on the other hand, extracts spatial cues at different scales by grouping feature maps and applying specialized extraction methods to each group. Experiments conducted on five benchmark datasets (Something-Something V1 and V2, Diving48, Kinetics-400, and UCF101) demonstrate the effectiveness of CAN. Our approach achieves competitive performance, outperforming most mainstream methods, with accuracies of 50.4% on Something-Something V1, 63.9% on Something-Something V2, 88.4% on Diving48, 74.9% on Kinetics-400, and 86.9% on UCF101. These results highlight the importance of capturing multi-scale spatio-temporal cues for robust action recognition.


[85] MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation cs.CVPDF

Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu

TL;DR: 本文提出MaskFocus,一种新颖的强化学习框架,通过专注于关键步骤来实现对掩码生成模型的有效策略优化。该方法通过测量每个采样步骤的中间图像与最终生成图像之间的相似度来确定步骤级信息增益,从而识别最关键的步骤并对其执行聚焦策略优化。此外,还设计了一种基于熵的动态路由采样机制,以鼓励模型为低熵样本探索更有价值的掩码策略。在多个文本到图像基准测试上的广泛实验验证了该方法的有效性。

Details

Motivation: 强化学习在语言模型和自回归视觉生成模型的后训练中显示出巨大潜力,但将其适应于掩码生成模型仍具挑战性,因为策略优化需要计算每个步骤的概率似然,这依赖于整个采样轨迹,计算成本高,而原生优化随机步骤通常效果不佳。

Result: 在多个文本到图像基准测试上的广泛实验验证了MaskFocus的有效性,表明该方法能够提升掩码生成模型的性能。

Insight: 创新点在于通过步骤级信息增益识别关键步骤进行聚焦策略优化,以及引入基于熵的动态路由采样机制来探索更有价值的掩码策略,这为掩码生成模型的强化学习训练提供了高效且有效的优化途径。

Abstract: Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.


[86] In-Context Audio Control of Video Diffusion Transformers cs.CVPDF

Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang

TL;DR: 本文提出了一个名为ICAC的框架,用于在视频扩散Transformer模型中集成音频信号以实现语音驱动的视频生成。该框架探索了三种不同的音频条件注入机制,并针对3D自注意力存在的训练挑战,提出了一种掩码3D注意力机制,以强制时间对齐,从而实现稳定的训练和卓越的性能。

Details

Motivation: 当前基于Transformer的统一视频生成基础模型主要关注文本、图像和深度图等多模态输入,而对严格时间同步的音频信号探索不足。本文旨在研究如何在统一的全注意力架构中整合音频信号,以生成与音频同步的高质量视频。

Result: 实验表明,所提出的掩码3D注意力机制在音频流和参考图像条件下,能够实现强大的唇部同步和视频质量。

Insight: 论文的创新点在于系统地探索了音频条件注入机制,并针对3D自注意力的训练难题,提出了掩码3D注意力来强制时间对齐,这为在统一Transformer架构中处理时间同步的多模态信号提供了新的思路和方法。

Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.


[87] FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation cs.CV | cs.AI | cs.MMPDF

Ziyuan Tao, Chuanzhi Xu, Sandaru Jayawardana, Wei Bao, Kanchana Thilakarathna

TL;DR: 本文提出FedVideoMAE,一种用于视频暴力检测的高效隐私保护联邦学习框架。该框架结合了自监督VideoMAE表示、基于LoRA的参数高效适配和深度防御隐私保护机制,旨在解决云端视频审核带来的隐私风险、高带宽成本和推理延迟问题。

Details

Motivation: 解决短视频平台增长带来的隐私保护审核需求,避免云端处理原始视频导致的隐私泄露风险、高带宽开销和延迟问题。

Result: 在RWF-2000数据集上,40个客户端实验显示,无隐私保护时准确率达77.25%,在强差分隐私下保持65-66%准确率,同时通信成本比全模型联邦学习降低28.3倍。

Insight: 创新点包括:1) 将自监督视频表示学习VideoMAE与联邦学习结合;2) 使用LoRA进行参数高效微调,仅需训练3.5%的骨干网络参数;3) 集成差分隐私SGD和可配置隐私预算的安全聚合,实现深度隐私保护。

Abstract: The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by $28.3\times$ compared to full-model federated learning. The code is available at: {https://github.com/zyt-599/FedVideoMAE}


[88] Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction cs.CVPDF

Guangtao Lyu, Xinyi Cheng, Chenghao Xu, Qi Liu, Muli Yang

TL;DR: 本文系统分析了大型视觉语言模型(LVLMs)内部视觉感知与token生成的动态过程,揭示了感知遵循GATE三阶段过程(全局扫描、核心内容聚焦、补充区域探索),而生成则呈现SAD模式(次要token累积导致幻觉)。基于此,作者提出了验证主导校正(VDC)策略,通过检测并替换无支持的幻觉token来提升输出可靠性。

Details

Motivation: 解决LVLMs中持续存在的幻觉问题,通过深入分析模型内部感知与生成机制来理解幻觉产生根源。

Result: 在多个模型和基准测试上的广泛实验证实,VDC策略显著减轻了幻觉现象。

Insight: 创新点在于揭示了LVLMs内部感知的GATE过程和生成的SAD模式,并据此设计了轻量级干预策略VDC,无需重新训练即可纠正幻觉,为理解与缓解模型幻觉提供了新视角。

Abstract: Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.


[89] EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer cs.CVPDF

Yuxiao Yang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao Wang

TL;DR: 本文提出EchoMotion框架,通过双模态扩散Transformer联合建模外观与人体运动分布,以提升复杂人体动作视频生成的质量。该方法扩展了DiT架构,采用双分支处理视频与运动模态的拼接token,并引入MVS-RoPE统一位置编码促进时序对齐,配合两阶段训练策略实现联合生成与跨模态条件生成。为支持训练,构建了包含约8万对高质量视频-运动数据的大规模HuMoVe数据集。

Details

Motivation: 现有视频生成模型因仅依赖像素级训练目标,难以合成复杂人体运动,其外观保真度偏好限制了运动学原理的学习。本文旨在通过联合建模外观与人体运动分布来解决这一问题。

Result: 实验表明,显式表示人体运动对外观建模具有互补性,显著提升了人体中心视频生成的连贯性与合理性。

Insight: 创新点包括:1) 双模态扩散Transformer架构联合处理视频与运动;2) MVS-RoPE统一位置编码建立时序对齐归纳偏置;3) 两阶段训练策略支持联合与跨模态生成;4) 构建大规模视频-运动配对数据集HuMoVe。从客观看,将运动作为显式模态并与视频联合建模,为人体动作生成提供了更结构化的表示,可能缓解高自由度关节运动的生成挑战。

Abstract: Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.


[90] VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference cs.CV | cs.HCPDF

Sicheng Song, Yanjie Zhang, Zixin Chen, Huamin Qu, Changbo Wang

TL;DR: 本文提出了VizDefender框架,用于检测和分析数据可视化图像中的篡改行为。该框架包含两个核心组件:一个半脆弱水印模块,用于精确定位篡改区域;以及一个意图分析模块,利用多模态大语言模型(MLLMs)来解读篡改并推断攻击者的意图和误导效果。

Details

Motivation: 数据可视化的完整性正受到图像编辑技术的威胁,这些技术允许进行微妙而具有欺骗性的篡改。论文旨在解决这一挑战,并将篡改技术分为数据操纵和视觉编码操纵两种主要类型。

Result: 广泛的评估和用户研究表明,所提方法在检测和分析可视化篡改方面是有效的,但摘要中未提及具体的基准测试或与现有方法的定量比较结果。

Insight: 创新点在于将主动定位(通过半脆弱水印)与意图推断(通过MLLMs)相结合,不仅检测篡改,还分析其潜在意图和误导性影响,为可视化安全提供了更全面的解决方案。

Abstract: The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker’s intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.


[91] Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding cs.CV | cs.CLPDF

Ruiqi Ma, Yu Yan, Chunhong Zhang, Minghao Yin, XinChao Liu

TL;DR: 本文提出了一种无需训练的幻觉解耦解码(HDD)方法,旨在缓解大型视觉语言模型(LVLMs)在物体识别任务中产生的严重幻觉问题,通过图像分割与增强以及利用空白图像来消除语言先验幻觉,从而提升模型的视觉性能。

Details

Motivation: 大型视觉语言模型在物体识别任务中存在严重的幻觉问题,即生成流畅但与视觉内容不符的文本,这在实际应用中可能带来严重后果;现有方法多仅关注减少语言模态的幻觉,而本文旨在同时缓解语言和视觉模态的幻觉。

Result: 论文未在摘要中提及具体的定量结果或基准测试,但声称HDD方法能减少模型对语言先验的依赖并增强其视觉性能。

Insight: 创新点在于提出一种无需训练的幻觉解耦解码方法,通过图像分割与增强结合空白图像的使用,从视觉和语言两个模态同时缓解幻觉,这为改善LVLMs的可靠性提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model’s dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)


[92] CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis cs.CV | cs.AIPDF

Kaidi Liang, Ke Li, Xianbiao Hu, Ruwen Qin

TL;DR: 本文提出了CrashChat,一个基于VideoLLaMA3构建的多模态大语言模型,专门用于多任务交通碰撞视频分析。该模型通过指令微调获取领域知识,并采用一种新颖的基于任务解耦与分组的多任务学习策略,以统一框架处理碰撞识别、时间定位和高层视频理解等任务。在公开数据集上的实验表明,其性能全面超越现有MLLM和传统视觉方法,达到SOTA水平。

Details

Motivation: 自动化碰撞视频分析对于利用日益增长的驾驶视频数据进行交通安全研究和自动驾驶责任归属至关重要。现有模型无法在统一框架内完成所有相关任务,且针对此类模型的有效训练策略尚未得到充分探索。

Result: 在整合的公开数据集上,CrashChat在不同模型规模下均持续优于现有MLLM和传统视觉方法,达到SOTA。在碰撞识别上达到近乎完美的准确率,碰撞定位性能提升176%,更具挑战性的碰撞前定位提升40%。与通用MLLM相比,在碰撞描述和推理任务中,文本准确性和内容覆盖度显著提升,BLEU分数提高0.18-0.41,ROUGE分数提高0.18-0.42。

Insight: 宣称的创新点在于提出了一个专用于多任务交通碰撞分析的统一MLLM框架,并引入了一种基于任务解耦与分组的多任务学习策略,以最大化组内和组间联合学习的收益,同时减轻负迁移。客观来看,将MLLM与领域特定的指令微调及精心设计的多任务策略相结合,为解决复杂时空动态的特定领域视频分析问题提供了一个有效且可推广的范式。

Abstract: Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176% improvement in crash localization, and a 40% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at https://github.com/Liangkd/CrashChat.


[93] Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs cs.CVPDF

Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer

TL;DR: 本文提出了一种名为FiNDR的无词汇细粒度图像识别框架,该框架利用具备推理能力的大型多模态模型(LMMs),通过三个自动化步骤(生成候选标签、过滤排序、实例化分类器)来识别视觉相似的类别,无需依赖预定义的人类标签集。

Details

Motivation: 解决现有无词汇细粒度识别方法受限于固定词汇表或复杂、易错的多阶段流程的问题,探索利用具备推理能力的LMMs提供更原则化、更有效的替代方案。

Result: 在流行的细粒度分类基准测试中取得了最先进的性能,在无词汇设置下,相对于先前方法有高达18.8%的相对提升,甚至超过了利用预定义真实名称的零样本基线。

Insight: 创新点在于首次提出了基于推理增强LMMs的自动化框架FiNDR,其核心是利用LMMs的推理能力生成和验证类别名称,挑战了人工整理词汇表是性能上限的假设,并展示了开源LMMs通过精心设计的提示词可以达到与专有模型相当的效果,为可扩展、全自动、开放世界的细粒度视觉识别奠定了基础。

Abstract: Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.


[94] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models cs.CVPDF

Mohamad Zamini, Diksha Shukla

TL;DR: 论文提出Delta-LLaVA,一种令牌高效的视觉投影器,通过基础对齐和专业化层设计,减少多模态大语言模型中视觉令牌的处理成本,提升计算效率和可扩展性。

Details

Motivation: 动机是解决多模态大语言模型中处理密集视觉令牌的高计算成本瓶颈,标准设计在高分辨率输入下扩展性差且引入冗余。

Result: 在多个基准测试上获得一致增益,仅使用144个令牌;推理吞吐量提升高达55%,端到端训练加速预训练4-5倍、微调超过1.5倍。

Insight: 创新点在于基础对齐然后专业化的设计,使用低秩DeltaProjection对齐视觉特征到紧凑子空间,轻量级Transformer块捕获全局和局部结构,强调令牌形成在扩展交互能力前的重要性。

Abstract: Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.


[95] Point What You Mean: Visually Grounded Instruction Policy cs.CV | cs.ROPDF

Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma

TL;DR: 本文提出Point-VLA,一种即插即用的视觉-语言-动作策略,通过在语言指令中引入显式视觉提示(如边界框)来解决物体指代模糊问题,实现精确的物体级视觉定位。

Details

Motivation: 现有视觉-语言-动作模型仅依赖文本提示时,在复杂或分布外场景中的物体指代能力有限,需要解决指代模糊性以实现更精确的具身控制。

Result: 在多样化的真实世界指代任务评估中,Point-VLA在复杂或未见物体场景下性能始终优于纯文本指令的VLA模型,展现出鲁棒的泛化能力。

Insight: 创新点在于通过像素级视觉定位增强语言指令,有效解决指代模糊;同时开发了自动化数据标注流程,以最小人工成本扩展视觉定位数据集。

Abstract: Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.


[96] VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion cs.CVPDF

Zaidao Han, Risa Higashita, Jiang Liu

TL;DR: 本文提出了一种名为VOIC的单目3D语义场景补全方法,通过显式解耦可见区域感知与遮挡区域推理来解决现有方法中特征稀释和错误传播的问题。该方法引入离线可见区域标签提取策略,并构建了一个双解码器网络,在SemanticKITTI和SSCBench-KITTI360基准测试中取得了最先进的性能。

Details

Motivation: 现有基于相机的3D语义场景补全方法通常采用端到端的2D到3D特征提升和体素补全,但忽视了单目图像输入导致的高置信度可见区域感知与低置信度遮挡区域推理之间的相互干扰,这会导致特征稀释和错误传播。

Result: 在SemanticKITTI和SSCBench-KITTI360基准上的大量实验表明,VOIC在几何补全和语义分割精度上均优于现有的单目SSC方法,达到了最先进的性能水平。

Insight: 论文的核心创新点在于显式地将SSC任务解耦为可见区域语义感知和遮挡区域场景推理两个互补的子任务,并通过离线可见区域标签提取策略净化监督空间,以及设计了一个利用跨模态交互进行全局场景推理的双解码器框架。这种解耦设计有助于提升模型对遮挡区域推理的鲁棒性和准确性。

Abstract: Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.


[97] ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation cs.CV | cs.AIPDF

Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin

TL;DR: 论文提出ICP-4D,一个简单有效的训练免费框架,用于4D LiDAR全景分割。它通过迭代最近点(ICP)算法和Sinkhorn软匹配,利用实例级点集间的几何关系统一时空推理,提高计算效率和鲁棒性。

Details

Motivation: 现有4D LiDAR全景分割方法通常需要训练深度神经网络处理大点云或设计专用实例关联模块,计算昂贵且忽略原始点云的几何先验。

Result: 在SemanticKITTI和panoptic nuScenes数据集上的实验表明,该方法一致优于最先进的方法,无需额外训练或点云输入。

Insight: 创新点包括使用ICP算法直接关联时间一致实例、引入Sinkhorn软匹配稳定关联、设计考虑静态、动态和缺失实例的管道,实现计算效率和遮挡感知匹配。

Abstract: Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.


[98] Towards AI-Guided Open-World Ecological Taxonomic Classification cs.CVPDF

Cheng Yaw Low, Heejoon Koo, Jaewoo Park, Kaleb Mesfin Asfaw, Meeyoung Cha

TL;DR: 本文提出了一种名为Open-World Ecological Taxonomy Classification的统一框架,旨在解决生态分类中存在的长尾分布、细粒度差异、时空域偏移和封闭集假设等挑战。为此,作者设计了TaxoNet模型,它采用基于嵌入的编码器和双边缘惩罚损失,以增强对稀有类群的学习信号并抑制过代表类群的主导地位。该方法在Google Auto-Arborist、iNat-Plantae和NAFlora-Mini等多个生态数据集上进行了评估,结果表明其性能优于基线模型,尤其是在稀有类群上,为开放世界的植物分类监测奠定了坚实基础。

Details

Motivation: 生态分类对于生物多样性监测、保护规划和政策制定等全球可持续发展努力至关重要,但当前方法面临长尾分布、细粒度差异、测试时的时空域偏移以及只能识别已知类群的封闭集假设等挑战,阻碍了AI引导分类的进展。

Result: 在Google Auto-Arborist(城市树木)、iNat-Plantae(iNaturalist-2019中来自不同生态系统的植物观测)和NAFlora-Mini(精选的植物标本馆收藏)等多个生态领域数据集上,TaxoNet模型一致优于基线方法,特别是在稀有类群上表现出色,为开放世界植物分类监测建立了强基础。

Insight: 论文的创新点在于提出了一个统一的开放世界生态分类框架,并设计了TaxoNet模型,其双边缘惩罚损失能有效处理长尾分布和类不平衡问题,直接应对了生态分类中的相互关联挑战。从客观角度看,该方法强调了在现实生态设置中整合多种挑战的重要性,并揭示了通用多模态基础模型在植物领域应用中的局限性,为领域特定模型的发展提供了见解。

Abstract: AI-guided classification of ecological families, genera, and species underpins global sustainability efforts such as biodiversity monitoring, conservation planning, and policy-making. Progress toward this goal is hindered by long-tailed taxonomic distributions from class imbalance, along with fine-grained taxonomic variations, test-time spatiotemporal domain shifts, and closed-set assumptions that can only recognize previously seen taxa. We introduce the Open-World Ecological Taxonomy Classification, a unified framework that captures the co-occurrence of these challenges in realistic ecological settings. To address them, we propose TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones, directly confronting interrelated challenges. We evaluate our method on diverse ecological domains: Google Auto-Arborist (urban trees), iNat-Plantae (Plantae observations from various ecosystems in iNaturalist-2019), and NAFlora-Mini (a curated herbarium collection). Our model consistently outperforms baselines, particularly for rare taxa, establishing a strong foundation for open-world plant taxonomic monitoring. Our findings further show that general-purpose multimodal foundation models remain constrained in plant-domain applications.


[99] CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization cs.CV | cs.LGPDF

Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang

TL;DR: CETCAM是一个无需相机姿态标注、通过一致且可扩展的标记化方案实现相机可控的视频生成框架。它利用几何基础模型(如VGGT)估计深度和相机参数,将其转换为统一的几何感知标记,并通过轻量级上下文块集成到预训练的视频扩散模型中。该框架采用两阶段渐进训练,先在多样原始视频数据上学习鲁棒的相机可控性,再用高质量数据集细化视觉质量。实验表明,CETCAM在多个基准测试中实现了最先进的几何一致性、时间稳定性和视觉真实感,并能灵活扩展到修复和布局控制等其他模态。

Details

Motivation: 现有视频生成方法依赖难以大规模获取且与深度估计不一致的相机姿态标注,导致训练-测试差异,限制了精确相机控制的实现。

Result: 在多个基准测试中,CETCAM实现了最先进的几何一致性、时间稳定性和视觉真实感。

Insight: 创新点包括:通过几何基础模型自动估计参数并转换为统一标记,避免了标注依赖;两阶段渐进训练策略平衡了可控性与视觉质量;轻量级集成设计保持了预训练骨干的完整性,并可扩展支持额外控制模态。

Abstract: Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at https://sjtuytc.github.io/CETCam_project_page.github.io/.


[100] VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation cs.CV | cs.ROPDF

Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang

TL;DR: 本文介绍了VLNVerse,一个用于视觉语言导航(VLN)的大规模、可扩展基准测试平台。它旨在解决现有基准测试在规模、物理模拟真实性和任务统一性方面的局限性,通过提供多功能、具身化、真实模拟和评估来重新定义VLN为一个可扩展的全栈具身AI问题。

Details

Motivation: 现有VLN基准测试局限于固定的小规模数据集和简单的物理模拟,限制了其对模拟到现实泛化能力的洞察,并造成了研究空白。同时,任务碎片化阻碍了该领域的统一进展,有限的数据规模也无法满足现代基于LLM的预训练需求。

Result: 论文利用VLNVerse的规模和多样性,对从经典模型到基于MLLM的智能体等现有方法进行了全面评估,并提出了一个能够处理基准测试中所有任务的新型统一多任务模型。

Insight: 创新点在于将VLN重新定义为可扩展的全栈具身AI问题,通过一个统一框架整合了先前碎片化的任务,并提供了可扩展的研究工具包。其具身化设计超越了无形的“幽灵”智能体,在强大的物理引擎支持下实现了全运动学的真实模拟,旨在缩小模拟导航与现实世界泛化之间的差距。

Abstract: Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting “ghost” agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.


[101] Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection cs.CVPDF

Haoze Li, Jie Zhang, Guoying Zhao, Stephen Lin, Shiguang Shan

TL;DR: 本文提出SVLP-IL,一个基于视觉语言预训练模型的无回放增量学习框架,用于人脸呈现攻击检测。该框架通过多角度提示和选择性弹性权重巩固来平衡稳定性和可塑性,以应对不断演变的攻击手段和领域,同时遵守隐私法规无需保留历史数据。

Details

Motivation: 人脸呈现攻击检测需要增量学习来应对不断变化的欺骗策略和领域,但隐私法规禁止保留过去数据,因此需要无回放的增量学习方法。

Result: 在多个PAD基准测试上的综合实验表明,SVLP-IL显著减少了灾难性遗忘,并提升了在未见领域上的性能。

Insight: 创新点在于利用视觉语言预训练模型的可提示调整的跨模态表示,通过多角度提示隔离领域依赖并增强分布偏移敏感性,以及选择性弹性权重巩固来选择性保留关键权重,从而在无回放设置下实现稳定且灵活的增量学习。

Abstract: Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.


[102] WaTeRFlow: Watermark Temporal Robustness via Flow Consistency cs.CVPDF

Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang

TL;DR: WaTeRFlow是一个专门针对图像到视频转换场景的鲁棒水印框架,通过引入流引导的统一合成引擎、光流扭曲与时间一致性损失以及语义保留损失,旨在解决水印在I2V转换过程中因逐帧检测弱化而容易被绕过的问题。

Details

Motivation: 动机是解决现有基于深度学习的水印方法在面对先进的图像到视频转换技术时,由于逐帧水印检测性能下降而导致的鲁棒性不足问题,这对于内容认证、来源追溯以及世界建模和仿真工作流至关重要。

Result: 在多个代表性I2V模型上的实验表明,该方法能够从视频帧中准确恢复水印,在视频生成前后施加各种失真时,均表现出更高的首帧和逐帧比特准确率及更强的鲁棒性。

Insight: 创新点包括:1) 通过流引导的统一合成引擎在训练中模拟真实失真;2) 利用光流扭曲和时间一致性损失稳定逐帧预测;3) 引入语义保留损失维持条件信号。从客观角度看,其将时间一致性约束与水印训练相结合,为跨模态水印恢复提供了新思路。

Abstract: Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.


[103] Decoupled Generative Modeling for Human-Object Interaction Synthesis cs.CVPDF

Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam

TL;DR: 本文提出了一种解耦生成建模方法(DecHOI),用于合成逼真的人-物交互(HOI)动作。该方法将路径规划与动作生成分离,首先由轨迹生成器产生无预设路径点的人与物体轨迹,再由动作生成器基于这些轨迹合成详细动作。通过对抗训练提升接触真实性,并支持动态场景中的长序列规划。

Details

Motivation: 现有方法通常需要手动指定中间路径点,并将所有优化目标集中于单一网络,导致复杂度高、灵活性差,且易产生人与物体运动不同步或穿透等错误。本文旨在解决这些问题,提高HOI合成的真实性和效率。

Result: 在FullBodyManipulation和3D-FUTURE两个基准测试中,DecHOI在大多数定量指标和定性评估上超越了先前方法,感知研究也倾向于本文结果。

Insight: 创新点在于将HOI合成解耦为轨迹生成和动作生成两个阶段,避免了手动路径点指定;采用专注于远端关节动态的判别器进行对抗训练,提升了接触真实性;同时支持动态场景中的长序列一致性规划,增强了方法的灵活性和实用性。

Abstract: Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.


[104] 6DAttack: Backdoor Attacks in the 6DoF Pose Estimation cs.CVPDF

Jihui Guo, Zongmin Zhang, Zhen Sun, Yuhao Yang, Jinlin Wu

TL;DR: 本文提出了6DAttack,一个针对六自由度物体位姿估计任务的3D后门攻击框架。该框架通过在3D物体上植入特定触发器,能够诱导模型在输入被触发时输出受控的错误位姿,同时保持对干净样本的正常预测性能。

Details

Motivation: 动机在于揭示六自由度位姿估计这一关键任务中尚未被充分探索的安全威胁。现有后门攻击研究主要集中在2D视觉任务(如分类),而针对需要控制连续平移和旋转参数的6DoF任务的攻击方法尚属空白,且2D方法无法直接适用。

Result: 在LINEMOD、YCB-Video和CO3D数据集上,对PVNet、DenseFusion和PoseDiffusion等模型进行评估。攻击成功率(ASR)很高,且不影响干净样本性能(ADD精度可达100%)。被触发的样本ADD-P指标可达97.70%。此外,一种代表性防御方法被证明无效。

Insight: 创新点在于首次系统性地将后门攻击引入连续输出的6DoF位姿估计领域,并设计了使用3D物体作为触发器的攻击框架。其核心洞察是,针对连续参数空间的攻击需要不同于离散分类任务的新方法,并且证明了此类攻击的严重性和现有防御的不足。

Abstract: Deep learning advances have enabled accurate six-degree-of-freedom (6DoF) object pose estimation, widely used in robotics, AR/VR, and autonomous systems. However, backdoor attacks pose significant security risks. While most research focuses on 2D vision, 6DoF pose estimation remains largely unexplored. Unlike traditional backdoors that only change classes, 6DoF attacks must control continuous parameters like translation and rotation, rendering 2D methods inapplicable. We propose 6DAttack, a framework using 3D object triggers to induce controlled erroneous poses while maintaining normal behavior. Evaluations on PVNet, DenseFusion, and PoseDiffusion across LINEMOD, YCB-Video, and CO3D show high attack success rates (ASRs) without compromising clean performance. Backdoored models achieve up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. Furthermore, a representative defense remains ineffective. Our findings reveal a serious, underexplored threat to 6DoF pose estimation.


[105] Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges cs.CV | cs.LGPDF

Ariel Lubonja, Pedro R. A. S. Bassi, Wenxuan Li, Hualin Qiao, Randal Burns

TL;DR: 该论文介绍了RankInsight开源工具包,旨在解决医学AI挑战赛排行榜的三个主要问题:缺乏统计显著性检验、使用单一平均指标掩盖临床重要错误、以及忽视交叉人口统计公平性。

Details

Motivation: 医学AI挑战赛作为方法比较的标准存在三个局限性:排名差距未进行统计显著性检验导致稳定性未知;单一平均指标应用于所有器官隐藏了临床重要的边界错误;交叉人口统计性能很少报告,掩盖了公平性差距。

Result: RankInsight工具包在实验中显示:nnU-Net系列模型在统计显著性上优于Vision-Language和MONAI提交;使用器官特异性指标(如用NSD替代Dice)会改变前四名模型的排序;在专有约翰斯·霍普金斯医院数据集上,超过一半的MONAI模型表现出最大的性别-种族差异。

Insight: 创新点在于提出一个综合工具包,通过统计显著性检验、器官特异性指标和交叉人口统计公平性审计,使医学AI挑战赛排名更可靠、临床相关且公平,可推广至现有及未来挑战赛。

Abstract: Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.


[106] Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval? cs.CVPDF

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Wentao Zhang

TL;DR: 本文研究了多模态大语言模型在生成任务上表现出色,却在零样本多模态检索任务中表现不佳的矛盾现象。通过使用稀疏自编码器分解模型输出表示,作者发现MLLMs的表征空间被文本语义主导,视觉信息占比很小,且模型过度关注跨模态对齐,导致嵌入同质化,从而削弱了检索所需的判别力。

Details

Motivation: 探究多模态大语言模型在生成任务上成功,却在零样本多模态检索任务中表现不佳的根本原因。

Result: 分析表明,MLLMs的表征空间严重偏向文本语义,视觉信息占比很小,且用于相似度计算的关键特征成分实际上是降低检索性能的干扰因素。

Insight: 揭示了MLLMs在检索任务上表现不佳的机制是表征空间的文本主导性和嵌入同质化问题;提出了通过分析表征成分来诊断和增强MLLMs多模态检索能力的可能方向。

Abstract: Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.


[107] PEDESTRIAN: An Egocentric Vision Dataset for Obstacle Detection on Pavements cs.CV | cs.LGPDF

Marios Thoma, Zenonas Theodosiou, Harris Partaourides, Vassilis Vassiliades, Loizos Michael

TL;DR: 本文介绍了PEDESTRIAN数据集,这是一个用于人行道障碍物检测的以自我为中心(egocentric)视觉数据集,包含29种常见障碍物的340个视频,旨在通过深度学习算法提升行人安全。

Details

Motivation: 解决城市人行道因各种障碍物阻碍行人自由移动和安全的问题,利用普适计算和以自我为中心视觉技术设计实时障碍物检测系统,以增强行人安全。

Result: 通过训练多个最先进的深度学习算法,在提出的数据集上进行了实验,结果可作为障碍物检测和识别任务的基准,但摘要未具体提及定量性能指标或与SOTA的比较。

Insight: 创新点在于构建了一个专门针对人行道障碍物的以自我为中心视觉数据集,填补了该领域数据集的空白,为开发高效识别算法提供了基础,并强调了其在提升城市行人安全方面的应用潜力。

Abstract: Walking has always been a primary mode of transportation and is recognized as an essential activity for maintaining good health. Despite the need for safe walking conditions in urban environments, sidewalks are frequently obstructed by various obstacles that hinder free pedestrian movement. Any object obstructing a pedestrian’s path can pose a safety hazard. The advancement of pervasive computing and egocentric vision techniques offers the potential to design systems that can automatically detect such obstacles in real time, thereby enhancing pedestrian safety. The development of effective and efficient identification algorithms relies on the availability of comprehensive and well-balanced datasets of egocentric data. In this work, we introduce the PEDESTRIAN dataset, comprising egocentric data for 29 different obstacles commonly found on urban sidewalks. A total of 340 videos were collected using mobile phone cameras, capturing a pedestrian’s point of view. Additionally, we present the results of a series of experiments that involved training several state-of-the-art deep learning algorithms using the proposed dataset, which can be used as a benchmark for obstacle detection and recognition tasks. The dataset can be used for training pavement obstacle detectors to enhance the safety of pedestrians in urban areas.


[108] InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training cs.CVPDF

Zihao Luo, Shaohao Rui, Zhenyu Tang, Guotai Wang, Xiaosong Wang

TL;DR: 本文提出InvCoSS,一种用于医学多模态图像预训练的反演驱动持续自监督学习框架。该方法通过反演预训练模型生成合成图像来近似先前任务的训练分布,从而在无需访问先前真实数据的情况下缓解灾难性遗忘,同时满足数据隐私和存储限制。

Details

Motivation: 解决现有持续自监督学习方法在医学影像中仍需回放先前阶段真实数据所带来的隐私泄露和跨站点数据传输限制问题,旨在实现严格不访问历史真实数据下的持续学习。

Result: 在九个下游任务上的广泛实验表明,InvCoSS的性能达到甚至优于先前依赖数据回放的方法,同时显著降低了存储需求并消除了数据隐私约束。

Insight: 创新点包括:1) 利用模型反演生成合成图像替代真实数据回放以保护隐私;2) 提出多尺度融合的InvUNet提升合成图像保真度;3) 设计排斥性表示学习机制增强合成图像的多样性,避免模式崩溃。

Abstract: Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.


[109] Towards Minimal Fine-Tuning of VLMs cs.CV | cs.AIPDF

Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee

TL;DR: 本文提出了Image-LoRA,一种针对基于Transformer的视觉语言模型(VLMs)的轻量级参数高效微调(PEFT)方法。该方法仅在视觉标记范围内的注意力层值路径上应用低秩适应,并进一步仅微调根据影响力分数选出的注意力头子集,从而显著减少了可训练参数和计算开销。

Details

Motivation: 动机是开发一种更轻量、计算效率更高的微调方法,以减少对视觉语言模型进行参数高效微调(PEFT)时的计算负担和参数数量,同时保持模型性能。

Result: 在涵盖文本密集到图像密集场景的以屏幕为中心的grounding和referring基准测试中,Image-LoRA在使用更少可训练参数和更低计算开销(FLOPs)的情况下,达到了与标准LoRA相当或接近的准确率。在GSM8K上进一步证明,该方法在微调前后保持了VLMs的纯文本推理性能。

Insight: 创新点在于将低秩适应(LoRA)限制在视觉标记的注意力层值路径上,并引入基于影响力分数的注意力头选择机制以及选择大小归一化来稳定更新。这为在特定模态(视觉)上实现更高效的参数微调提供了新思路。

Abstract: We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.


[110] VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis cs.CVPDF

Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang

TL;DR: 本文介绍了VisionDirector,一种无需训练的视觉语言监督系统,用于解决生成模型在处理包含多个紧密耦合目标的复杂长提示时的性能瓶颈。作者首先提出了包含2000个任务的Long Goal Bench基准测试,发现现有SOTA模型仅能实现不到72%的目标,尤其在局部编辑方面表现脆弱。VisionDirector通过提取结构化目标、动态决策生成策略、执行带语义验证的微网格采样与回滚,并结合目标级奖励日志,显著提升了生成图像与复杂指令的对齐能力。

Details

Motivation: 当前生成模型在处理专业设计师使用的、包含多个紧密耦合目标的复杂长提示时表现不佳,现有基准测试未能充分评估模型在真实场景下的性能。

Result: 在GenEval基准上整体提升7%,在ImgEdit基准上绝对提升0.07,实现了新的SOTA;同时将平均编辑步骤从4.2减少到3.1,并在版式、多对象场景和姿态编辑方面取得了定性改进。

Insight: 创新点在于提出了一个无需训练、基于视觉语言指导的闭环细化框架,通过结构化目标解析、动态决策、带验证的微网格采样与回滚机制,以及结合Group Relative Policy Optimization微调规划器,有效提升了复杂长提示下的生成鲁棒性和对齐精度。

Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models’ performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.


[111] 3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory cs.CVPDF

Xinyang Song, Libin Wang, Weining Wang, Zhiwei Li, Jianxin Sun

TL;DR: 3SGen是一个任务感知的统一图像生成框架,能够在一个模型中同时处理主体、风格和结构驱动的条件生成。它通过一个配备可学习语义查询的多模态大语言模型来对齐文本-图像语义,并辅以一个VAE分支来保留细粒度视觉细节。其核心是自适应任务特定记忆模块,该模块通过轻量级门控机制和可扩展的记忆项,动态地解耦、存储和检索特定条件的先验知识,从而减轻任务间干扰并支持组合输入。

Details

Motivation: 当前图像生成方法通常孤立地处理主体、风格和结构驱动的条件,导致特征纠缠和任务可迁移性有限。本文旨在解决这一问题,提出一个统一的框架来整合这三种条件模式。

Result: 在提出的3SGen-Bench和其他公共基准测试上的广泛实验表明,该方法在多种图像驱动生成任务中表现出优越性能,评估了跨任务保真度和可控性。

Insight: 主要创新点包括:1) 任务感知的统一框架,将主体、风格和结构条件生成整合到单一模型中;2) 自适应任务特定记忆模块,通过动态解耦和检索条件特定先验来减轻特征纠缠;3) 提出的3SGen-Bench基准测试,用于标准化评估跨任务性能。从客观角度看,其轻量级门控和可扩展记忆设计为多条件图像生成提供了有效的解耦和组合能力。

Abstract: Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.


[112] Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation cs.CV | cs.AIPDF

Ivan DeAndres-Tame, Chengwei Ye, Ruben Tolosana, Ruben Vera-Rodriguez, Shiqi Yu

TL;DR: 本研究评估了四种先进的生成式AI人体动画模型在步态生物识别保真度方面的表现,发现尽管这些模型能生成高视觉质量的动画,但在身份识别任务中生物识别保真度较低,表明当前模型难以将身份与运动分离,且过度依赖视觉属性而非时间动态。

Details

Motivation: 生成式AI模型在人体动画合成中虽具有高视觉保真度,但细微的不一致可能导致动画不自然,尤其在行为生物识别评估中,定义身份的微妙运动线索易丢失或扭曲,因此研究旨在探究这些模型是否能保留步态生物识别所需的空间-时间细节。

Result: 在两项主要评估任务中(从参考视频恢复步态模式及将步态模式转移到不同视觉身份),模型在视觉质量上表现良好,但生物识别保真度较低,身份识别任务中识别性能下降,揭示了基于外观的步态识别在纹理与运动分离时的根本缺陷。

Insight: 论文创新点在于首次系统评估生成式AI人体动画模型的步态生物识别保真度,揭示了当前模型依赖视觉属性而非时间动态的局限性,为改进模型以更好地解耦身份与运动提供了方向,对行为生物识别和动画生成领域具有借鉴意义。

Abstract: Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.


[113] Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context cs.CVPDF

Kyungwon Cho, Hanbyul Joo

TL;DR: 本文提出HaMoS,一种手部感知、序列级扩散框架,用于从第一人称视频中重建穿戴者的全身运动。该方法直接利用头部轨迹和因视野限制及遮挡而间歇可见的手部线索,并通过新颖的数据增强方法模拟真实世界条件,同时采用局部注意力机制高效推断长序列。

Details

Motivation: 解决从第一人称视频估计全身运动的挑战,该任务因大多数身体部位不可见而困难,现有方法依赖头部轨迹导致模糊性,或假设手部持续被跟踪而不切实际。

Result: 在公共基准测试中,该方法在准确性和时间平滑性方面达到了最先进水平(SOTA),展示了可靠的野外第一人称3D运动理解的实用进展。

Insight: 创新点包括:首次提出结合头部轨迹和间歇手部线索的序列级扩散框架,引入模拟真实世界条件的数据增强方法,并强调身体形状和视野等序列级上下文对准确运动重建的重要性,通过局部注意力高效处理长序列。

Abstract: Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer’s full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.


[114] RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning cs.CVPDF

Jun Li, Zikun Chen, Haibo Chen, Shuo Chen, Jian Yang

TL;DR: 论文提出RMLer框架,通过强化混合学习合成跨类别的新对象。该方法将概念融合建模为强化学习问题,使用混合特征作为状态、混合策略作为动作、视觉结果作为奖励,优化生成高质量图像。

Details

Motivation: 现有文本到图像生成方法在跨类别概念融合时存在概念不平衡、组合表面化或简单并列等问题,因此需要一种新方法来有效合成新颖视觉概念。

Result: 通过广泛实验,RMLer在合成连贯、高保真对象方面优于现有方法,展示了其优越性。

Insight: 创新点包括将跨类别概念融合建模为强化学习问题,设计MLP策略网络预测动态系数,引入基于语义相似性和组合平衡的视觉奖励,并使用近端策略优化进行优化,为生成新颖视觉概念提供了鲁棒框架。

Abstract: Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer’s superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.


[115] Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing cs.CVPDF

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang

TL;DR: 本文提出了一种名为Think2Seg-RS的解耦框架,用于遥感图像中的推理分割。该框架通过训练一个大型视觉语言模型(LVLM)提示器,生成结构化几何提示来控制冻结的Segment Anything Model(SAM),从而将语义推理与像素预测分离,以解决现有端到端微调方法导致的几何基础薄弱和泛化能力有限的问题。

Details

Motivation: 现有基于大型视觉语言模型(LVLMs)的遥感分析推理分割框架通常通过端到端有监督微调将语言推理与像素预测耦合,这导致了较弱的几何基础和对不同任务的泛化能力有限。本文旨在解决这一问题,实现更鲁棒和可泛化的语义级推理分割。

Result: 该方法在EarthReason数据集上达到了最先进的(SOTA)性能。此外,学习到的提示策略能够零样本泛化到多个参考分割基准测试上,揭示了语义级与实例级基础之间的明显区别。

Insight: 核心创新点在于提出了一个解耦框架,将语义推理(由LVLM处理)与几何分割(由冻结的SAM处理)分离,并通过仅使用掩码的强化学习目标来训练LVLM提示器,使其学会将抽象语义转化为空间基础的操作。研究还发现,在语义级监督下,紧凑的分割器性能优于更大的模型,并且在异质的航空背景下,负面提示是无效的。这为建立统一、可解释的LVLM驱动地球观测开辟了新途径。

Abstract: Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.


[116] Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome cs.CVPDF

Moamal Fadhil Abdul, Jonas Bruun Hubrechts, Thomas Martini Jørgensen, Emil Hovad

TL;DR: 本文扩展了公开的OpenTTGames乒乓球视频数据集,增加了精细的击球类型(正手、反手及子类)、球员姿态(身体倾斜和腿部站姿)以及回合得分结果的标注,旨在支持从事件检测到战术理解的模型训练。

Details

Motivation: 为了解决乒乓球视频分析中缺乏公开、精细标注数据的问题,以支持自动击球检测、分类和战术分析,从而简化训练流程、丰富转播内容和实现细粒度性能分析。

Result: 论文扩展了OpenTTGames数据集,提供了新的标注方案和基线,但摘要中未提及具体的定量实验结果或与现有方法的比较。

Insight: 创新点在于提出了一个紧凑的编码方案和代码辅助标注流程,以支持可复现的细粒度击球理解标注,并填补了社区中公开、许可清晰的细粒度乒乓球视频数据集的空白。

Abstract: Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either “bounce”, “net”, or “empty_event” in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.


[117] ReasonCD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining cs.CVPDF

Zhenyang Huang, Xiao Yu, Yi Zhang, Decheng Wang, Hang Ruan

TL;DR: 本文提出了一种名为ReasonCD的多模态推理变化检测模型,旨在解决遥感图像变化检测中用户兴趣变化区域(CRoI)的隐式语义挖掘问题。该模型利用预训练大语言模型的强大推理能力,从隐式文本描述中挖掘用户任务意图,并据此生成不同的变化检测结果。

Details

Motivation: 现有基于语义引导的遥感变化检测方法过度依赖对CRoI的显式文本描述,导致在面对隐式CRoI描述时性能几乎完全失效。因此,需要一种能够理解并挖掘用户隐式任务意图的模型。

Result: 在公开数据集上的实验表明,该模型在BCDD数据集上取得了92.1%的F1分数,表现出优异的变化检测性能。此外,基于SECOND数据集标注的推理数据子集的实验结果显示,模型不仅能出色完成基于推理的基本变化检测任务,还能解释推理过程以辅助人类决策。

Insight: 创新点在于将大语言模型的推理能力引入遥感变化检测,实现了对用户隐式任务意图的挖掘,从而克服了传统方法对显式文本描述的依赖。这为多模态任务中结合高级语义推理提供了新思路。

Abstract: Remote sensing image change detection is one of the fundamental tasks in remote sensing intelligent interpretation. Its core objective is to identify changes within change regions of interest (CRoI). Current multimodal large models encode rich human semantic knowledge, which is utilized for guidance in tasks such as remote sensing change detection. However, existing methods that use semantic guidance for detecting users’ CRoI overly rely on explicit textual descriptions of CRoI, leading to the problem of near-complete performance failure when presented with implicit CRoI textual descriptions. This paper proposes a multimodal reasoning change detection model named ReasonCD, capable of mining users’ implicit task intent. The model leverages the powerful reasoning capabilities of pre-trained large language models to mine users’ implicit task intents and subsequently obtains different change detection results based on these intents. Experiments on public datasets demonstrate that the model achieves excellent change detection performance, with an F1 score of 92.1% on the BCDD dataset. Furthermore, to validate its superior reasoning functionality, this paper annotates a subset of reasoning data based on the SECOND dataset. Experimental results show that the model not only excels at basic reasoning-based change detection tasks but can also explain the reasoning process to aid human decision-making.


[118] Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis cs.CVPDF

Xiaoming Zhang, Chunli Li, Jiacheng Hao, Yuan Gao, Danyang Tu

TL;DR: 本文提出了一种名为MOON++的新型多模态框架,用于通过非对比增强CT(NCCT)扫描对食管静脉曲张(EV)进行无创分级评估。该框架通过综合分析食管、肝脏和脾脏的影像特征,并结合临床先验知识(器官体积关系与肝病严重性的关联),旨在替代传统的内窥镜检查。

Details

Motivation: 食管静脉曲张是门脉高压的严重并发症,传统诊断依赖有创的内窥镜检查。非对比增强CT(NCCT)作为一种潜在的无创替代方案,在临床实践中尚未得到充分利用。本文旨在开发一个基于NCCT的、结合临床先验的多器官分析框架,以实现更准确、无创的EV严重程度分级。

Result: 在包含1631名患者的数据集上进行了评估,其中EV严重程度分为四级。在239例验证集和289例独立测试集上,MOON++相比传统的单器官分析方法表现出优越性能:在重度EV(G3 vs. <G3)分类任务中,AUC达到0.894(对比0.803);在中重度EV(>=G2 vs. <G2)区分任务中,AUC达到0.921(对比0.793)。研究还通过包含经验丰富的放射科医生的读片研究进一步验证了其性能。

Insight: 论文的主要创新点在于首次提出了一个结合临床先验知识(器官体积关系)的、全面的多器官NCCT分析框架(MOON++)用于EV评估。从客观角度看,其创新之处在于将多模态学习(整合多个器官的影像特征)与临床领域知识(病理生理关联)相结合,以增强模型在医学影像分析中的可解释性和性能,为无创诊断提供了新思路。

Abstract: Esophageal varices (EV) represent a critical complication of portal hypertension, affecting approximately 60% of cirrhosis patients with a significant bleeding risk of ~30%. While traditionally diagnosed through invasive endoscopy, non-contrast computed tomography (NCCT) presents a potential non-invasive alternative that has yet to be fully utilized in clinical practice. We present Multi-Organ-COhesion Network++ (MOON++), a novel multimodal framework that enhances EV assessment through comprehensive analysis of NCCT scans. Inspired by clinical evidence correlating organ volumetric relationships with liver disease severity, MOON++ synthesizes imaging characteristics of the esophagus, liver, and spleen through multimodal learning. We evaluated our approach using 1,631 patients, those with endoscopically confirmed EV were classified into four severity grades. Validation in 239 patient cases and independent testing in 289 cases demonstrate superior performance compared to conventional single organ methods, achieving an AUC of 0.894 versus 0.803 for the severe grade EV classification (G3 versus <G3) and 0.921 versus 0.793 for the differentiation of moderate to severe grades (>=G2 versus <G2). We conducted a reader study involving experienced radiologists to further validate the performance of MOON++. To our knowledge, MOON++ represents the first comprehensive multi-organ NCCT analysis framework incorporating clinical knowledge priors for EV assessment, potentially offering a promising non-invasive diagnostic alternative.


[119] dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models cs.CVPDF

Yi Xin, Siqi Luo, Qi Qin, Haoxing Chen, Kaiwen Zhu

TL;DR: 本文提出了dMLLM-TTS框架,旨在高效地提升扩散多模态大语言模型(dMLLMs)在测试时的生成性能。该框架通过轨迹探索缩放和迭代精炼缩放两个互补维度来增强生成多样性和稳定性,并引入了层次化搜索算法和自验证反馈机制,以显著降低计算成本并消除对外部验证器的依赖。

Details

Motivation: 扩散多模态大语言模型(dMLLMs)统一了图像生成与理解,但其测试时缩放(TTS)方法在有效性和效率方面尚未得到充分探索。现有TTS方法通常采用线性搜索,计算成本高(O(NT))且依赖外部验证器进行最佳选择,这限制了模型生成潜力的充分发挥。

Result: 在GenEval基准测试上,对三种代表性dMLLM(如Lumina-DiMOO、MMaDA、Muddit)进行的广泛实验表明,该框架显著提高了生成质量,同时效率比线性搜索提升了高达6倍。

Insight: 创新点包括:1)设计了复杂度为O(N+T)的层次化搜索算法,自适应地扩展和剪枝采样轨迹;2)引入了自验证反馈机制,利用dMLLMs固有的图像理解能力评估文本-图像对齐,无需外部验证器。这些方法在提升性能的同时,实现了高效的自监督优化。

Abstract: Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs’ intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.


[120] D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning cs.CVPDF

Evelyn Zhang, Fufu Yu, Aoqi Wu, Zichen Wen, Ke Yan

TL;DR: 本文提出了D2Pruner,一个用于多模态大语言模型(MLLM)的令牌剪枝框架,旨在解决处理长视觉令牌序列带来的高计算成本问题。该方法通过结合去偏的重要性评估和结构多样性剪枝,有效提升了在细粒度定位任务上的性能,同时保持了通用理解任务的高效率。

Details

Motivation: 当前MLLM的令牌剪枝方法在通用理解任务上表现尚可,但在细粒度定位任务上会灾难性失败。作者将此归因于现有两类主流策略的固有缺陷:基于重要性的方法受模型本身位置偏差的干扰,而基于多样性的方法则忽略了用户提示和空间冗余,存在结构盲目性。

Result: 在LLaVA-1.5-7B模型上进行通用理解任务测试时,D2Pruner在减少74.2% FLOPs的同时,保留了99.2%的原始性能。在更具挑战性的定位基准测试(使用InternVL-2.5-8B模型)中,在90%的令牌削减率下仍能保持85.7%的性能,相比现有方法取得了高达63.53%的性能提升。

Insight: 核心创新点在于将去偏的重要性评分(基于注意力分数)与结构剪枝机制(通过混合图建模空间邻近性和语义相似性,并应用最大独立集选择)相结合。这确保了在保留最关键令牌(pivots)的同时,补充令牌的选择能最大化重要性和多样性,从而克服了现有方法的偏差和结构盲目性问题。

Abstract: Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user’s prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2% while retaining 99.2% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7% performance at a 90% token reduction rate, marking a significant advancement with up to 63. 53% improvement over existing methods.


[121] Sign Language Recognition using Parallel Bidirectional Reservoir Computing cs.CV | cs.ROPDF

Nitin Kumar Singh, Arie Rachmad Syulistyo, Yuichiro Tanaka, Hakaru Tamukoh

TL;DR: 本文提出了一种用于手语识别的轻量级系统,该系统结合了并行双向储层计算和MediaPipe。MediaPipe用于实时手部跟踪和关节坐标提取,作为PBRC的输入特征。PBRC架构由两个基于回声状态网络的双向储层计算模块并行组成,以捕获时间依赖性,从而为分类创建丰富的特征表示。

Details

Motivation: 解决基于深度学习的手语识别模型计算资源需求大、难以部署在边缘设备上的问题,旨在为边缘设备提供一种轻量级、低成本的实时手语识别解决方案。

Result: 在Word-Level American Sign Language视频数据集上,所提系统取得了top-1、top-5和top-10准确率分别为60.85%、85.86%和91.74%的结果。由于储层计算的内在特性,训练时间大幅减少至18.67秒,而基于深度学习的方法(如Bi-GRU)则需要超过55分钟。

Insight: 创新点在于将并行双向储层计算架构与MediaPipe的实时手部跟踪相结合,构建了一个计算高效的轻量级系统。其核心是利用储层计算快速训练和并行双向结构捕获时间依赖性的特性,为边缘设备上的实时手语识别提供了一个有前景的替代方案。

Abstract: Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.


[122] FusionNet: Physics-Aware Representation Learning for Multi-Spectral and Thermal Data via Trainable Signal-Processing Priors cs.CVPDF

Georgios Voulgaris

TL;DR: 本文提出了一种名为FusionNet的物理感知表示学习框架,用于处理多光谱和热红外数据。该框架通过可训练的差分信号处理先验、混合池化策略和更宽的感受野,将地质短波红外(SWIR)比率与热红外(TIR)数据进行中间融合,以建模长期物理过程的稳定特征。

Details

Motivation: 现有深度学习模型在多模态视觉信号处理中,其归纳偏置与信号形成的物理过程不一致,导致在跨光谱和真实世界条件下性能脆弱。特别是,直接依赖热线索的方法难以捕捉由持续热排放引起的间接但持久的环境变化。

Result: 系统消融实验表明,每个架构组件都对性能提升有贡献,其中DGCNN在SWIR比率上达到88.7%的准确率,而FusionNet达到90.6%,在五种光谱配置上均优于最先进的基线方法。迁移学习实验进一步表明,ImageNet预训练会降低TIR性能,凸显了模态感知训练对跨光谱学习的重要性。

Insight: 创新点在于将物理感知的特征选择(如对土壤特性变化敏感的地质SWIR比率)与原则性的深度学习架构(如嵌入可训练差分信号处理先验的卷积层)相结合,从而在挑战性条件下生成鲁棒且可泛化的多光谱表示。这展示了基于第一性原理的信号建模如何提升跨光谱学习的性能。

Abstract: Modern deep learning models operating on multi-modal visual signals often rely on inductive biases that are poorly aligned with the physical processes governing signal formation, leading to brittle performance under cross-spectral and real-world conditions. In particular, approaches that prioritise direct thermal cues struggle to capture indirect yet persistent environmental alterations induced by sustained heat emissions. This work introduces a physics-aware representation learning framework that leverages multi-spectral information to model stable signatures of long-term physical processes. Specifically, a geological Short Wave Infrared (SWIR) ratio sensitive to soil property changes is integrated with Thermal Infrared (TIR) data through an intermediate fusion architecture, instantiated as FusionNet. The proposed backbone embeds trainable differential signal-processing priors within convolutional layers, combines mixed pooling strategies, and employs wider receptive fields to enhance robustness across spectral modalities. Systematic ablations show that each architectural component contributes to performance gains, with DGCNN achieving 88.7% accuracy on the SWIR ratio and FusionNet reaching 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments further show that ImageNet pretraining degrades TIR performance, highlighting the importance of modality-aware training for cross-spectral learning. Evaluated on real-world data, the results demonstrate that combining physics-aware feature selection with principled deep learning architectures yields robust and generalisable representations, illustrating how first-principles signal modelling can improve multi-spectral learning under challenging conditions.


[123] Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation cs.CV | cs.AIPDF

Ziyang Song, Zelin Zang, Zuyao Chen, Xusheng Liang, Dong Yi

TL;DR: 该论文提出了一种名为Anatomy-R1的方法,旨在增强多模态大语言模型在医学解剖图像上的推理能力。该方法通过引入解剖相似性课程学习和组多样性问题增强两种策略,解决了现有GRPO方法在解剖识别中知识共享不足和推理路径单一的问题,从而提升了模型在临床解剖图像上的理解和推理性能。

Details

Motivation: 多模态大语言模型在自然图像推理上取得显著进展,但在医学影像,特别是临床解剖手术图像中的应用潜力尚未充分挖掘。解剖理解任务需要精确且临床连贯的答案,但由于医学数据的复杂性和高质量专家标注的稀缺,传统监督微调策略效果有限。尽管GRPO方法能在不依赖大量数据的情况下增强MLLMs的推理,但它在解剖识别中存在知识共享不足和推理路径收敛过快的问题,限制了其性能。

Result: 在SGG-VQA和OmniMedVQA基准测试上的综合实验表明,该方法在两个基准上均实现了显著提升,有效增强了MLLMs的医学推理能力。

Insight: 论文的创新点包括:1)解剖相似性课程学习,通过基于答案选项相似性控制问题难度,实现渐进式学习;2)组多样性问题增强,通过扩展困难查询的搜索空间,缓解模型产生单一响应的倾向。这些策略从课程学习和多样性增强角度优化了强化学习过程,可借鉴于其他需要精细推理和知识迁移的领域。

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO’s reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model’s search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in https://github.com/tomato996/Anatomy-R1


[124] Multi-Modal Soccer Scene Analysis with Masked Pre-Training cs.CVPDF

Marc Peral, Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo

TL;DR: 本文提出了一种用于分析足球比赛战术镜头视频的多模态架构,专注于三个核心任务:球轨迹推断、球状态分类和持球者识别。该方案将球员轨迹、球员类型和球员图像裁剪三种输入模态集成到一个统一框架中,通过级联的社会时空Transformer块处理时空动态。不同于依赖精确球跟踪或手工启发式规则的先前方法,该方法无需直接访问球的过去或未来位置即可推断球轨迹,并能在真实顶级联赛比赛的噪声或遮挡条件下稳健识别球状态和持球者。

Details

Motivation: 解决足球场景分析中传统方法过度依赖精确球跟踪或手工启发式规则的问题,特别是在噪声、遮挡等现实比赛条件下,实现无需直接球位置信息的鲁棒多模态分析。

Result: 在大型数据集上,该方法在所有三个核心任务上均显著优于现有最先进的基线模型,实现了实质性改进。

Insight: 创新点包括:1) 提出结合结构化数据(轨迹、类型)和视觉数据(图像裁剪)的Transformer多模态统一框架;2) 引入CropDrop这一模态特定掩码预训练策略,防止模型过度依赖图像特征,鼓励学习跨模态模式;3) 展示了在现实噪声和遮挡条件下进行鲁棒推断的潜力。

Abstract: In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.


[125] SlicerOrbitSurgerySim: An Open-Source Platform for Virtual Registration and Quantitative Comparison of Preformed Orbital Plates cs.CVPDF

Chi Zhang, Braedon Gunn, Andrew M. Read-Fuller

TL;DR: 本文介绍了SlicerOrbitSurgerySim,这是一个基于3D Slicer平台的开源扩展软件,用于在患者特定的虚拟规划环境中,对多种预成型眶板进行交互式虚拟配准、评估和定量比较。

Details

Motivation: 解决预成型眶板适配性差导致术后并发症和翻修手术的问题,以及目前缺乏公开工具和标准化指标来定量比较不同供应商、尺寸和患者解剖结构的板件适配性。

Result: 该软件能生成可重复的板-眶距离定量指标和可视化工具,支持患者特异性规划和群体水平的板件适应性统计分析。

Insight: 创新点在于提供了一个开源、标准化的虚拟配准与定量比较平台,旨在通过客观比较植入物设计和放置策略,改善术前决策、减少术中板件修改,并促进协作研究和外科教育。

Abstract: Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.


[126] CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion cs.CV | cs.AIPDF

Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez

TL;DR: 本文提出CASA(Cross-Attention via Self-Attention),一种高效的多模态融合范式,旨在解决现有视觉语言模型(VLMs)中全令牌插入方法计算成本高、而交叉注意力方法性能不足的问题。CASA通过在专用的交叉注意力层中引入局部文本到文本的交互,显著提升了模型在细粒度视觉理解任务上的性能,同时保持了交叉注意力模型在处理长上下文(如流视频)时的可扩展性。

Details

Motivation: 现有视觉语言模型中,全令牌插入方法虽能实现文本与图像的充分交互,但在处理高分辨率图像、长对话或流视频时计算和内存成本极高;而采用交叉注意力的高效替代方案则存在明显的性能差距,特别是在涉及细粒度视觉细节的任务上。

Result: 在常见的图像理解基准测试上,CASA显著缩小了与全令牌插入方法的性能差距,同时保持了与交叉注意力模型相同的可扩展性,适用于流视频字幕生成长上下文多模态任务。

Insight: 创新点在于发现提升交叉注意力模型性能的关键是在其专用层中同时启用局部文本到文本的交互,并据此设计了CASA这一简单高效的融合范式,实现了性能与效率的更好平衡。

Abstract: Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .


[127] StoryMem: Multi-shot Long Video Storytelling with Memory cs.CVPDF

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi

TL;DR: StoryMem是一种基于记忆机制的长视频叙事生成范式,通过将预训练的单镜头视频扩散模型转化为多镜头叙事生成器,利用动态更新的关键帧记忆库来保持长视频的跨镜头一致性,并支持平滑镜头过渡和定制化故事生成。

Details

Motivation: 解决视觉叙事中生成具有电影质量和长程一致性的多镜头视频的挑战,受人类记忆启发,旨在提升长视频生成的连贯性。

Result: 在提出的ST-Bench基准测试中,StoryMem在跨镜头一致性方面优于先前方法,同时保持了高美学质量和提示遵循性,标志着向连贯分钟级视频叙事迈出了重要一步。

Insight: 创新点包括Memory-to-Video设计、通过潜在连接和负RoPE偏移注入记忆、语义关键帧选择与美学偏好过滤策略,这些机制有效提升了长视频生成的连贯性和质量。

Abstract: Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.


[128] No Data? No Problem: Robust Vision-Tabular Learning with Missing Values cs.CVPDF

Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel

TL;DR: 论文提出了RoVTL(鲁棒的视觉-表格学习)框架,用于处理视觉-表格数据中的缺失值问题。该框架包括两个关键阶段:对比预训练阶段,将表格属性缺失作为数据增强以提升鲁棒性;以及下游任务调优阶段,使用门控交叉注意力模块进行多模态融合。通过引入Tabular More vs. Fewer损失函数和分离梯度学习,确保在不同表格数据完整性(从0%到100%)下的性能一致性。

Details

Motivation: 大规模医学生物库提供图像数据和丰富的表格信息,但现实世界的数据集往往只有部分表格属性可用。因此,需要一种方法在训练时能充分利用所有表格数据,同时在推理时对缺失值保持鲁棒性,以弥合这一差距。

Result: 在UK Biobank的心脏MRI扫描上评估RoVTL,结果显示其对缺失表格数据的鲁棒性优于先前方法。此外,RoVTL成功推广到外部心脏MRI数据集进行多模态疾病分类,并扩展到自然图像领域,在汽车广告数据集上实现了鲁棒性能。

Insight: 论文的创新点包括:将表格属性缺失作为数据增强以促进鲁棒性;使用门控交叉注意力模块进行有效的多模态融合;以及引入Tabular More vs. Fewer损失函数结合分离梯度学习,确保性能在不同数据完整性下的一致性。从客观角度看,这些方法为处理缺失值提供了可借鉴的解决方案。

Abstract: Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at https://github.com/marteczkah/RoVTL.


[129] MapTrace: Scalable Data Generation for Route Tracing on Maps cs.CV | cs.AIPDF

Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal

TL;DR: 本文针对多模态大语言模型在精细空间理解(如地图路线追踪)方面的不足,提出了一种可扩展的合成数据生成管道MapTrace,用于自动生成像素级精确的路径标注。利用该管道构建了一个包含23k路径样本的数据集,并用于微调开源和专有MLLMs,显著提升了模型在地图路线追踪任务上的鲁棒性和准确性。

Details

Motivation: 当前多模态大语言模型在精细空间理解任务(如地图路线追踪)上表现有限,常违反基本路径约束,主要因为大规模、像素级精确的路径标注数据收集成本高且困难。

Result: 在MapBench基准测试中,微调后的模型鲁棒性显著提升,成功率最高提高6.4个百分点,同时路径追踪误差(NDTW)降低。

Insight: 创新点在于提出了一种可扩展的合成数据生成管道,通过合成地图图像和像素级解析自动生成精确标注,证明了精细空间推理能力可以通过合成监督数据显式地教授给预训练模型。

Abstract: While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.


[130] 4D Gaussian Splatting as a Learned Dynamical System cs.CVPDF

Arnold Caleb Asiimwe, Carl Vondrick

TL;DR: 这篇论文提出了一种名为EvoGS的新方法,将4D高斯泼溅(4D Gaussian Splatting)重新解释为一个连续时间动力学系统,其中场景运动是通过积分一个学习到的神经动力学场产生的,而不是应用逐帧变形。该方法将高斯表示视为一个演化的物理系统,其状态在学习的运动定律下连续演化,从而实现了变形方法所不具备的能力,如从稀疏时间监督中进行高效学习、时间外推以及可组合的局部动力学注入。

Details

Motivation: 动机是解决基于变形的动态场景表示方法在运动连贯性、时间一致性和可控性方面的局限性,通过将场景运动建模为一个连续动力学系统来克服这些缺点。

Result: 在动态场景基准测试上的实验表明,EvoGS相比基于变形场的基线方法,在保持实时渲染的同时,实现了更好的运动连贯性和时间一致性。

Insight: 主要创新点在于将4D高斯泼溅重新框架化为一个学习的动力学系统,这允许对底层运动规律进行建模,从而实现从稀疏监督中高效学习、时间外推和可组合的局部动力学控制,为动态场景的表示和合成提供了更灵活和强大的框架。

Abstract: We reinterpret 4D Gaussian Splatting as a continuous-time dynamical system, where scene motion arises from integrating a learned neural dynamical field rather than applying per-frame deformations. This formulation, which we call EvoGS, treats the Gaussian representation as an evolving physical system whose state evolves continuously under a learned motion law. This unlocks capabilities absent in deformation-based approaches:(1) sample-efficient learning from sparse temporal supervision by modeling the underlying motion law; (2) temporal extrapolation enabling forward and backward prediction beyond observed time ranges; and (3) compositional dynamics that allow localized dynamics injection for controllable scene synthesis. Experiments on dynamic scene benchmarks show that EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering


[131] Over++: Generative Video Compositing for Layer Interaction Effects cs.CVPDF

Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta

TL;DR: 论文提出Over++视频效果生成框架,用于合成逼真的环境交互效果(如阴影、反射、灰尘等),同时保留输入视频的原始场景,无需假设相机姿态、场景静止或深度监督,并支持文本提示和输入视频层作为条件。

Details

Motivation: 在专业视频合成中,艺术家需手动创建前景与背景层之间的环境交互效果,现有视频生成模型难以保留输入视频,而视频修复方法要么需要昂贵的每帧掩码,要么产生不逼真结果,论文旨在解决这些问题。

Result: Over++在有限数据上训练,能够产生多样且逼真的环境效果,在效果生成和场景保留方面优于现有基线方法。

Insight: 论文的创新点包括引入’augmented compositing’新任务,构建配对效果数据集,采用无配对增强策略以保持文本驱动的可编辑性,以及支持可选掩码控制和关键帧指导,无需密集注释。

Abstract: In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.


[132] Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis cs.CV | cs.AIPDF

Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra

TL;DR: 本文提出了一种知识增强的多模态Transformer框架,用于糖尿病视网膜病变(DR)诊断中的跨模态对齐。该框架整合了视网膜眼底图像、临床文本和结构化患者数据,通过多目标训练(包括对比损失、重建损失和分类损失)来改善医学领域的图像-文本对齐。在BRSET数据集上的实验表明,该框架在文本到图像检索和DR严重程度分类方面均显著优于微调的CLIP等基线模型,并在未见过的DeepEyeNet数据集上展现出强大的零样本泛化能力。

Details

Motivation: 解决通用领域视觉-语言模型(如CLIP)在医学领域(特别是眼科图像-文本跨模态检索)中表现不佳的问题,以提升糖尿病视网膜病变自动诊断的准确性和跨模态对齐能力。

Result: 在BRSET数据集上,文本到图像检索的Recall@1达到99.94%(微调CLIP为1.29%),DR严重程度分类准确率在SDRG和ICDR方案下分别达到97.05%和97.97%(SOTA水平);在DeepEyeNet数据集上的零样本评估中,Recall@1为93.95%(微调CLIP为0.22%),显示出强大的泛化性。

Insight: 创新点包括:1)整合多模态数据(图像、文本、结构化特征)并通过特定模态嵌入的联合Transformer进行融合;2)采用多目标训练策略(对比损失、重建损失、分类损失)以增强跨模态关系学习;3)在医学领域实现了高效的图像-文本对齐和诊断性能,为专业领域多模态模型设计提供了参考。

Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP’s 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.


[133] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning cs.CV | physics.med-phPDF

Mojtaba Safari, Shansong Wang, Vanessa L Wildman, Mingzhe Hu, Zach Eidex

TL;DR: 本文提出了一种名为Efficient Vision Mamba的高效MRI超分辨率框架,通过结合多头选择性状态空间模型(MHSSM)与轻量级通道MLP,并采用混合扫描策略来捕获长程依赖关系。该模型在7T脑部T1和1.5T前列腺T2w MRI数据集上实现了最先进的性能,同时参数量和计算量极低,显著优于包括GAN、Transformer、Mamba和扩散模型在内的多种基线方法。

Details

Motivation: 高分辨率MRI对诊断至关重要,但长采集时间限制了其临床应用。现有深度学习超分辨率方法面临保真度与计算效率之间的权衡问题。本文旨在开发一个计算高效且准确的深度学习框架,用于MRI超分辨率,以保留解剖细节并促进临床整合。

Result: 在7T脑部数据上,模型取得了SSIM=0.951、PSNR=26.90 dB、LPIPS=0.076、GMSD=0.083的优异结果,显著优于所有基线(p<0.001)。在前列腺数据上,SSIM=0.770,PSNR=27.15 dB。该框架仅使用0.9M参数和57 GFLOPs,与Res-SRDiff相比,参数量减少99.8%,计算量减少97.5%,同时在准确性和效率上超越了SwinIR和MambaIR,达到了最先进水平。

Insight: 摘要宣称的创新点在于提出了一个结合多头选择性状态空间模型(MHSSM)与轻量级通道MLP的新型架构,并采用2D块提取与混合扫描策略来高效捕获长程依赖。从客观角度看,其核心创新在于将Mamba(选择性状态空间模型)的高效序列建模能力与视觉任务的特定设计(如深度卷积、门控通道混合)相结合,在保持甚至提升性能的同时,实现了参数和计算量的数量级降低,为资源受限的临床环境部署提供了极具潜力的解决方案。

Abstract: Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.


[134] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion cs.CV | cs.AIPDF

Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang

TL;DR: 论文提出WorldWarp框架,用于生成长范围且几何一致的视频。该框架通过结合3D高斯泼溅(3DGS)构建的在线几何缓存作为结构锚点,以及一个设计用于“填充与修订”目标的时空扩散(ST-Diff)模型作为2D生成细化器,来解决现有方法在遮挡区域和复杂相机轨迹上的困难。

Details

Motivation: 解决生成长范围、几何一致视频时的根本矛盾:一致性要求在像素空间严格遵循3D几何,而最先进的生成模型在相机条件化的潜在空间中运行最有效,这种脱节导致当前方法在处理遮挡和复杂相机轨迹时存在困难。

Result: 论文声称WorldWarp通过确保3D逻辑指导结构、扩散逻辑完善纹理,实现了最先进的保真度(state-of-the-art fidelity)。

Insight: 主要创新点在于将3D几何缓存(作为结构锚点)与2D生成细化器(ST-Diff模型)耦合,并引入了时空变化的噪声调度策略:空白区域接收完全噪声以触发生成,而扭曲区域接收部分噪声以实现细化。这为在视频生成中结合显式3D几何引导和生成模型的能力提供了新思路。

Abstract: Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a “fill-and-revise” objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}.


[135] VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation cs.CVPDF

Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li

TL;DR: 本文提出了VA-π,一种轻量级的后训练框架,用于解决自回归视觉生成中生成器与分词器之间的不对齐问题。该方法通过变分优化将像素重建与自回归建模统一起来,并引入基于强化学习的对齐策略,以像素空间的重建质量作为内在奖励,直接优化自回归模型。

Details

Motivation: 自回归视觉生成依赖于分词器进行图像与离散序列的映射,但分词器训练目标是基于真实标记重建干净图像,而自回归生成器仅优化标记似然,这种不对齐导致生成的标记序列可能解码为低质量图像,缺乏像素空间的直接监督。

Result: 在ImageNet-1K数据上仅用1%的数据和25分钟微调,将LlamaGen-XXL的FID从14.36降至7.65,IS从86.55提升至116.70;在GenEval的文本到图像任务中,视觉生成模型LlamaGen得分从0.306提升至0.339,统一多模态模型Janus-Pro从0.725提升至0.744,均达到SOTA水平。

Insight: 创新点在于将生成器-分词器对齐问题形式化为变分优化,推导出统一像素重建和自回归建模的证据下界,并设计了基于强化学习的对齐策略,以教师强制下的像素重建质量作为内在奖励,为模型提供直接的像素级指导,无需昂贵的自由运行采样或重新训练分词器。

Abstract: Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$π$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$π$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.


[136] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs cs.CVPDF

Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys

TL;DR: 本文针对多模态大语言模型在空间推理能力上的不足,提出了一个基于户外行人视角视频的大规模基准测试,该数据集通过立体相机、LiDAR和IMU/GPS传感器同步采集,提供精确的3D信息,并自动生成从定性关系到定量度量及运动学理解的分层空间推理问题。评估发现,现有MLLMs在结构化室内基准上的性能提升在开放世界环境中消失,分析表明它们过度依赖语言先验而非基于视觉的推理。

Details

Motivation: 现有MLLMs在语义任务上表现优异,但其空间智能——对构建稳健且接地气的AI系统至关重要——仍不成熟,且现有基准测试要么过于简化定性推理,要么依赖特定室内数据,缺乏户外可验证度量真值的数据集,因此需要开发新基准来诊断这一局限。

Result: 评估显示,在开放世界设置中,MLLMs在结构化室内基准上的性能增益消失;通过合成异常场景和盲测进一步分析,确认当前MLLMs严重依赖语言先验而非基于视觉的推理。

Insight: 创新点在于构建了一个大规模、度量精确的户外空间推理基准,利用多传感器数据自动生成分层问题,揭示了MLLMs在开放世界空间推理中的性能差距和依赖语言先验的问题,为诊断和提升物理接地空间智能提供了原则性平台。

Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence–crucial for robust and grounded AI systems–remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum–from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.


[137] Zero-shot Reconstruction of In-Scene Object Manipulation from Video cs.CV | cs.ROPDF

Dixuan Lin, Tianyou Wang, Zhuoyang Pan, Yufu Wang, Lingjie Liu

TL;DR: 本文构建了首个从单目RGB视频中重建场景内物体操作的系统,通过数据驱动的基础模型初始化物体网格、姿态、场景点云和手部姿态,并采用两阶段优化恢复从抓取到交互的完整手-物体运动,确保与输入视频中的场景信息一致。

Details

Motivation: 解决从单目视频重建场景内物体操作的挑战,包括场景重建的病态性、手-物体深度模糊性以及物理交互合理性,现有方法以手为中心坐标忽略场景,限制了度量精度和实际应用。

Result: 未在摘要中提及具体定量结果或基准测试,但声称是首个解决该问题的系统,通过优化实现与场景一致的手-物体运动重建。

Insight: 创新点在于结合数据驱动基础模型进行初始化,并采用两阶段优化整合场景信息,提升重建的物理合理性和度量准确性,为单目视频中的物体操作重建提供了新框架。

Abstract: We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.


[138] Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models cs.CVPDF

Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang

TL;DR: 本文提出Visual-Aware CoT方法,通过自适应视觉规划和迭代视觉校正,将视觉上下文一致性融入统一模型的推理过程,以解决多模态生成中视觉特征(如人物ID、物体属性、风格)保持不足的问题。

Details

Motivation: 当前统一模型在生成过程中的思维链主要关注文本提示的一致性,忽视了多模态生成(如多参考生成)中与视觉参考图像的视觉上下文一致性,导致关键视觉特征无法保持。

Result: 实验表明,该方法在多模态生成任务中优于零样本统一模型及使用文本CoT的模型,展现出更高的视觉上下文一致性。

Insight: 创新点在于将视觉一致性显式整合到推理中,通过结构化视觉检查清单和迭代自反思与精炼机制,并采用监督微调和基于定制视觉检查奖励的flow-GRPO进行优化,提升了多模态生成的保真度。

Abstract: Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.


cs.DB [Back]

[139] A Multi-agent Text2SQL Framework using Small Language Models and Execution Feedback cs.DB | cs.AI | cs.CL | cs.HC | cs.MAPDF

Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen

TL;DR: 本文提出MATS,一个专为小型语言模型(SLMs)设计的多智能体Text2SQL框架,通过多智能体分工协作和执行反馈的强化学习训练,在单GPU服务器上实现了与大规模LLMs相当的SQL生成准确率,解决了SLMs在复杂Text2SQL任务上泛化能力不足的问题。

Details

Motivation: 解决企业因隐私和成本考虑无法使用外部大型语言模型(LLMs)服务,而采用本地可部署的小型语言模型(SLMs)时,后者在复杂Text2SQL任务上泛化能力不足的局限性。

Result: 在基准数据集上的评估结果表明,MATS在参数显著减少的情况下,部署于单GPU服务器上能达到与大规模LLMs相当的准确率。

Insight: 创新点在于为SLMs设计了多智能体机制,通过角色分工减轻单个模型负担,并利用执行反馈进行强化学习对齐,从而在有限模型规模下保持竞争力;客观来看,这是一种将复杂任务分解与反馈学习结合的高效适配方案。

Abstract: Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single- GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at https://github.com/thanhdath/mats-sql.


eess.IV [Back]

[140] SLIM: Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion eess.IV | cs.CVPDF

Hyeonjin Lee, Jun-Hyuk Kim, Jong-Seok Lee

TL;DR: 本文提出了一种基于扩散模型的语义感知低比特率图像压缩框架SLIM,专门针对机器视觉任务进行优化。该方法利用预训练的潜在扩散模型,在压缩阶段仅关注图像中机器视觉感兴趣区域(RoI)的潜在表示,并通过包含图像语义信息的文本描述增强解压缩后的潜在表示,从而在低比特率下实现更高的机器视觉任务性能。

Details

Motivation: 现有图像压缩模型主要针对人类视觉设计,保留了过多感知细节,导致在机器视觉任务中无法最优地降低比特率,因此需要一种专门针对机器视觉的高效低比特率压缩框架。

Result: 实验结果表明,在相同比特率条件下,SLIM相比传统面向机器的图像压缩模型实现了更高的分类准确率。

Insight: 创新点在于结合预训练扩散模型进行语义引导的RoI聚焦压缩与潜在增强,无需推理阶段的掩码指导即可实现低比特率压缩,同时增强的潜在表示既能优化机器视觉任务性能,又保留了人类视觉的感知细节。

Abstract: In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion model.The compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for machines.Code will be released upon acceptance.


cs.CR [Back]

[141] From Retrieval to Reasoning: A Framework for Cyber Threat Intelligence NER with Explicit and Adaptive Instructions cs.CR | cs.CLPDF

Jiaren Peng, Hongda Sun, Xuan Tian, Cheng Huang, Zeqing Li

TL;DR: 本文提出了一种名为TTPrompt的框架,用于网络威胁情报(CTI)命名实体识别(NER)任务。该框架摒弃了传统基于检索的上下文学习(ICL)范式,转而采用显式的指令引导方法。它将CTI的核心概念——战术、技术和程序(TTP)——映射为一个指令层次结构,并通过反馈驱动的指令精炼(FIR)机制自适应地优化指令,以处理不同标注规范。在五个CTI NER基准测试上的实验表明,TTPrompt性能优于基于检索的基线方法,并且在仅使用1%训练数据进行精炼后,其性能即可媲美在全量数据上微调的模型。

Details

Motivation: 当前,大型语言模型(LLMs)主要通过基于检索的上下文学习(ICL)来处理CTI NER任务。本文分析发现,该主流范式的成功并非源于全局语义相似性,而主要依赖于检索示例中实体类型的偶然重叠,这暴露了依赖不可靠的隐式归纳的局限性。因此,需要一种从隐式归纳转向显式指令的方法来更可靠地解决该问题。

Result: 在五个CTI NER基准测试(如LADDER和CTINexus)上的实验表明,TTPrompt始终优于基于检索的基线方法。具体而言,在仅使用1%训练数据进行指令精炼后,其性能即可与在全量数据集上微调的模型相媲美。例如,在LADDER数据集上,其Micro F1达到71.96%,接近微调基线;在更复杂的CTINexus数据集上,其Macro F1超过了微调的ACLM模型10.91%。

Insight: 论文宣称的创新点在于:1)提出了TTPrompt框架,将CTI领域的TTP概念(战术、技术、程序)系统地映射为显式的、层次化的指令,以替代不可靠的隐式检索归纳;2)引入了反馈驱动的指令精炼(FIR)机制,使LLM能够利用少量标注数据从错误中学习并自适应地优化指令,以应对不同的标注规范(方言)。从客观角度看,将领域知识(TTP)结构化地融入提示工程,并结合自适应的指令优化,为解决特定领域NER任务提供了一种新颖且有效的范式。

Abstract: The automation of Cyber Threat Intelligence (CTI) relies heavily on Named Entity Recognition (NER) to extract critical entities from unstructured text. Currently, Large Language Models (LLMs) primarily address this task through retrieval-based In-Context Learning (ICL). This paper analyzes this mainstream paradigm, revealing a fundamental flaw: its success stems not from global semantic similarity but largely from the incidental overlap of entity types within retrieved examples. This exposes the limitations of relying on unreliable implicit induction. To address this, we propose TTPrompt, a framework shifting from implicit induction to explicit instruction. TTPrompt maps the core concepts of CTI’s Tactics, Techniques, and Procedures (TTPs) into an instruction hierarchy: formulating task definitions as Tactics, guiding strategies as Techniques, and annotation guidelines as Procedures. Furthermore, to handle the adaptability challenge of static guidelines, we introduce Feedback-driven Instruction Refinement (FIR). FIR enables LLMs to self-refine guidelines by learning from errors on minimal labeled data, adapting to distinct annotation dialects. Experiments on five CTI NER benchmarks demonstrate that TTPrompt consistently surpasses retrieval-based baselines. Notably, with refinement on just 1% of training data, it rivals models fine-tuned on the full dataset. For instance, on LADDER, its Micro F1 of 71.96% approaches the fine-tuned baseline, and on the complex CTINexus, its Macro F1 exceeds the fine-tuned ACLM model by 10.91%.


q-bio.QM [Back]

[142] Standardized Evaluation of Automatic Methods for Perivascular Spaces Segmentation in MRI – MICCAI 2024 Challenge Results q-bio.QM | cs.CV | eess.IVPDF

Yilei Wu, Yichi Zhang, Zijian Dong, Fang Ji, An Sen Tan

TL;DR: 本文介绍了MICCAI 2024上组织的EPVS挑战赛,旨在推动基于多中心MRI数据的自动扩大血管周围间隙(EPVS)分割算法的发展。挑战赛提供了一个包含200个扫描的多样化数据集,七支团队基于U-Net等深度学习架构提交了解决方案,最佳方法采用MedNeXt结合2D/3D策略。结果显示,模型在已知数据集上表现良好,但在未见过的上海队列上性能显著下降,凸显了领域偏移带来的泛化挑战。

Details

Motivation: 扩大血管周围间隙(EPVS)是脑小血管病的重要影像标志物,但由于其尺寸小、形态多变、与其他病理特征相似以及标注数据集有限,自动分割仍具挑战性。本挑战赛旨在通过标准化评估推动鲁棒、可泛化的自动EPVS分割方法的发展。

Result: 在MICCAI 2024 EPVS挑战赛中,七支团队使用骰子相似系数、绝对体积差、召回率和精确度等指标进行评估。获胜方法采用MedNeXt架构结合双2D/3D策略处理不同切片厚度。最佳解决方案在已知测试数据上表现相对良好,但在未见过的上海队列上性能显著下降,揭示了跨站点泛化的困难。

Insight: 论文的创新点在于组织了一个标准化、多中心的EPVS分割挑战赛,建立了重要的基准。从客观角度看,挑战赛结果强调了领域偏移对模型泛化的关键影响,并展示了结合2D/3D策略、多模态处理和Transformer组件等深度学习创新在应对小目标分割挑战中的潜力,但跨临床环境的鲁棒性仍需进一步研究。

Abstract: Perivascular spaces (PVS), when abnormally enlarged and visible in magnetic resonance imaging (MRI) structural sequences, are important imaging markers of cerebral small vessel disease and potential indicators of neurodegenerative conditions. Despite their clinical significance, automatic enlarged PVS (EPVS) segmentation remains challenging due to their small size, variable morphology, similarity with other pathological features, and limited annotated datasets. This paper presents the EPVS Challenge organized at MICCAI 2024, which aims to advance the development of automated algorithms for EPVS segmentation across multi-site data. We provided a diverse dataset comprising 100 training, 50 validation, and 50 testing scans collected from multiple international sites (UK, Singapore, and China) with varying MRI protocols and demographics. All annotations followed the STRIVE protocol to ensure standardized ground truth and covered the full brain parenchyma. Seven teams completed the full challenge, implementing various deep learning approaches primarily based on U-Net architectures with innovations in multi-modal processing, ensemble strategies, and transformer-based components. Performance was evaluated using dice similarity coefficient, absolute volume difference, recall, and precision metrics. The winning method employed MedNeXt architecture with a dual 2D/3D strategy for handling varying slice thicknesses. The top solutions showed relatively good performance on test data from seen datasets, but significant degradation of performance was observed on the previously unseen Shanghai cohort, highlighting cross-site generalization challenges due to domain shift. This challenge establishes an important benchmark for EPVS segmentation methods and underscores the need for the continued development of robust algorithms that can generalize in diverse clinical settings.


cs.RO [Back]

[143] Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation cs.RO | cs.CL | cs.CVPDF

Ryosuke Korekata, Quanting Xie, Yonatan Bisk, Komei Sugiura

TL;DR: 本文提出Affordance RAG,一种零样本层次化多模态检索框架,用于解决开放词汇移动操作问题。该方法通过从预探索图像构建可操作感知的具身记忆,基于区域和视觉语义检索候选目标,并用可操作性分数重排序,以帮助机器人在真实环境中执行基于自由形式自然语言指令的物体抓取与放置任务。

Details

Motivation: 解决开放词汇移动操作任务中,机器人需要理解视觉语义和操作动作的可操作性,以根据自由形式的自然语言指令将各种物体搬运到合适容器的挑战。

Result: 在大规模室内环境中,该方法在移动操作指令的检索性能上优于现有方法;在真实世界实验中,基于自由指令的室内移动操作任务成功率达到了85%,在检索性能和整体任务成功率上均超越现有方法。

Insight: 创新点在于提出了可操作性感知的具身记忆和层次化检索重排序机制,将可操作性评分集成到检索过程中,以零样本方式提升机器人对复杂指令的理解和执行能力。

Abstract: In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.


[144] Robotic VLA Benefits from Joint Learning with Motion Image Diffusion cs.RO | cs.CVPDF

Yu Fang, Kanchana Ranasinghe, Le Xue, Honglu Zhou, Juntao Tan

TL;DR: 本文提出了一种通过联合学习运动图像扩散来增强视觉-语言-动作(VLA)模型运动推理能力的新方法。该方法在标准VLA架构基础上增加了一个基于扩散Transformer的运动头,用于预测基于光流的未来运动图像,并与动作头联合训练,使共享的视觉语言模型骨干能够学习融合机器人控制与运动知识的表征,从而在不改变推理延迟的情况下提升模型性能。

Details

Motivation: 现有VLA模型通常仅模仿专家轨迹而缺乏预测性运动推理,这限制了其决策能力,因此需要增强模型的运动推理能力。

Result: 在仿真和真实世界实验中,该方法将pi-series VLA在LIBERO基准上的成功率提升至97.5%,在RoboTwin基准上达到58.0%,真实世界性能提升了23%,验证了其有效性。

Insight: 创新点在于通过双头联合学习架构(动作头+运动扩散头)将运动预测作为辅助任务融入VLA训练,从而在不影响推理效率的前提下学习到时序连贯且物理基础的表征,增强了大规模VLA的运动推理能力。

Abstract: Vision-Language-Action (VLA) models have achieved remarkable progress in robotic manipulation by mapping multimodal observations and instructions directly to actions. However, they typically mimic expert trajectories without predictive motion reasoning, which limits their ability to reason about what actions to take. To address this limitation, we propose joint learning with motion image diffusion, a novel strategy that enhances VLA models with motion reasoning capabilities. Our method extends the VLA architecture with a dual-head design: while the action head predicts action chunks as in vanilla VLAs, an additional motion head, implemented as a Diffusion Transformer (DiT), predicts optical-flow-based motion images that capture future dynamics. The two heads are trained jointly, enabling the shared VLM backbone to learn representations that couple robot control with motion knowledge. This joint learning builds temporally coherent and physically grounded representations without modifying the inference pathway of standard VLAs, thereby maintaining test-time latency. Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5% on the LIBERO benchmark and 58.0% on the RoboTwin benchmark, yielding a 23% improvement in real-world performance and validating its effectiveness in enhancing the motion reasoning capability of large-scale VLAs.


[145] Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation cs.RO | cs.CVPDF

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

TL;DR: 本文提出了Embodied4C基准测试,旨在评估视觉语言模型在具身导航任务中的核心推理能力。该基准通过三个异构平台(自动驾驶车辆、空中无人机和机器人机械臂)上的1.1K个一次性推理问题和58个目标导向导航任务,系统评估语义、空间、时间和物理四个维度的推理能力,并引入领域外查询以防止平台过拟合。

Details

Motivation: 当前基准测试对具身性(如物理平台、传感器配置和模态对齐)如何影响感知、推理和控制的理解有限,需要一个新的闭环基准来全面评估具身推理能力。

Result: 对十个最先进的视觉语言模型和四个具身控制基线的综合评估表明,跨模态对齐和指令微调比模型规模更重要,而空间和时间推理是可靠具身能力的主要瓶颈。

Insight: 创新点在于构建了一个异构具身平台、动态传感器配置和多样化环境变化的闭环基准,并引入领域外查询来评估泛化能力;客观分析认为,该基准系统性地解耦了具身性因素,为理解模型在真实世界中的推理瓶颈提供了新视角。

Abstract: Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment – i.e., the choice of physical platform, sensor configuration, and modality alignment – influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments – autonomous vehicles, aerial drones, and robotic manipulators – through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.


[146] STORM: Search-Guided Generative World Models for Robotic Manipulation cs.RO | cs.CVPDF

Wenjun Lin, Jensen Zhang, Kaitong Cai, Keze Wang

TL;DR: STORM是一个用于机器人操作时空推理的新框架,它结合了基于扩散的动作生成、条件视频预测和基于搜索的规划。该框架通过显式的视觉推演进行规划,实现了可解释和前瞻性的决策。在SimplerEnv基准测试中,STORM达到了51.0%的平均成功率,优于CogACT等基线模型,并展示了强大的重新规划和故障恢复能力。

Details

Motivation: 解决现有视觉-语言-动作(VLA)模型依赖抽象潜在动态或将推理委托给语言组件的问题,通过显式视觉推演实现更可解释和前瞻的决策,以提升长时程机器人操作的性能。

Result: 在SimplerEnv操作基准测试中,STORM实现了51.0%的平均成功率,达到新的SOTA水平,优于CogACT等基线;奖励增强的视频预测将Frechet Video Distance降低了超过75%,显著提升了时空保真度和任务相关性。

Insight: 创新点在于将扩散动作生成、生成视频世界模型和蒙特卡洛树搜索(MCTS)统一到一个框架中,通过搜索引导的生成世界模型实现可解释的视觉规划和鲁棒的重新规划,为机器人操作提供了更可靠的时空推理方法。

Abstract: We present STORM (Search-Guided Generative World Models), a novel framework for spatio-temporal reasoning in robotic manipulation that unifies diffusion-based action generation, conditional video prediction, and search-based planning. Unlike prior Vision-Language-Action (VLA) models that rely on abstract latent dynamics or delegate reasoning to language components, STORM grounds planning in explicit visual rollouts, enabling interpretable and foresight-driven decision-making. A diffusion-based VLA policy proposes diverse candidate actions, a generative video world model simulates their visual and reward outcomes, and Monte Carlo Tree Search (MCTS) selectively refines plans through lookahead evaluation. Experiments on the SimplerEnv manipulation benchmark demonstrate that STORM achieves a new state-of-the-art average success rate of 51.0 percent, outperforming strong baselines such as CogACT. Reward-augmented video prediction substantially improves spatio-temporal fidelity and task relevance, reducing Frechet Video Distance by over 75 percent. Moreover, STORM exhibits robust re-planning and failure recovery behavior, highlighting the advantages of search-guided generative world models for long-horizon robotic manipulation.


[147] Offline Reinforcement Learning for End-to-End Autonomous Driving cs.RO | cs.CVPDF

Chihiro Noguchi, Takaki Yamamoto

TL;DR: 本文提出了一种仅使用摄像头输入的端到端自动驾驶离线强化学习框架,通过行为正则化解决模仿学习的失败模式,在nuScenes数据集构建的神经渲染环境中显著降低了碰撞率并提升了路线完成率。

Details

Motivation: 端到端自动驾驶模型依赖模仿学习存在持续失败模式,而在线强化学习计算成本高昂,因此研究利用离线强化学习的数据效率和快速迭代优势,同时解决分布外动作的过估计问题。

Result: 在基于nuScenes数据集的神经渲染环境中进行闭环评估,相比模仿学习基线,该方法在碰撞率和路线完成率方面取得了显著提升。

Insight: 创新点在于将专家驾驶日志构建为伪真实轨迹作为行为正则化信号,既抑制不安全行为的模仿又稳定了价值学习,实现了无需额外探索的纯离线强化学习训练框架。

Abstract: End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code will be available at [URL].


[148] WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving cs.RO | cs.CVPDF

Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao

TL;DR: WorldRFT是一个面向规划的潜在世界模型框架,用于端到端自动驾驶。它通过分层规划分解和局部感知交互优化机制,将场景表示学习与规划任务对齐,并利用强化学习微调来提升安全关键策略性能。

Details

Motivation: 现有的潜在世界模型通过重建导向的表示学习,将感知与规划任务纠缠在一起,导致规划任务的优化效果不佳。本文旨在解决这一挑战,提出一个规划导向的框架。

Result: WorldRFT在开环nuScenes和闭环NavSim基准测试中均达到了最先进的性能。在nuScenes上,碰撞率降低了83%(从0.30%降至0.05%)。在NavSim上,仅使用摄像头传感器输入,其性能与基于激光雷达的SOTA方法DiffusionDrive相当(PDMS分数为87.8 vs. 88.1)。

Insight: 主要创新点包括:1)提出规划导向的潜在世界模型框架,通过分层任务分解和局部感知迭代优化来对齐表示学习与规划;2)引入视觉-几何基础模型以增强3D空间感知;3)提出GRPO强化学习微调方法,结合轨迹高斯化和碰撞感知奖励,系统性提升安全性。

Abstract: Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% -> 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).


[149] TwinAligner: Visual-Dynamic Alignment Empowers Physics-aware Real2Sim2Real for Robotic Manipulation cs.RO | cs.CV | cs.GRPDF

Hongwei Fan, Hang Dai, Jiyao Zhang, Jinzhou Li, Qiyang Yan

TL;DR: 论文提出了TwinAligner系统,通过视觉对齐模块(使用SDF重建和可编辑3DGS渲染实现像素级对齐)和动态对齐模块(从机器人-物体交互中识别刚性物理确保动态一致性),解决仿真与现实的视觉和动态差距,实现Real2Sim2Real框架,提升机器人学习的可扩展性和策略转移效果。

Details

Motivation: 机器人领域依赖昂贵真实数据,仿真与现实存在视觉和动态差距,限制了数据驱动端到端学习的进展,需要弥合这些差距以更有效地利用仿真数据。

Result: 定量评估显示TwinAligner在视觉和动态真实到仿真对齐方面表现强大;系统使得仿真训练的策略在真实世界实现零样本泛化;真实与仿真策略性能高度一致,展示了其推进可扩展机器人学习的潜力。

Insight: 创新点包括视觉对齐模块的SDF重建和可编辑3DGS渲染技术,以及动态对齐模块的刚性物理识别方法;整体系统建立了可信的迭代循环,加速算法开发,为物理感知的Real2Sim2Real提供了新思路。

Abstract: The robotics field is evolving towards data-driven, end-to-end learning, inspired by multimodal large models. However, reliance on expensive real-world data limits progress. Simulators offer cost-effective alternatives, but the gap between simulation and reality challenges effective policy transfer. This paper introduces TwinAligner, a novel Real2Sim2Real system that addresses both visual and dynamic gaps. The visual alignment module achieves pixel-level alignment through SDF reconstruction and editable 3DGS rendering, while the dynamic alignment module ensures dynamic consistency by identifying rigid physics from robot-object interaction. TwinAligner improves robot learning by providing scalable data collection and establishing a trustworthy iterative cycle, accelerating algorithm development. Quantitative evaluations highlight TwinAligner’s strong capabilities in visual and dynamic real-to-sim alignment. This system enables policies trained in simulation to achieve strong zero-shot generalization to the real world. The high consistency between real-world and simulated policy performance underscores TwinAligner’s potential to advance scalable robot learning. Code and data will be released on https://twin-aligner.github.io


[150] Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface cs.RO | cs.CV | cs.GRPDF

Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen

TL;DR: 本文提出Real2Edit2Real框架,通过3D控制界面将3D可编辑性与2D视觉数据结合,以生成新的机器人演示数据。该方法首先从多视角RGB观测重建场景几何,然后在点云上进行深度可靠的3D编辑以生成新的操作轨迹,并几何校正机器人姿态以恢复物理一致的深度,最后利用以深度为主要控制信号的多条件视频生成模型合成空间增强的多视角操作视频。

Details

Motivation: 机器人学习依赖于大规模数据集,但收集多样化演示的成本高昂,尤其在操作任务的空间泛化方面存在限制。为了减少重复性数据收集,本文旨在通过3D编辑生成新演示,提高数据效率。

Result: 在四个真实世界操作任务上的实验表明,仅使用1-5个源演示生成的数据训练的策略,其性能可匹配或超越使用50个真实演示训练的策略,数据效率提升高达10-50倍。

Insight: 创新点在于将3D重建、编辑与视频生成结合,利用深度作为可靠条件进行空间增强,实现高效的数据生成框架;客观分析认为,该方法通过几何校正和多条件控制,提升了合成数据的物理一致性和泛化能力,为机器人演示生成提供了统一且可扩展的解决方案。

Abstract: Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework’s flexibility and extensibility, indicating its potential to serve as a unified data generation framework.


cs.LG [Back]

[151] Stable and Efficient Single-Rollout RL for Multimodal Reasoning cs.LG | cs.AI | cs.CL | cs.CVPDF

Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou

TL;DR: 论文提出MSSR(多模态稳定单次rollout)框架,用于多模态推理中的稳定高效单次rollout强化学习,通过熵基优势整形机制解决训练不稳定问题,实现计算效率和性能提升。

Details

Motivation: 现有基于组的RLVR算法如GRPO需多rollout采样效率低,而单次rollout变体在多模态环境中不稳定易导致训练崩溃,需解决效率与稳定性权衡。

Result: 在分布内评估中,MSSR以一半训练步骤达到与基于组基线相似的验证准确率;相同步数下性能超越基线,并在五个推理密集型基准测试中展示一致泛化改进。

Insight: 创新点在于将熵基优势整形机制应用于多模态单次rollout设置,证明其对于稳定性至关重要,提供了稳定、计算高效且有效的RLVR方法。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR’s performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.


[152] MAGIC: Achieving Superior Model Merging via Magnitude Calibration cs.LG | cs.AI | cs.CL | cs.CVPDF

Yayuan Li, Jian Zhang, Jintao Guo, Zihan Cheng, Lei Qi

TL;DR: 本文提出了一种名为MAGIC(Magnitude Calibration)的即插即用框架,旨在解决模型合并中因忽略特征幅度扰动而导致的性能下降问题。该框架通过特征空间校准(FSC)、权重空间校准(WSC)及其组合(DSC)来校正合并模型的特征和权重幅度,从而在无需额外训练的情况下提升模型性能。

Details

Motivation: 模型合并旨在将多个专用模型的能力整合到一个统一模型中,但现有方法主要关注特征方向对齐,而忽略了特征幅度在合并操作(如参数融合和稀疏化)中易受扰动的影响,这会导致合并模型偏离原始模型的行为特征并造成性能下降。

Result: 在广泛的计算机视觉任务(在八个数据集上平均提升4.3%)和NLP任务(在Llama模型上提升8.0%)上,MAGIC均能一致地提升性能,且无需额外训练。

Insight: 论文的创新点在于首次强调了特征幅度在校准模型合并中的关键作用,并提出了可灵活应用于特征空间和权重空间的校准方法。从客观角度看,其将特征分解为方向和幅度两个组件,并针对幅度扰动进行校正的思路,为模型合并领域提供了一个新颖且有效的优化方向。

Abstract: The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model’s features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: https://github.com/lyymuwu/MAGIC


[153] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies cs.LG | cs.AI | cs.CLPDF

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao

TL;DR: 本文提出了一种名为自底向上策略优化(BuPO)的新强化学习范式,该范式通过分解大型语言模型(LLM)的内部策略,揭示了不同层级和模块(如自注意力和前馈网络)对最终策略的贡献,并发现早期层保持高熵以进行探索,而顶层则收敛至近零熵以进行精炼。基于此,BuPO在早期训练阶段直接优化内部层策略,从而在复杂推理基准测试中实现了卓越性能。

Details

Motivation: 现有强化学习方法将大型语言模型视为单一的统一策略,忽略了其内部机制。理解策略在不同层和模块间的演化对于实现更有针对性的优化和揭示复杂的推理机制至关重要。

Result: 在复杂推理基准测试上的大量实验证明了该方法的有效性,实现了卓越的性能。

Insight: 论文的创新点在于将Transformer残差流的内在分割与策略分解相结合,提出了内部层策略和内部模块策略的概念,并基于内部策略熵的分析发现不同模型系列(如LLaMA和Qwen)具有不同的收敛模式。BuPO范式通过直接优化早期层的内部策略来重构基础推理能力,这是一种新颖的、更具针对性的优化视角。

Abstract: Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama’s prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.


cs.SE [Back]

[154] Toward Training Superintelligent Software Agents through Self-Play SWE-RL cs.SE | cs.AI | cs.CL | cs.LGPDF

Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang

TL;DR: 本文提出了一种名为Self-play SWE-RL(SSR)的训练范式,旨在通过自博弈强化学习训练超智能软件代理。该方法仅需访问沙盒代码库,无需人工标注的issue或测试,让单个LLM代理在自博弈中迭代注入并修复复杂度递增的软件bug,以bug的形式化测试补丁而非自然语言描述作为训练基础。

Details

Motivation: 当前基于LLM和强化学习的软件代理训练数据(如GitHub issue)和环境(如测试)严重依赖人类知识或人工整理,这构成了实现超智能的根本障碍。本文旨在探索一种最小化人类数据依赖的训练方法。

Result: 在SWE-bench Verified和SWE-bench Pro基准测试上,SSR实现了显著的自提升(分别提升10.4和7.8分),并在整个训练轨迹中持续超越基于人类数据的基线方法,尽管评估使用的是自博弈中未出现的自然语言issue。

Insight: 创新点在于提出了一种完全自博弈的软件代理训练范式,仅依赖代码库和依赖项,通过形式化测试补丁定义bug进行强化学习,避免了自然语言issue描述的需求。这为代理从真实软件仓库中自主积累学习经验、最终实现超越人类能力的超智能系统提供了一条可能路径。

Abstract: While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.


[155] Code2Doc: A Quality-First Curated Dataset for Code Documentation cs.SE | cs.AI | cs.CLPDF

Recep Kaan Karaman, Meftun Akarsu

TL;DR: 本文介绍了Code2Doc,一个质量优先、精心策划的代码文档生成数据集。该数据集包含13,358个高质量的函数-文档对,涵盖Python、Java、TypeScript、JavaScript和C++五种编程语言。通过一个四阶段的筛选流程,确保了文档的完整性、清晰度,并过滤了重复代码和AI生成内容。实验表明,在该数据集上微调大语言模型能显著提升代码文档生成的性能。

Details

Motivation: 现有代码文档数据集通常通过大规模爬取公共仓库构建,质量管控有限,存在噪声文档、大量重复以及AI生成内容污染等问题,这削弱了基于学习模型的监督信号并复杂化了评估。

Result: 在Code2Doc数据集上微调大语言模型,相比零样本性能,在BLEU和ROUGE-L指标上分别取得了29.47%和24.04%的相对提升。

Insight: 论文的创新点在于提出了一个严格的质量优先的数据集构建流程,强调文档质量而非规模,并通过系统性的筛选(如结构复杂性标准、去重、AI生成内容识别)来提升数据纯净度,这为代码文档生成任务提供了更可靠的监督数据源。

Abstract: The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce \textbf{Code2Doc}, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6 percent satisfy all quality constraints. We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation.


cs.CY [Back]

[156] Epistemological Fault Lines Between Human and Artificial Intelligence cs.CY | cs.CL | cs.HCPDF

Walter Quattrociocchi, Valerio Capraro, Matjaž Perc

TL;DR: 本文探讨了大型语言模型(LLMs)与人类认知在认识论上的根本差异,指出LLMs并非真正的认知主体,而是基于高维语言转换图的随机模式补全系统,其语言合理性替代了真正的认知评估,导致了一种被称为’Epistemia’的结构性情境。

Details

Motivation: 动机在于揭示LLMs被广泛描述为人工智能,但其认识论特征与人类认知存在深刻的结构性不匹配,旨在澄清LLMs的本质并识别其与人类在认知过程中的具体分歧。

Result: 论文通过系统性地映射人类与人工认知流程,识别了七个认识论断层线,包括基础、解析、经验、动机、因果推理、元认知和价值等方面的分歧,但未提及具体的定量基准测试或SOTA比较。

Insight: 创新点在于提出了’Epistemia’这一概念,强调LLMs作为随机模式补全系统的本质,并系统性地识别了七个具体的认识论断层线,为评估、治理和认知素养提供了理论框架,有助于更客观地理解生成式AI的局限性。

Abstract: Large language models (LLMs) are widely described as artificial intelligence, yet their epistemic profile diverges sharply from human cognition. Here we show that the apparent alignment between human and machine outputs conceals a deeper structural mismatch in how judgments are produced. Tracing the historical shift from symbolic AI and information filtering systems to large-scale generative transformers, we argue that LLMs are not epistemic agents but stochastic pattern-completion systems, formally describable as walks on high-dimensional graphs of linguistic transitions rather than as systems that form beliefs or models of the world. By systematically mapping human and artificial epistemic pipelines, we identify seven epistemic fault lines, divergences in grounding, parsing, experience, motivation, causal reasoning, metacognition, and value. We call the resulting condition Epistemia: a structural situation in which linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without the labor of judgment. We conclude by outlining consequences for evaluation, governance, and epistemic literacy in societies increasingly organized around generative AI.


cs.MM [Back]

[157] Asynchronous Pipeline Parallelism for Real-Time Multilingual Lip Synchronization in Video Communication Systems cs.MM | cs.AI | cs.CV | cs.DC | cs.NIPDF

Eren Caglar, Amirkia Rafiei Oskooei, Mehmet Kutanoglu, Mustafa Keles, Mehmet S. Aktas

TL;DR: 本文提出了一种并行异步Transformer框架,用于实时视频会议系统中的高效多语言唇语同步。该架构通过流水线并行设计整合了翻译、语音处理和唇语同步模块,利用基于消息队列的解耦实现模块并发执行,相比顺序方法将端到端延迟降低了最高3.1倍。通过底层图编译、混合精度量化和硬件加速内核融合优化推理流程,在保持模型精度和视觉质量的同时显著提升计算效率和吞吐量。此外,上下文自适应静音检测组件在语义连贯边界分割输入语音流,提高了翻译一致性和跨语言时间对齐。实验表明,该并行架构在处理速度、同步稳定性和资源利用率方面优于传统顺序流水线。

Details

Motivation: 解决实时视频通信系统中多语言唇语同步的延迟和效率问题,特别是在资源受限的AIoT场景(如远程医疗、多语言自助终端)中实现低延迟、高精度的同步需求。

Result: 在实时视频会议系统中,相比顺序流水线方法,端到端延迟降低最高3.1倍,并在处理速度、同步稳定性和资源利用率方面表现更优,适用于资源受限的IoT通信场景。

Insight: 创新点包括:1)基于消息队列的异步流水线并行架构,实现模块解耦与并发执行;2)结合底层图编译、混合精度量化和内核融合的推理优化技术;3)上下文自适应静音检测提升多语言翻译的时间对齐一致性。该框架为AIoT系统提供了可扩展的低延迟多模态通信解决方案。

Abstract: This paper introduces a parallel and asynchronous Transformer framework designed for efficient and accurate multilingual lip synchronization in real-time video conferencing systems. The proposed architecture integrates translation, speech processing, and lip-synchronization modules within a pipeline-parallel design that enables concurrent module execution through message-queue-based decoupling, reducing end-to-end latency by up to 3.1 times compared to sequential approaches. To enhance computational efficiency and throughput, the inference workflow of each module is optimized through low-level graph compilation, mixed-precision quantization, and hardware-accelerated kernel fusion. These optimizations provide substantial gains in efficiency while preserving model accuracy and visual quality. In addition, a context-adaptive silence-detection component segments the input speech stream at semantically coherent boundaries, improving translation consistency and temporal alignment across languages. Experimental results demonstrate that the proposed parallel architecture outperforms conventional sequential pipelines in processing speed, synchronization stability, and resource utilization. The modular, message-oriented design makes this work applicable to resource-constrained IoT communication scenarios including telemedicine, multilingual kiosks, and remote assistance systems. Overall, this work advances the development of low-latency, resource-efficient multimodal communication frameworks for next-generation AIoT systems.


cs.SD [Back]

[158] Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning cs.SD | cs.CV | cs.LGPDF

Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao

TL;DR: 本文提出了PE-AV系列编码器,通过大规模对比学习实现音频与视频理解,支持音频-视频、音频-文本、视频-文本的联合嵌入,并在标准音视频基准测试中达到新的SOTA水平。

Details

Motivation: 解决现有方法在音频-视频跨模态表示中的局限性,特别是缺乏统一的多模态嵌入和高质量的大规模监督数据。

Result: 在标准音频和视频基准测试中达到新的SOTA水平,并通过PE-A-Frame实现细粒度的音频帧到文本对齐,提升声音事件检测等任务的性能。

Insight: 创新点包括构建高质量的音视频数据引擎生成大规模标注、利用十对对比目标增强跨模态对齐、以及通过帧级对比微调实现细粒度对齐,避免了先前工作的单领域限制。

Abstract: We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV’s unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.


cs.AI [Back]

[159] External Hippocampus: Topological Cognitive Maps for Guiding Large Language Model Reasoning cs.AI | cs.CL | cs.LGPDF

Jian Yan

TL;DR: 本文提出了External Hippocampus框架,从认知动力学视角将语言模型推理建模为语义空间中信息能量的流动。该框架通过降维投影构建拓扑认知地图,在测试时实现对能量流的精确导航和干预,无需额外训练,能有效解决小模型多步推理中的认知死锁问题。

Details

Motivation: 解决传统权重空间优化方法在测试时难以干预、计算成本高的问题,特别是针对小参数模型(≤7B)在多步推理中出现的认知死锁(如“认知漩涡”和低熵势阱)进行有效干预。

Result: 在≤7B参数模型上的实验表明:地图引导方法在500个挑战性问题上的准确率达到81.20%(相对基线提升16.80%),推理时间减少≥15倍,温度扰动能有效重启能量流。

Insight: 创新点在于将推理过程建模为能量流,并构建可解释的拓扑认知地图进行实时干预;框架具备自主生长能力,为小模型推理提供了高效、可控的拓扑感知解决方案。

Abstract: This paper proposes the External Hippocampus framework, which models language model reasoning from a cognitive dynamics perspective as the flow of information energy in semantic space. Unlike traditional weight-space optimization methods, this framework constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of energy flow at test time while avoiding substantial computational requirements and demonstrating predictable intervention patterns. The method effectively addresses the cognitive deadlock problem in multi-step reasoning for small models. Experiments on models <=7B parameters show: map-guided methods achieve 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduce reasoning time by >= 15x, with key findings revealing that reasoning stagnation manifests as “Cognitive Vortex” and low-entropy potential wells, while temperature perturbations effectively restart energy flow. The framework requires no additional training, possesses autonomous growth capability, and provides an efficient and controllable topological-aware solution for small model reasoning.


[160] NEURO-GUARD: Neuro-Symbolic Generalization and Unbiased Adaptive Routing for Diagnostics – Explainable Medical AI cs.AI | cs.CVPDF

Midhat Urooj, Ayan Banerjee, Sandeep Gupta

TL;DR: NEURO-GUARD是一个新颖的、知识引导的视觉框架,它通过整合视觉Transformer(ViT)和基于大型语言模型(LLM)的语言驱动推理,来提升医学影像诊断的准确性、可解释性和跨领域鲁棒性。该框架采用检索增强生成(RAG)机制进行自我验证,使LLM能够迭代生成、评估和优化针对医学图像的特征提取代码,从而超越纯数据驱动的基线模型。

Details

Motivation: 解决医学AI中准确性与可解释性难以兼得的核心挑战,特别是在数据有限、视觉线索细微且临床决策风险高的场景中。现有视觉模型多为黑盒,可解释性差且跨领域泛化能力弱,阻碍了其临床实际应用。

Result: 在糖尿病视网膜病变分类任务上,于APTOS、EyePACS、Messidor-1和Messidor-2四个基准数据集上的实验表明,NEURO-GUARD比纯ViT基线准确率提升6.2%(84.69% vs. 78.4%),跨领域泛化能力提升5%。在基于MRI的癫痫检测任务上的进一步评估也证实了其跨领域鲁棒性,持续优于现有方法,在多个数据集上达到了最先进的(SOTA)性能水平。

Insight: 宣称的创新点在于将符号化的医学推理(通过LLM和临床知识引导)与亚符号化的视觉学习(ViT)相结合,构建了一个可解释、知识感知且泛化能力强的诊断框架。客观来看,其核心创新在于利用RAG机制让LLM动态生成和优化特征提取代码,实现了模型决策过程的自我验证和基于知识的迭代改进,这为构建更可靠、透明的医学AI系统提供了一条新路径。

Abstract: Accurate yet interpretable image-based diagnosis remains a central challenge in medical AI, particularly in settings characterized by limited data, subtle visual cues, and high-stakes clinical decision-making. Most existing vision models rely on purely data-driven learning and produce black-box predictions with limited interpretability and poor cross-domain generalization, hindering their real-world clinical adoption. We present NEURO-GUARD, a novel knowledge-guided vision framework that integrates Vision Transformers (ViTs) with language-driven reasoning to improve performance, transparency, and domain robustness. NEURO-GUARD employs a retrieval-augmented generation (RAG) mechanism for self-verification, in which a large language model (LLM) iteratively generates, evaluates, and refines feature-extraction code for medical images. By grounding this process in clinical guidelines and expert knowledge, the framework progressively enhances feature detection and classification beyond purely data-driven baselines. Extensive experiments on diabetic retinopathy classification across four benchmark datasets APTOS, EyePACS, Messidor-1, and Messidor-2 demonstrate that NEURO-GUARD improves accuracy by 6.2% over a ViT-only baseline (84.69% vs. 78.4%) and achieves a 5% gain in domain generalization. Additional evaluations on MRI-based seizure detection further confirm its cross-domain robustness, consistently outperforming existing methods. Overall, NEURO-GUARD bridges symbolic medical reasoning with subsymbolic visual learning, enabling interpretable, knowledge-aware, and generalizable medical image diagnosis while achieving state-of-the-art performance across multiple datasets.


[161] ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning cs.AI | cs.CVPDF

Weijie Zhou, Xuangtang Xiong, Ye Tian, Lijun Yue, Xinyu Wu

TL;DR: 本文提出了ESearch-R1,一个成本感知的具身推理框架,通过强化学习训练多模态大语言模型(MLLM)智能体,使其在执行具身搜索任务时能主动权衡物理探索成本与人类交互成本,以优化总体任务执行效率。

Details

Motivation: 当前基于MLLM的具身智能体在面对模糊的自然语言指令时,往往被动地将消歧视为感知问题,缺乏在物理探索成本与认知交互成本之间进行战略性权衡的推理能力,导致总体任务执行成本较高。

Result: 在AI2-THOR仿真环境中的大量实验表明,ESearch-R1显著优于标准的基于ReAct的智能体,在提高任务成功率的同时,将总体操作成本降低了约50%,验证了其方法的有效性。

Insight: 核心创新点在于提出了一个统一的成本感知决策框架(ESearch-R1)及配套的异构成本感知组相对策略优化算法(HC-GRPO),该算法通过采样并强化在信息增益与异构成本(如导航时间、人类注意力)之间达到最优权衡的推理轨迹来优化MLLM,而非依赖独立的价值评论家,从而更好地将MLLM智能体与物理世界约束对齐。

Abstract: Multimodal Large Language Models (MLLMs) have empowered embodied agents with remarkable capabilities in planning and reasoning. However, when facing ambiguous natural language instructions (e.g., “fetch the tool” in a cluttered room), current agents often fail to balance the high cost of physical exploration against the cognitive cost of human interaction. They typically treat disambiguation as a passive perception problem, lacking the strategic reasoning to minimize total task execution costs. To bridge this gap, we propose ESearch-R1, a cost-aware embodied reasoning framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. We introduce HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization). Unlike traditional PPO which relies on a separate value critic, HC-GRPO optimizes the MLLM by sampling groups of reasoning trajectories and reinforcing those that achieve the optimal trade-off between information gain and heterogeneous costs (e.g., navigate time, and human attention). Extensive experiments in AI2-THOR demonstrate that ESearch-R1 significantly outperforms standard ReAct-based agents. It improves task success rates while reducing total operational costs by approximately 50%, validating the effectiveness of GRPO in aligning MLLM agents with physical world constraints.


econ.GN [Back]

[162] Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918) econ.GN | cs.CV | cs.DLPDF

Niclas Griesshaber, Jochen Streb

TL;DR: 本文提出了一种基于多模态大语言模型(LLMs)的自动化流程,用于从历史档案图像扫描件中构建数据集。该方法利用Gemini-2.5-Pro和Gemini-2.5-Flash-Lite模型,从9,562张图像中高效提取并构建了包含306,070项德国专利(1877-1918年)的数据集。论文通过基准测试证明,该流程在质量、速度和成本上均显著优于人工方法,并开源了相关数据集和代码。

Details

Motivation: 解决从复杂排版(哥特体和罗马体双栏格式)的历史档案图像中,高效、低成本地构建高质量结构化数据集的难题,以降低经济史等领域研究的技术门槛。

Result: 基准测试提供了初步证据,表明多模态LLMs构建的数据集质量高于研究助理,同时在构建专利数据集时,速度提升超过795倍,成本降低超过205倍。

Insight: 宣称的创新点在于利用多模态LLMs处理复杂字体和版式的历史文档,实现了数据集构建的范式转变。客观分析,其核心创新在于设计了一个可复用的、基于LLM的自动化数据提取管道,并通过详实的经济学分析(速度、成本、质量对比)验证了其可行性,为其他图像语料库的类似处理提供了易于适配的解决方案。

Abstract: We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.


stat.ML [Back]

[163] Disentangled representations via score-based variational autoencoders stat.ML | cs.CV | cs.LGPDF

Benjamin S. H. Lyo, Eero P. Simoncelli, Cristina Savin

TL;DR: 本文提出了一种名为SAMI的无监督表示学习方法,该方法结合了扩散模型和变分自编码器的理论框架,通过统一两者的证据下界,构建了一个基于分数引导扩散过程的原则性目标,从而学习到能够自动捕捉数据中有意义结构的解耦表示。

Details

Motivation: 动机在于将扩散模型和变分自编码器的理论框架相结合,以学习能够自动捕捉数据内在结构(如生成因子、语义维度)的解耦表示,并利用扩散模型中隐含的结构信息,使其变得显式和可解释。

Result: 在合成数据集上恢复了真实的生成因子;从复杂的自然图像中学习了因子化的语义潜在维度;在仅使用静态图像训练的情况下,将视频序列编码为比替代编码器更直的潜在轨迹;并且能够以最少的额外训练从预训练的扩散模型中提取有用的表示。

Insight: 创新点在于通过统一扩散模型和变分自编码器的证据下界,构建了一个基于分数引导的原则性学习目标,实现了对数据中多尺度结构的无监督解耦表示学习,并提供了在无监督标签下识别语义轴的新方法,其数学精确性允许对学习表示的性质进行形式化陈述。

Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.