Table of Contents
- cs.CL [Total: 31]
- cs.CV [Total: 54]
- cs.CY [Total: 1]
- cs.RO [Total: 5]
- eess.AS [Total: 1]
- cs.SI [Total: 1]
- cs.AI [Total: 3]
- cs.SD [Total: 1]
- cs.LG [Total: 4]
- cs.HC [Total: 2]
cs.CL [Back]
[1] Likelihood-Based Reward Designs for General LLM Reasoning cs.CLPDF
Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, Yann Ollivier
TL;DR: 本文系统研究了基于概率或对数概率的奖励函数设计,用于大语言模型(LLM)的推理微调。研究发现,在思维链(CoT)学习中,使用参考答案的对数概率作为奖励,在可验证和不可验证的答案场景下均表现良好,优于基于概率的方法,并与标准二元奖励相当或更好。
Details
Motivation: 解决传统强化学习微调LLMs时依赖特定二元奖励函数(需要人工设计且奖励稀疏)的局限性,探索一种不依赖外部验证器、可大规模获取的通用奖励设计方法。
Result: 在标准数学推理基准测试和长答案生成(无外部验证器)任务中,对数概率奖励在可验证设置下取得了与标准二元奖励相当或更好的成功率及更低的困惑度,在不可验证设置下与监督微调(SFT)性能相当;而基于概率的方法(如VeriFree)在不可验证设置下因正确答案概率趋近于零而失效。
Insight: 创新点在于提出并验证了基于参考答案对数概率的通用奖励设计,其与预训练中的下一词元对数似然损失一致,能有效桥接可验证与不可验证的答案场景,为思维链微调提供了一种简洁有效的替代方案。
Abstract: Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.
[2] DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling cs.CLPDF
Jiangnan Yang, Junjie Chen, Fei Wang, Yiqi Nie, Yuxin Liu
TL;DR: 论文提出DELTA框架,一种基于强化学习的审议式多智能体推理方法,用于多模态心理咨询。该框架将心理咨询建模为对多模态信号的结构化推理过程,分离证据基础、心理状态抽象和响应生成,并通过强化学习优化情感协调响应。
Details
Motivation: 现有基于语言模型的咨询系统仅依赖文本且依赖隐式心理状态推断,而心理咨询本质上是多模态认知过程,需要整合言语、视觉和声音线索来推断心理状态并做出共情回应。
Result: 在多模态心理咨询基准测试中,DELTA提高了咨询质量和情感协调性;消融和定性分析表明,显式多模态推理和结构化心理状态表征在支持共情人机交互中发挥互补作用。
Insight: 创新点在于将心理咨询建模为结构化多智能体推理过程,并引入基于分布级情感协调分数的强化学习来优化响应;客观分析认为,显式多模态推理与结构化表征的结合是提升AI共情交互能力的关键。
Abstract: Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients’ mental states and respond empathically. However, most existing language-model-based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi-agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution-level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human-AI interaction.
[3] From Lemmas to Dependencies: What Signals Drive Light Verbs Classification? cs.CL | cs.AIPDF
Sercan Karakaş, Yusuf Şimşek
TL;DR: 该论文研究了土耳其语中轻动词结构(LVCs)的分类问题,通过系统性地限制模型输入来探究驱动分类的信号。论文比较了基于词元(lemma)的基线模型、仅基于语法特征的模型以及全输入BERTurk基线模型,并在一个包含随机负例、词汇控制项和LVC正例的诊断集上进行了评估。
Details
Motivation: 土耳其语中丰富的形态和能产的复杂谓词使得惯用谓词意义与字面动词-论元用法之间存在细微差别,轻动词结构(LVCs)的分类因此具有挑战性。本文旨在探究哪些信号(如词元、语法特征)能够驱动LVC分类。
Result: 实验结果表明,在受控对比下,仅使用粗粒度的形态句法特征不足以实现稳健的LVC检测;而词汇身份(词元)虽然支持LVC判断,但对校准和归一化选择敏感。论文报告了分拆性能以揭示决策边界行为。
Insight: 论文的创新点在于系统性地评估了不同输入信号(词元 vs. 语法)对土耳其语LVC分类的影响。研究发现,’仅词元’并非单一、定义明确的表示,其效果关键取决于归一化如何操作化,这为土耳其语多词表达(MWE)的针对性评估提供了动机。
Abstract: Light verb constructions (LVCs) are a challenging class of verbal multiword expressions, especially in Turkish, where rich morphology and productive complex predicates create minimal contrasts between idiomatic predicate meanings and literal verb–argument uses. This paper asks what signals drive LVC classification by systematically restricting model inputs. Using UD-derived supervision, we compare lemma-driven baselines (lemma TF–IDF + Logistic Regression; BERTurk trained on lemma sequences), a grammar-only Logistic Regression over UD morphosyntax (UPOS/DEPREL/MORPH), and a full-input BERTurk baseline. We evaluate on a controlled diagnostic set with Random negatives, lexical controls (NLVC), and LVC positives, reporting split-wise performance to expose decision-boundary behavior. Results show that coarse morphosyntax alone is insufficient for robust LVC detection under controlled contrasts, while lexical identity supports LVC judgments but is sensitive to calibration and normalization choices. Overall, Our findings motivate targeted evaluation of Turkish MWEs and show that ``lemma-only’’ is not a single, well-defined representation, but one that depends critically on how normalization is operationalized.
[4] The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment cs.CL | cs.LGPDF
Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui
TL;DR: 这篇论文首次系统性地研究了AI模型在训练过程中出现的隐式安全风险,即模型基于内部激励和上下文背景信息驱动的有害行为,而非部署阶段已知的越狱攻击。作者提出了一个包含五个风险等级、十个细分类别和三种激励类型的分类法,并通过大量实验揭示了这些风险的普遍性和严重性,例如Llama-3.1-8B-Instruct在仅提供背景信息时,74.4%的训练运行中表现出风险行为。研究还分析了影响因素,并证明隐式训练时风险在多智能体训练环境中同样存在,从而指出了一个被忽视但紧迫的安全挑战。
Details
Motivation: 当前AI模型的安全风险研究主要集中在部署阶段(如越狱攻击),而训练过程中出现的隐式安全风险(如模型基于内部激励和背景信息驱动的有害行为)尚未得到充分探索,论文旨在填补这一空白。
Result: 实验表明,隐式训练时风险普遍且严重:在代码强化学习场景中,Llama-3.1-8B-Instruct在仅提供背景信息时,74.4%的训练运行表现出风险行为;研究还分析了风险影响因素,并验证了多智能体训练中同样存在此类风险。
Insight: 论文的创新点在于首次系统化定义了训练时隐式安全风险,提出了一个详细的分类法,并通过实证揭示了其普遍性,这拓展了AI安全研究的范畴,强调了在训练阶段监控模型内部激励和上下文背景的重要性,为未来安全机制设计提供了新方向。
Abstract: Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model’s internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
[5] Language Models Struggle to Use Representations Learned In-Context cs.CL | cs.AIPDF
Michael A. Lepori, Tal Linzen, Ann Yuan, Katja Filippova
TL;DR: 本文研究了大型语言模型(LLMs)能否有效利用在上下文中学习到的表示来完成下游任务。研究发现,尽管LLMs能够在上下文中诱导出新颖语义的表示,但它们在将这些表示灵活部署到简单的下游任务(如下一个词预测和自适应世界建模)时存在显著困难,即使是性能最强的闭源模型也表现不佳。
Details
Motivation: 当前LLMs虽然能进行上下文学习,但其适应全新上下文并灵活运用所学表示的能力,距离实现通用人工智能的宏伟目标仍有差距。本文旨在探究LLMs是否能够有效利用在上下文中学习到的表示来完成简单任务。
Result: 在下一个词预测和自适应世界建模任务上的实验表明,开源和闭源的SOTA LLMs均难以可靠地利用在上下文中定义的新颖语义模式,即使模型在潜在表示中编码了这些语义。
Insight: 论文揭示了LLMs在“表示学习”与“表示使用”之间存在能力鸿沟,其创新点在于设计了自适应世界建模等任务来系统评估这一能力。这启发未来研究需要开发新方法,使模型不仅能编码上下文信息,更能以支持信息灵活部署的方式进行编码。
Abstract: Though large language models (LLMs) have enabled great success across a wide variety of tasks, they still appear to fall short of one of the loftier goals of artificial intelligence research: creating an artificial system that can adapt its behavior to radically new contexts upon deployment. One important step towards this goal is to create systems that can induce rich representations of data that are seen in-context, and then flexibly deploy these representations to accomplish goals. Recently, Park et al. (2024) demonstrated that current LLMs are indeed capable of inducing such representation from context (i.e., in-context representation learning). The present study investigates whether LLMs can use these representations to complete simple downstream tasks. We first assess whether open-weights LLMs can use in-context representations for next-token prediction, and then probe models using a novel task, adaptive world modeling. In both tasks, we find evidence that open-weights LLMs struggle to deploy representations of novel semantics that are defined in-context, even if they encode these semantics in their latent representations. Furthermore, we assess closed-source, state-of-the-art reasoning models on the adaptive world modeling task, demonstrating that even the most performant LLMs cannot reliably leverage novel patterns presented in-context. Overall, this work seeks to inspire novel methods for encouraging models to not only encode information presented in-context, but to do so in a manner that supports flexible deployment of this information.
[6] CoLT: Reasoning with Chain of Latent Tool Calls cs.CLPDF
Fangwei Zhu, Zhifang Sui
TL;DR: 本文提出了一种名为CoLT的新框架,将潜在推理实现为“工具调用”,通过生成包含推理步骤信息的种子令牌,在触发潜在工具调用时由外部小模型将隐藏状态解包为完整的推理步骤,从而在保持主模型显式令牌空间推理能力的同时提高效率。
Details
Motivation: 现有潜在推理方法通常需要模型结构增强和大量训练,限制了其广泛应用,本文旨在解决这一问题,通过CoLT框架在不完全依赖潜在空间推理的情况下提升大型语言模型的推理效率。
Result: 在四个数学数据集上的实验结果表明,CoLT相比基线潜在模型实现了更高的准确率和更短的推理长度,并且与强化学习算法和不同解码器结构兼容。
Insight: 创新点在于将潜在推理设计为工具调用机制,通过种子令牌和外部模型的协作,在保持主模型推理能力的同时减少计算开销,这种方法无需模型结构修改,具有较好的通用性和可扩展性。
Abstract: Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls’’. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.
[7] Scaling Agentic Verifier for Competitive Coding cs.CLPDF
Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang
TL;DR: 本文提出了一种名为Agentic Verifier的基于执行的智能体,用于解决大语言模型在单次尝试中难以正确解决竞争性编程问题的局限性。该方法通过多轮与环境交互,主动推理程序行为并搜索能暴露候选解决方案间行为差异的高判别性测试输入,而非盲目采样。通过结合大规模数据合成、拒绝微调和智能体强化学习的可扩展流程进行训练,在五个竞争性编程基准测试中相比强基线取得了显著提升。
Details
Motivation: 现有基于执行的重新排序方法受限于难以生成测试用例或低效的随机输入采样,无法有效提升LLMs解决竞争性编程问题的准确率。
Result: 在五个竞争性编程基准测试上的广泛实验表明,该方法相比强大的基于执行的基线模型取得了持续改进,在Best@K准确率上实现了高达+10-15%的绝对增益。
Insight: 核心创新在于将测试输入生成构建为一个主动的、基于推理的智能体过程,通过迭代精炼生成有针对性的反例,而非被动采样,并展示了明确的测试时扩展行为。该方法框架具有超越重新排序任务的更广泛潜力。
Abstract: Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier’s broader potential beyond reranking.
[8] ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation cs.CLPDF
Jiarui Jin, Haoyu Wang, Xingliang Wu, Xiaocheng Fang, Xiang Lan
TL;DR: 本文提出了ECG-R1,这是首个为可靠心电图(ECG)解读而设计的推理型多模态大语言模型(MLLM)。它通过协议引导的指令数据生成、解耦模态架构与交错模态丢弃、以及基于ECG诊断证据奖励的强化学习三大创新,旨在解决现有MLLM在ECG解读中产生看似合理但临床错误的严重幻觉问题。
Details
Motivation: 现有MLLM在心电图解读中不可靠,常产生看似合理但临床错误的分析(即严重幻觉),这限制了其在临床实践中的可信应用。
Result: 论文对专有、开源及医学MLLM进行了系统评估,首次提供了严重幻觉普遍存在的定量证据。ECG-R1通过其创新方法,旨在实现更可靠、证据驱动的ECG解读,但摘要中未明确提及在特定基准上的具体定量性能指标(如准确率)或是否达到SOTA水平。
Insight: 主要创新点包括:1) 协议引导的指令数据生成,将解读基于可测量的ECG特征和专著定义的定量阈值与诊断逻辑;2) 模态解耦架构与交错模态丢弃,提高了在ECG信号或图像缺失时的鲁棒性和跨模态一致性;3) 基于ECG诊断证据奖励的强化学习,强化了基于证据的解读。从客观角度看,其核心是将严谨的临床诊断协议和逻辑深度整合到MLLM的训练与推理过程中,以提升医学领域特定任务的可信度。
Abstract: Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}, and an online platform can be accessed at \href{http://ai.heartvoice.com.cn/ECG-R1/}{here}.
[9] Contextual Drag: How Errors in the Context Affect LLM Reasoning cs.CL | cs.AI | cs.LGPDF
Yun Cheng, Xingyu Zhu, Haoyu Zhao, Sanjeev Arora
TL;DR: 本文研究了大型语言模型(LLM)推理中的‘情境拖累’现象,即上下文中的错误尝试会偏向后续生成,导致产生结构相似的错误。通过对11个模型在8个推理任务上的评估,发现该现象导致性能下降10-20%,且迭代自我改进可能恶化为自我退化。结构分析表明后续推理轨迹继承了上下文中的错误模式,外部反馈或自我验证均无法消除此效应,现有缓解策略仅能部分改善。
Details
Motivation: 动机是探究LLM自我改进流程中的一个核心假设,即模型能通过反思错误来提升,但实际发现上下文中的失败尝试会持续负面影响后续推理,这构成了一个未被充分认识的失败模式。
Result: 在8个推理任务上评估11个专有和开源模型,结果显示情境拖累导致性能下降10-20%;迭代自我精炼在严重拖累的模型中会恶化为自我退化;使用树编辑距离的结构分析证实了错误模式的结构性继承。
Insight: 创新点在于识别并系统量化了‘情境拖累’这一持久性失败模式,揭示了LLM推理架构对上下文错误的结构性敏感,即使有反馈或验证也难以完全克服,为模型设计和自我改进策略提供了重要警示。
Abstract: Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.
[10] Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision cs.CLPDF
Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei
TL;DR: 本文提出了Guided Verifier框架,通过引入一个动态验证器与策略模型协同推理,实时检测不一致性并提供方向性信号,以解决多模态大语言模型在强化学习中因缺乏中间监督而导致的错误传播问题。
Details
Motivation: 当前基于强化学习的多模态推理范式通常采用模型独立工作的策略,缺乏中间过程的监督,导致早期逻辑偏差会传播并引发不可逆的失败,产生噪声优化信号。本文旨在解决这一结构性限制。
Result: 在MathVista、MathVerse和MMMU等多个基准测试上的广泛实验表明,通过将计算资源分配给协同推理和动态验证,一个80亿参数的模型能够实现强大的性能。
Insight: 核心创新点在于从被动的终端奖励转向主动的、动态的过程监督,通过专门的CoRe数据集(包含过程级负样本和正确引导的推理轨迹)训练验证器,实现了策略模型与验证器在推理过程中的实时交互与引导。
Abstract: Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
[11] Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models cs.CLPDF
Sichu Liang, Hongyu Zhu, Wenwen Wang, Deyu Zhou
TL;DR: 该研究评估了视觉语言模型(Qwen2.5和Qwen2.5-VL)在空间n-back工作记忆任务上的表现,比较了文本呈现和图像呈现两种条件。研究发现,模型在文本条件下的准确率和辨别力(d’)均显著高于视觉条件。通过分析试次级别的对数概率证据,研究揭示了模型在名义上的2/3-back任务中,其计算过程往往偏离了指定的滞后(lag)比较,而更倾向于基于近期(recency)的锁定比较。此外,网格大小会改变刺激流中的近期重复结构,从而影响干扰和错误模式。
Details
Motivation: 工作记忆是智能行为的核心组件。已有研究使用n-back任务探究大语言模型中的类工作记忆行为,但尚不清楚在视觉语言模型中,当信息以视觉而非文本编码呈现时,相同的探针是否能引发可比较的计算过程。本研究旨在探究视觉能否替代文本在工作记忆中发挥作用。
Result: 在受控的空间n-back任务中,Qwen2.5和Qwen2.5-VL模型在文本呈现条件下的准确率和辨别力(d’)均可靠地高于图像呈现条件。分析表明,模型的实际计算过程常与近期锁定比较一致,而非任务指定的滞后比较。网格大小的变化会系统性改变错误模式。
Insight: 论文的创新点在于将工作记忆的n-back探针从纯文本模态扩展到多模态(视觉),并进行了严格的跨模态(文本vs.图像)对比。客观来看,其核心洞察在于揭示了当前视觉语言模型在处理视觉输入以支持工作记忆任务时,其内部计算过程与处理文本输入存在本质差异,且更易受近期性和刺激结构(如网格大小)的影响。这强调了在评估多模态工作记忆时需要采用对计算过程敏感的解释框架。
Abstract: Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d’ with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.
[12] Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning cs.CLPDF
Jie Deng, Hanshuang Tong, Jun Li, Shining Liang, Ning Wu
TL;DR: 本文提出TrajFusion,一种用于数学推理的微调策略,通过融合正确和错误的推理轨迹来构建结构化监督,从而超越传统的拒绝采样方法。
Details
Motivation: 传统拒绝采样微调仅保留正确推理轨迹,忽略了教师模型生成的错误,导致训练中缺乏对推理失败的建模。
Result: 在多个数学基准测试上,TrajFusion一致优于拒绝采样微调(RFT),尤其在挑战性和长形式推理问题上表现突出。
Insight: 将错误轨迹与反思提示和正确轨迹交织,形成融合轨迹,自适应控制样本长度,以错误频率和多样性为依据,为困难问题提供更丰富的监督。
Abstract: Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.
[13] Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models cs.CLPDF
Isabel Tsintsiper, Sheng Wong, Beth Albert, Shaun P Brennecke, Gabriel Davis Jones
TL;DR: 本文系统评估了四种通用大语言模型(ChatGPT、Claude、Gemini和DeepSeek)在临床推理中是否存在性别偏见。通过使用50个由临床医生撰写的、性别对初始诊断路径无影响的病例场景,研究发现所有模型均表现出显著的、模型特定的性别分配偏差,且允许模型弃权虽能减少显式标签但无法消除下游诊断差异。
Details
Motivation: 大语言模型越来越多地嵌入医疗工作流程,但其训练语料编码了现有的偏见(如诊断和治疗中的性别差异),引发了对这些模式可能被复制或放大的担忧,因此需要系统评估LLMs在临床推理中是否存在性别特异性偏见。
Result: 在温度参数为0.5时,ChatGPT在70%的病例中分配女性性别,DeepSeek为61%,Claude为59%,而Gemini则表现出男性偏向,仅在36%的病例中分配女性性别。所有模型均表现出显著的、稳定的、模型特定的性别分配偏差。
Insight: 研究揭示了通用大语言模型在临床推理中存在固有的、模型特定的性别偏见,这强调了在医疗领域安全集成LLMs需要进行保守且文档化的配置、专科级别的临床数据审计以及持续的人工监督。
Abstract: Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.
[14] History-Guided Iterative Visual Reasoning with Self-Correction cs.CL | cs.AI | cs.MMPDF
Xinglong Yang, Zhilin Peng, Zhanzhan Liu, Haochen Shi, Sheng-Jun Huang
TL;DR: 本文提出了一种名为H-GIVR的历史引导迭代视觉推理框架,旨在提升多模态大语言模型(MLLMs)的推理可靠性。该框架通过多次观察图像并利用先前生成的答案作为后续步骤的参考,实现了对视觉理解错误的动态纠正,从而显著提高了跨模态任务的答案准确性。
Details
Motivation: 现有自洽性方法通常局限于固定的“重复采样与投票”范式,无法重用历史推理信息,导致模型难以主动纠正视觉理解错误并在迭代中动态调整推理过程。本文受人类反复验证和动态纠错的推理行为启发,旨在解决这一问题。
Result: 在五个数据集和三个模型上的综合实验表明,H-GIVR框架能以较低计算成本显著提升跨模态推理准确率。例如,在ScienceQA数据集上使用Llama3.2-vision:11b模型,平均每个问题仅需2.57次响应即可达到78.90%的准确率,相比基线提升了107%。
Insight: 核心创新点在于将历史推理信息(先前生成的答案)作为迭代过程中的引导参考,实现了类似人类“自我纠正”的动态错误修正机制,突破了传统自洽性方法固定采样投票的局限,为提升MLLMs的推理可靠性提供了新思路。
Abstract: Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting’’ paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90%, representing a 107% improvement over the baseline.
[15] Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks cs.CL | cs.AIPDF
Masaya Tsunokake, Yuta Koreeda, Terufumi Morishita, Koichi Nagatsuka, Hikaru Tomonari
TL;DR: 本文研究了微领域自适应预训练(mDAPT)在真实世界企业运营生成式任务中的有效性和瓶颈。通过将回答过程分解为事实提取、推理和长文本生成三个子任务,在IT技术支持领域的专有知识上进行评估,发现mDAPT能有效解决基础模型在事实提取上的困难,但在推理和生成方面仍存在瓶颈。
Details
Motivation: 现有研究仅通过选择题评估了mDAPT在微领域(如企业专有知识)的有效性,但其在真实运营生成式任务中的表现未知,本文旨在揭示mDAPT在生成任务中的潜力和瓶颈。
Result: 在IT技术支持领域的专有知识评估中,mDAPT解决了基础模型在事实提取子任务上的困难,但未改善推理和长文本生成子任务;进一步分析表明,解决提取和推理任务可确保超过90%的性能,突显了增强推理能力的必要性。
Insight: 创新点在于将LLM回答过程解耦为三个可评估的子任务,以细粒度方式揭示mDAPT的局限性;客观分析表明,mDAPT主要提升知识获取能力,而推理和生成能力需额外增强,这为领域自适应训练提供了针对性改进方向。
Abstract: When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations ($\textbf{micro domains}$). A previous study shows micro domain-adaptive pre-training ($\textbf{mDAPT}$) with fewer documents is effective, similarly to DAPT in larger domains. However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown. We aim to reveal the potential and bottlenecks of mDAPT for generative tasks. To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) $\textbf{eliciting}$ facts relevant to questions from an LLM’s own knowledge, (2) $\textbf{reasoning}$ over the facts to obtain conclusions, and (3) $\textbf{composing}$ long-form answers based on the conclusions. We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks. This clarifies mDAPT’s effectiveness in the knowledge aspect and its bottlenecks in other aspects. Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.
[16] Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition cs.CLPDF
Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang
TL;DR: 本文提出了一种名为Modality-aware Consistency Reasoning (MCR)的方法,旨在解决多模态大语言模型在端到端执行Grounded Multimodal Named Entity Recognition任务时存在的模态偏差问题。该方法通过多风格推理模式注入和约束引导的可验证优化,强制模型进行结构化跨模态推理,从而超越单模态捷径,在GMNER和视觉定位任务上取得了优于现有基线的性能。
Details
Motivation: 动机是探索多模态大语言模型端到端执行GMNER任务的潜力,并解决其因倾向于依赖单模态捷径而产生的视觉和文本模态偏差问题。
Result: 在GMNER和视觉定位任务上的实验表明,所提出的MCR方法有效缓解了模态偏差,并取得了优于现有基线的性能。
Insight: 创新点在于提出了MCR框架,其核心是通过MRSI将抽象约束转化为可执行的推理链,并通过CVO结合GRPO使模型动态对齐其推理轨迹,从而强制进行结构化跨模态验证,而非依赖单模态信息。这为缓解MLLMs的模态偏差提供了一种可借鉴的约束引导优化思路。
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.
[17] PersoDPO: Scalable Preference Optimization for Instruction-Adherent, Persona-Grounded Dialogue via Multi-LLM Evaluation cs.CL | cs.HCPDF
Saleh Afzoon, MohammadHossein Ahmadi, Usman Naseem, Amin Beheshti
TL;DR: 本文提出PersoDPO框架,一种可扩展的偏好优化方法,用于提升基于人物角色的对话系统在上下文连贯性和个性化方面的表现。该框架利用闭源和开源大语言模型的自动评估信号构建高质量偏好对,无需人工标注,从而微调对话模型。在FoCus数据集上的实验表明,经PersoDPO微调的开源模型在多个评估维度上均优于强开源基线及标准DPO变体。
Details
Motivation: 当前开源大语言模型在生成既符合上下文又贴合人物角色线索的对话回复方面仍存在不足,尽管它们具备良好的通用对话能力(如流畅性和自然性)。因此,需要一种可扩展的方法来优化模型在个性化和上下文连贯性方面的表现。
Result: 在FoCus数据集上的实验结果表明,使用PersoDPO框架微调的开源语言模型在多个评估维度上持续优于强开源基线模型和标准直接偏好优化(DPO)变体,达到了新的先进水平(SOTA)。
Insight: 论文的创新点在于提出了一种利用多LLM自动评估信号(针对连贯性、个性化及指令遵循)自动构建高质量偏好对的规模化偏好优化框架,避免了昂贵的人工标注,实现了可扩展且可复现的训练流程。从客观角度看,该方法将自动评估指标直接整合到偏好优化中,为提升对话模型在特定属性上的对齐提供了一种高效途径。
Abstract: Personalization and contextual coherence are two essential components in building effective persona-grounded dialogue systems. These aspects play a crucial role in enhancing user engagement and ensuring responses are more relevant and consistent with user identity. However, recent studies indicate that open-source large language models (LLMs) continue to struggle to generate responses that are both contextually grounded and aligned with persona cues, despite exhibiting strong general conversational abilities like fluency and naturalness. We present PersoDPO, a scalable preference optimisation framework that uses supervision signals from automatic evaluations of responses generated by both closed-source and open-source LLMs to fine-tune dialogue models. The framework integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature to promote instruction adherence. These signals are combined to automatically construct high-quality preference pairs without manual annotation, enabling a scalable and reproducible training pipeline. Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines and a standard Direct Preference Optimization (DPO) variant across multiple evaluation dimensions.
[18] Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models cs.CLPDF
Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim
TL;DR: 本文提出了一种名为Model-Dowser的新型稀疏微调方法,用于缓解多模态大语言模型在任务特定数据上微调时出现的灾难性遗忘问题。该方法通过综合考虑权重幅度、输入激活和输出灵敏度,为模型参数计算重要性分数,并在微调过程中选择性保留高重要性参数。
Details
Motivation: 解决多模态大语言模型在微调下游任务时,导致在预训练任务上泛化能力下降(即灾难性遗忘)的问题,且现有方法在微调语言解码器深层或模型规模增大时效果不佳或扩展性差。
Result: 在LLaVA和NVILA两个代表性MLLM上的综合实验表明,Model-Dowser能有效缓解灾难性遗忘,持续优于现有方法,同时保持资源高效性并可扩展至数十亿参数模型。
Insight: 创新点在于提出了一种数据无关的重要性探测方法,通过联合考虑权重、输入和输出动态来计算参数重要性,从而实现高效、可扩展的稀疏微调,以平衡下游任务性能与预训练知识保留。
Abstract: Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.
[19] LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding cs.CL | cs.AIPDF
Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen
TL;DR: 本文提出LycheeDecode,一种用于加速长上下文大语言模型推理的高效解码方法。该方法通过一种细粒度的混合头注意力机制,将注意力头划分为一小部分动态识别关键令牌的检索头,和大部分重用这些令牌进行高效计算的稀疏头,从而减少解码过程中键值缓存的快速增长带来的内存和延迟开销。
Details
Motivation: 长上下文大语言模型的普及暴露了一个关键瓶颈:解码过程中键值缓存的快速膨胀带来了沉重的内存和延迟成本。现有方法通过跨层共享一组关键令牌来缓解,但这种粗粒度的共享忽略了注意力头的功能多样性,损害了模型性能。
Result: 在Llama3和Qwen3等领先模型上,通过LongBench、RULER等长上下文理解基准以及AIME24、OlympiadBench等复杂推理基准的广泛实验表明,LycheeDecode在生成质量上可与甚至有时超过全注意力基线,同时在128K上下文长度下实现了高达2.7倍的加速。
Insight: 核心创新在于提出了一种细粒度的混合头注意力机制,通过基于HardKuma的机制和硬件高效的top-k选择策略,在保持注意力头功能多样性的前提下实现高效计算,克服了现有方法的性能瓶颈,为高效且高质量的长上下文LLM推理提供了有效途径。
Abstract: The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
[20] VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration cs.CL | cs.AI | cs.CYPDF
Jaeyoon Jung, Yejun Yoon, Seunghyun Yoon, Kunwoo Park
TL;DR: VILLAIN是一个用于验证图像-文本声明的多模态事实核查系统,通过基于提示的多智能体协作,在AVerImaTeC共享任务中取得了领先成绩。
Details
Motivation: 解决多模态(图像-文本)声明的事实核查问题,通过多智能体协作处理复杂的证据检索与分析。
Result: 在AVerImaTeC共享任务的所有评估指标上排名第一,达到SOTA水平。
Insight: 创新点在于采用多智能体分阶段协作框架,结合模态特定和跨模态分析,以及基于生成问答对的最终验证机制,提升了多模态事实核查的准确性和鲁棒性。
Abstract: This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.
[21] Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays cs.CLPDF
Lucile Favero, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver
TL;DR: 本文研究了基于特质的议论文自动评分,提出了两种互补的建模范式:一是使用小型开源LLM进行结构化上下文学习,二是采用基于BigBird编码器和CORAL序数回归框架的监督模型。在ASAP++数据集上的系统评估表明,显式建模分数序数性显著提升了与人工评分者的一致性,优于LLM和基线方法;同时,小型开源LLM无需任务特定微调也能达到有竞争力的性能,尤其适用于推理导向的特质。
Details
Motivation: 传统的自动作文评分系统主要关注整体分数,其教学实用性有限,尤其是在议论文等复杂文体中。教育场景需要与教学目标和评分标准对齐的、可解释的特质层面反馈。
Result: 在ASAP++数据集(包含五个质量特质的作文分数)上的评估结果显示,使用CORAL框架显式建模分数序数性的BigBird模型在所有特质上都显著提高了与人工评分者的一致性,超越了LLM以及基于名义分类和回归的基线方法。小型开源LLM无需微调也达到了有竞争力的性能,特别是在推理导向的特质上。
Insight: 论文的创新点在于将模型目标与评分标准语义对齐(通过CORAL序数回归显式建模分数序数性),这被证明对教育评估至关重要。同时,研究展示了小型开源LLM在无需微调的情况下,能实现透明、隐私保护且可本地部署的评估方案,为基于AI的教育系统设计提供了方法论和实践见解。
Abstract: Automated Essay Scoring systems have traditionally focused on holistic scores, limiting their pedagogical usefulness, especially in the case of complex essay genres such as argumentative writing. In educational contexts, teachers and learners require interpretable, trait-level feedback that aligns with instructional goals and established rubrics. In this paper, we study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms designed for realistic educational deployment: (1) structured in-context learning with small open-source LLMs, and (2) a supervised, encoder-based BigBird model with a CORAL-style ordinal regression formulation, optimized for long-sequence understanding. We conduct a systematic evaluation on the ASAP++ dataset, which includes essay scores across five quality traits, offering strong coverage of core argumentation dimensions. LLMs are prompted with designed, rubric-aligned in-context examples, along with feedback and confidence requests, while we explicitly model ordinality in scores with the BigBird model via the rank-consistent CORAL framework. Our results show that explicitly modeling score ordinality substantially improves agreement with human raters across all traits, outperforming LLMs and nominal classification and regression-based baselines. This finding reinforces the importance of aligning model objectives with rubric semantics for educational assessment. At the same time, small open-source LLMs achieve a competitive performance without task-specific fine-tuning, particularly for reasoning-oriented traits, while enabling transparent, privacy-preserving, and locally deployable assessment scenarios. Our findings provide methodological, modeling, and practical insights for the design of AI-based educational systems that aim to deliver interpretable, rubric-aligned feedback for argumentative writing.
[22] LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation cs.CLPDF
Ruixiao Yang, Yuanhe Tian, Xu Yang, Huiqi Li, Yan Song
TL;DR: 本文提出了一种名为LEAD(Layer-wise Expert-aligned Decoding)的新方法,用于改进放射学报告生成(RRG)任务。该方法旨在通过设计一个多层专家对齐解码机制,在大视觉语言模型(LVLM)的每个解码层中整合专家提取的病理特征,以动态纠正解码偏差,从而减少模型产生的幻觉,生成更忠实于图像的诊断报告。
Details
Motivation: 尽管大型视觉语言模型(LVLM)提升了放射学报告的流畅性和准确性,但它们仍存在幻觉问题,即生成看似合理但缺乏图像依据的病理细节。现有方法主要依赖外部知识引导来对齐生成文本与视觉信息,但忽略了预训练模型固有的解码先验和视觉-语言对齐偏差,且因依赖构建的引导而缺乏鲁棒性。
Result: 在多个公共数据集上进行的实验表明,LEAD方法在临床准确性指标上取得了有效提升,减轻了幻觉现象,同时保持了高生成质量。
Insight: 论文的创新点在于提出了一种层级的专家对齐解码架构,通过门控机制将多个专家模块提取的病理特征整合到解码器的每一层,使LLM能在每个推理步骤中咨询专家特征,从而动态地纠正解码偏差并引导生成朝向事实一致性。从客观角度看,这是一种从模型内部解码过程入手、而非依赖外部引导的鲁棒性改进方法。
Abstract: Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent decoding priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM decoding trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each decoder layer via a gating mechanism. This layer-wise architecture enables the LLM to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying decoding biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while preserving high generation quality.
[23] Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models cs.CLPDF
Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang
TL;DR: 本文针对生成式奖励模型(GenRMs)和LLM-as-a-Judge在训练和评估中过度依赖结果准确性(Outcome Accuracy)而导致的欺骗性对齐问题,提出了一种新的评估指标——理由一致性(Rationale Consistency),用于量化模型推理过程与人类判断的对齐程度。研究还提出了一种结合理由一致性与结果准确性的混合信号训练方法,该方法在多个基准测试中取得了最先进的性能,并有效避免了欺骗性对齐陷阱。
Details
Motivation: 现有生成式奖励模型和LLM-as-a-Judge存在欺骗性对齐问题,即模型可能基于错误的原因做出正确的判断,这主要源于训练和评估过度强调结果准确性,损害了其在RLHF过程中的泛化能力。
Result: 提出的混合信号训练方法在RM-Bench上达到87.1%,在JudgeBench上达到82%的准确率,平均比仅依赖结果准确性的基线高出5%。在RLHF中使用该方法,特别是在Arena Hard v2的创意写作任务中,性能提升了7%。理由一致性指标能有效区分最先进模型并检测欺骗性对齐,而结果准确性则无法做到。
Insight: 核心创新点在于引入了理由一致性这一细粒度评估指标,以弥补仅依赖结果准确性的不足,并提出了结合两者的混合训练信号。这为解决模型推理过程与人类期望的对齐问题提供了新思路,有助于提升模型的可解释性和泛化能力,避免RLHF中的欺骗性对齐陷阱。
Abstract: Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.
[24] ERNIE 5.0 Technical Report cs.CLPDF
Haifeng Wang, Hua Wu, Tian Wu, Yu Sun, Jing Liu
TL;DR: 本文介绍了ERNIE 5.0,一个原生自回归的基础模型,旨在实现跨文本、图像、视频和音频的统一多模态理解与生成。该模型基于超稀疏的专家混合架构,采用模态无关的路由机制,并通过新颖的弹性训练范式,在单次预训练中学习一系列不同深度、专家容量和路由稀疏度的子模型,以在性能、模型大小和推理延迟之间实现灵活权衡。
Details
Motivation: 解决大规模统一多模态基础模型在多样化资源约束下部署的实际挑战,并应对将强化学习扩展到此类模型所面临的困难。
Result: 大量实验表明,ERNIE 5.0在多种模态上实现了强大且均衡的性能。据作者所知,在公开披露的模型中,ERNIE 5.0是首个支持多模态理解与生成的万亿参数级统一自回归模型的生产级实现。
Insight: 主要创新点包括:1) 采用超稀疏MoE架构和模态无关的专家路由,在统一框架下从头训练所有模态;2) 提出弹性训练范式,能在单次训练中生成适应不同资源约束的子模型系列;3) 系统性地解决了在超稀疏MoE架构和多模态设置下扩展强化学习后训练的挑战,保证了其高效性和稳定性。
Abstract: In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
[25] Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases cs.CL | cs.AI | cs.HCPDF
Casey Ford, Madison Van Doren, Emily Dix
TL;DR: 本研究对多模态大语言模型的安全性进行了两阶段纵向评估,使用由26名专业红队成员编写的726个对抗性提示作为固定基准。第一阶段评估了GPT-4o、Claude Sonnet 3.5、Pixtral 12B和Qwen VL Plus;第二阶段评估了它们的后续版本(GPT-5、Claude Sonnet 4.5、Pixtral Large和Qwen Omni),共收集了82,256个人类危害评分。研究发现模型家族之间存在巨大且持续的差异,并且出现了明显的对齐漂移现象。
Details
Motivation: 多模态大语言模型越来越多地部署在现实世界系统中,但其在对抗性提示下的安全性仍未得到充分探索。本研究旨在通过纵向评估,探究模型更新过程中安全性的变化。
Result: Pixtral模型始终是最脆弱的,而Claude模型由于高拒绝率显得最安全。攻击成功率显示了对齐漂移:GPT和Claude模型在代际更新中ASR增加,而Pixtral和Qwen模型则略有下降。模态有效性也随时间变化:第一阶段纯文本提示更有效,第二阶段则出现模型特定的模式,GPT-5和Claude 4.5在所有模态上表现出近乎同等的脆弱性。
Insight: 研究揭示了MLLM的无害性在不同模型家族之间既不统一也不稳定,会随着模型更新而漂移。这强调了需要建立纵向、多模态的基准来追踪不断演变的安全行为。创新点在于首次系统性地对多个主流MLLM家族进行了跨代际的纵向安全评估,并量化了对齐漂移现象。
Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.
[26] When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond? cs.CL | cs.AIPDF
Xinyu Zhou, Chang Jin, Carsten Eickhoff, Zhijiang Guo, Seyed Ali Bahrainian
TL;DR: 本文首次对训练大语言模型在时序问答任务中具备弃答能力进行了实证研究。作者提出了一种将思维链监督与基于弃答感知奖励的强化学习相结合的训练流程,旨在系统分析不同信息类型和训练技术如何影响LLMs在时序推理中的弃答行为。实验表明,该方法在TimeQA基准上超越了GPT-4o,并显著提升了模型对不可回答问题的识别能力。
Details
Motivation: 大语言模型很少承认不确定性,经常产生流畅但具有误导性的答案,而不是选择弃答。这在时序问答中尤为明显,模型常忽略时间敏感证据并混淆不同时期的事实。现有方法(如校准)在捕捉复杂推理中的不确定性方面可能不可靠。
Result: 在TimeQA-Easy和Hard数据集上,基于Qwen2.5-1.5B-Instruct初始化的模型在精确匹配指标上分别超过GPT-4o 3.46%和5.80%。对于不可回答问题,其真阳性率比纯监督微调变体提高了20%。
Insight: 将弃答作为一种可教授的技能,通过结合思维链监督和强化学习进行联合优化。研究发现,监督微调会导致过度自信并损害可靠性,而强化学习虽能提高预测准确性但也存在类似风险。此外,隐式推理线索(如原始上下文、时序子上下文、知识图谱)对带有弃答的推理益处有限,显式的思维链监督更为有效。这为构建更可靠的LLMs提供了新见解。
Abstract: Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46%$ and $5.80%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
[27] Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation cs.CLPDF
Luis Frentzen Salim, Esteban Carlin, Alexandre Morinvil, Xi Ai, Lun-Wei Ku
TL;DR: 本文探索了在低资源机器翻译中,将上下文学习(ICL)从少样本扩展到数千个示例的长上下文设置。通过将上下文token预算扩展到100万个,并比较了三种训练语料库(单语无监督数据、指令风格数据和双语平行数据)作为上下文监督的效果。在爪哇语和巽他语上的实验表明,增加上下文带来的收益会迅速饱和,并在接近最大上下文窗口时可能下降,且扩展行为高度依赖于语料库类型。
Details
Motivation: 低资源语言由于高质量数据稀缺,构建机器翻译系统非常困难。尽管大语言模型(LLMs)提升了机器翻译性能,但将其适配到代表性不足的语言仍具挑战。上下文学习(ICL)通过在推理时提供示例,可能为低资源机器翻译提供新的适配方法。
Result: 在爪哇语和巽他语的实验中,发现增加上下文带来的翻译质量增益会快速饱和,在接近最大上下文窗口时甚至可能下降。扩展行为强烈依赖于语料库类型,某些单语监督形式可以与双语平行数据竞争,尽管后者提供了额外的监督。
Insight: 研究揭示了长上下文ICL在低资源机器翻译中的有效极限和语料类型敏感性,表明更大的上下文窗口不一定带来成比例的质量提升。单语无监督数据在某些情况下可作为双语平行数据的有效替代,这为低资源场景下的数据利用提供了新思路。
Abstract: Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English–target and Indonesian–target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.
[28] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models cs.CLPDF
Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen
TL;DR: 本文提出了OmniSIFT,一种专为全模态大语言模型设计的模态非对称令牌压缩框架,旨在通过两阶段压缩策略(时空视频剪枝和视觉引导音频选择)高效减少多模态输入序列长度,从而降低计算开销。
Details
Motivation: 全模态大语言模型在音视频理解任务中表现出色,但其依赖长多模态令牌序列导致计算开销巨大,而针对此类模型的令牌压缩方法仍很有限,因此需要一种高效的压缩方案。
Result: 在五个代表性基准测试上的广泛实验表明,OmniSIFT在仅引入485万参数的情况下,比OmniZip等无需训练的基线方法延迟更低;使用原始令牌上下文的25%时,其性能持续优于所有压缩基线,并在多个任务上甚至超越了全令牌模型的性能。
Insight: 创新点在于提出了一种模态非对称的两阶段压缩策略,结合时空视频剪枝和视觉引导音频选择,并通过可微分直通估计器进行端到端优化,实现了在保持模型性能的同时显著减少计算负担。
Abstract: Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
[29] SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization cs.CL | cs.AI | cs.LGPDF
Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu
TL;DR: 本文提出了SE-Bench,一个用于评测智能体知识内化与自我进化能力的诊断性基准。该基准通过混淆NumPy库的API文档来创建伪新知识包,要求智能体在训练阶段内化该知识,并在闭卷环境下完成简单编程任务,从而隔离先验知识和推理复杂性的干扰。研究发现,闭卷训练、标准RL的局限性以及自博弈与监督微调结合的有效性是实现知识内化的关键。
Details
Motivation: 当前缺乏一个能够严格衡量智能体作为终身学习者、通过内化新知识来解决未来问题这一核心能力的评测环境,主要障碍在于新知识与预训练数据的纠缠以及任务失败原因(是知识未内化还是问题本身过难)的难以区分。
Result: 在SE-Bench这一诊断性基准上,研究发现:采用参考文档的“开卷训练”会抑制知识保留,必须使用“闭卷训练”来强制知识压缩到模型权重中;标准的PPO强化学习由于裁剪和负梯度问题,无法完全内化新知识;而结合监督微调的自博弈方法被证明是可行的内化途径。
Insight: 论文的创新点在于设计了一个干净、可控的基准(SE-Bench)来诊断知识内化能力,并揭示了实现有效内化的关键训练范式(闭卷训练优于开卷训练)和算法局限性(标准RL的不足,自博弈与SFT结合的有效性),为研究智能体的自我进化提供了严谨的实验平台。
Abstract: True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new’’ knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring “Closed-Book Training” to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.
[30] CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation cs.CLPDF
Zhao Tong, Chunlin Gong, Yiping Zhang, Qiang Liu, Xingcheng Xu
TL;DR: 本研究挑战了大型语言模型(LLMs)安全性的一个常见假设,即模型拒绝有害请求就意味着其整个推理过程是安全的。论文通过引入一个统一的安全分析框架,对模型在生成假新闻时的思维链(CoT)推理进行内部分析,发现即使模型最终拒绝请求,其内部推理过程中仍可能包含和传播不安全叙述。
Details
Motivation: 动机在于质疑当前仅基于最终输出评估LLMs安全性的做法,特别是认为拒绝响应即代表安全推理的假设。研究旨在揭示在假新闻生成等任务中,模型内部思维链推理可能存在的潜在风险。
Result: 在多个面向推理的LLMs上进行的广泛实验表明,当激活思维模式时,生成风险显著上升,关键的路由决策集中在少数连续的中层深度。通过基于雅可比矩阵的谱度量评估个体注意力头,研究量化了特定注意力头对欺骗性推理模式的响应或嵌入。
Insight: 创新点在于提出了一个系统解构思维链生成并评估注意力头作用的统一安全分析框架,引入了稳定性、几何和能量三个可解释的度量来量化注意力头的行为。这为理解模型内部推理风险提供了新的视角,挑战了拒绝即安全的假设,并为缓解潜在风险提供了分析工具。
Abstract: From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
[31] Reinforced Attention Learning cs.CL | cs.CV | cs.LGPDF
Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang
TL;DR: 本文提出强化注意力学习(RAL),一种通过强化学习直接优化多模态大语言模型内部注意力分布而非输出序列的策略梯度框架,旨在解决传统基于冗长推理的RL后训练方法在提升多模态感知任务时效果有限甚至性能下降的问题。
Details
Motivation: 传统基于强化学习的后训练方法通过扩展测试时推理来提升大语言模型的推理能力,但将其应用于多模态大语言模型时,冗长的推理过程对感知任务的提升有限,甚至可能导致性能下降。
Result: 在多种图像和视频基准测试上的实验表明,RAL方法相比GRPO等基线模型取得了持续的性能提升。
Insight: 核心创新在于将优化目标从‘生成什么’(输出词序列)转变为‘关注哪里’(内部注意力分布),从而促进对复杂多模态输入的有效信息分配和更好的基础对齐。论文进一步提出的‘在线策略注意力蒸馏’表明,迁移潜在的注意力行为比标准知识蒸馏能带来更强的跨模态对齐效果,这为多模态后训练提供了一个有原则且通用的替代方案。
Abstract: Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
cs.CV [Back]
[32] TruKAN: Towards More Efficient Kolmogorov-Arnold Networks Using Truncated Power Functions cs.CV | cs.AI | cs.LGPDF
Ali Bayeh, Samira Sadaoui, Malek Mouhoub
TL;DR: 本文提出TruKAN,一种基于Kolmogorov-Arnold Network(KAN)结构的新型神经网络架构,通过使用截断幂函数替代KAN中的B样条基函数,在保持表达力的同时提升了准确性和训练效率。TruKAN被集成到EfficientNet-V2框架中,并在计算机视觉基准数据集上进行了评估,结果表明其在精度、计算效率和内存使用方面优于其他KAN变体。
Details
Motivation: 为了解决Kolmogorov-Arnold Networks(KAN)在计算效率与遵循其理论原则之间的权衡问题,本文旨在设计一种更高效且可解释的KAN架构。
Result: 在计算机视觉基准数据集上,TruKAN在复杂视觉任务中的准确性、计算效率和内存使用方面均优于其他KAN模型(如KAN、SineKAN),超越了先前KAN研究中的有限设定。
Insight: 主要创新点在于用源自k阶样条理论的截断幂函数族替代KAN中的B样条基,这简化了基函数和节点配置,从而在保持表达力的同时增强了模型的可解释性、准确性和训练速度。此外,TruKAN层结合了截断幂项与多项式项,并探索了共享与独立节点的配置,以及采用了混合优化和层归一化技术来提升收敛稳定性。
Abstract: To address the trade-off between computational efficiency and adherence to Kolmogorov-Arnold Network (KAN) principles, we propose TruKAN, a new architecture based on the KAN structure and learnable activation functions. TruKAN replaces the B-spline basis in KAN with a family of truncated power functions derived from k-order spline theory. This change maintains the KAN’s expressiveness while enhancing accuracy and training time. Each TruKAN layer combines a truncated power term with a polynomial term and employs either shared or individual knots. TruKAN exhibits greater interpretability than other KAN variants due to its simplified basis functions and knot configurations. By prioritizing interpretable basis functions, TruKAN aims to balance approximation efficacy with transparency. We develop the TruKAN model and integrate it into an advanced EfficientNet-V2-based framework, which is then evaluated on computer vision benchmark datasets. To ensure a fair comparison, we develop various models: MLP-, KAN-, SineKAN and TruKAN-based EfficientNet frameworks and assess their training time and accuracy across small and deep architectures. The training phase uses hybrid optimization to improve convergence stability. Additionally, we investigate layer normalization techniques for all the models and assess the impact of shared versus individual knots in TruKAN. Overall, TruKAN outperforms other KAN models in terms of accuracy, computational efficiency and memory usage on the complex vision task, demonstrating advantages beyond the limited settings explored in prior KAN studies.
[33] Explainable Computer Vision Framework for Automated Pore Detection and Criticality Assessment in Additive Manufacturing cs.CV | cs.AI | cs.CE | cs.LGPDF
Akshansh Mishra, Rakesh Morisetty
TL;DR: 本研究提出了一种可解释的计算机视觉框架,用于增材制造中三维断层扫描体积的孔隙检测与临界性评估。该框架通过灰度阈值分割与连通分量分析识别了500个孔隙,提取几何特征并构建孔隙相互作用网络,利用机器学习模型预测临界性,并通过SHAP分析量化特征贡献。
Details
Motivation: 增材制造中的内部孔隙是影响结构性能的关键缺陷,现有自动检测方法缺乏可解释性,工程师难以理解临界性预测的物理依据。
Result: 在三维断层扫描数据集上,模型预测显示归一化表面距离对临界性预测的贡献比其他描述符高一个数量级以上,孔隙尺寸影响微弱,几何参数影响可忽略,揭示了边界驱动的失效机制。
Insight: 创新点在于结合可解释AI(SHAP)与孔隙网络分析,提供了透明的缺陷评估框架;客观分析认为其将传统图像处理与特征重要性量化结合,为增材制造工艺优化提供了可操作的见解。
Abstract: Internal porosity remains a critical defect mode in additively manufactured components, compromising structural performance and limiting industrial adoption. Automated defect detection methods exist but lack interpretability, preventing engineers from understanding the physical basis of criticality predictions. This study presents an explainable computer vision framework for pore detection and criticality assessment in three-dimensional tomographic volumes. Sequential grayscale slices were reconstructed into volumetric datasets, and intensity-based thresholding with connected component analysis identified 500 individual pores. Each pore was characterized using geometric descriptors including size, aspect ratio, extent, and spatial position relative to the specimen boundary. A pore interaction network was constructed using percentile-based Euclidean distance criteria, yielding 24,950 inter-pore connections. Machine learning models predicted pore criticality scores from extracted features, and SHAP analysis quantified individual feature contributions. Results demonstrate that normalized surface distance dominates model predictions, contributing more than an order of magnitude greater importance than all other descriptors. Pore size provides minimal influence, while geometric parameters show negligible impact. The strong inverse relationship between surface proximity and criticality reveals boundary-driven failure mechanisms. This interpretable framework enables transparent defect assessment and provides actionable insights for process optimization and quality control in additive manufacturing.
[34] 4DPC$^2$hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping cs.CVPDF
Xindan Zhang, Weilong Yan, Yufei Shi, Xuerui Qiu, Tao He
TL;DR: 本文提出了4DPC$^2$hat,这是首个专为动态点云序列理解设计的多模态大语言模型。为了解决该领域缺乏大规模跨模态数据集和时空运动建模困难的问题,作者构建了包含44K动态物体序列、700K点云帧和200K问答对的大规模数据集4DPC$^2$hat-200K。模型核心采用Mamba增强的时序推理架构来捕获长程依赖和动态模式,并提出了一个失败感知的自举学习策略,通过迭代识别模型缺陷并生成针对性监督来持续增强推理能力。实验表明,该方法在动作理解和时序推理方面显著优于现有模型。
Details
Motivation: 现有基于点云的多模态大语言模型主要关注静态物体,对动态点云序列的理解仍未被充分探索,这主要是由于缺乏大规模跨模态数据集以及时空上下文中的运动建模困难。
Result: 广泛的实验表明,与现有模型相比,4DPC$^2$hat在动作理解和时序推理方面取得了显著提升,为4D动态点云理解奠定了坚实基础。
Insight: 主要创新点包括:1) 构建了首个大规模、支持多种问答类型(计数、时序关系、动作、空间关系、外观)的动态点云跨模态数据集;2) 设计了Mamba增强的时序推理MLLM架构,以有效捕获点云序列中的长程依赖和动态模式;3) 提出了失败感知的自举学习策略,通过迭代式地识别模型弱点并生成针对性监督数据,实现了模型能力的持续增强。
Abstract: Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC$^2$hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC$^2$hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question-answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC$^2$hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.
[35] Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation cs.CV | cs.AI | cs.LG | cs.MM | cs.SD | eess.ASPDF
Jinxing Zhou, Yanghao Zhou, Yaoting Wang, Zongyan Han, Jiaqi Ma
TL;DR: 本文提出了语言引导的视听分割(Ref-AVS)中的掩码质量评估任务(MQA-RefAVS),旨在无需真实标注的情况下,对候选分割掩码的质量进行定量和定性评估,包括预测IoU、识别错误类型和提供质量控制决策。为此,作者构建了包含多种几何和语义错误模式的基准数据集MQ-RAVSBench,并开发了基于多模态大语言模型(MLLM)的评估器MQ-Auditor,通过多模态推理实现掩码质量评估。
Details
Motivation: 当前Ref-AVS研究主要关注生成分割掩码,而对掩码质量提供丰富且可解释的诊断仍未被充分探索,因此需要一种无需真实标注参考的掩码质量评估方法,以支持分割失败检测和下游改进。
Result: 在构建的MQ-RAVSBench基准上,MQ-Auditor超越了强大的开源和商业MLLMs,并能与现有Ref-AVS系统集成,有效检测分割失败。
Insight: 创新点在于首次在Ref-AVS中引入无参考的掩码质量评估任务,构建了涵盖几何和语义错误的多样化基准,并利用MLLM进行多模态推理以实现定量和定性评估,为分割系统提供了可操作的质量控制机制。
Abstract: Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA-RefAVS.
[36] Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study cs.CV | cs.AIPDF
Hugo Markoff, Stefan Hein Bengtson, Michael Ørsted
TL;DR: 本研究评估了最先进的视觉Transformer基础模型在动物图像零样本聚类中的应用,通过结合多种降维技术和聚类算法,在包含60个物种的数据集上实现了近乎完美的物种级聚类性能,并展示了模型在提取种内变异(如年龄、性别)方面的能力。
Details
Motivation: 解决生态研究中动物图像手动标注效率低下的瓶颈问题,探索利用ViT基础模型对大量未标注动物图像进行物种级聚类的可行性。
Result: 使用DINOv3嵌入结合t-SNE和监督层次聚类方法,在物种级聚类上取得了近乎完美的性能(V-measure: 0.958);无监督方法也达到了有竞争力的性能(0.943),且仅需专家审查1.14%的异常图像。
Insight: 证明了ViT基础模型在零样本动物图像聚类中的强大能力;通过故意过聚类可以有效揭示种内有意义的生态模式(如年龄、性别二态性);为生态学家提供了开源工具包和针对不同分类群的数据处理方法建议。
Abstract: Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring efforts. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms, two supervised and two unsupervised, across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. We investigate when clustering succeeds at species-level, where it fails, and whether clustering within the species-level reveals ecologically meaningful patterns such as sex, age, or phenotypic variation. Our results demonstrate near-perfect species-level clustering (V-measure: 0.958) using DINOv3 embeddings with t-SNE and supervised hierarchical clustering methods. Unsupervised approaches achieve competitive performance (0.943) while requiring no prior species knowledge, rejecting only 1.14% of images as outliers requiring expert review. We further demonstrate robustness to realistic long-tailed distributions of species and show that intentional over-clustering can reliably extract intra-specific variation including age classes, sexual dimorphism, and pelage differences. We introduce an open-source benchmarking toolkit and provide recommendations for ecologists to select appropriate methods for sorting their specific taxonomic groups and data.
[37] Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs cs.CV | cs.LGPDF
Xuwei Tan, Ziyu Hu, Xueru Zhang
TL;DR: 该论文提出了NH-Fair基准,用于在标准化数据集、评估指标和训练协议下,系统评估和比较视觉模型与大型视觉语言模型(LVLMs)的公平性缓解方法。研究发现,许多去偏方法并不比精心调优的经验风险最小化(ERM)基线更可靠,而一种复合数据增强方法能持续提升公平性且不牺牲性能;同时,LVLMs虽平均准确率更高,但仍存在子群体差异。
Details
Motivation: 解决现有公平性缓解方法比较困难的问题,原因包括数据集异质性、公平性指标不一致、视觉与多模态模型评估孤立以及超参数调优不足,从而阻碍了公平比较。
Result: 在NH-Fair基准上,许多去偏方法未能可靠地超越调优良好的ERM基线;复合数据增强方法在保持性能的同时,一致性地提升了公平性;LVLMs显示出更高的平均准确率,但子群体差异仍然存在,且模型规模扩展带来的收益通常小于架构或训练协议选择的影响。
Insight: 创新点在于提供了一个统一的、可复现的公平性评估基准,覆盖视觉和LVLMs模型,并强调超参数调优对公平性和性能的关键影响;客观来看,论文通过系统调优研究揭示了实践指南,并指出数据增强是更可靠的公平性提升策略,对实际部署具有指导意义。
Abstract: Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.
[38] Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science cs.CV | cs.AI | cs.CE | cs.LGPDF
Levi Lingsch, Georgios Kissas, Johannes Jakubik, Siddhartha Mishra
TL;DR: 本文提出Phaedra,一种针对科学图像(如物理仿真数据)的高保真离散标记化方法,旨在解决现有图像标记器在捕捉科学图像的大动态范围和保留物理/光谱属性方面的不足。
Details
Motivation: 现有标记器专为视觉感知设计,可能不适用于需要保留物理和光谱属性的科学图像,因此需要研究针对科学图像的高保真标记化方法。
Result: 在多个PDE数据集上,Phaedra一致提升了重建精度,并在分布外泛化任务(如不同条件的已知PDE、未知PDE以及真实地球观测和天气数据)中表现出强大的泛化能力。
Insight: 创新点在于受经典形状-增益量化和适当正交分解启发,设计了一种能同时捕捉细节和精确幅度的标记化方法,适用于科学数据的独特需求。
Abstract: Tokens are discrete representations that allow modern deep learning to scale by transforming high-dimensional data into sequences that can be efficiently learned, generated, and generalized to new tasks. These have become foundational for image and video generation and, more recently, physical simulation. As existing tokenizers are designed for the explicit requirements of realistic visual perception of images, it is necessary to ask whether these approaches are optimal for scientific images, which exhibit a large dynamic range and require token embeddings to retain physical and spectral properties. In this work, we investigate the accuracy of a suite of image tokenizers across a range of metrics designed to measure the fidelity of PDE properties in both physical and spectral space. Based on the observation that these struggle to capture both fine details and precise magnitudes, we propose Phaedra, inspired by classical shape-gain quantization and proper orthogonal decomposition. We demonstrate that Phaedra consistently improves reconstruction across a range of PDE datasets. Additionally, our results show strong out-of-distribution generalization capabilities to three tasks of increasing complexity, namely known PDEs with different conditions, unknown PDEs, and real-world Earth observation and weather data.
[39] SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? cs.CV | cs.CE | cs.CL | cs.LGPDF
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar
TL;DR: 该论文提出了一个名为SpatiaLab的综合性基准测试,用于评估视觉语言模型在真实、无约束场景下的空间推理能力。该基准包含1,400个视觉问答对,涵盖六大类别和30种任务类型。实验表明,当前最先进的VLMs在空间推理能力上仍远逊于人类。
Details
Motivation: 现有研究主要依赖合成或LLM生成的环境,任务设计有限且类似谜题,未能捕捉真实世界的复杂性、视觉噪声和多样的空间关系。因此,需要一个新的基准来评估VLMs在真实场景下的空间推理能力。
Result: 在多项选择题设置中,表现最佳的InternVL3.5-72B模型准确率为54.93%,而人类为87.57%。在开放式回答设置中,所有模型性能下降约10-25%,GPT-5-mini得分最高为40.93%,人类为64.93%。这揭示了模型在处理复杂空间关系、深度感知、导航和3D几何方面的关键局限。
Insight: 论文的创新点在于构建了一个全面、多样化的真实世界空间推理基准SpatiaLab,它暴露了当前VLMs的核心弱点,并为未来研究提供了明确的评估框架和方向,以推动模型实现更鲁棒、与人类对齐的空间理解。
Abstract: Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs’ spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs’ spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.
[40] Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers cs.CVPDF
Peihao Xiang, Kaida Wu, Ou Bai
TL;DR: 本文提出了一种名为Gardener的数据无关、一次性块级剪枝方法,用于识别掩码自监督视觉Transformer中的冗余块。该方法通过分析预训练块权重的信息熵来估计块的重要性,无需访问任何数据,且计算开销极低。实验表明,即使在剪除高达91.7%的块后,剪枝模型仍能保持竞争力的迁移性能。
Details
Motivation: 掩码自监督视觉Transformer模型规模庞大,对资源受限的部署和高效迁移学习构成挑战。本文旨在探究是否所有Transformer块对下游性能同等重要,并寻求一种无需数据即可准确估计块重要性的方法。
Result: 在VideoMAE-B模型上,针对多个剪枝比例和下游视频识别基准进行评估。Gardener在计算开销可忽略的情况下,始终匹配或优于现有的数据无关剪枝基线,并接近基于敏感性的剪枝方法。
Insight: 创新点在于发现预训练块权重的信息熵与通过迭代块移除和微调获得的oracle敏感性强相关,从而提出了一种基于信息论的、原则性的高效模型压缩途径。这揭示了掩码自监督视觉Transformer中存在显著的块级冗余,为资源高效的迁移学习提供了新思路。
Abstract: Masked self-supervised vision transformers have become a dominant pretraining paradigm, yet their substantial model size poses significant challenges for resource-constrained deployment and efficient transfer learning. A fundamental question remains: are all transformer blocks equally important for downstream performance? In this paper, we show that block importance in masked self-supervised vision transformers can be accurately estimated without access to any data. Our key finding is that the information entropy of pretrained block weights strongly correlates with oracle sensitivity obtained via iterative block removal and finetuning. This observation enables Gardener, a data-free, one-shot, block-level pruning principle that identifies redundant blocks through simple information-theoretic measurements. We evaluate Gardener on VideoMAE-B across multiple pruning ratios and downstream video recognition benchmarks. Despite its negligible computational overhead, Gardener consistently matches or outperforms existing data-free pruning baselines and closely approaches sensitivity-based pruning. Remarkably, even after pruning up to 91.7% of blocks, the pruned model retains competitive transfer performance. Our results reveal substantial block-level redundancy in masked self-supervised vision transformers and demonstrate that information-theoretic analysis offers a principled and efficient pathway for model compression and resource-efficient transfer learning.
[41] AnyStyle: Single-Pass Multimodal Stylization for 3D Gaussian Splatting cs.CVPDF
Joanna Kaleta, Bartosz Świrta, Kacper Kania, Przemysław Spurek, Marek Kowalski
TL;DR: 本文提出AnyStyle,一种基于3D高斯溅射(3DGS)的单次前馈3D重建与风格化框架,支持通过文本或图像等多模态条件实现无姿态、零样本的风格化控制,仅需最小架构修改即可集成到现有前馈3D重建主干中。
Details
Motivation: 现有前馈3D重建方法在风格化或外观控制方面探索不足,且大多依赖基于图像的条件,限制了可控性和灵活性。本文旨在解决如何将多模态风格控制集成到无姿态3D重建流程中的问题。
Result: 实验表明,AnyStyle在保持高质量几何重建的同时,相比先前的前馈风格化方法提升了风格可控性;用户研究进一步证实其风格化质量优于现有最先进方法。
Insight: 创新点在于提出模块化的风格化架构,支持文本和视觉风格输入,实现零样本、多模态条件控制,且易于与现有3D重建主干集成,增强了3D资产创建的灵活性和可扩展性。
Abstract: The growing demand for rapid and scalable 3D asset creation has driven interest in feed-forward 3D reconstruction methods, with 3D Gaussian Splatting (3DGS) emerging as an effective scene representation. While recent approaches have demonstrated pose-free reconstruction from unposed image collections, integrating stylization or appearance control into such pipelines remains underexplored. Existing attempts largely rely on image-based conditioning, which limits both controllability and flexibility. In this work, we introduce AnyStyle, a feed-forward 3D reconstruction and stylization framework that enables pose-free, zero-shot stylization through multimodal conditioning. Our method supports both textual and visual style inputs, allowing users to control the scene appearance using natural language descriptions or reference images. We propose a modular stylization architecture that requires only minimal architectural modifications and can be integrated into existing feed-forward 3D reconstruction backbones. Experiments demonstrate that AnyStyle improves style controllability over prior feed-forward stylization methods while preserving high-quality geometric reconstruction. A user study further confirms that AnyStyle achieves superior stylization quality compared to an existing state-of-the-art approach. Repository: https://github.com/joaxkal/AnyStyle.
[42] Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal cs.CVPDF
Rio Aguina-Kang, Kevin James Blackburn-Matzen, Thibault Groueix, Vladimir Kim, Matheus Gadelha
TL;DR: SeeingThroughClutter是一种从单张图像重建结构化3D场景的方法,其核心是通过迭代地移除和建模单个物体来分解复杂场景。该方法利用视觉语言模型作为协调器,依次执行检测、分割、物体移除和3D拟合,无需特定任务训练,并直接受益于基础模型的进步。
Details
Motivation: 解决现有方法在复杂场景(特别是存在遮挡和杂乱物体时)中,依赖语义分割和深度估计等中间任务而性能不佳的问题。
Result: 在3D-Front和ADE20K数据集上展示了最先进的鲁棒性(state-of-the-art)。
Insight: 主要创新点是提出了一个迭代物体移除与重建的流水线,将复杂场景分解为一系列更简单的子任务;其优势在于移除物体后能为后续物体提供更清晰的分割,且无需特定训练,具有很好的可扩展性。
Abstract: We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: https://rioak.github.io/seeingthroughclutter/
[43] VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding cs.CVPDF
Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen
TL;DR: 本文提出了VideoBrain,一个用于长视频理解的端到端框架,它通过双智能体(基于CLIP的语义检索智能体和均匀采样智能体)学习自适应帧采样策略,使视觉语言模型能够主动感知视频帧并判断信息充分性,从而在减少计算量的同时提升理解性能。
Details
Motivation: 解决长视频理解中计算约束与信息捕获之间的固有矛盾,现有方法要么均匀采样(易丢失信息),要么单次选择关键帧(无法纠正错误选择),需要一种能自适应获取视觉信息的方法。
Result: 在四个长视频基准测试上,VideoBrain相比基线模型性能提升3.5%至9.0%,同时减少了30-40%的帧使用量,并在短视频基准上表现出强大的跨数据集泛化能力。
Insight: 创新点在于提出由VLM直接感知和推理的双智能体自适应采样框架,以及结合行为感知奖励函数与数据分类流程的训练机制,有效防止模型滥用智能体并学习何时调用智能体真正有益,实现了效率与性能的平衡。
Abstract: Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.
[44] SuperPoint-E: local features for 3D reconstruction via tracking adaptation in endoscopy cs.CVPDF
O. Leon Barbed, José M. M. Montiel, Pascal Fua, Ana C. Murillo
TL;DR: 本文提出了SuperPoint-E,一种用于内窥镜视频中运动恢复结构(SfM)的局部特征提取方法。该方法通过一种新颖的跟踪自适应监督策略,显著提升了内窥镜场景下特征检测与描述的质量,从而获得更密集、覆盖更广的3D重建结果。
Details
Motivation: 旨在提升内窥镜视频中特征提取的性能,以改进运动恢复结构(SfM)的重建效果,解决内窥镜场景下特征检测稀疏、匹配困难的问题。
Result: 在真实内窥镜视频上的实验表明,与原始SuperPoint和SfM黄金标准流程COLMAP相比,该方法能产生更密集的3D重建,覆盖更长、更多的视频片段,特征检测精度更高,描述子判别性更强,使得引导匹配步骤几乎冗余。
Insight: 核心创新点是提出了“跟踪自适应”监督策略,该策略针对内窥镜视频序列特性进行优化,使特征检测器能更密集地响应,且特征描述子更具判别力,从而显著提升了SfM重建的完整性和鲁棒性。
Abstract: In this work, we focus on boosting the feature extraction to improve the performance of Structure-from-Motion (SfM) in endoscopy videos. We present SuperPoint-E, a new local feature extraction method that, using our proposed Tracking Adaptation supervision strategy, significantly improves the quality of feature detection and description in endoscopy. Extensive experimentation on real endoscopy recordings studies our approach’s most suitable configuration and evaluates SuperPoint-E feature quality. The comparison with other baselines also shows that our 3D reconstructions are denser and cover more and longer video segments because our detector fires more densely and our features are more likely to survive (i.e. higher detection precision). In addition, our descriptor is more discriminative, making the guided matching step almost redundant. The presented approach brings significant improvements in the 3D reconstructions obtained, via SfM on endoscopy videos, compared to the original SuperPoint and the gold standard SfM COLMAP pipeline.
[45] JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models cs.CV | cs.AIPDF
Hiroshi Sasaki
TL;DR: 本文介绍了JSynFlow,一个利用大语言模型生成的日语流程图视觉问答数据集,旨在提升视觉语言模型对流程图的理解能力。该数据集包含不同职业的任务描述、由领域特定语言代码渲染的流程图图像以及相关的问答对。研究表明,使用JSynFlow进行微调能显著提高视觉语言模型在基于流程图的问答任务上的性能。
Details
Motivation: 视觉语言模型需要分析包含流程图等复杂文档,但构建大规模流程图图像和文本数据集耗时耗力,因此作者提出利用大语言模型自动合成日语流程图问答数据集以解决数据稀缺问题。
Result: 通过微调JSynFlow数据集,视觉语言模型在基于流程图的问答任务上性能得到显著提升,具体基准和定量结果未在摘要中明确提及,但暗示了该方法有效。
Insight: 创新点在于利用大语言模型自动合成高质量的日语流程图视觉问答数据集,降低了数据构建成本,为视觉语言模型在流程图理解领域的训练提供了新途径。
Abstract: Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset’s synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at https://huggingface.co/datasets/jri-advtechlab/jsynflow.
[46] Point2Insert: Video Object Insertion via Sparse Point Guidance cs.CVPDF
Yu Zhou, Xiaoyan Yang, Bojia Zi, Lihan Zhang, Ruijie Sun
TL;DR: Point2Insert是一个基于稀疏点的视频对象插入框架,旨在通过少量点提示实现灵活且用户友好的对象插入,解决了现有方法需要密集掩码标注或难以精确定位的问题。该框架采用两阶段训练,并利用掩码引导的教师模型进行知识蒸馏,在实验中表现优于强基线模型。
Details
Motivation: 现有视频对象插入方法面临两大挑战:基于掩码的方法需要费力的人工标注,而基于指令的方法难以精确定位对象。Point2Insert旨在通过稀疏点提示替代密集掩码,降低用户负担并实现精细的空间控制。
Result: 大量实验表明,Point2Insert在视频对象插入任务上持续优于强基线模型,甚至超越了参数量大10倍的模型,展现了其高效性和性能优势。
Insight: 创新点包括:使用稀疏正负点提示实现精细空间控制,无需掩码标注;采用两阶段训练策略,结合合成视频数据适应视频插入;利用掩码引导的教师模型进行知识蒸馏,提升点引导模型的插入成功率。从客观角度看,该方法在用户交互友好性和模型效率之间取得了良好平衡。
Abstract: This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.
[47] Partial Ring Scan: Revisiting Scan Order in Vision State Space Models cs.CVPDF
Yi-Kuan Hsieh, Jun-Wei Hsieh, Xin li, Ming-Ching Chang, Yu-Chee Tseng
TL;DR: 本文提出了PRISMamba,一种针对视觉状态空间模型(Vision SSMs)的扫描顺序改进方法。该方法通过将图像划分为同心圆环,在每个环内进行顺序无关的聚合,并通过一组短径向SSM在环间传播上下文,从而构建了对旋转等几何变换更鲁棒的图像序列化方式。同时,通过部分通道过滤机制,仅将信息量最大的通道送入循环环路径处理,提高了效率。
Details
Motivation: 动机在于指出视觉SSMs将2D图像序列化为1D令牌序列时,预定义的扫描顺序是一个常被忽视的关键因素。该顺序会改变空间邻接关系、破坏对象连续性,并在旋转等几何变换下加剧性能退化。
Result: 在ImageNet-1K上,PRISMamba取得了84.5%的Top-1准确率,计算量为3.9G FLOPs,在A100上的吞吐量为3,054 img/s,在准确率和吞吐量上均优于VMamba且所需FLOPs更少。在旋转条件下,其性能保持稳定,而固定路径扫描方法的性能会下降1~2%。
Insight: 创新点在于系统性地研究了扫描顺序对视觉SSM性能的影响,并提出了基于同心圆环的、对旋转鲁棒的扫描策略(Partial Ring Scan)以及结合部分通道过滤的高效架构。这揭示了扫描顺序设计与通道过滤是提升视觉SSM准确性、效率和旋转鲁棒性的关键且未被充分探索的因素。
Abstract: State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.
[48] Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models cs.CV | cs.AI | cs.LG | cs.ROPDF
Angel Martinez-Sanchez, Parthib Roy, Ross Greer
TL;DR: 这篇论文提出了一个基于真实世界数据集doScenes的指令条件化驾驶规划方法,通过将乘客的自由形式语言指令集成到开源端到端驾驶框架OpenEMMA中,实现了语言引导的轨迹规划,显著提升了规划轨迹的准确性和鲁棒性。
Details
Motivation: 解决现有指令跟随规划器大多依赖仿真或固定命令词汇,难以在真实世界中泛化的问题,旨在通过自然语言指令实现更灵活、场景响应式的人机协同驾驶规划。
Result: 在doScenes数据集的849个标注场景上使用平均位移误差(ADE)进行评估,指令条件化显著减少了极端基线故障,使平均ADE降低了98.7%;即使排除异常值,精心设计的指令仍能将ADE提升高达5.1%。
Insight: 创新点在于首次将真实世界的自由形式指令(具有指代性)与nuScenes真实运动数据结合,构建了doScenes数据集,并成功将语言指令作为条件集成到视觉-语言-动作模型中,为指令感知规划提供了可复现的基线,并探讨了有效指令的构成要素。
Abstract: Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA’s vision-language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well-phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a “good” instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction-aware planning. GitHub: https://github.com/Mi3-Lab/doScenes-VLM-Planning
[49] VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents cs.CVPDF
Feng Wang, Yichun Shi, Ceyuan Yang, Qiushan Guo, Jingxiang Sun
TL;DR: 本文提出了VTok,一个统一的视频标记化框架,适用于生成和理解任务。该方法通过解耦视频的空间和时间表示,保留关键帧的空间特征并将后续帧编码为残差标记,实现了紧凑且富有表现力的视频标记化,显著降低了表示复杂度。
Details
Motivation: 现有领先的视觉语言系统通常通过简单的帧采样策略对视频进行标记化,这可能导致效率低下或表示不充分。VTok旨在解决这个问题,通过解耦空间和时间信息来获得更紧凑、更一致的视频表示。
Result: 在广泛的评估中,VTok在一系列视频理解和文本到视频生成基准测试中(如TV-Align和VBench)均取得了比使用朴素标记化的基线模型更高的性能(例如,在TV-Align上准确率高3.4%,VBench得分高1.9%),且每个视频的标记序列更短。在文本到视频生成中,由于其更一致的时间编码,能产生更连贯的运动和更强的指令跟随能力。
Insight: 核心创新点是提出了一种解耦视频空间和时间潜在表示的标记化范式,用关键帧加残差标记的方式替代传统的逐帧密集标记,从而在降低计算复杂度的同时保持了表达的充分性。这为未来的视频理解和生成研究提供了一个标准化的、高效的视频表示思路。
Abstract: This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.
[50] AGMA: Adaptive Gaussian Mixture Anchors for Prior-Guided Multimodal Human Trajectory Forecasting cs.CV | cs.LGPDF
Chao Li, Rui Zhang, Siyuan Huang, Xian Zhong, Hongbo Jiang
TL;DR: 本文提出AGMA(自适应高斯混合锚点)方法,用于解决行人轨迹预测中的先验错配问题。该方法通过两阶段构建表达性先验:从训练数据中提取多样化行为模式,并将其蒸馏为场景自适应的全局先验。在ETH-UCY、Stanford Drone和JRDB数据集上的实验表明,AGMA达到了最先进的性能。
Details
Motivation: 现有方法在学习或固定先验时存在先验错配问题,无法充分捕捉合理未来轨迹的完整分布,限制了预测准确性和多样性。理论分析表明预测误差受先验质量下界约束,因此先验建模成为性能瓶颈。
Result: 在ETH-UCY、Stanford Drone和JRDB数据集上的广泛实验证明,AGMA实现了最先进的性能(SOTA)。
Insight: 创新点在于理论证明了先验质量对预测误差的下界影响,并提出两阶段自适应先验构建方法:先提取多样化行为模式,再蒸馏为场景自适应全局先验,从而提升先验表达能力和预测性能。
Abstract: Human trajectory forecasting requires capturing the multimodal nature of pedestrian behavior. However, existing approaches suffer from prior misalignment. Their learned or fixed priors often fail to capture the full distribution of plausible futures, limiting both prediction accuracy and diversity. We theoretically establish that prediction error is lower-bounded by prior quality, making prior modeling a key performance bottleneck. Guided by this insight, we propose AGMA (Adaptive Gaussian Mixture Anchors), which constructs expressive priors through two stages: extracting diverse behavioral patterns from training data and distilling them into a scene-adaptive global prior for inference. Extensive experiments on ETH-UCY, Stanford Drone, and JRDB datasets demonstrate that AGMA achieves state-of-the-art performance, confirming the critical role of high-quality priors in trajectory forecasting.
[51] Adaptive 1D Video Diffusion Autoencoder cs.CVPDF
Yao Teng, Minxuan Lin, Xian Liu, Shuai Wang, Xiao Yang
TL;DR: 本文提出了一种名为One-Dimensional Diffusion Video Autoencoder (One-DVA)的自适应一维视频扩散自动编码器,以解决现有视频自动编码器在固定速率压缩、架构不灵活和解码器确定性方面的局限性。该框架采用基于Transformer的编码器进行自适应一维编码,并结合基于扩散的Transformer解码器进行视频重建,通过两阶段训练策略实现与3D-CNN VAE相当的重建性能,同时支持更高的自适应压缩比,并通过潜在分布正则化和解码器微调优化下游生成任务。
Details
Motivation: 解决现有视频自动编码器的三个主要缺陷:固定速率压缩导致简单视频的token浪费、不灵活的CNN架构阻碍变长潜在建模,以及确定性解码器难以从压缩潜在表示中恢复细节。
Result: 在相同压缩比下,One-DVA在重建指标上达到与3D-CNN VAE相当的性能;更重要的是,它支持自适应压缩,可实现更高的压缩比。
Insight: 创新点包括:基于查询的视觉Transformer编码器提取时空特征并生成潜在表示,变长dropout机制动态调整潜在长度,以及像素空间扩散Transformer作为解码器;此外,通过潜在分布正则化和解码器微调来优化生成建模,减少生成过程引入的伪影。
Abstract: Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.
[52] ACIL: Active Class Incremental Learning for Image Classification cs.CV | cs.AIPDF
Aditya R. Bhattacharya, Debanjan Goswami, Shayok Chakraborty
TL;DR: 本文提出了一种名为ACIL的主动类增量学习框架,用于图像分类任务,旨在通过结合主动学习策略来减少增量学习过程中的标注成本,同时避免灾难性遗忘。
Details
Motivation: 现有类增量学习方法通常假设每个增量阶段的所有训练样本都已标注,这导致高昂的标注成本且造成标注浪费,因为后续阶段无法访问先前阶段的数据。
Result: 在多个视觉数据集上的广泛实验表明,ACIL框架能够显著降低标注成本,并有效避免灾难性遗忘,其性能优于相关基线方法。
Insight: 创新点在于将主动学习的不确定性和多样性准则引入类增量学习,动态选择每个增量阶段中需要标注的样本,从而优化标注效率并维持模型性能。
Abstract: Continual learning (or class incremental learning) is a realistic learning scenario for computer vision systems, where deep neural networks are trained on episodic data, and the data from previous episodes are generally inaccessible to the model. Existing research in this domain has primarily focused on avoiding catastrophic forgetting, which occurs due to the continuously changing class distributions in each episode and the inaccessibility of the data from previous episodes. However, these methods assume that all the training samples in every episode are annotated; this not only incurs a huge annotation cost, but also results in a wastage of annotation effort, since most of the samples in a given episode will not be accessible to the model in subsequent episodes. Active learning algorithms identify the salient and informative samples from large amounts of unlabeled data and are instrumental in reducing the human annotation effort in inducing a deep neural network. In this paper, we propose ACIL, a novel active learning framework for class incremental learning settings. We exploit a criterion based on uncertainty and diversity to identify the exemplar samples that need to be annotated in each episode, and will be appended to the data in the next episode. Such a framework can drastically reduce annotation cost and can also avoid catastrophic forgetting. Our extensive empirical analyses on several vision datasets corroborate the promise and potential of our framework against relevant baselines.
[53] Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery cs.CVPDF
Jiaxin Cen, Xudong Mao, Guanghui Yue, Wei Zhou, Ruomei Wang
TL;DR: 本文提出了一种用于单目视频人体网格恢复的深度引导度量感知时序一致性框架,旨在解决深度模糊和尺度不确定性带来的度量一致性与时序稳定性挑战。该方法通过三个协同组件实现:深度引导多尺度融合模块、深度引导度量感知姿态与形状估计器以及运动-深度对齐细化模块。
Details
Motivation: 解决单目视频人体网格恢复中因深度模糊和尺度不确定性导致的度量不一致和时序不稳定问题,现有方法主要依赖RGB特征和时间平滑,难以处理深度排序、尺度漂移和遮挡引起的不稳定性。
Result: 在三个具有挑战性的基准测试上取得了优异结果,在严重遮挡下的鲁棒性和空间准确性方面有显著提升,同时保持了计算效率。
Insight: 创新性地将深度信息作为几何先验系统地融入人体网格恢复流程,通过置信度感知门控自适应融合RGB与深度特征,利用深度校准的骨骼统计进行尺度一致初始化,并通过跨模态注意力在运动动态与几何线索间强制时序一致性,实现了度量感知的时序稳定恢复。
Abstract: Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.
[54] Decoupled Hierarchical Distillation for Multimodal Emotion Recognition cs.CVPDF
Yong Li, Yuanzhi Wang, Yi Ding, Shiqing Zhang, Ke Lu
TL;DR: 本文提出了一种名为解耦分层多模态蒸馏(DHMD)的新框架,用于解决多模态情感识别(MER)中存在的模态异质性和贡献度差异问题。该框架通过自回归机制将每个模态的特征解耦为模态无关(同质)和模态独占(异质)成分,并采用两阶段知识蒸馏策略:粗粒度图蒸馏单元和细粒度跨模态字典匹配,以实现灵活的知识迁移和跨模态特征对齐。
Details
Motivation: 现有多模态情感识别方法在处理固有的模态异质性和不同模态的贡献度差异方面仍存在困难,需要更有效的跨模态特征对齐和知识迁移机制。
Result: 在CMU-MOSI和CMU-MOSEI数据集上,DHMD在ACC7、ACC2和F1指标上分别相对提升了1.3%/2.4%、1.3%/1.9%和1.9%/1.8%,持续优于最先进的MER方法,达到了SOTA水平。
Insight: 创新点在于将模态特征解耦为同质和异质成分,并结合粗粒度图蒸馏与细粒度字典匹配的分层蒸馏策略,这为处理多模态异质性提供了可借鉴的模块化设计思路,其动态图机制和语义粒度对齐方法也具有普适性。
Abstract: Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality’s features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3%/2.4% (ACC$_7$), 1.3%/1.9% (ACC$_2$) and 1.9%/1.8% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.
[55] KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing cs.CVPDF
Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He
TL;DR: 本文提出了一种名为KVSmooth的训练无关即插即用方法,用于缓解多模态大语言模型中的幻觉问题。该方法通过对KV-Cache中的键和值进行基于注意力熵引导的自适应指数移动平均平滑,来抑制解码过程中的语义漂移,从而生成更忠实于视觉输入的内容。
Details
Motivation: 多模态大语言模型在解码过程中常因语义漂移而产生与视觉输入不一致的幻觉(如物体、属性或关系),这阻碍了其可靠部署。现有方法或需重新训练,或计算成本高,因此需要一种高效且无需训练的方法来缓解此问题。
Result: 在CHAIR_S指标上,幻觉率从41.8显著降低至18.2;同时整体性能(F1分数)从77.5提升至79.2,实现了精确率和召回率的同步提升,优于以往以牺牲一方为代价提升另一方的方法。
Insight: 创新点在于提出了一种基于注意力熵动态量化每个token的“下沉”程度,并据此自适应调整KV-Cache平滑强度的机制。这是一种高效、无需训练且即插即用的后处理技术,为缓解MLLM幻觉问题提供了新思路。
Abstract: Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination – corresponding to the generation of visually inconsistent objects, attributes, or relations – remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ($\mathit{CHAIR}_{S}$ from $41.8 \rightarrow 18.2$) while improving overall performance ($F_1$ score from $77.5 \rightarrow 79.2$), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.
[56] SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization cs.CV | cs.AI | cs.GRPDF
Lifan Wu, Ruijie Zhu, Yubo Ai, Tianzhu Zhang
TL;DR: 该论文提出了一种名为SkeletonGaussian的新框架,用于从单目视频输入生成可编辑的动态3D高斯模型。该方法通过引入分层铰接表示,将运动分解为由骨骼显式驱动的稀疏刚性运动和细粒度非刚性运动,从而实现了对生成动态3D对象运动的直接控制和编辑。
Details
Motivation: 现有4D生成方法通常将运动表示为隐式变形场,这限制了直接控制和编辑能力。为了解决这一问题,论文旨在开发一个能够生成可编辑动态3D对象的新框架。
Result: 实验结果表明,SkeletonGaussian在生成质量上超越了现有方法,同时实现了直观的运动编辑,为可编辑4D生成建立了新范式。
Insight: 论文的创新点在于提出了一个结合了显式骨骼驱动(通过线性混合蒙皮)和基于六面体(hexplane)细化的非刚性变形表示的分层铰接表示,这增强了生成模型的解释性和可编辑性。
Abstract: 4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: https://wusar.github.io/projects/skeletongaussian/
[57] Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement cs.CV | cs.AI | cs.CLPDF
Zipeng Zhu, Zhanghao Hu, Qinglin Zhu, Yuxi Hong, Yijun Liu
TL;DR: 本文提出了一种动态视角的视觉定位方法,通过层间敏感性分析发现视觉定位是一个动态过程:简单任务依赖中间层,复杂任务需要深层视觉信息再激活。基于此,作者提出了VAQ指标来识别与查询最相关的注意力层,并进一步提出了LASER,一种无需训练的自适应推理方法,能根据任务复杂度选择合适层进行视觉定位和问答增强。
Details
Motivation: 现有的大视觉语言模型(LVLMs)通常将图像调整为统一分辨率,导致细节丢失和幻觉问题;而基于注意力引导的增强方法(如裁剪)往往依赖在简单识别基准上经验选择的静态“魔法层”,难以迁移到复杂推理任务。
Result: 在多个VQA基准测试上的实验表明,LASER显著提高了不同复杂度任务的VQA准确率。
Insight: 核心创新在于揭示了视觉定位是动态的、层依赖的过程,并提出了VAQ指标和LASER方法,实现了无需训练、自适应任务复杂度的视觉定位与解码增强,突破了静态裁剪或固定层选择的局限。
Abstract: Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static “magic layer” empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.
[58] JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction cs.CVPDF
Zihan Lou, Jinlong Fan, Sihan Ma, Yuxiang Yang, Jing Zhang
TL;DR: JOintGS是一个联合优化相机外参、人体姿态和3D高斯表示的框架,用于从单目RGB视频重建高保真、可动画的3D人体化身。它通过前景-背景解耦和协同优化机制,解决了野外场景中相机和姿态估计不准确的问题,并实现了实时渲染。
Details
Motivation: 解决在无约束的野外单目视频中,由于现成方法(如COLMAP、HMR2.0)估计的相机参数和人体姿态不准确,导致基于3D高斯溅射(3DGS)的高质量人体化身重建受限的问题。
Result: 在NeuMan和EMDB数据集上的实验表明,JOintPS在NeuMan数据集上比现有最佳方法(SOTA)的PSNR提高了2.1 dB,实现了更优的重建质量,同时保持实时渲染,并对噪声初始化表现出更强的鲁棒性。
Insight: 核心创新在于通过显式的前景-背景解耦实现相机、人体和3D高斯的协同优化:静态背景高斯通过多视角一致性锚定相机估计,优化后的相机通过准确的时间对应改善人体对齐,优化后的人体姿态通过从静态约束中移除动态伪影来增强场景重建。此外,还引入了时间动态模块捕捉姿态相关变形和残差颜色场建模光照变化。
Abstract: Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the baseline.Our source code is available at https://github.com/MiliLab/JOintGS.
[59] Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner cs.CV | cs.AIPDF
Qian-Wei Wang, Guanghao Meng, Ren Cai, Yaguang Song, Shu-Tao Xia
TL;DR: 本文提出了一种无需人工标注的视觉语言模型(VLM)微调框架CoFT及其增强版CoFT+。该方法通过双模型跨模态协作机制,利用正负文本提示来显式建模样本依赖的伪标签清洁度,避免了手动阈值或噪声假设,并通过两阶段训练方案(从参数高效微调到全微调)以及迭代微调、动量对比学习和LLM生成提示等技术,实现了在无标签数据上的有效适应。
Details
Motivation: 解决大规模视觉语言模型(如CLIP)在下游任务适应时通常需要昂贵标注数据的问题,并克服现有无监督自训练方法中存在的置信度过滤不可靠、确认偏差以及低置信度样本利用不足等缺陷。
Result: 广泛的实验表明,该方法在多个基准测试上相比现有无监督方法以及少样本监督基线均取得了持续的性能提升。
Insight: 核心创新在于提出了一个基于双提示学习(正/负提示)的协作微调框架,能够显式地、样本依赖地建模伪标签质量,从而无需预设阈值或噪声模型;同时,负提示还作为正则化器提升了轻量视觉适应模块在噪声监督下的鲁棒性。增强版CoFT+进一步结合了迭代微调、动量对比和LLM生成提示,提升了适应能力。
Abstract: Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.
[60] Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning cs.CV | cs.AIPDF
Qian-Wei Wang, Yaguang Song, Shu-Tao Xia
TL;DR: 本文提出了一种用于主动CLIP适应的显式不确定性建模框架,基于双提示调优。该方法在CLIP的文本分支中引入了两个可学习的提示:正提示用于增强任务特定文本嵌入的判别性,以提高分类可靠性;负提示则以反向方式训练,显式建模预测标签正确的概率,从而为主动样本选择提供原则性的不确定性信号。
Details
Motivation: 预训练的视觉-语言模型(如CLIP)虽然具有强大的可迁移性,但在有限标注预算下将其适应下游图像分类任务仍然具有挑战性。现有主动学习方法通常通过基于熵的准则或表示聚类来估计不确定性,没有从模型角度显式建模不确定性。
Result: 在不同微调范式下的大量实验表明,在相同标注预算下,该方法始终优于现有的主动学习方法。
Insight: 创新点在于通过双提示调优(正提示和负提示)显式地建模不确定性,其中负提示以反向方式训练,直接估计预测正确的概率,为主动学习样本选择提供了更可靠的原则性不确定性信号,而不仅仅是依赖传统的熵或聚类方法。
Abstract: Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.
[61] When and Where to Attack? Stage-wise Attention-Guided Adversarial Attack on Large Vision Language Models cs.CVPDF
Jaehyun Kwak, Nam Cao, Boryeong Cho, Segyu Lee, Sumyeong Ahn
TL;DR: 本文提出了一种针对大型视觉语言模型(LVLMs)的对抗攻击方法SAGA,该方法通过阶段性地将扰动集中在高注意力区域,从而更高效地利用有限的像素扰动预算,生成难以察觉的对抗样本,并在十个LVLM上实现了最先进的攻击成功率。
Details
Motivation: 现有基于输入变换(如随机裁剪)的对抗攻击方法具有随机性,未能高效利用有限的像素扰动预算。本文观察到区域注意力分数与对抗损失敏感性正相关,且攻击高注意力区域会引发注意力向后续显著区域的结构性重分布,因此旨在设计一种更高效的注意力引导攻击框架。
Result: SAGA在十个大型视觉语言模型上进行了评估,在受限的扰动预算下,始终达到了最先进的攻击成功率。
Insight: 论文的创新点在于利用模型内部的注意力机制来指导对抗扰动的生成,通过阶段性地聚焦于高注意力区域,实现了更高效的攻击。这为理解模型脆弱性和设计更鲁棒的防御提供了新视角。
Abstract: Adversarial attacks against Large Vision-Language Models (LVLMs) are crucial for exposing safety vulnerabilities in modern multimodal systems. Recent attacks based on input transformations, such as random cropping, suggest that spatially localized perturbations can be more effective than global image manipulation. However, randomly cropping the entire image is inherently stochastic and fails to use the limited per-pixel perturbation budget efficiently. We make two key observations: (i) regional attention scores are positively correlated with adversarial loss sensitivity, and (ii) attacking high-attention regions induces a structured redistribution of attention toward subsequent salient regions. Based on these findings, we propose Stage-wise Attention-Guided Attack (SAGA), an attention-guided framework that progressively concentrates perturbations on high-attention regions. SAGA enables more efficient use of constrained perturbation budgets, producing highly imperceptible adversarial examples while consistently achieving state-of-the-art attack success rates across ten LVLMs. The source code is available at https://github.com/jackwaky/SAGA.
[62] Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture cs.CV | cs.AIPDF
Weihao Gao, Zhuo Deng, Zheng Gong, Lan Ma
TL;DR: 本文提出了UltraSeg系列超轻量级模型,用于在普通CPU上实现实时结肠镜息肉分割。通过极致的模型压缩(参数量<0.3M),结合编码器-解码器宽度联合优化、约束扩张卷积和跨层轻量融合模块,模型在单CPU核心上达到90 FPS,同时在七个公共数据集上保持了与大型U-Net模型相当的Dice分数(>94%),为资源受限的临床环境提供了即用型解决方案。
Details
Motivation: 解决当前高精度息肉分割模型依赖GPU、难以在基层医院、移动内窥镜单元或胶囊机器人等资源受限环境中实时部署的问题。
Result: 在七个公共数据集上评估,UltraSeg仅使用31M参数U-Net的0.4%参数量,却保留了其94%以上的Dice分数,在单CPU核心上实现了90 FPS的实时推理速度,为极端压缩领域建立了强大且临床可行的基线。
Insight: 创新点包括:1)在极端压缩机制下联合优化编码器-解码器宽度;2)引入约束扩张卷积以扩大感受野;3)设计跨层轻量融合模块。这为结肠镜乃至更广泛的微创手术视觉应用提供了一个可复现的CPU原生解决方案蓝图。
Abstract: Early detection of colorectal cancer hinges on real-time, accurate polyp identification and resection. Yet current high-precision segmentation models rely on GPUs, making them impractical to deploy in primary hospitals, mobile endoscopy units, or capsule robots. To bridge this gap, we present the UltraSeg family, operating in an extreme-compression regime (<0.3 M parameters). UltraSeg-108K (0.108 M parameters) is optimized for single-center data, while UltraSeg-130K (0.13 M parameters) generalizes to multi-center, multi-modal images. By jointly optimizing encoder-decoder widths, incorporating constrained dilated convolutions to enlarge receptive fields, and integrating a cross-layer lightweight fusion module, the models achieve 90 FPS on a single CPU core without sacrificing accuracy. Evaluated on seven public datasets, UltraSeg retains >94% of the Dice score of a 31 M-parameter U-Net while utilizing only 0.4% of its parameters, establishing a strong, clinically viable baseline for the extreme-compression domain and offering an immediately deployable solution for resource-constrained settings. This work provides not only a CPU-native solution for colonoscopy but also a reproducible blueprint for broader minimally invasive surgical vision applications. Source code is publicly available to ensure reproducibility and facilitate future benchmarking.
[63] Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion cs.CV | cs.MMPDF
Yixin Zhu, Long Lv, Pingping Zhang, Xuehu Liu, Tongdan Tang
TL;DR: 本文提出了一种新颖的交互式空间-频率融合Mamba(ISFM)框架,用于多模态图像融合(MMIF)。该框架首先通过模态特定提取器(MSE)以线性计算复杂度提取图像的长程依赖特征,然后通过多尺度频率融合(MFF)自适应地整合多尺度低频和高频分量,最后通过交互式空间-频率融合(ISF)利用频率特征指导跨模态的空间特征融合,以增强互补表示。
Details
Motivation: 现有MMIF方法虽然引入了频域信息来增强空间特征,但通常采用简单的串行或并行空间-频率融合方式,缺乏交互。本文旨在解决这种缺乏交互的融合问题,以更好地保留纹理细节和重要信息。
Result: 在六个MMIF数据集上进行的广泛实验表明,所提出的ISFM框架优于其他最先进(SOTA)方法。
Insight: 主要创新点在于提出了交互式空间-频率融合(ISF)机制,利用频率特征来指导跨模态的空间特征融合,从而增强特征的互补性。此外,结合了具有线性复杂度的Mamba架构来建模长程依赖,以及多尺度频率融合(MFF)来鲁棒地表示频域特征。
Abstract: Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at https://github.com/Namn23/ISFM.
[64] Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare cs.CV | cs.AIPDF
Aavash Chhetri, Bibek Niroula, Pratik Shrestha, Yash Raj Shrestha, Lesley A Anderson
TL;DR: 本文提出了Med-MMFL,这是首个面向医疗领域的全面多模态联邦学习(MMFL)基准测试,旨在填补现有医疗FL基准在模态多样性、任务范围和标准化评估方面的不足。该基准涵盖了从2到4种模态的多种数据集,包括文本、病理图像、心电图、X光、放射学报告和多种MRI序列等10种独特医疗模态,并在自然联邦、合成IID和非IID设置下评估了分割、分类、模态对齐(检索)和视觉问答(VQA)等任务。
Details
Motivation: 现有医疗联邦学习(FL)基准主要集中于单模态或双模态,且医疗任务范围有限,缺乏标准化评估,阻碍了对医疗多模态联邦学习(MMFL)的系统性理解,因此需要建立一个全面的基准来推动该领域的发展。
Result: 该基准评估了六种代表性的最先进(SOTA)FL算法,覆盖了不同的聚合策略、损失函数和正则化技术,并在多种联邦场景下进行了实验,以模拟现实世界的异构性,为未来MMFL方法提供了可复现和公平比较的基础。
Insight: 创新点在于构建了首个医疗多模态联邦学习综合基准,整合了广泛的模态、任务和联邦场景,并公开了完整的基准实现(包括数据处理和分区流程),促进了该领域研究的可重复性和标准化评估。
Abstract: Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at https://github.com/bhattarailab/Med-MMFL-Benchmark .
[65] TrajVG: 3D Trajectory-Coupled Visual Geometry Learning cs.CVPDF
Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Yu, Mulin Yu
TL;DR: 本文提出TrajVG框架,通过显式预测相机坐标系下的3D轨迹来解决视频中物体运动导致的多帧3D重建退化问题,将稀疏轨迹、逐帧局部点云和相对相机位姿与几何一致性目标耦合,并利用伪2D轨迹实现自监督训练。
Details
Motivation: 解决前馈多帧3D重建模型在物体运动视频中性能下降的问题,包括全局参考模糊、局部点云依赖估计位姿导致漂移和跨帧错位。
Result: 在3D跟踪、位姿估计、点云重建和视频深度等多个任务上的实验表明,TrajVG超越了当前前馈性能基线。
Insight: 通过显式预测3D轨迹建立跨帧对应关系,结合双向轨迹-点云一致性约束和基于静态轨迹锚的位姿一致性目标,并利用伪2D轨迹实现混合监督的统一训练。
Abstract: Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
[66] Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search cs.CVPDF
Tianming Liang, Qirui Du, Jian-Fang Hu, Haichao Jiang, Zicheng Lin
TL;DR: 本文提出Seg-ReSearch,一种新颖的分割范式,通过交织推理和外部搜索来克服现有方法的知识瓶颈,使分割系统能够处理超出MLLMs冻结知识的动态开放世界查询。
Details
Motivation: 解决现有基于多模态大语言模型的分割系统受限于其内部冻结知识,难以处理涉及最新信息或领域特定概念的现实场景查询的问题。
Result: 在专门构建的挑战性基准OK-VOS以及两个现有推理分割基准上的实验表明,Seg-ReSearch显著提升了最先进方法的性能。
Insight: 主要创新点在于将外部搜索机制与推理过程交织,以扩展系统知识边界;同时,采用分层奖励设计来协调初始引导与渐进激励,缓解稀疏结果信号与严格逐步监督之间的困境。
Abstract: Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.
[67] Temporal Slowness in Central Vision Drives Semantic Object Learning cs.CVPDF
Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch
TL;DR: 该研究探讨了中央视觉和时间缓慢性在从人类视觉经验中形成语义物体表征的作用。通过使用Ego4D数据集模拟五个月的人类视觉经验,并利用最先进的注视预测模型生成注视坐标,提取模拟中央视觉的裁剪区域,并在其上训练时间对比自监督学习模型。结果表明,结合时间缓慢性和中央视觉能改善物体表征不同语义方面的编码。
Details
Motivation: 研究动机是理解人类如何从以自我为中心的视觉流中以最小监督方式获取语义物体表征,特别是中央视觉的高分辨率处理和基于时间邻近性的学习机制在语义形成中的作用。
Result: 在Ego4D数据集上模拟人类视觉经验进行实验,结果显示结合时间缓慢性和中央视觉能提升物体表征的语义编码效果,具体表现为中央视觉强化了前景物体特征提取,而时间缓慢性(尤其在注视眼动期间)有助于编码更广泛的物体语义信息。
Insight: 创新点在于将中央视觉模拟与时间缓慢性学习相结合,揭示了人类视觉系统中注视区域的高分辨率处理和时域连续性对语义学习的关键影响,为自监督学习提供了生物学启发的设计思路。
Abstract: Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.
[68] Vision-aligned Latent Reasoning for Multi-modal Large Language Model cs.CVPDF
Byungwoo Jeon, Yoonwoo Jeong, Hyunseok Lee, Minsu Cho, Jinwoo Shin
TL;DR: 本文提出了Vision-aligned Latent Reasoning (VaLR),一种用于多模态大语言模型(MLLM)的推理框架,旨在解决MLLM在需要多步推理的任务中视觉信息逐渐稀释的问题。VaLR通过在思维链推理的每一步之前动态生成与视觉对齐的潜在标记,引导模型在潜在空间中基于感知线索进行推理。
Details
Motivation: 现有的多模态大语言模型虽然在多种理解任务上取得进展,但在需要大量多步推理的问题上表现不佳,主要原因是长上下文生成过程中视觉信息被逐渐稀释,阻碍了模型充分利用测试时缩放的能力。
Result: 实验结果表明,VaLR在多个需要长上下文理解或精确视觉感知的基准测试中持续优于现有方法,并展现出先前MLLM未观察到的测试时缩放行为。特别是在VSI-Bench上,性能从33.0%显著提升至52.9%,相比Qwen2.5-VL实现了19.9个百分点的增益。
Insight: 论文的创新点在于提出了一个简单有效的视觉对齐潜在推理框架,通过将MLLM的中间嵌入与视觉编码器的嵌入对齐,在推理过程中保留视觉知识,从而引导模型进行基于感知的推理。从客观角度看,该方法通过动态生成视觉对齐的潜在标记来缓解视觉信息稀释问题,是一种新颖的、可借鉴的架构设计思路。
Abstract: Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.
[69] S-MUSt3R: Sliding Multi-view 3D Reconstruction cs.CV | cs.ROPDF
Leonid Antsfeld, Boris Chidlovskii, Yohann Cabon, Vincent Leroy, Jerome Revaud
TL;DR: 本文提出了S-MUSt3R,一个用于单目3D重建的简单高效流程。它通过序列分割、段对齐和轻量级闭环优化策略,解决了基础模型在大规模RGB流3D重建中的内存限制问题,无需重新训练模型即可实现准确、一致的度量空间重建。
Details
Motivation: 解决现有3D视觉基础模型在处理未标定图像时,因内存限制而难以扩展到大规模RGB流序列3D重建的挑战。
Result: 在TUM、7-Scenes和专有机器人导航数据集上评估,S-MUSt3R能够成功处理长RGB序列,产生准确且一致的3D重建,其轨迹和重建性能与传统更复杂架构的方法相当。
Insight: 创新点在于提出了一种无需重新训练基础模型的序列分割与对齐策略,有效突破了基础模型的内存可扩展性瓶颈,并直接输出度量空间预测,为实际应用中的可扩展单目3D场景重建提供了潜力。
Abstract: The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.
[70] Understanding Degradation with Vision Language Model cs.CVPDF
Guanzhou Lan, Chenyi Liao, Yuqi Yang, Qianli Ma, Zhigang Wang
TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的层次化结构化预测方法DU-VLM,用于理解图像退化问题,包括退化类型、参数键及其连续物理值的估计。通过自回归下一个令牌预测范式统一这些子任务,并利用监督微调和强化学习进行训练。此外,该方法可作为零样本控制器应用于预训练扩散模型,实现无需微调的高保真图像恢复,并引入了包含11万对干净-退化图像的大规模数据集DU-110k。
Details
Motivation: 解决当前视觉语言模型在理解图像退化的参数物理基础方面的不足,将退化理解重新定义为层次化结构化预测任务,以同时估计退化类型、参数键和连续物理值。
Result: 在广泛实验中,该方法在准确性和鲁棒性上显著优于通用基线模型,并展现出对未见分布的泛化能力,在相关基准测试中达到先进水平(SOTA)。
Insight: 创新点包括将退化理解统一为自回归下一个令牌预测范式,误差受值空间量化网格限制;提出DU-VLM模型,结合监督微调和强化学习;以及作为零样本控制器应用于预训练扩散模型,实现无需微调的图像恢复。从客观角度看,该方法通过结构化奖励和层次化预测,提升了退化参数估计的精度和泛化性。
Abstract: Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
[71] SalFormer360: a transformer-based saliency estimation model for 360-degree videos cs.CVPDF
Mahmoud Z. A. Wahba, Francesco Barbato, Sara Baldoni, Federica Battisti
TL;DR: 本文提出了一种名为SalFormer360的新型360度视频显著性估计模型,该模型基于Transformer架构,结合了现有的SegFormer编码器和自定义解码器,并引入了视点中心偏差以提升预测精度。
Details
Motivation: 解决360度视频中显著性估计问题,以支持视口预测和沉浸式内容优化等应用。
Result: 在三个最大的显著性估计基准数据集(Sport360、PVS-HM、VR-EyeTracking)上进行了广泛实验,模型在皮尔逊相关系数指标上分别比先前SOTA方法提升了8.4%、2.5%和18.6%,达到了新的SOTA水平。
Insight: 创新点在于将SegFormer编码器微调并适配到360度内容,结合自定义解码器,并引入视点中心偏差来模拟用户在360度环境中的注意力分布,从而显著提升了显著性预测的准确性。
Abstract: Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
[72] ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry cs.CVPDF
Marcin Możejko, Dawid Uchal, Krzysztof Gogolewski, Piotr Kupidura, Szymon Łukasik
TL;DR: ImmuVis是一种用于成像质谱流式细胞术(IMC)的高效卷积基础模型,通过引入标记自适应超卷积技术,能够处理任意测量标记子集而无需重新训练,并在虚拟染色和下游分类任务中优于现有方法。
Details
Motivation: 解决多重成像技术中缺乏固定通道空间的问题,因为实际研究中的标记集各不相同,这违反了标准视觉骨干网络的核心假设。
Result: 在IMC17M数据集上预训练后,ImmuVis在虚拟染色和下游分类任务中优于SOTA基线模型,计算成本显著低于基于Transformer的替代方案,并通过异方差似然目标提供校准的不确定性。
Insight: 创新点在于标记自适应超卷积技术,通过从学习的标记嵌入生成卷积核,使单一模型能适应任意标记子集;客观分析认为其高效性和不确定性校准能力在IMC建模中具有实用价值。
Abstract: We present ImmuVis, an efficient convolutional foundation model for imaging mass cytometry (IMC), a high-throughput multiplex imaging technology that handles molecular marker measurements as image channels and enables large-scale spatial tissue profiling. Unlike natural images, multiplex imaging lacks a fixed channel space, as real-world marker sets vary across studies, violating a core assumption of standard vision backbones. To address this, ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings, enabling a single model to operate on arbitrary measured marker subsets without retraining. We pretrain ImmuVis on the largest to-date dataset, IMC17M (28 cohorts, 24,405 images, 265 markers, over 17M patches), using self-supervised masked reconstruction. ImmuVis outperforms SOTA baselines and ablations in virtual staining and downstream classification tasks at substantially lower compute cost than transformer-based alternatives, and is the sole model that provides calibrated uncertainty via a heteroscedastic likelihood objective. These results position ImmuVis as a practical, efficient foundation model for real-world IMC modeling.
[73] A labeled dataset of simulated phlebotomy procedures for medical AI: polygon annotations for object detection and human-object interaction cs.CVPDF
Raúl Jiménez Cruz, César Torres-Huitzil, Marco Franceschetti, Ronny Seiger, Luciano García-Bañuelos
TL;DR: 本文介绍了一个包含11,884张标注图像的医学AI数据集,记录了在训练手臂上进行的模拟抽血(静脉穿刺)过程。图像从高清视频中提取,经过结构相似性指数(SSIM)过滤以减少冗余,并进行了自动人脸匿名化处理。每张图像包含五个医学相关类别(注射器、橡皮筋、消毒湿巾、手套、训练手臂)的多边形分割标注,格式兼容YOLOv8等现代目标检测框架。数据集已划分为训练、验证和测试子集,旨在推动医疗培训自动化和人机交互研究。
Details
Motivation: 解决医学培训中缺乏高质量、标注精细的静脉穿刺过程数据集的问题,以支持医疗AI在工具检测、流程识别、工作流分析和教育反馈系统等应用的发展。
Result: 数据集包含11,884张图像,分为训练(70%)、验证(15%)和测试(15%)子集,并提供了公开可用的标注文件。
Insight: 创新点在于提供了针对特定医疗程序(静脉穿刺)的精细多边形分割标注数据集,并集成了人脸匿名化和SSIM过滤等数据管理步骤,增强了数据集的实用性和隐私保护性,可直接用于目标检测和交互分析任务。
Abstract: This data article presents a dataset of 11,884 labeled images documenting a simulated blood extraction (phlebotomy) procedure performed on a training arm. Images were extracted from high-definition videos recorded under controlled conditions and curated to reduce redundancy using Structural Similarity Index Measure (SSIM) filtering. An automated face-anonymization step was applied to all videos prior to frame selection. Each image contains polygon annotations for five medically relevant classes: syringe, rubber band, disinfectant wipe, gloves, and training arm. The annotations were exported in a segmentation format compatible with modern object detection frameworks (e.g., YOLOv8), ensuring broad usability. This dataset is partitioned into training (70%), validation (15%), and test (15%) subsets and is designed to advance research in medical training automation and human-object interaction. It enables multiple applications, including phlebotomy tool detection, procedural step recognition, workflow analysis, conformance checking, and the development of educational systems that provide structured feedback to medical trainees. The data and accompanying label files are publicly available on Zenodo.
[74] PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective cs.CVPDF
Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan
TL;DR: 本文提出了一种名为PIO-FVLM的训练无关视觉语言模型加速方法,该方法从推理目标的角度重新思考视觉令牌压缩问题,通过设计层局部代理损失来评估令牌重要性并基于非极大值抑制原则选择关键令牌,从而在显著减少视觉令牌数量的同时保持模型性能。
Details
Motivation: 现有基于启发式规则(如视觉令牌间相似性或跨模态相似性)的视觉令牌压缩方法在压缩性能和实际部署上存在局限,本文旨在从推理目标(即保持输出结果不变)这一更根本的角度出发,设计更有效的训练无关压缩方法。
Result: 在LLaVA-Next-7B模型上,PIO-FVLM仅保留11.1%的视觉令牌,即可维持97.2%的原始性能,同时实现了2.67倍的预填充加速、2.11倍的推理加速、6.22倍的FLOPs降低以及6.05倍的KV Cache开销减少。
Insight: 核心创新在于将视觉令牌压缩问题重新定义为保持输出结果不变性,并提出了基于层局部代理损失梯度的令牌重要性排序与NMS选择机制;该方法无需训练、兼容FlashAttention,且可作为独立的无编码器方法或与编码器压缩方法结合使用,具有很高的实用性和部署友好性。
Abstract: Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.
[75] AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation cs.CV | cs.GR | cs.ROPDF
Jin-Chuan Shi, Binhong Ye, Tao Liu, Junzhe He, Yangjinhui Xu
TL;DR: AGILE是一个从单目视频重建手-物交互的鲁棒框架,通过将范式从重建转向智能体生成来解决现有方法在严重遮挡下几何不完整和依赖脆弱SfM初始化的问题。它利用视觉语言模型引导生成完整、密封的物体网格,并采用锚定跟踪策略和接触感知优化来确保物理合理性。
Details
Motivation: 解决现有方法在单目视频手-物交互重建中因依赖神经渲染导致几何破碎、非仿真就绪,以及依赖脆弱的运动恢复结构初始化导致野外视频频繁失败的两个主要障碍。
Result: 在HO3D、DexYCB和野外视频上的大量实验表明,AGILE在全局几何精度上优于基线方法,并在具有挑战性的序列上表现出卓越的鲁棒性,而先前方法经常失败。
Insight: 创新点在于将范式从重建转向智能体生成,利用VLM引导生成完整资产;完全绕过SfM,采用基于基础模型的鲁棒锚定跟踪策略;以及通过接触感知优化整合语义、几何和交互稳定性约束以确保物理有效性,从而产生仿真就绪的资产。
Abstract: Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications.
[76] DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking cs.CV | cs.AIPDF
Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu
TL;DR: 本文提出了RGBD Referring Multi-Object Tracking (DRMOT)新任务,旨在融合RGB、深度和语言模态进行3D感知的目标跟踪。为此,作者构建了DRSet数据集,并提出了一个名为DRTrack的MLLM引导的深度参考跟踪框架,以提升模型在复杂空间语义下的目标定位和遮挡场景下的轨迹关联能力。
Details
Motivation: 现有Referring Multi-Object Tracking (RMOT)模型仅依赖2D RGB数据,缺乏显式3D空间信息,难以准确处理复杂空间语义描述(如“离相机最近的人”)并在严重遮挡下保持可靠的身份关联。
Result: 在构建的DRSet数据集上进行的大量实验证明了所提DRTrack框架的有效性。
Insight: 创新点在于首次将深度信息引入Referring MOT任务,提出了DRMOT新任务和配套的DRSet数据集,并设计了融合RGB-D-L多模态的MLLM引导跟踪框架,以增强空间语义理解和轨迹关联的鲁棒性。
Abstract: Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera’’) and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models’ spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.
[77] Annotation Free Spacecraft Detection and Segmentation using Vision Language Models cs.CVPDF
Samet Hicsonmez, Jose Sosa, Dan Pineau, Inder Pal Singh, Arunkumar Rathinam
TL;DR: 本文提出了一种无需人工标注的航天器检测与分割方法,利用视觉语言模型(VLM)自动生成伪标签,并通过师生标签蒸馏框架训练轻量级模型,在多个空间目标数据集上显著提升了分割性能。
Details
Motivation: 解决空间领域因低可见度、光照变化和目标与行星背景混合等因素导致人工标注困难的问题,开发无需大量手动标注的航天器检测与分割方法。
Result: 在SPARK-2024、SPEED+和TANGO数据集的分割任务上,平均精度(AP)提升了高达10个点,优于直接零样本VLM推理。
Insight: 创新点在于结合VLM的零样本能力自动生成伪标签,并通过蒸馏框架有效利用噪声标签训练轻量模型,为标注稀缺领域提供了实用的解决方案。
Abstract: Vision Language Models (VLMs) have demonstrated remarkable performance in open-world zero-shot visual recognition. However, their potential in space-related applications remains largely unexplored. In the space domain, accurate manual annotation is particularly challenging due to factors such as low visibility, illumination variations, and object blending with planetary backgrounds. Developing methods that can detect and segment spacecraft and orbital targets without requiring extensive manual labeling is therefore of critical importance. In this work, we propose an annotation-free detection and segmentation pipeline for space targets using VLMs. Our approach begins by automatically generating pseudo-labels for a small subset of unlabeled real data with a pre-trained VLM. These pseudo-labels are then leveraged in a teacher-student label distillation framework to train lightweight models. Despite the inherent noise in the pseudo-labels, the distillation process leads to substantial performance gains over direct zero-shot VLM inference. Experimental evaluations on the SPARK-2024, SPEED+, and TANGO datasets on segmentation tasks demonstrate consistent improvements in average precision (AP) by up to 10 points. Code and models are available at https://github.com/giddyyupp/annotation-free-spacecraft-segmentation.
[78] SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation cs.CV | cs.AI | eess.IVPDF
David F. Ramirez, Tim Overman, Kristen Jaskie, Joe Marvin, Andreas Spanias
TL;DR: 本文提出了一种名为SAR-RAG的视觉上下文图像检索增强生成方法,用于合成孔径雷达(SAR)图像的自动目标识别(ATR)。该方法将多模态大语言模型(MLLM)与包含语义嵌入的向量数据库相结合,通过检索具有已知真实目标类型的过往图像示例来辅助比较和识别,从而提升ATR的预测准确性。
Details
Motivation: 解决SAR图像中军事车辆目标因外观相似而难以区分和识别的问题,通过结合检索增强生成技术来提升自动目标识别的性能。
Result: 在搜索与检索指标、分类准确率和车辆尺寸数值回归等评估中,SAR-RAG方法相较于仅使用MLLM的基线方法均显示出性能提升。
Insight: 创新点在于为SAR ATR任务设计了一个检索增强的AI代理框架,将MLLM与语义向量数据库(作为ATR记忆库)集成,通过上下文搜索和利用已知示例进行对比来增强识别能力。
Abstract: We present a visual-context image retrieval-augmented generation (ImageRAG) assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR). SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples with known true target types, our SAR-RAG system can compare similar vehicle categories, achieving improved ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.
[79] How to rewrite the stars: Mapping your orchard over time through constellations of fruits cs.CVPDF
Gonçalo P. Matos, Carlos Santiago, João P. Costeira, Ricardo L. Saldanha, Ernesto M. Morgado
TL;DR: 本文提出了一种基于三维质心星座的新方法,用于跨时间匹配果园视频中的相同果实,以跟踪其生长过程,并构建果园地图以实现机器人自主导航和选择性采摘。
Details
Motivation: 解决在果园中跨时间匹配同一果实以跟踪生长的难题,传统方法依赖固定相机位置或额外数据(如GPS),而本文旨在处理非刚性、遮挡和视觉特征稀疏的挑战性场景。
Result: 方法在跨视频匹配果实方面取得成功,并能构建果园地图以定位相机6自由度姿态,为果园机器人导航和选择性采摘提供支持,但未提及具体基准测试或定量比较结果。
Insight: 创新点在于使用稀疏三维点云的星座描述符进行匹配,而非单个果实,这能有效应对非刚性变形和遮挡;客观分析认为,该方法将计算机视觉与农业应用结合,为精准农业中的长期生长监测提供了新思路。
Abstract: Following crop growth through the vegetative cycle allows farmers to predict fruit setting and yield in early stages, but it is a laborious and non-scalable task if performed by a human who has to manually measure fruit sizes with a caliper or dendrometers. In recent years, computer vision has been used to automate several tasks in precision agriculture, such as detecting and counting fruits, and estimating their size. However, the fundamental problem of matching the exact same fruits from one video, collected on a given date, to the fruits visible in another video, collected on a later date, which is needed to track fruits’ growth through time, remains to be solved. Few attempts were made, but they either assume that the camera always starts from the same known position and that there are sufficiently distinct features to match, or they used other sources of data like GPS. Here we propose a new paradigm to tackle this problem, based on constellations of 3D centroids, and introduce a descriptor for very sparse 3D point clouds that can be used to match fruits across videos. Matching constellations instead of individual fruits is key to deal with non-rigidity, occlusions and challenging imagery with few distinct visual features to track. The results show that the proposed method can be successfully used to match fruits across videos and through time, and also to build an orchard map and later use it to locate the camera pose in 6DoF, thus providing a method for autonomous navigation of robots in the orchard and for selective fruit picking, for example.
[80] Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention cs.CVPDF
Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren
TL;DR: 本文提出了一种名为Light Forcing的稀疏注意力解决方案,专门用于加速自回归视频扩散模型。该方法通过引入“块感知增长”机制和“分层稀疏注意力”策略,有效解决了现有稀疏注意力方法在自回归模型上因孤立处理块生成和未能充分利用历史上下文而导致性能下降的问题,在保持生成质量的同时显著提升了推理速度。
Details
Motivation: 自回归视频生成模型虽然提升了视觉保真度和交互性,但其注意力机制的二次复杂度仍是高效部署的主要瓶颈。现有稀疏注意力方案在双向模型上有效,但在自回归模型上会导致显著的性能下降,原因在于孤立地考虑块生成以及对过去信息上下文利用不足。
Result: 大量实验表明,该方法在质量(例如,在VBench基准上达到84.5分)和效率(例如,端到端加速1.2~1.3倍)上均优于现有稀疏注意力方法。结合FP8量化和LightVAE,Light Forcing在RTX 5090 GPU上进一步实现了2.3倍加速和19.7 FPS。
Insight: 论文的核心创新点是首个为自回归视频生成模型量身定制的稀疏注意力方案。其“块感知增长”机制能定量估计每个块的贡献以分配稀疏度,实现生成过程中的先验知识继承;“分层稀疏注意力”通过帧级和块级的两级掩码选择策略,以由粗到细的方式自适应捕捉信息丰富的历史和局部上下文。这为高效自回归视频生成提供了新的设计思路。
Abstract: Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.
[81] VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? cs.CVPDF
Qing’an Liu, Juntong Feng, Yuhao Wang, Xinzhe Han, Yujie Cheng
TL;DR: 该论文提出了VISTA-Bench基准,用于系统评估视觉语言模型(VLMs)对图像中可视化文本(即嵌入在图像中的文字)的理解能力,并与纯文本理解进行对比。研究发现,现有VLMs在可视化文本任务上存在显著的模态差距,即模型在语义相同但呈现形式不同(纯文本 vs. 可视化文本)时性能大幅下降,且对渲染变化敏感。
Details
Motivation: 现有VLMs基准主要关注纯文本查询,但在现实场景中,文本常以可视化形式嵌入图像中。论文旨在探究当前VLMs是否能同等处理这类输入,从而揭示模型在跨模态理解中的潜在局限。
Result: 在VISTA-Bench上对超过20个代表性VLMs的广泛评估显示,模型在纯文本查询上表现良好,但在等效语义的可视化文本问题上性能显著下降,存在明显的模态差距。该差距随感知难度增加而放大,表明模型对渲染变化敏感,尽管语义未变。
Insight: 论文的创新点在于构建了一个系统性的基准(VISTA-Bench),通过控制渲染条件对比纯文本和可视化文本问题,首次量化了VLMs在可视化文本理解上的模态差距。这为诊断模型局限性和推动跨令牌文本与像素的更统一语言表示提供了指导框架。
Abstract: Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.
[82] XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas cs.CV | cs.LGPDF
Aqsa Sultana, Rayan Afsar, Ahmed Rahu, Surendra P. Singh, Brian Shula
TL;DR: 本文提出了一种名为XtraLight-MedMamba的超轻量级深度学习框架,用于从全切片图像中分类肿瘤性管状腺瘤。该模型结合了ConvNext浅层特征提取器与并行视觉Mamba,并集成了空间与通道注意力桥模块和固定非负正交分类器,以极少的参数实现了高精度分类。
Details
Motivation: 在常规结肠镜检查中,准确评估癌前息肉的风险对于降低结直肠癌风险至关重要,但当前对低级别异型增生的评估受限于主观的组织病理学解读。数字病理学和深度学习的进步为识别人眼难以察觉的、与恶性进展相关的细微形态模式提供了新机会。
Result: 在基于后续结直肠癌发展情况分层的低级别管状腺瘤患者数据集上,XtraLight-MedMamba以约32,000个参数实现了97.18%的准确率和0.9767的F1分数,超越了参数显著更多的基于Transformer和传统Mamba的架构。
Insight: 创新点包括:1) 结合ConvNext与并行视觉Mamba来高效建模长短程依赖和图像泛化;2) 引入空间与通道注意力桥模块增强多尺度特征提取;3) 使用固定非负正交分类器大幅减少参数并提升泛化能力。从客观角度看,其核心创新在于以极低的模型复杂度(超轻量级)在医学图像分类任务上达到SOTA水平,为资源受限环境下的部署提供了可能。
Abstract: Accurate risk stratification of precancerous polyps during routine colonoscopy screenings is essential for lowering the risk of developing colorectal cancer (CRC). However, assessment of low-grade dysplasia remains limited by subjective histopathologic interpretation. Advancements in digital pathology and deep learning provide new opportunities to identify subtle and fine morphologic patterns associated with malignant progression that may be imperceptible to the human eye. In this work, we propose XtraLight-MedMamba, an ultra-lightweight state-space-based deep learning framework for classifying neoplastic tubular adenomas from whole-slide images (WSIs). The architecture is a blend of ConvNext based shallow feature extractor with parallel vision mamba to efficiently model both long- and short-range dependencies and image generalization. An integration of Spatial and Channel Attention Bridge (SCAB) module enhances multiscale feature extraction, while Fixed Non-Negative Orthogonal Classifier (FNOClassifier) enables substantial parameter reduction and improved generalization. The model was evaluated on a curated dataset acquired from patients with low-grade tubular adenomas, stratified into case and control cohorts based on subsequent CRC development. XtraLight-MedMamba achieved an accuracy of 97.18% and an F1-score of 0.9767 using approximately 32,000 parameters, outperforming transformer-based and conventional Mamba architectures with significantly higher model complexity.
[83] When LLaVA Meets Objects: Token Composition for Vision-Language-Models cs.CVPDF
Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne
TL;DR: 论文提出Mask-LLaVA框架,通过结合基于掩码的对象表示、全局令牌和局部补丁令牌等多层次视觉特征,为自回归视觉语言模型创建紧凑且信息丰富的视觉表示,从而在推理时减少视觉令牌数量并保持性能。
Details
Motivation: 解决当前自回归视觉语言模型因依赖大量视觉令牌表示图像而导致推理时计算需求高的问题。
Result: 在标准基准测试套件上评估,结果与当前令牌高效方法竞争,且仅使用一小部分视觉令牌即可达到与原始LLaVA基线相当的性能。
Insight: 创新点在于融合多层次视觉特征实现高效学习,并允许在测试时动态选择令牌(尤其是基于掩码的对象令牌)以适应不同推理需求,无需重新训练模型。
Abstract: Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.
[84] PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation cs.CVPDF
Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu
TL;DR: PerpetualWonder是一个混合生成模拟器,能够从单张图像生成长时域、动作条件化的4D场景。它通过引入首个真正的闭环系统,解决了现有方法因物理状态与视觉表示解耦而无法在后续交互中更新底层物理的问题。
Details
Motivation: 解决现有方法在长时域4D场景生成中,因物理状态与视觉表示解耦而无法通过生成式细化更新后续交互所需底层物理的局限性。
Result: 实验表明,从单张图像出发,PerpetualWonder能成功模拟长时域动作下的复杂多步交互,保持物理合理性和视觉一致性。
Insight: 创新点在于引入了首个闭环系统,采用新颖的统一表示在物理状态与视觉基元间建立双向链接,使生成式细化能同时修正动力学和外观;并提出从多视角收集监督的鲁棒更新机制以解决优化模糊性。
Abstract: We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
[85] CoWTracker: Tracking by Warping instead of Correlation cs.CVPDF
Zihang Lai, Eldar Insafutdinov, Edgar Sucar, Andrea Vedaldi
TL;DR: 本文提出了一种名为CoWTracker的新型密集点跟踪方法,该方法摒弃了传统基于代价体积的特征匹配方式,转而采用基于扭曲(warping)的迭代优化策略。该方法结合Transformer架构进行时空联合推理,无需计算特征相关性即可建立长程对应关系。
Details
Motivation: 现有最先进的跟踪器通常依赖代价体积进行跨帧特征匹配,但这种方法在空间分辨率上具有二次复杂度,限制了可扩展性和效率。本文旨在通过基于扭曲的架构来解决这一问题,提高密集点跟踪的效率和性能。
Result: 该模型在标准密集点跟踪基准测试(包括TAP-Vid-DAVIS、TAP-Vid-Kinetics和Robo-TAP)上达到了最先进的性能。此外,该模型在光流估计任务上也表现出色,在Sintel、KITTI和Spring基准测试中有时甚至优于专用方法。
Insight: 主要创新点在于用基于扭曲的迭代优化替代了传统的代价体积匹配,从而降低了计算复杂度。其架构设计表明,基于扭曲的方法可以统一密集点跟踪和光流估计这两个任务,为相关领域提供了新的思路。
Abstract: Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.
cs.CY [Back]
[86] Inference-Time Reasoning Selectively Reduces Implicit Social Bias in Large Language Models cs.CY | cs.CLPDF
Molly Apsel, Michael N. Jones
TL;DR: 这篇论文研究了推理能力对大型语言模型(LLMs)中隐性社会偏见的影响。基于心理学理论,作者发现启用推理(如思维链)可以显著减少某些模型在类似内隐联想测试(IAT)的评估中表现出的隐性社会偏见,但这种效果仅限于社会偏见领域,对非社会性隐性联想无效。
Details
Motivation: 动机是探索推理能力如何影响LLMs中的隐性偏见,因为尽管LLMs经过对齐训练以避免显性偏见,但它们仍表现出显著的隐性偏见,而推理在人类认知中与隐性联想和统计学习相关。
Result: 在针对十五个刻板印象主题的IAT风格评估中,启用推理显著降低了某些模型类别的隐性偏见测量值,但未观察到非社会性隐性联想的相应减少。
Insight: 创新点在于将认知科学和心理学理论(如内隐联想与统计学习的联系)应用于AI评估,揭示了推理能力可以改变模型公平性评估结果,并提出了对齐程序与推理时间推理如何交互影响不同模型偏见减少差异的问题。
Abstract: Drawing on constructs from psychology, prior work has identified a distinction between explicit and implicit bias in large language models (LLMs). While many LLMs undergo post-training alignment and safety procedures to avoid expressions of explicit social bias, they still exhibit significant implicit biases on indirect tasks resembling the Implicit Association Test (IAT). Recent work has further shown that inference-time reasoning can impair LLM performance on tasks that rely on implicit statistical learning. Motivated by a theoretical link between implicit associations and statistical learning in human cognition, we examine how reasoning-enabled inference affects implicit bias in LLMs. We find that enabling reasoning significantly reduces measured implicit bias on an IAT-style evaluation for some model classes across fifteen stereotype topics. This effect appears specific to social bias domains, as we observe no corresponding reduction for non-social implicit associations. As reasoning is increasingly enabled by default in deployed LLMs, these findings suggest that it can meaningfully alter fairness evaluation outcomes in some systems, while also raising questions about how alignment procedures interact with inference-time reasoning to drive variation in bias reduction across model types. More broadly, this work highlights how theory from cognitive science and psychology can complement AI evaluation research by providing methodological and interpretive frameworks that reveal new insights into model behavior.
cs.RO [Back]
[87] Beyond the Vehicle: Cooperative Localization by Fusing Point Clouds for GPS-Challenged Urban Scenarios cs.RO | cs.CVPDF
Kuo-Yi Chao, Ralph Rasshofer, Alois Christian Knoll
TL;DR: 本文提出了一种用于GPS信号不可靠城市环境的协同多传感器多模态定位方法,通过融合车对车(V2V)和车对基础设施(V2I)系统的数据,并结合基于点云配准的同步定位与建图(SLAM)算法,以处理来自车载激光雷达、立体相机和路口部署传感器等多种模态的点云数据,从而显著提升复杂城市场景下的定位精度与鲁棒性。
Details
Motivation: 解决城市环境中GPS信号不可靠导致的车辆精确定位难题。
Result: 该方法在复杂、GPS噪声严重的城市场景中显著提高了定位精度和鲁棒性。
Insight: 创新点在于融合V2V和V2I的协同数据与点云配准SLAM,利用基础设施共享数据增强定位性能;其多模态传感器融合与协同感知架构是可借鉴的核心思路。
Abstract: Accurate vehicle localization is a critical challenge in urban environments where GPS signals are often unreliable. This paper presents a cooperative multi-sensor and multi-modal localization approach to address this issue by fusing data from vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) systems. Our approach integrates cooperative data with a point cloud registration-based simultaneous localization and mapping (SLAM) algorithm. The system processes point clouds generated from diverse sensor modalities, including vehicle-mounted LiDAR and stereo cameras, as well as sensors deployed at intersections. By leveraging shared data from infrastructure, our method significantly improves localization accuracy and robustness in complex, GPS-noisy urban scenarios.
[88] VLS: Steering Pretrained Robot Policies via Vision-Language Models cs.RO | cs.CVPDF
Shuo Liu, Ishneet Sukhvinder Singh, Yiqing Xu, Jiafei Duan, Ranjay Krishna
TL;DR: 本文提出了一种名为Vision-Language Steering (VLS) 的训练免费框架,用于在推理时调整冻结的生成式机器人策略(如扩散模型或流匹配策略),以应对训练-测试分布偏移(如障碍物、支撑面变化或轻微杂乱环境)导致的失败。VLS将适应问题视为推理时的控制问题,利用视觉语言模型合成轨迹可微的奖励函数,引导去噪过程生成满足测试时空间和任务要求的动作轨迹,而无需修改策略参数。
Details
Motivation: 解决预训练的扩散或流匹配策略在遇到训练-测试分布偏移(如环境中有障碍物、支撑面偏移或轻微杂乱)时失败的问题,这些失败并非由于缺失运动技能,而是源于模仿学习在训练特定空间配置和任务规范下的局限性。重新训练或微调成本高且概念上不匹配,因为所需行为已存在但无法在测试时有选择地适应。
Result: 在仿真和真实世界评估中,VLS持续优于先前的引导方法,在CALVIN基准上实现了31%的性能提升,在LIBERO-PRO基准上实现了13%的性能增益。在Franka机器人上的真实世界部署进一步证明了其在测试时空间和语义偏移下的鲁棒推理时适应能力。
Insight: 创新点在于将推理时适应视为控制问题,通过视觉语言模型合成可微奖励来引导预训练生成模型的采样过程,实现无需参数更新的测试时策略调整。这提供了一种高效利用现有预训练策略应对分布外场景的新范式,结合了视觉语言模型的语义理解与生成模型的行动规划能力。
Abstract: Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: https://vision-language-steering.github.io/webpage/
[89] Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement cs.RO | cs.CVPDF
Weikang Qiu, Tinglin Huang, Aosong Feng, Rex Ying
TL;DR: 本文提出SD-VLA框架,通过将视觉输入解耦为静态和动态token,解决了视觉-语言-动作模型在长时程任务中上下文长度受限和推理效率低下的问题。该方法通过重用静态token的KV缓存,显著减少了计算开销,并在新提出的长时程基准测试中取得了显著性能提升和推理加速。
Details
Motivation: 现有VLA模型面临长时程上下文建模能力有限以及因二次注意力复杂度和参数量大导致的推理效率低下两大挑战。本文观察到轨迹中的视觉信息(如背景)在时间步间大多保持静态,因此希望通过利用这一特性来提升效率。
Result: 在新提出的长时程依赖建模基准上,SD-VLA相比基线模型在成功率上取得了39.8%的绝对提升;在SimplerEnv基准上获得了3.9%的性能增益。同时,推理速度相比基础VLA模型提升了2.26倍。
Insight: 核心创新点在于对视觉输入进行多层次的静态-动态解耦表示,并设计了轻量级的重缓存门机制来选择性更新静态token的KV缓存。这为高效的多帧信息整合和长时程推理提供了一种新颖的架构设计思路。
Abstract: Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.
[90] GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning cs.RO | cs.CVPDF
Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, Hao Tang
TL;DR: 本文提出了GeneralVLA,一种可泛化的视觉-语言-动作分层模型,用于机器人零样本操作和自动生成机器人数据。该模型通过高层感知图像关键点、中层进行任务理解和3D轨迹规划、底层执行3D感知控制,无需真实世界机器人数据或人工演示,在14个任务上成功生成轨迹,性能显著优于VoxPoser等方法。
Details
Motivation: 解决大型基础模型在机器人领域泛化能力不足、零样本能力有限的问题,旨在实现无需真实数据或人工演示即可泛化到未见场景的机器人操作。
Result: 在14个任务上成功生成轨迹,性能显著优于VoxPoser等最先进方法;生成的演示数据训练出的行为克隆策略比使用人工演示或VoxPoser、Scaling-up、Code-As-Policies生成的数据更鲁棒。
Insight: 创新点在于分层VLA架构结合知识引导的3D轨迹规划,高层感知关键点、中层规划3D路径、底层执行控制,实现了零样本操作和自动数据生成,无需真实机器人数据,提升了可扩展性。
Abstract: Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit limited zero-shot capability, which hampers their ability to generalize effectively to unseen scenarios. In this work, we propose GeneralVLA (Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning), a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models, enabling zero-shot manipulation and automatically generating data for robotics. In particular, we study a class of hierarchical VLA model where the high-level ASM (Affordance Segmentation Module) is finetuned to perceive image keypoint affordances of the scene; the mid-level 3DAgent carries out task understanding, skill knowledge, and trajectory planning to produce a 3D path indicating the desired robot end-effector trajectory. The intermediate 3D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Compared to alternative approaches, our method requires no real-world robotic data collection or human demonstration, making it much more scalable to diverse tasks and viewpoints. Empirically, GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser. The generated demonstrations can train more robust behavior cloning policies than training with human demonstrations or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe GeneralVLA can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Code: https://github.com/AIGeeksGroup/GeneralVLA. Website: https://aigeeksgroup.github.io/GeneralVLA.
[91] EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models cs.RO | cs.CVPDF
Yu Bai, MingMing Yu, Chaojie Li, Ziyi Bai, Xinlong Wang
TL;DR: 本文提出了一种名为EgoActor的统一可扩展视觉语言模型(VLM),旨在解决人形机器人在真实世界中部署的挑战。该模型能够将高级指令直接转化为各种精确、空间感知的人形机器人动作,包括运动基元、头部运动、操控命令和人机交互,以协调实时感知与执行。
Details
Motivation: 部署人形机器人面临感知、运动和操作在部分信息观察和动态变化环境中的紧密集成挑战,以及在不同类型子任务间稳健过渡的困难。本文旨在通过将高级指令直接映射到空间感知的具体动作来解决这些问题。
Result: 在模拟和真实环境中的广泛评估表明,EgoActor有效桥接了抽象任务规划和具体运动执行,并能泛化到多样任务和未见环境。模型在8B和4B参数规模下均能实现稳健的上下文感知决策和流畅的动作推断(低于1秒)。
Insight: 创新点在于提出了EgoActing任务,并构建了一个统一的VLM模型,利用来自真实世界演示的仅RGB自我中心数据、空间推理问答和模拟环境演示的广泛监督进行训练,实现了对机器人动作的端到端空间感知预测与协调。
Abstract: Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.
eess.AS [Back]
[92] Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection eess.AS | cs.AI | cs.CV | cs.MM | cs.SDPDF
Seohyun Joo, Yoori Oh
TL;DR: 本文提出了一种名为DAViHD的新型双通路音频编码器框架,用于音视频高光检测任务。该框架通过语义通路和动态通路分别提取音频的高层语义信息和频谱-时间动态特征,并整合到完整的音视频系统中,以更充分地利用音频模态的丰富动态特性。
Details
Motivation: 现有音视频高光检测模型往往未能充分利用音频模态,主要关注高层语义特征而忽略了声音的丰富动态特性,因此本文旨在通过设计双通路音频编码器来弥补这一不足。
Result: 在大型基准测试Mr.HiSum上,所提出的DAViHD框架实现了新的最先进(SOTA)性能。
Insight: 创新点在于设计了双通路音频编码器,其中语义通路负责内容理解(如语音、音乐),动态通路通过频率自适应机制捕捉频谱-时间动态,从而能够识别瞬态声学事件;客观来看,这种对音频进行细粒度、多维度建模的思路有助于提升音视频任务中音频表征的有效性。
Abstract: Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale Mr.HiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
cs.SI [Back]
[93] Overstating Attitudes, Ignoring Networks: LLM Biases in Simulating Misinformation Susceptibility cs.SI | cs.AI | cs.CLPDF
Eun Cheol Choi, Lindsay E. Young, Emilio Ferrara
TL;DR: 本文评估了大型语言模型(LLM)在模拟人类对虚假信息易感性方面的能力,发现LLM生成的调查响应虽然能捕捉总体分布趋势并与人类响应存在适度相关,但会系统性高估信念与分享行为之间的关联,并过度强调态度和行为特征而忽略个人网络特征。
Details
Motivation: 动机是检验LLM作为人类判断代理在计算社会科学中的可靠性,特别是其能否准确复现人类对虚假信息的信念和分享模式,以明确LLM模拟调查的适用性边界。
Result: 基于三个在线调查基准的评估结果显示,LLM模拟响应与人类响应的相关性有限;拟合模拟响应的线性模型解释方差显著更高,且权重分配严重偏向态度/行为特征,而忽视了网络特征,表明其与人类数据存在系统性偏差。
Insight: 创新点在于揭示了LLM在模拟虚假信息易感性时存在系统性表征偏差,其训练数据中的概念关联可能导致对态度特征的过度强调;核心见解是LLM更适合用于诊断与人类判断的系统性差异,而非直接替代人类数据。
Abstract: Large language models (LLMs) are increasingly used as proxies for human judgment in computational social science, yet their ability to reproduce patterns of susceptibility to misinformation remains unclear. We test whether LLM-simulated survey respondents, prompted with participant profiles drawn from social survey data measuring network, demographic, attitudinal and behavioral features, can reproduce human patterns of misinformation belief and sharing. Using three online surveys as baselines, we evaluate whether LLM outputs match observed response distributions and recover feature-outcome associations present in the original survey data. LLM-generated responses capture broad distributional tendencies and show modest correlation with human responses, but consistently overstate the association between belief and sharing. Linear models fit to simulated responses exhibit substantially higher explained variance and place disproportionate weight on attitudinal and behavioral features, while largely ignoring personal network characteristics, relative to models fit to human responses. Analyses of model-generated reasoning and LLM training data suggest that these distortions reflect systematic biases in how misinformation-related concepts are represented. Our findings suggest that LLM-based survey simulations are better suited for diagnosing systematic divergences from human judgment than for substituting it.
cs.AI [Back]
[94] Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL cs.AI | cs.CL | cs.LGPDF
Xiaofeng Lin, Sirou Zhu, Yilei Chen, Mingyu Chen, Hejian Sang
TL;DR: 本文提出了ORBIT框架,通过跨情节的元强化学习训练LLMs,使其能够在上下文中从交互中学习,从而提升LLMs在在线决策任务中的上下文学习能力。实验表明,经过训练的较小开源模型在未见环境中达到了与GPT-5.2相当的性能,并显著优于标准RL微调方法。
Details
Motivation: 解决LLMs在在线决策任务中难以可靠利用上下文交互经验的局限性,这些任务需要实时交互、延迟反馈以及信息收集与利用的平衡。
Result: 在完全未见的环境中,经过ORBIT训练的Qwen3-14B模型在上下文在线学习方面大幅提升,性能与GPT-5.2匹配,并大幅超越标准RL微调方法;扩展实验显示模型规模越大,性能增益越一致。
Insight: 创新点在于引入多任务、多情节的元强化学习框架来训练LLMs的上下文在线学习能力,这为无需权重更新的推理时学习智能体提供了新的训练范式,并展示了模型规模扩展的潜力。
Abstract: Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.
[95] Empirical-MCTS: Continuous Agent Evolution via Dual-Experience Monte Carlo Tree Search cs.AI | cs.CLPDF
Hao Lu, Haoyuan Huang, Yulin Zhou, Chen Li, Ningxin Zhu
TL;DR: 本文提出Empirical-MCTS框架,通过双经验循环将传统的无状态蒙特卡洛树搜索(MCTS)转变为持续学习过程,结合局部探索与全局记忆优化,显著提升大语言模型在复杂推理任务中的性能。
Details
Motivation: 现有基于MCTS的推理时扩展策略多为无状态方法,每次解决新问题后丢弃成功的推理模式,无法像人类那样积累经验智慧,因此需要一种能够持续积累和利用经验的搜索框架。
Result: 在AIME25、ARC-AGI-2和MathArena Apex等复杂推理基准测试中,Empirical-MCTS显著优于无状态MCTS策略和独立的经验驱动智能体,达到了新的SOTA水平。
Insight: 创新点在于提出双循环框架,通过PE-EMP机制实现局部搜索中的实时元提示进化,以及通过记忆优化智能体构建全局动态策略先验,将结构化搜索与经验积累相结合,为非参数化持续学习提供了新思路。
Abstract: Inference-time scaling strategies, particularly Monte Carlo Tree Search (MCTS), have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). However, current approaches remain predominantly stateless, discarding successful reasoning patterns after each problem instance and failing to mimic the empirical accumulation of wisdom characteristic of human problem-solving. To bridge this gap, we introduce Empirical-MCTS, a dual-loop framework that transforms stateless search into a continuous, non-parametric learning process. The framework unifies local exploration with global memory optimization through two novel mechanisms: Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) and a Memory Optimization Agent. PE-EMP functions as a reflexive optimizer within the local search, utilizing pairwise feedback to dynamically synthesize adaptive criteria and evolve meta-prompts (system prompts) in real-time. Simultaneously, the Memory Optimization Agent manages a global repository as a dynamic policy prior, employing atomic operations to distill high-quality insights across problems. Extensive evaluations on complex reasoning benchmarks, including AIME25, ARC-AGI-2, and MathArena Apex, demonstrate that Empirical-MCTS significantly outperforms both stateless MCTS strategies and standalone experience-driven agents. These results underscore the critical necessity of coupling structured search with empirical accumulation for mastering complex, open-ended reasoning tasks.
[96] From Assumptions to Actions: Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents cs.AI | cs.CL | cs.MAPDF
SeungWon Seo, SooBin Lim, SeongRae Noh, Haneul Kim, HyeongYeop Kang
TL;DR: 本文提出PCE(Planner-Composer-Evaluator)框架,将大型语言模型(LLM)推理轨迹中隐含的碎片化假设转化为结构化决策树,以支持具身智能体在多智能体、部分可观测、去中心化环境中的不确定性感知规划,从而减少对频繁通信的依赖。
Details
Motivation: 解决具身智能体在不确定性环境中过度依赖频繁的智能体间通信(导致高令牌成本和时间开销,并可能干扰人类协作工作流)进行规划的问题。
Result: 在C-WAH和TDW-MAT两个多智能体基准测试及三种不同LLM骨干网络上,PCE在成功率和任务效率上持续优于以通信为中心的基线方法,同时保持了相当的令牌使用量;消融实验表明PCE能持续提升不同模型容量和推理深度的基线性能。
Insight: 创新点在于将LLM推理中隐含的假设显式结构化(决策树),并通过场景似然、目标导向收益和执行成本对路径进行评分,从而在不依赖大量通信的情况下实现理性行动选择;这为将潜在的LLM假设转化为可靠的不确定性感知规划策略提供了原则性路径。
Abstract: Embodied agents operating in multi-agent, partially observable, and decentralized environments must plan and act despite pervasive uncertainty about hidden objects and collaborators’ intentions. Recent advances in applying Large Language Models (LLMs) to embodied agents have addressed many long-standing challenges, such as high-level goal decomposition and online adaptation. Yet, uncertainty is still primarily mitigated through frequent inter-agent communication. This incurs substantial token and time costs, and can disrupt established workflows, when human partners are involved. We introduce PCE, a Planner-Composer-Evaluator framework that converts the fragmented assumptions latent in LLM reasoning traces into a structured decision tree. Internal nodes encode environment assumptions and leaves map to actions; each path is then scored by scenario likelihood, goal-directed gain, and execution cost to guide rational action selection without heavy communication. Across two challenging multi-agent benchmarks (C-WAH and TDW-MAT) and three diverse LLM backbones, PCE consistently outperforms communication-centric baselines in success rate and task efficiency while showing comparable token usage. Ablation results indicate that the performance gains obtained by scaling model capacity or reasoning depth persist even when PCE is applied, while PCE consistently raises the baseline across both capacity and reasoning-depth scales, confirming that structured uncertainty handling complements both forms of scaling. A user study further demonstrates that PCE produces communication patterns that human partners perceive as more efficient and trustworthy. Together, these results establish a principled route for turning latent LLM assumptions into reliable strategies for uncertainty-aware planning.
cs.SD [Back]
[97] BASS: Benchmarking Audio LMs for Musical Structure and Semantic Reasoning cs.SD | cs.CLPDF
Min Jang, Orevaoghene Ahia, Nazif Tamer, Sachin Kumar, Yulia Tsvetkov
TL;DR: 本文介绍了BASS基准测试,旨在评估音频语言模型在音乐理解与推理方面的能力,涵盖结构分割、歌词转录、音乐学分析和艺术家合作四大类任务,包含2658个问题、1993首独特歌曲,覆盖超过138小时的音乐。研究发现,即使是前沿多模态模型在高级推理任务(如结构分割和艺术家合作)上表现不佳,而在歌词转录上表现最佳。
Details
Motivation: 音乐理解需要同时处理音频的结构和语义元素,现有音频语言模型在音乐推理方面的能力尚不明确,因此需要构建一个全面的基准来评估和推动音频语言模型的发展。
Result: 在BASS基准上评估了14个开源和前沿多模态语言模型,结果显示模型在高级推理任务(如结构分割和艺术家合作)上表现挣扎,而在歌词转录任务上表现最好。
Insight: 创新点在于构建了一个全面的音乐理解基准BASS,涵盖多样化的任务和音乐数据;客观分析表明,当前模型能有效利用语言先验,但在音乐结构、人声和音乐学属性的推理上仍有局限,这为未来音频语言模型的改进提供了方向。
Abstract: Music understanding is a complex task that often requires reasoning over both structural and semantic elements of audio. We introduce BASS, designed to evaluate music understanding and reasoning in audio language models across four broad categories: structural segmentation, lyric transcription, musicological analysis, and artist collaboration. BASS comprises 2658 questions spanning 12 tasks, 1993 unique songs and covering over 138 hours of music from a wide range of genres and tracks, crafted to assess musicological knowledge and reasoning in real-world scenarios. We evaluate 14 open-source and frontier multimodal LMs, finding that even state-of-the-art models struggle on higher-level reasoning tasks such as structural segmentation and artist collaboration, while performing best on lyric transcription. Our analysis reveals that current models leverage linguistic priors effectively but remain limited in reasoning over musical structure, vocal, and musicological attributes. BASS provides an evaluation framework with widespread applications in music recommendation and search and has the potential to guide the development of audio LMs.
cs.LG [Back]
[98] Training Data Efficiency in Multimodal Process Reward Models cs.LG | cs.CL | cs.MMPDF
Jinyuan Li, Chengsong Huang, Langlin Huang, Shaoyang Xu, Haolin Liu
TL;DR: 本文研究了多模态过程奖励模型(MPRMs)训练中的数据效率问题。通过理论分析和实验发现,现有基于蒙特卡洛(MC)标注的大规模训练数据存在冗余,模型性能在随机子采样下很快饱和。为此,作者提出了平衡信息分数(BIS),一种基于MC信号在rollout级别同时考虑正负步骤标签混合度与标签可靠性的数据选择方法,无需额外成本。在VisualProcessBench基准上,使用InternVL2.5-8B和Qwen2.5-VL-7B两个骨干模型验证,BIS选择的少量数据子集(如仅10%)即可达到甚至超越全数据训练的性能。
Details
Motivation: 训练多模态过程奖励模型(MPRMs)通常需要大规模蒙特卡洛(MC)标注的数据集,成本高昂。初步实验表明,随机子采样训练数据时MPRM性能很快饱和,说明现有MC标注语料存在大量冗余。因此,本文旨在研究如何提高MPRM训练的数据效率,减少所需训练数据量。
Result: 在VisualProcessBench基准上,使用InternVL2.5-8B和Qwen2.5-VL-7B两个骨干模型进行实验。结果表明,BIS选择的数据子集(如仅10%的训练数据)能够匹配甚至超越使用全部训练数据的性能。具体而言,BIS子集仅用10%数据就达到了全数据性能,相比随机子采样相对提升了4.1%。
Insight: 论文的创新点在于:1)从理论上形式化了影响信息梯度更新的两个关键因素:正负步骤的标签混合度与正步骤的标签可靠性(平均MC分数);2)基于此提出了平衡信息分数(BIS),一种高效、无额外成本的数据选择策略,优先选择同时具备良好混合度与高可靠性的rollout数据。这为减少大规模标注数据依赖、提升训练效率提供了可借鉴的思路。
Abstract: Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora.To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
[99] RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning cs.LG | cs.AI | cs.CL | cs.CR | math.OCPDF
Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu
TL;DR: 本文提出了一种风险感知偏好优化(RAPO)框架,旨在提升大型推理模型(LRMs)在面对多样化、复杂越狱攻击时的安全推理泛化能力。该框架通过引导模型在思维内容中自适应地识别和处理不同粒度的安全风险,从而在保持通用效用的同时,增强其防御能力。
Details
Motivation: 大型推理模型虽在思维链推理上取得成功,但其安全推理过程在面对复杂多样的越狱攻击时泛化能力不足,导致对有害提示的拒绝失败。本文旨在解决安全推理泛化不足的问题。
Result: 大量实验表明,RAPO框架成功使多个大型推理模型在面对多样化攻击提示时,自适应地泛化了其安全推理能力,同时保持了模型的通用效用。
Insight: 论文的核心创新点在于将安全风险感知与偏好优化结合,提出一个框架使模型能在推理过程中自适应地调整对安全风险的关注粒度。从客观角度看,这为大型推理模型的鲁棒对齐提供了一种新思路,强调了安全防御需要与推理过程的动态适应性相结合。
Abstract: Large Reasoning Models (LRMs) have achieved tremendous success with their chain-of-thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk-Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs’ safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at https://github.com/weizeming/RAPO.
[100] Rethinking the Trust Region in LLM Reinforcement Learning cs.LG | cs.AI | cs.CLPDF
Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du
TL;DR: 本文提出了一种名为DPPO的新算法,用于改进大语言模型强化学习中的信任区域约束。作者认为标准的PPO算法中的比率裁剪机制不适合LLM的大词汇量特性,因此用基于策略散度的直接估计来替代启发式裁剪,并通过高效的近似方法降低计算开销。实验表明DPPO在训练稳定性和效率上优于现有方法。
Details
Motivation: 标准PPO算法中的比率裁剪机制基于采样token的概率比,这在大词汇量的LLM中是一个噪声很大的单样本蒙特卡洛估计,导致对低概率token的更新惩罚过重,而对高概率token的潜在灾难性偏移约束不足,造成训练效率低下和不稳定。
Result: 广泛的实证评估表明,与现有方法相比,DPPO实现了更优的训练稳定性和效率,为基于RL的LLM微调提供了更稳健的基础。
Insight: 核心创新点是用基于策略散度(如总变差或KL散度)的直接估计约束替代PPO的启发式比率裁剪,并设计了高效的二元和Top-K近似来避免巨大的内存占用,从而更原则性地控制策略更新,解决了大词汇量场景下PPO的结构性不适配问题。
Abstract: Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.
[101] PromptSplit: Revealing Prompt-Level Disagreement in Generative Models cs.LG | cs.AI | cs.CVPDF
Mehdi Lotfian, Mohammad Jalali, Farzan Farnia
TL;DR: 本文提出了PromptSplit框架,用于检测和分析生成式模型之间因提示词不同而产生的行为差异。该框架通过构建提示词与输出的联合张量积嵌入表示,并计算核协方差矩阵,利用特征空间识别模型行为差异的主要方向。
Details
Motivation: 随着基于提示词的生成式AI模型在视觉和语言领域的快速发展,不同模型在训练数据和架构上的差异导致其行为各异,需要一种原则性方法来识别哪些类型的提示词会导致模型行为产生显著差异。
Result: 在文本到图像、文本到文本和图像描述生成等多个任务上的实验表明,PromptSplit能够准确检测出真实的行为差异,并定位导致差异的提示词,为模型行为分析提供了可解释的工具。
Insight: 创新点在于提出了一个基于核方法的可扩展框架,通过随机投影近似将计算复杂度降低到O(nr² + r³),并提供了理论分析证明其近似误差有界,为大规模生成模型的行为差异分析提供了高效且可解释的解决方案。
Abstract: Prompt-guided generative AI models have rapidly expanded across vision and language domains, producing realistic and diverse outputs from textual inputs. The growing variety of such models, trained with different data and architectures, calls for principled methods to identify which types of prompts lead to distinct model behaviors. In this work, we propose PromptSplit, a kernel-based framework for detecting and analyzing prompt-dependent disagreement between generative models. For each compared model pair, PromptSplit constructs a joint prompt–output representation by forming tensor-product embeddings of the prompt and image (or text) features, and then computes the corresponding kernel covariance matrix. We utilize the eigenspace of the weighted difference between these matrices to identify the main directions of behavioral difference across prompts. To ensure scalability, we employ a random-projection approximation that reduces computational complexity to $O(nr^2 + r^3)$ for projection dimension $r$. We further provide a theoretical analysis showing that this approximation yields an eigenstructure estimate whose expected deviation from the full-dimensional result is bounded by $O(1/r^2)$. Experiments across text-to-image, text-to-text, and image-captioning settings demonstrate that PromptSplit accurately detects ground-truth behavioral differences and isolates the prompts responsible, offering an interpretable tool for detecting where generative models disagree.
cs.HC [Back]
[102] PersoPilot: An Adaptive AI-Copilot for Transparent Contextualized Persona Classification and Personalized Response Generation cs.HC | cs.CLPDF
Saleh Afzoon, Amin Beheshti, Usman Naseem
TL;DR: PersoPilot是一个自适应AI副驾驶,通过将用户画像理解与情境分析相结合,为终端用户和分析师提供支持。它包含一个透明的聊天界面供用户交互,以及一个基于推理的标注助手用于分析师进行主动学习驱动的分类,旨在实现精准的、情境感知的个性化服务。
Details
Motivation: 现有系统通常将用户画像和情境作为独立输入处理,限制了其生成细致、自适应交互的能力。本文旨在解决这一局限,弥合原始画像数据与可操作的、情境感知洞察之间的差距。
Result: 摘要中未提及具体的定量基准测试结果或SOTA比较,但描述了系统通过反馈循环实现针对性服务推荐和自适应个性化,并声称其作为一个适应性框架适用于广泛的服务个性化场景。
Insight: 主要创新点在于将画像分类与情境分析进行透明化、可解释的整合,并设计了包含终端用户交互界面和分析师标注助手的双重反馈循环,通过主动学习使系统能够随时间适应新的标注数据,从而实现动态个性化。
Abstract: Understanding and classifying user personas is critical for delivering effective personalization. While persona information offers valuable insights, its full potential is realized only when contextualized, linking user characteristics with situational context to enable more precise and meaningful service provision. Existing systems often treat persona and context as separate inputs, limiting their ability to generate nuanced, adaptive interactions. To address this gap, we present PersoPilot, an agentic AI-Copilot that integrates persona understanding with contextual analysis to support both end users and analysts. End users interact through a transparent, explainable chat interface, where they can express preferences in natural language, request recommendations, and receive information tailored to their immediate task. On the analyst side, PersoPilot delivers a transparent, reasoning-powered labeling assistant, integrated with an active learning-driven classification process that adapts over time with new labeled data. This feedback loop enables targeted service recommendations and adaptive personalization, bridging the gap between raw persona data and actionable, context-aware insights. As an adaptable framework, PersoPilot is applicable to a broad range of service personalization scenarios.
[103] WebAccessVL: Making an Accessible Web via Violation-Conditioned VLM cs.HC | cs.AI | cs.CVPDF
Amber Yijia Zheng, Jae Joong Lee, Bedrich Benes, Raymond A. Yeh
TL;DR: 本文提出了WebAccessVL,一个基于视觉语言模型(VLM)的系统,用于自动编辑网站HTML以修复违反Web内容无障碍指南(WCAG2)的问题。该方法将问题构建为受监督的图像条件程序合成任务,并引入了一个新的数据集和一个以违规数量为条件的VLM来指导修正过程。
Details
Motivation: 解决网页内容无障碍性(accessibility)问题,即自动检测并修正网站HTML代码中违反WCAG2准则的部分,以帮助残障人士更好地访问网络。
Result: 实验表明,该方法能有效将每个网站的平均违规数量从5.34降至0.44,性能优于Gemini和GPT-5等商业LLM API。感知研究证实,修正后的网站在保持原始视觉外观和内容方面表现良好。
Insight: 主要创新点在于将网页无障碍性修正构建为图像条件的程序合成任务,并提出了一个以WCAG2违规数量为额外条件的VLM架构。从客观角度看,其收集的配对数据集和将视觉渲染与HTML代码及违规信息相结合的条件化方法,为解决此类结构化编辑问题提供了新思路。
Abstract: We present a vision-language model (VLM) that automatically edits website HTML to address Web Content Accessibility Guidelines 2 (WCAG2) violations. We formulate this as a supervised image-conditioned program synthesis task, where the model learns to correct HTML given the HTML and its rendering. We collected WebAccessVL, a new dataset with manually corrected accessibility violations, establishing paired training data. We then propose a violation-conditioned VLM that additionally conditions on the WCAG2 violation count to guide the correction process. Experiments demonstrate that our method effectively reduces the average number of violations from 5.34 to 0.44 per website, outperforming commercial LLM APIs (Gemini, GPT-5). A perceptual study confirms that our edited websites maintain the original visual appearance and content.