Table of Contents
- cs.CL [Total: 55]
- cs.CV [Total: 122]
- cs.IR [Total: 2]
- eess.AS [Total: 1]
- cs.CR [Total: 1]
- cs.SD [Total: 1]
- cs.AI [Total: 7]
- cs.LG [Total: 8]
- cs.CY [Total: 2]
- cs.RO [Total: 1]
- cs.SE [Total: 2]
- cs.GR [Total: 1]
cs.CL [Back]
[1] TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction cs.CLPDF
Chengye Wang, Lin Fu, Zexi Kuang, Yilun Zhao
TL;DR: 本文提出了TexOCR,一个专注于将科学PDF文档重建为可编译LaTeX代码的OCR系统。研究引入了TexOCR-Bench基准和TexOCR-Train大规模训练语料库,用于评估和训练模型。通过结合监督微调和基于可验证奖励的强化学习,训练了一个20亿参数的模型,旨在解决现有OCR系统在保持文档结构、浮动体放置和引用完整性方面的不足。
Details
Motivation: 现有文档OCR主要针对纯文本或Markdown,忽略了LaTeX在科学出版中至关重要的结构和可执行属性。本文旨在解决将科学PDF页面级别重建为可编译LaTeX代码的挑战,以保留文档的完整结构和可编译性。
Result: 在TexOCR-Bench基准上对21个前沿模型进行的实验表明,现有系统经常违反关键的文档不变性(如一致的章节结构、正确的浮动体放置和有效的标签-引用链接),从而损害了编译可靠性和下游可用性。分析进一步显示,与仅使用监督微调相比,采用基于可验证奖励的强化学习在结构和编译指标上带来了持续改进。
Insight: 论文的创新点在于将OCR任务重新定义为可编译LaTeX的重建,并引入了多维度评估套件(转录保真度、结构忠实度和端到端可编译性)以及基于LaTeX单元测试的可验证奖励机制用于强化学习,这直接强制执行了可编译性和引用完整性,为解决复杂文档的结构化重建问题提供了新思路。
Abstract: Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.
[2] AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL | cs.LG | cs.PLPDF
Pouya Pezeshkpour, Estevam Hruschka
TL;DR: 本文提出了AutoPyVerifier框架,用于自动学习紧凑的可执行验证器来评估大语言模型(LLM)的输出质量。该框架利用LLM合成候选验证函数,并通过在有向无环图(DAG)上进行搜索来精炼它们,从而生成一组最小化的Python验证器,其联合满足度能紧密匹配目标目标(如正确性)。
Details
Motivation: 当前验证器面临基本权衡:基于LLM的验证器表达能力强但难以控制且易出错,而确定性可执行验证器可靠且可解释但能力有限。本文旨在解决如何从开发集的LLM输出和标签中自动诱导出一组紧凑的Python验证器,以近似目标目标。
Result: 在数学推理、编码、函数调用和指令跟随等多个基准测试中,针对多个最先进(SOTA)的LLM,AutoPyVerifier将目标目标预测的F1分数提高了高达55.0点。此外,将发现的验证器集作为外部工具暴露给LLM,可将下游准确率提高高达17.0点。
Insight: 创新点在于结合LLM的生成能力和DAG搜索的确定性优化,自动学习紧凑、可执行且语义基础的验证器集。客观分析表明,该方法能根据基准和模型动态调整验证目标,并提升验证的结构性和可解释性,为LLM训练和推理控制提供了可靠工具。
Abstract: Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code
[3] Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge cs.CL | cs.AI | cs.CV | cs.IRPDF
Mengyu Wang, Xiaoying Zhi, Zhiyi Li, Robin Schmucker, Shay B. Cohen
TL;DR: 本文提出了一种名为自知识重表达(SKR)的新型任务无关适应方法,旨在解决大型语言模型(LLM)在专业非生成任务上的性能瓶颈。该方法通过将LLM的输出从通用令牌生成转换为高效的任务特定表达,从而在不依赖人工标注或模型蒸馏的情况下,显著提升LLM在信息检索、目标检测和异常检测等任务上的性能。
Details
Motivation: 论文的动机在于,尽管LLM通过下一令牌预测(NTP)范式能够表达其内在知识,但其顺序性限制了在专业非生成任务上的性能。作者认为这一瓶颈源于知识表达机制,而非知识获取不足,因此提出SKR方法来优化LLM的知识表达以适应特定任务。
Result: 在大型金融文档数据集上的实验表明,SKR方法在信息检索任务中Recall@1提升超过40%,目标检测延迟降低超过76%,异常检测的AUPRC提升超过33%。在MMDocRAG数据集上的结果超越了领先的检索模型至少12.6%。
Insight: 摘要中宣称的创新点是SKR作为一种完全本地化、无需标注数据的任务无关适应方法,能够高效地将LLM的内在知识重表达为任务特定形式。从客观角度看,其创新之处在于绕过传统监督或蒸馏方法,直接优化LLM的知识表达机制,为LLM在专业领域的应用提供了轻量级且高效的适配方案。
Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this performance bottleneck to the LLMs’ knowledge expression mechanism, rather than to deficiencies in knowledge acquisition. To address this, we propose Self-Knowledge Re-expression (SKR), a novel, task-agnostic adaptation method. SKR transforms the LLM’s output from generic token generation to highly efficient, task-specific expression. SKR is a fully local method that uses only unannotated data, requiring neither human supervision nor model distillation. Experiments on a large financial document dataset demonstrate substantial improvements: over 40% in Recall@1 for information retrieval tasks, over 76% reduction in object detection latency, and over 33% increase in anomaly detection AUPRC. Our results on the MMDocRAG dataset surpass those of leading retrieval models by at least 12.6%.
[4] Evaluating Temporal Consistency in Multi-Turn Language Models cs.CLPDF
Yash Kumar Atri, Steven L. Johnson, Tom Hartvigsen
TL;DR: 本文研究了多轮对话中语言模型的时间一致性能力,提出了时间范围稳定性概念,并构建了ChronoScope大规模诊断基准来评估模型在受控多轮交互中保持、覆盖或传递时间范围事实上下文的能力。通过对先进语言模型的广泛评估,发现模型在受控多轮设置中经常违反时间范围稳定性,倾向于漂移到当前假设,且失败随交互长度增加而加剧,揭示了单轮事实准确性与顺序交互下连贯时间推理之间的差距。
Details
Motivation: 解决语言模型在交互式部署场景中,用户随时间推理事实而非孤立处理时,模型需要维护和更新对话早期建立的隐式时间假设的挑战,即时间范围稳定性问题。
Result: 在基于Wikidata生成的超过一百万条确定性问题链的ChronoScope基准上评估,发现最先进的语言模型在受控多轮设置中频繁违反时间范围稳定性,即使有正确底层知识,模型也常漂移到当前假设,且失败随交互长度增加,即使在理想上下文条件下也持续存在。
Insight: 创新点在于提出时间范围稳定性这一诊断视角,并构建了大规模可控的ChronoScope基准来隔离和评估多轮交互中的时间行为;客观分析认为,该研究揭示了语言模型在顺序交互中时间推理的脆弱性,为改进模型的时间一致性提供了重要基准和洞见。
Abstract: Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time-scoped factual context across dialogue turns. We introduce ChronoScope, a large-scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi-turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow-up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories. Through extensive evaluation of state-of-the-art language models, we find that temporal scope stability is frequently violated in controlled multi-turn settings, with models often drifting toward present-day assumptions despite correct underlying knowledge. These failures intensify with interaction length and persist even under oracle context conditions, revealing a gap between single-turn factual accuracy and coherent temporal reasoning under sequential interaction. We make our dataset and evaluation suite publicly available at https://github.com/yashkumaratri/ChronoScope
[5] DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining cs.CL | cs.AI | cs.LGPDF
Youze Zheng, Jianyou Wang, Yuhan Chen, Matthew Feng, Longtian Bao
TL;DR: 本文提出DeepImagine框架,通过连续反事实想象训练大型语言模型进行生物医学推理,以提升临床试验结果预测的准确性。该方法利用真实临床试验数据构建自然和近似反事实对,结合监督微调和强化学习进行训练,并辅以合成推理轨迹增强模型解释性。
Details
Motivation: 解决现有大型语言模型和传统相关预测方法在临床试验结果预测任务上性能有限的问题,旨在通过反事实推理近似隐藏的因果机制。
Result: 在临床试验结果预测任务上,DeepImagine训练的参数小于100亿的模型(如Qwen3.5-9B)相比未调优的语言模型和传统相关基线(如随机森林和逻辑回归)有持续改进,但未明确提及是否达到SOTA水平。
Insight: 创新点在于通过反事实想象训练模型推理实验条件扰动下的结果变化,以近似因果机制;可借鉴的是结合监督与强化学习处理不同反事实监督设置,以及使用合成推理轨迹增强模型解释性和科学实用性。
Abstract: Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.
[6] ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents cs.CLPDF
Yating Wu, Yuhao Zhang, Sayan Ghosh, Sourya Basu, Anoop Deoras
TL;DR: 本文提出了ContextWeaver,一个用于LLM智能体的选择性、依赖结构化的记忆框架,旨在解决智能体在长上下文交互中因历史信息管理不当而导致的推理困难问题。该框架将交互轨迹组织成推理步骤图,并基于依赖关系选择和构建相关上下文,以支持多步推理。
Details
Motivation: 现有LLM智能体在长上下文交互中面临挑战,滑动窗口和提示压缩等方法可能遗漏后续步骤所依赖的早期结构化信息,而基于检索的记忆系统则忽略了多步推理所需的因果和逻辑结构。
Result: 在SWE-Bench Verified和Lite基准测试中,ContextWeaver在pass@1指标上优于滑动窗口基线,同时减少了推理步骤和令牌使用量。
Insight: 创新点在于提出了基于依赖关系的记忆构建与遍历、紧凑的依赖关系摘要以及轻量级验证层,通过建模逻辑依赖为使用工具的LLM智能体提供了稳定且可扩展的记忆机制。
Abstract: Large language model (LLM) agents often struggle in long-context interactions. As the agent accumulates more interaction history, context management approaches such as sliding window and prompt compression may omit earlier structured information that later steps rely on. Recent retrieval-based memory systems surface relevant content but still overlook the causal and logical structure needed for multi-step reasoning. We introduce ContextWeaver, a selective and dependency-structured memory framework that organizes an agent’s interaction trace into a graph of reasoning steps and selects the relevant context for future actions. Unlike prior context management approaches, ContextWeaver supports: (1) dependency-based construction and traversal that link each step to the earlier steps it relies on; (2) compact dependency summarization that condenses root-to-step reasoning paths into reusable units; and (3) a lightweight validation layer that incorporates execution feedback. On the SWE-Bench Verified and Lite benchmarks, ContextWeaver improves performance over a sliding-window baseline in pass@1, while reducing reasoning steps and token usage. Our observations suggest that modeling logical dependencies provides a stable and scalable memory mechanism for LLM agents that use tools.
[7] DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding cs.CLPDF
Qiyuan Jin
TL;DR: 本文提出了DARC-CLIP,一个基于CLIP的自适应多模态融合框架,用于理解模因(meme)。它通过引入自适应交叉注意力精炼器(ACAR)进行双向信息对齐,以及动态特征适配器(DFA)进行任务敏感的语义适应,以解决现有静态融合方法难以捕捉模态间细粒度依赖的问题。
Details
Motivation: 模因通过视觉和文本信号的交互传达意义,常包含幽默、讽刺和冒犯性内容。现有基于CLIP的方法依赖静态融合,难以捕捉模态间的细粒度依赖,因此需要更准确的多模态建模来检测有害或敏感内容。
Result: 在PrideMM基准测试(包括仇恨、目标、立场和幽默分类任务)上,DARC-CLIP取得了极具竞争力的分类准确率,在仇恨检测任务上比最强基线显著提升了+4.18 AUROC和+6.84 F1分数。在CrisisHateMM数据集上的泛化测试进一步验证了其有效性。消融研究证实ACAR和DFA是性能提升的主要贡献者。
Insight: 主要创新点是提出了层次化精炼堆栈,通过自适应交叉注意力机制实现动态的双向模态对齐与融合,以及任务敏感的特征适配,这为社交敏感内容的多模态分析提供了一种有效的自适应跨信号精炼策略。
Abstract: Memes convey meaning through the interaction of visual and textual signals, often combining humor, irony, and offense in subtle ways. Detecting harmful or sensitive content in memes requires accurate modeling of these multimodal cues. Existing CLIP-based approaches rely on static fusion, which struggles to capture fine grained dependencies between modalities. We propose DARC-CLIP, a CLIP-based framework for adaptive multimodal fusion with a hierarchical refinement stack. DARC-CLIP introduces Adaptive Cross-Attention Refiners to for bidirectional information alignment and Dynamic Feature Adapters for task-sensitive signal adaptation. We evaluate DARC-CLIP on the PrideMM benchmark, which includes hate, target, stance, and humor classification, and further test generalization on the CrisisHateMM dataset. DARC-CLIP achieves highly competitive classification accuracy across tasks, with significant gains of +4.18 AUROC and +6.84 F1 in hate detection over the strongest baseline. Ablation studies confirm that ACAR and DFA are the main contributors to these gains. These results show that adaptive cross-signal refinement is an effective strategy for multimodal content analysis in socially sensitive classification.
[8] Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt cs.CL | cs.AIPDF
Zhenzhen Huang, Chaoning Zhang, Fachrina Dewi Puspitasari, Jiaquan Zhang, Yitian Zhou
TL;DR: 本文提出了一种通过显式提示消歧来优化大语言模型(LLM)推理前输入的方法,以解决自然提示中语义模糊导致推理路径错误的问题。该方法识别提示中的语义风险,检查其多视角一致性,并解决语义冲突,最终将消歧后的清晰、结构化的提示输入给LLM。核心创新在于利用计算高效的小语言模型(SLM)作为主要执行器来完成这一消歧过程。
Details
Motivation: LLM的性能高度依赖用户输入的开放性提示,而自然提示常因不符合语法规则而产生语义模糊,导致模型在多个可能的解释中混淆,无法选择正确的推理路径来回答问题。现有方法在推理过程中进行查询编辑,但未能从根本上解决模糊性问题。
Result: 在多个基准测试上的综合实验表明,该方法以仅0.02美元的成本,将推理性能提升了2.5个百分点。
Insight: 主要创新点在于提出了一个在LLM推理前进行的、显式的提示消歧优化机制,将模糊的自然提示转化为逻辑结构清晰的输入,从而让LLM能更聚焦于语义关键信息。一个关键的工程洞见是,利用计算效率高的小语言模型(SLM)来执行主要的消歧任务,实现了性能提升与成本控制的良好平衡,且不干扰LLM内部推理机制。
Abstract: Large language models (LLMs) are increasingly utilized in various complex reasoning tasks due to their excellent instruction following capability. However, the model’s performance is highly dependent on the open-ended characteristics of the users’ input prompt. Natural prompts often do not follow proper syntactic rules, which creates ambiguous queries that yield multiple interpretations. Such ambiguous prompts confuse the model in choosing the correct reasoning paths to answer questions. Prior works address this challenge by applying query editing during the LLM inference process without explicitly solving the root cause of the ambiguity. To address this limitation, we propose a pre-inference prompt optimization mechanism via explicit prompt disambiguation. Particularly, we identify semantic risks in the prompt, check their multi-perspective consistency, and resolve any semantic conflicts that arise. Finally, we organize the resolved ambiguities in a logically structured manner as a clean input to the LLM. By explicitly resolving semantic ambiguity, our method can produce a more focused attention distribution to the semantically essential tokens. We also leverage small language models (SLMs) as the main executor of prompt disambiguation to benefit from their efficient computation. Through comprehensive experiments on multiple benchmarks, we demonstrate that our method improves reasoning performance by 2.5 points at a cost of only $0.02. Our study promotes explicit prompt disambiguation as an effective prompt optimization method without disturbing the internal mechanism of LLM inference.
[9] Au-M-ol: A Unified Model for Medical Audio and Language Understanding cs.CL | cs.AIPDF
Meizhu Liu, Nistha Mitra, Paul Li, Amine Abdaoui, Adam Ledyard
TL;DR: 本文提出了Au-M-ol,一种新颖的多模态架构,通过扩展大型语言模型(LLMs)以处理音频,旨在提升医学自动语音识别(ASR)等临床相关任务的性能。该模型包含音频编码器、适配层和预训练LLM三个主要组件,能够直接理解口述医学内容,提高准确性和鲁棒性。实验表明,在医学转录任务上,其词错误率(WER)比现有最优基线降低了56%,并在噪声环境、专业术语和说话人差异等挑战性条件下表现良好。
Details
Motivation: 解决医学领域中,现有自动语音识别系统在临床任务(如转录)上准确性和鲁棒性不足的问题,特别是针对嘈杂环境、专业术语和说话人差异的挑战。
Result: 在医学转录任务上,词错误率(WER)比现有最优(SOTA)基线降低了56%;在噪声环境、领域特定术语和说话人变异性等挑战性条件下也表现优异。
Insight: 创新点在于将音频处理与LLMs统一集成,通过音频编码器和适配层将声学特征映射到LLM输入空间,实现端到端的医学音频与语言理解;客观分析认为,这种多模态融合架构为临床环境中的可靠、上下文感知的音频理解提供了有效方案。
Abstract: In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.
[10] $\mathcal{S}^2$IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction cs.CL | cs.AIPDF
Bingfeng Chen, Chenjie Qiu, Yifeng Xie, Boyan Xu, Ruichu Cai
TL;DR: 本文提出S^2IT框架,通过分步语法整合调优,将句法结构知识渐进式地融入大语言模型,以提升其在方面情感四元组预测任务中的性能。该方法将四元组生成分解为全局语法引导提取和局部语法引导分类两个阶段,并通过细粒度结构调优增强模型对句法结构的理解。
Details
Motivation: 动机在于,尽管句法结构信息在以往的抽取式范式中被证明有效,但由于大语言模型推理能力有限,其在生成式范式中仍未得到充分利用。本文旨在解决如何有效将句法知识整合进LLMs以提升ASQP任务性能的问题。
Result: 实验表明,S^2IT在多个数据集上显著提升了最先进的性能,达到了SOTA水平。
Insight: 创新点在于提出了一个分步的语法整合调优框架,将复杂的四元组生成任务分解为两个语法引导的阶段,并引入细粒度的结构预测任务来增强模型的结构理解能力,这为在生成式范式中有效利用句法知识提供了新思路。
Abstract: Aspect Sentiment Quad Prediction (ASQP) has seen significant advancements, largely driven by the powerful semantic understanding and generative capabilities of large language models (LLMs). However, while syntactic structure information has been proven effective in previous extractive paradigms, it remains underutilized in the generative paradigm of LLMs due to their limited reasoning capabilities. In this paper, we propose S^2IT, a novel Stepwise Syntax Integration Tuning framework that progressively integrates syntactic structure knowledge into LLMs through a multi-step tuning process. The training process is divided into three steps. S^2IT decomposes the quadruple generation task into two stages: 1) Global Syntax-guided Extraction and 2) Local Syntax-guided Classification, integrating both global and local syntactic structure information. Finally, Fine-grained Structural Tuning enhances the model’s understanding of syntactic structures through the prediction of element links and node classification. Experiments demonstrate that S^2IT significantly improves state-of-the-art performance across multiple datasets. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/S2IT.
[11] Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL | cs.LGPDF
Xinzhu Chen, Wei He, Huichuan Fan, Wenzhe Niu, Zhongxiang Sun
TL;DR: 本文提出了一种名为SHEAR(Span-level Hidden state Enabled Advantage Reweighting)的新方法,用于在强化学习与可验证奖励(RLVR)场景中实现细粒度的信用分配。该方法通过计算正确与错误推理轨迹之间跨度级隐藏状态分布的Wasserstein距离,来识别局部推理质量出现分歧的区域,并据此重新调整令牌级优势信号,从而改进现有的组相对策略优化(GRPO)方法。
Details
Motivation: 现有方法GRPO在RLVR中进行粗粒度的信用分配,对所有令牌赋予相同的优势值,而过程奖励模型虽然能提供细粒度监督,但需要步骤级标注或额外的奖励模型训练。本文旨在利用仅有的结果级正确性标签,从模型隐藏状态分布中提取信号,以实现无需额外标注的细粒度信用分配。
Result: 在五个数学推理基准和五个代码生成基准上的实验表明,SHEAR方法相比标准GRPO有所改进,并且其性能与有监督的过程奖励模型相当,同时不需要任何额外的标注或奖励模型训练。
Insight: 核心创新点在于发现并理论证明了正确与错误推理轨迹的隐藏状态分布之间的Wasserstein距离,在局部推理出现分歧的区域会增大,这可以作为一种自监督信号用于细粒度信用分配。SHEAR方法利用这一信号重新缩放令牌级优势,有效放大了对关键分歧令牌的更新,且无需额外模型,仅需对训练流程进行最小改动。
Abstract: Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.
[12] Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss cs.CL | cs.SDPDF
Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi
TL;DR: 本文提出了一种新颖的多模态检索框架,通过跨模态注意力机制和混合损失函数,提升了长时、带噪音频与文本之间的语义对齐鲁棒性。
Details
Motivation: 解决现有音频-文本检索方法在处理长时、带噪和弱标注音频时,因依赖对比学习和大批量训练而表现不佳的问题。
Result: 在基准数据集上的实验表明,该方法优于现有方法,尤其在信噪比5至15的噪声环境下有效。
Insight: 创新点在于结合了基于Transformer的投影、线性映射和双向注意力的跨模态嵌入精炼模块,以及融合余弦相似度、L1和对比目标的混合损失函数,实现了小批量下的稳定训练;通过静音感知分块和注意力池化高效处理长音频。
Abstract: Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.
[13] Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue cs.CL | cs.HCPDF
Yangyang Zhao, Linfan Dai, Li Cai, Bowen Xing, Libo Qin
TL;DR: 本文提出了一种名为VLK-RL的混合框架,将大型语言模型的约束推理能力与强化学习的长期行为优化能力相结合,以解决跨领域任务导向对话中约束推理与长期动作规划的挑战。该框架通过交叉验证机制确保LLM输出的可靠性,并将其转化为结构化状态表示,从而提升对话系统的泛化性和鲁棒性。
Details
Motivation: 跨领域任务导向对话需要同时处理隐式和显式的可行性约束,并进行长期多轮动作规划。LLM擅长推理约束但不擅长长期规划,RL擅长优化长期行为但无法从原始对话中恢复约束,简单结合两者会导致LLM的不可靠输出破坏状态表示并误导策略学习。
Result: 在多个基准测试上的实验表明,VLK-RL显著提高了泛化能力和鲁棒性,在长期任务上优于强大的单一模型基线。
Insight: 创新点在于设计了双重交叉验证程序来抑制LLM的幻觉和跨轮次不一致性,并将验证后的约束映射为与本体对齐的槽值表示,从而为RL策略优化提供结构化的、感知约束的状态。这为可靠地整合LLM的符号推理与RL的序列决策提供了一种系统化方法。
Abstract: Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.
[14] When Chain-of-Thought Fails, the Solution Hides in the Hidden States cs.CL | cs.AI | cs.LGPDF
Houman Mehrafarin, Amit Parekh, Ioannis Konstas
TL;DR: 本文通过激活修补技术对思维链(CoT)在GSM8K数据集上的作用进行机制性因果分析,发现即使原始CoT推理错误,其隐藏状态中的单个token也编码了足够信息以恢复正确答案,且任务相关信息分布不均匀,集中在推理链的中后层和语言类token中。
Details
Motivation: 探究思维链(CoT)中的中间推理在计算上是否有用,即CoT token是否包含任务相关信息,而非仅具解释性。
Result: 在GSM8K上,对多个模型进行激活修补后,生成答案的准确率显著高于直接答案提示和原始CoT轨迹,表明单个CoT token可编码足够信息以恢复正确答案。
Insight: CoT编码了可恢复的、token级别的问题解决信息;任务相关信息在正确与错误CoT运行中分布不均,集中于中后层网络和语言类token(如动词、实体),而数学类token主要编码答案近似内容;完整推理链并非总是必要,更短的修补输出也能实现更高准确率,为推理的表征与失效点提供了新见解。
Abstract: Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct-answer prompting and the original CoT trace, revealing that individual CoT tokens can encode sufficient information to recover the correct answer, even when the original trace is incorrect. This task-relevant information is more prevalent in correct than incorrect CoT runs and is unevenly distributed across tokens, concentrating in mid-to-late layers and appearing earlier in the reasoning trace. Moreover, patching language tokens such as verbs and entities carry task-solving information that steers generation toward correct reasoning, whereas mathematical tokens encode answer-proximal content that rarely succeeds. Patched outputs are often shorter and yet exceed the accuracy of a full CoT trace, suggesting complete reasoning chains are not always necessary. Together, these findings demonstrate that CoT encodes recoverable, token-level problem-solving information, offering new insight into how reasoning is represented and where it breaks down.
[15] VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs cs.CL | cs.HCPDF
Yurui Xiang, Xingyi Mao, Rui Sheng, Zixin Chen, Zelin Zang
TL;DR: VeriLLMed是一个用于交互式视觉调试医疗大语言模型的系统,它通过整合外部生物医学知识图谱来审计和调试模型的诊断推理过程,帮助开发者识别临床不可靠的推理模式。
Details
Motivation: 解决医疗大语言模型在真实世界部署中的挑战,包括开发者缺乏医学专业知识、模型错误类型多样且难以优先排序,以及现有调试方法以孤立实例为中心难以发现重复错误模式的问题。
Result: 通过案例研究和专家评估表明,VeriLLMed能够帮助开发者识别临床不可信的推理,并生成可操作的见解以改进医疗大语言模型。
Insight: 创新点在于将模型输出转化为可比较的推理路径,构建基于知识图谱的参考路径,并系统性地识别三类重复诊断错误:关系错误、分支错误和缺失错误,从而提供结构化的视觉分析框架。
Abstract: Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.
[16] Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition cs.CLPDF
Rujing Yao, Yufei Shi, Yang Wu, Ang Li, Zhuoren Jiang
TL;DR: 本文提出了一种名为博弈论可信知识获取(GTKA)的框架,旨在解决使用云端大语言模型(LLMs)时知识效用与用户隐私之间的权衡问题。该框架将查询过程建模为策略博弈,通过三个核心组件——隐私感知子查询生成器、对抗性重构攻击者和可信本地集成器——协同工作,以在保护敏感意图的同时从外部模型中获取高质量知识。
Details
Motivation: 动机在于解决使用云端LLMs(提供强大推理和动态知识)与保护用户隐私(避免提交原始查询暴露敏感意图)之间的矛盾,以及仅依赖本地模型(保护隐私但知识有限导致答案质量下降)的局限性。
Result: 在构建的生物医学和法律领域的敏感领域基准测试上,大量实验表明,GTKA在保持高保真答案质量的同时,与最先进的基线方法相比,显著降低了意图泄露。
Insight: 创新点在于将知识获取与隐私保护的权衡形式化为一个博弈论框架,通过对抗性训练优化子查询生成策略,实现了在最小化原始敏感意图可重构性的同时最大化知识获取准确性。这为安全、可信的LLM交互提供了一种新范式。
Abstract: Cloud-hosted Large Language Models (LLMs) offer unmatched reasoning capabilities and dynamic knowledge, yet submitting raw queries to these external services risks exposing sensitive user intent. Conversely, relying exclusively on trusted local models preserves privacy but often compromises answer quality due to limited parameter scale and knowledge. To resolve this dilemma, we propose Game-theoretic Trustworthy Knowledge Acquisition (GTKA), a framework that formulates the trade-off between knowledge utility and privacy as a strategic game. GTKA consists of three components: (i) a privacy-aware sub-query generator that decomposes sensitive intent into generalized, low-risk fragments; (ii) an adversarial reconstruction attacker that attempts to infer the original query from these fragments, providing adaptive leakage signals; and (iii) a trusted local integrator that synthesizes external responses within a secure boundary. By training the generator and attacker in an alternating adversarial manner, GTKA optimizes the sub-query generation policy to maximize knowledge acquisition accuracy while minimizing the reconstructability of the original sensitive intent. To validate our approach, we construct two sensitive-domain benchmarks in the biomedical and legal fields. Extensive experiments demonstrate that GTKA significantly reduces intent leakage compared to state-of-the-art baselines while maintaining high-fidelity answer quality.
[17] Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective cs.CLPDF
Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu
TL;DR: 本文重新审视了视觉问答(VQA)任务中的解码策略,认为从校准视角看,贪婪解码优于随机采样。作者提供了理论分析,证明在VQA这类答案分布集中且不确定性主要源于认知(如视觉证据缺失)的任务中,贪婪解码是最优的。实验表明,贪婪解码在多个基准测试中超越随机采样,并且提出的Greedy Decoding for Reasoning Models在多模态推理场景中表现更佳。
Details
Motivation: 解决多模态大语言模型(MLLMs)在视觉问答任务中盲目继承大语言模型(LLMs)的随机采样解码策略的问题,因为VQA是封闭式任务,其不确定性主要源于视觉证据的缺失或模糊,而非合理的答案多样性,因此随机解码可能不是最优选择。
Result: 在多个VQA基准测试上的广泛实验提供了经验证据,表明贪婪解码在预测准确性上优于随机采样策略。提出的Greedy Decoding for Reasoning Models在多模态推理场景中进一步超越了标准贪婪解码和随机采样,达到了更优的性能水平。
Insight: 创新点在于从模型校准的理论角度形式化了贪婪解码最优性的充分条件,并针对VQA任务特性(如答案分布集中、不确定性类型)提出了贪婪解码的适用性论证。客观分析认为,该研究强调了任务特定解码策略的重要性,为MLLMs的推理优化提供了新视角,即贪婪解码可以作为VQA的高效且强大的默认选择。
Abstract: Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.
[18] K-SENSE: A Knowledge-Guided Self-Augmented Encoder for Neuro-Semantic Evaluation of Mental Health Conditions on Social Media cs.CL | cs.AIPDF
Vijay Yadav
TL;DR: 本文提出了一种名为K-SENSE的知识引导自增强编码器框架,用于从社交媒体文本中早期检测心理健康状况(如压力和抑郁)。该框架通过一个三阶段编码流程,将外部常识知识推理与内部表示鲁棒性相结合,在压力和抑郁检测任务上取得了优于现有方法的性能。
Details
Motivation: 现有方法要么利用外部常识知识显式建模心理状态,要么应用自增强和对比训练来提高泛化能力,但很少在统一框架中结合两者。本文旨在解决社交媒体文本中因比喻性语言、隐含情感表达和高噪声带来的心理健康状况检测挑战。
Result: 在Dreaddit(压力检测)和Depression_Mixed(抑郁检测)数据集上,K-SENSE的平均F1分数分别达到86.1%和94.3%,比之前最强的基线方法分别提升了约2.6和1.5个百分点。消融实验证实了各架构组件(包括时序知识整合策略和微调时冻结知识编码器的选择)的贡献。
Insight: 主要创新点在于提出了一个将外部心理推理知识与内部表示鲁棒性相结合的统一框架,其核心是一个三阶段编码流程:提取跨五个心理维度的常识知识、构建融合双编码流隐藏表示的语义锚点、以及采用监督对比学习目标来对齐同类表示并抑制无关知识噪声。从客观角度看,其知识整合策略和对比学习目标的设计对于处理社交媒体噪声和隐含表达具有借鉴意义。
Abstract: Early detection of mental health conditions, particularly stress and depression, from social media text remains a challenging open problem in computational psychiatry and natural language processing. Automated systems must contend with figurative language, implicit emotional expression, and the high noise inherent in user-generated content. Existing approaches either leverage external commonsense knowledge to model mental states explicitly, or apply self-augmentation and contrastive training to improve generalization, but seldom do both in a principled, unified framework. We propose K-SENSE (Knowledge-guided Self-augmented Encoder for Neuro-Semantic Evaluation of Mental Health), a framework that jointly exploits external psychological reasoning and internal representation robustness. K-SENSE adopts a three-stage encoding pipeline: (1) inferential commonsense knowledge is extracted from the COMET model across five mental state dimensions; (2) a semantic anchor is constructed by combining hidden representations from two parallel encoding streams, projected into a shared space before fusion; and (3) a supervised contrastive learning objective aligns same-class representations while encouraging the attention mechanism to suppress irrelevant knowledge noise. We evaluate K-SENSE on Dreaddit (stress detection) and Depression_Mixed (depression detection), achieving mean F1-scores of 86.1 (0.6%) and 94.3 (0.8%), respectively, over five independent runs. These represent improvements of approximately 2.6 and 1.5 percentage points over the strongest prior baselines. Ablation experiments confirm the contribution of each architectural component, including the temporal knowledge integration strategy and the choice to keep the knowledge encoder frozen during fine-tuning.
[19] RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization cs.CL | cs.LGPDF
Dongxin Guo, Jikun Wu, Siu Ming Yiu
TL;DR: RouteNLP是一个用于降低大语言模型推理成本的闭环路由框架,通过将查询智能分配到分层模型组合中,在满足任务质量约束的同时最小化成本。它集成了难度感知路由器、基于保形预测的置信度校准级联以及蒸馏-路由协同优化循环三个组件。
Details
Motivation: 解决企业部署大语言模型时面临的高昂推理成本问题,因为大量查询是常规任务,可以用更小的模型处理,但现有系统未能有效利用分层模型组合来降低成本。
Result: 在企业客服部门为期8周的试点部署中,处理约每天5000个查询,将推理成本降低了58%,同时保持了91%的响应接受率,并将p99延迟从1847毫秒降至387毫秒。在一个包含金融、客服和法律领域的六任务基准测试上,实现了40-85%的成本降低,在结构化任务上保持了96-100%的质量,在生成任务上保持了96-98%的质量,人工评估确认74.5%的路由生成输出达到或超过了前沿模型的质量。
Insight: 创新点在于将难度感知路由、基于保形预测的置信度校准级联和蒸馏-路由协同优化循环集成到一个闭环框架中,特别是通过聚类升级失败案例进行有针对性的知识蒸馏并自动重训练路由器,其成本改进效果是未经针对性蒸馏的两倍以上。
Abstract: Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded $200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted distillation. In an 8-week pilot deployment processing ~5K queries/day at an enterprise customer-service division, RouteNLP reduced inference costs by 58% while maintaining 91% response acceptance and reducing p99 latency from 1,847 ms to 387 ms. On a six-task benchmark spanning finance, customer service, and legal domains, the framework achieves 40-85% cost reduction while retaining 96-100% quality on structured tasks and 96-98% on generation tasks, with human evaluation confirming that 74.5% of routed generation outputs match or exceed frontier-model quality.
[20] ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection cs.CL | cs.IR | cs.LGPDF
Dongxin Guo, Jikun Wu, Siu Ming Yiu
TL;DR: 本文提出了ComplianceNLP系统,这是一个端到端的自动化合规监控系统,用于自动跟踪法规变化、提取结构化义务,并识别机构政策与法规之间的合规差距。系统集成了知识图谱增强的RAG管道、多任务义务提取和基于严重性评分的合规差距分析。
Details
Motivation: 金融机构每年需处理海量监管事件,人工合规团队不堪重负,且自2008年金融危机以来行业已支付巨额罚款,因此亟需自动化解决方案来监控法规变化并检测合规差距。
Result: 在基准测试中,ComplianceNLP在差距检测上达到87.7的F1分数,比GPT-4o+RAG高出3.5个F1点,其溯源准确率为94.2%,在端到端误差传播下的F1为83.4。在金融机构的并行部署中,处理了9,847次更新,实现了96.0%的估计召回率和90.7%的精确度,分析师效率持续提升3.1倍。
Insight: 主要创新点包括:1) 利用知识图谱增强RAG,结构化监管知识(如跨引用)对性能提升贡献最大(+4.6 F1);2) 结合领域特定知识蒸馏与Medusa推测解码,实现了2.8倍的推理加速;3) 针对监管文本低熵特性的优化,使草稿令牌接受率达到91.3%。系统在信任校准、GRC集成和分布偏移监控方面的部署经验也具有借鉴意义。
Abstract: Financial institutions must track over 60,000 regulatory events annually, overwhelming manual compliance teams; the industry has paid over USD 300 billion in fines and settlements since the 2008 financial crisis. We present ComplianceNLP, an end-to-end system that automatically monitors regulatory changes, extracts structured obligations, and identifies compliance gaps against institutional policies. The system integrates three components: (1) a knowledge-graph-augmented RAG pipeline grounding generations in a regulatory knowledge graph of 12,847 provisions across SEC, MiFID II, and Basel III; (2) multi-task obligation extraction combining NER, deontic classification, and cross-reference resolution over a shared LEGAL-BERT encoder; and (3) compliance gap analysis that maps obligations to internal policies with severity-aware scoring. On our benchmark, ComplianceNLP achieves 87.7 F1 on gap detection, outperforming GPT-4o+RAG by +3.5 F1, with 94.2% grounding accuracy ($r=0.83$ vs. human judgments) and 83.4 F1 under realistic end-to-end error propagation. Ablations show that knowledge-graph re-ranking contributes the largest marginal gain (+4.6 F1), confirming that structural regulatory knowledge is critical for cross-reference-heavy tasks. Domain-specific knowledge distillation (70B $\to$ 8B) combined with Medusa speculative decoding yields $2.8\times$ inference speedup; regulatory text’s low entropy ($H=2.31$ bits vs. $3.87$ general text) produces 91.3% draft-token acceptance rates. In four months of parallel-run deployment processing 9,847 updates at a financial institution, the system achieved 96.0% estimated recall and 90.7% precision, with a $3.1\times$ sustained analyst efficiency gain. We report deployment lessons on trust calibration, GRC integration, and distributional shift monitoring for regulated-domain NLP.
[21] GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs cs.CLPDF
Tao Feng, Haozhen Zhang, Zijie Lei, Peixuan Han, Jiaxuan You
TL;DR: 本文提出了GraphPlanner,一种用于多智能体大语言模型的异构图记忆增强智能路由框架。它将工作流生成建模为马尔可夫决策过程,通过名为GARNet的异构图整合查询、智能体和响应之间的交互记忆,并使用强化学习进行优化,旨在提升任务性能与计算效率。
Details
Motivation: 现有LLM路由方法主要关注单轮、非智能体场景,难以应对需要任务规划、多轮异构智能体协作和记忆利用的现实复杂应用。本文旨在填补这一空白,将路由扩展到智能体LLM设置中。
Result: 在14个不同的LLM任务上评估,GraphPlanner优于强大的单轮和多轮路由器,准确率最高提升9.3%,同时将GPU成本从186.26 GiB大幅降低至1.04 GiB。它对新任务和LLM展现出强大的零样本泛化能力,并有效利用历史记忆支持归纳和转导推理。
Insight: 核心创新点在于将多智能体路由的工作流生成形式化为MDP,并引入异构图(GARNet)来结构化和利用智能体间的交互历史作为记忆,从而实现了性能与效率的联合优化。其支持归纳和转导推理的灵活架构设计具有借鉴意义。
Abstract: LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab-uiuc/GraphPlanner.
[22] Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM-as-a-Judge cs.CL | cs.AI | cs.CVPDF
Wentao Zhang, Qi Zhang, Mingkun Xu, Mu You, Henghua Shen
TL;DR: 本文提出了Agri-CPJ框架,这是一个无需训练、基于大语言模型和视觉语言模型的农业病虫害诊断可解释性框架。该框架通过生成结构化形态描述、多维度质量门控迭代优化、生成互补候选答案,并利用LLM作为裁判进行选择,从而提升诊断准确性和可解释性。
Details
Motivation: 解决农业病虫害诊断中模型在基准测试上表现良好但经常产生物种名称幻觉,以及预测正确时其推理过程对从业者不透明、难以理解的问题。
Result: 在CDDMBench基准测试中,使用GPT-5-Nano模型配合GPT-5-mini生成的描述,相比无描述基线,疾病分类准确率提升22.7个百分点,问答得分提升19.5分。在AgMMU-MCQs基准上,GPT-5-Nano达到77.84%,Qwen-VL-Chat达到64.54%,在格式从开放式转为多项选择题的情况下,性能达到或超过大多数同规模开源模型。
Insight: 主要创新点在于提出无需训练、基于提示工程的诊断框架,通过结构化描述生成与迭代优化(质量门控)显著提升下游任务性能,并利用LLM作为裁判进行答案选择,结合结构化描述和裁判推理过程,为诊断决策提供了可读的审计追踪,增强了模型的可解释性和可信度。
Abstract: Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri-CPJ (Caption-Prompt-Judge), a training-free few-shot framework in which a large vision-language model first generates a structured morphological caption, iteratively refined through multi-dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain-specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT-5-Nano with GPT-5-mini-generated captions yields \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. Evaluated without modification on AgMMU-MCQs, GPT-5-Nano reached 77.84% and Qwen-VL-Chat reached 64.54%, placing them at or above most open-source models of comparable scale despite the format shift from open-ended to multiple-choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis
[23] Multimodal QUD: Inquisitive Questions from Scientific Figures cs.CLPDF
Yating Wu, William Rudman, Venkata S Govindarajan, Alexandros G. Dimakis, Junyi Jessy Li
TL;DR: 本文提出了一种多模态问题讨论(MQUD)框架,将仅适用于文本的QUD理论扩展至多模态场景,以生成针对科学论文中图表和文本的探究性问题。作者构建了MQUD数据集,并通过微调视觉语言模型(VLM)来提升模型生成高质量、视觉基础的多模态问题的能力。
Details
Motivation: 现有视觉语言模型在科学论文图表理解方面的评测仅限于信息提取类问题,缺乏深度推理和上下文考虑,无法反映作者的沟通意图。本文旨在解决这一问题,通过生成结合图表和论文上下文的探究性问题,模拟人类阅读时的深度提问行为。
Result: 通过在MQUD数据集上微调视觉语言模型,模型从生成通用低级视觉问题转向需要高级多模态推理的内容特定基础问题,从而产生更高质量、更具视觉基础的多模态QUD生成。
Insight: 创新点在于将QUD理论从文本扩展到多模态(图表+文本),并构建了由原作者标注的MQUD数据集,以促进模型生成深度推理的探究性问题。这为多模态理解任务提供了新的评测方向和模型优化思路。
Abstract: Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper’s context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.
[24] LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models cs.CLPDF
Tianchun Li, Haochen Liu, Vishwa Pardeshi, Xingchen Wang, Tianci Liu
TL;DR: 本文提出LegalDrill框架,通过诊断驱动的合成方法,从大型教师模型中提取并迭代精炼法律推理轨迹,再通过自反思验证为小型语言模型(SLM)自适应选择最有效的训练数据,从而显著提升SLM在法律推理任务上的能力,无需昂贵的人工标注。
Details
Motivation: 解决小型语言模型(SLM)在处理高风险法律推理任务时,因容量有限而难以进行连贯法规解释和逻辑一致推理的问题,同时克服为SLM训练收集高质量、简洁推理轨迹数据成本高昂且标准拒绝采样方法缺乏细粒度信息的挑战。
Result: 在多个法律基准测试上的广泛实验表明,LegalDrill显著增强了代表性SLM(如Llama-2-7B)的法律推理能力,例如在CaseHOLD基准上达到与GPT-3.5相当的水平,在LegalBench推理任务上超越基线方法,展示了其有效性。
Insight: 创新点在于诊断驱动的合成框架,通过细粒度提示从教师模型提取推理轨迹并进行迭代精炼,结合自反思验证机制自适应选择高质量数据,为SLM训练(包括监督微调和直接偏好优化)提供可扩展的高质量合成数据,避免了依赖稀缺专家标注的瓶颈。
Abstract: Small language models (SLMs) are promising for real-world deployment due to their efficiency and low operational cost. However, their limited capacity struggles with high-stakes legal reasoning tasks that require coherent statute interpretation and logically consistent deduction. Furthermore, training SLMs for such tasks demands high-quality, concise reasoning trajectories, which are prohibitively expensive to manually collect and difficult to curate via standard rejection sampling, lacking granularity beyond final verdicts. To address these challenges, we propose {LegalDrill}, a diagnosis-driven synthesis framework that extracts and iteratively refines reasoning trajectories from a capable teacher via fine-grained prompting, then a self-reflective verification is employed to adaptively select the most effective data for the SLM student. The resulting data empower SLM training through supervised fine-tuning and direct preference optimization. Extensive experiments on several legal benchmarks demonstrate that {LegalDrill} significantly bolsters the legal reasoning capabilities of representative SLMs while bypassing the need for scarce expert annotations, paving a scalable path toward practical legal reasoning systems.
[25] One Size Fits None: Heuristic Collapse in LLM Investment Advice cs.CL | cs.LGPDF
Jillian Ross, Andrew W. Lo
TL;DR: 本文研究了前沿大语言模型在提供投资建议时是否真正整合用户完整情境,还是表现出启发式崩溃现象,即系统性地将复杂的多因素决策简化为少数主导输入。研究发现,LLM的投资分配决策主要由自我报告的风险承受能力决定,而其他相关因素贡献甚微,且网络搜索只能部分缓解但无法解决该问题。
Details
Motivation: 动机是探究LLM在高风险领域作为顾问部署时,是否能够整合用户的完整情境进行个性化推理,还是仅仅依赖表面特征做出决策,特别是在法律明确要求基于客户全部情况进行个性化推理的投资建议领域。
Result: 通过可解释的代理模型分析LLM输出,发现存在系统性的启发式崩溃:投资分配决策主要由自我报告的风险承受能力决定,其他相关因素贡献很小。网络搜索只能部分减弱但无法解决这种崩溃。
Insight: 论文宣称的创新点是识别并定义了LLM在复杂决策中的“启发式崩溃”现象,并指出仅靠网络搜索增强或扩大模型规模无法解决该问题。客观分析认为,其核心洞察在于部署LLM作为顾问时,需要审计其输入敏感性,而不仅仅是输出质量,这对高风险领域的AI系统评估和设计具有重要借鉴意义。
Abstract: Large language models are increasingly deployed as advisors in high-stakes domains – answering medical questions, interpreting legal documents, recommending financial products – where good advice requires integrating a user’s full context rather than responding to salient surface features. We investigate whether frontier LLMs actually do this, or whether they instead exhibit heuristic collapse: a systematic reduction of complex, multi-factor decisions to a small number of dominant inputs. We study the phenomenon in investment advice, where legal standards explicitly require individualized reasoning over a client’s full circumstances. Applying interpretable surrogate models to LLM outputs, we find systematic heuristic collapse: investment allocation decisions are largely determined by self-reported risk tolerance, while other relevant factors contribute minimally. We further find that web search partially attenuates heuristic collapse but does not resolve it. These findings suggest that heuristic collapse is not resolved by web search augmentation or model scale alone, and that deploying LLMs as advisors requires auditing input sensitivity, not just output quality.
[26] Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows cs.CL | cs.SEPDF
Nikita Borovkov, Elisei Rykov, Olga Tsymboi, Sergei Filimonov, Nikita Surnachev
TL;DR: 论文提出了一种在企业业务流程管理平台中部署的端到端客户支持工作流自动化系统。该系统利用大规模生成的监督数据(结构化UI交互轨迹和低开销的copilot反馈),通过分阶段部署管道训练下一个UI动作策略,并从copilot反馈中学习评论家以校准弃权机制,从而在后台仅执行高置信度步骤,将不确定决策交由操作员处理。在生产环境中,该系统实现了45%的会话自动化,并将平均处理时间减少了39%,同时未降低支持质量水平。
Details
Motivation: 解决在企业客户支持工作流中实现可扩展、选择性自动化的问题,旨在减少人工操作负担并提高效率。
Result: 在生产环境中,系统自动化了45%的会话,并将平均处理时间减少了39%,同时保持了支持质量水平。
Insight: 创新点在于利用现有的大规模监督数据(UI交互轨迹和copilot反馈)快速训练自动化策略,并通过学习评论家实现选择性自动化(仅执行高置信度步骤),允许单个操作员监督多个并发会话,仅在系统不确定时进行干预。这是一种从人类反馈中学习并校准弃权的实用部署方法。
Abstract: We present a deployed system that automates end-to-end customer support workflows inside an enterprise Business Process Management (BPM) platform. The approach is scalable in production and reaches selective automation within two weeks for a new process, leveraging supervision already generated at scale: structured per-case UI interaction traces and low-overhead copilot feedback, where operators either accept a suggestion or provide a correction. A staged deployment pipeline trains a next UI action policy, learns a critic from copilot feedback to calibrate abstention, and executes only high-confidence steps in the background while deferring uncertain decisions to operators and resuming from the updated UI state. This setup lets one operator supervise multiple concurrent sessions and be interrupted only when the system is uncertain. The system operates on a schema-driven view of the BPM interface and includes monitoring and safe fallbacks for production. In production, it automated 45% of sessions and reduced average handling time by 39% without degrading support quality level.
[27] Knowledge Vector of Logical Reasoning in Large Language Models cs.CLPDF
Zixuan Wang, Yuanyuan Lei
TL;DR: 本文研究大语言模型中演绎、归纳和溯因三种逻辑推理的知识表示,发现每种推理类型可表示为线性空间中的特定知识向量且相互独立。受认知科学启发,作者提出互补子空间约束优化框架,通过互补损失和子空间约束损失增强推理向量间的知识互补性,实验表明优化后的向量能提升推理性能,并通过可解释性分析揭示了不同推理的共享与特有特征。
Details
Motivation: 动机在于探索大语言模型中逻辑推理(演绎、归纳、溯因)的知识表示及其相互关系,基于认知科学理论认为这些推理形式在人脑中紧密互动,且观察到一种推理过程可从其他推理链中受益,旨在优化LLMs中推理表示以促进互补性。
Result: 通过沿推理向量的引导实验,发现融入互补知识的优化向量带来一致的性能提升;可解释性分析揭示了不同推理在LLMs中的共享和特定特征,但未提及具体基准测试或SOTA比较。
Insight: 创新点包括将逻辑推理形式化为线性知识向量,并提出互补子空间约束优化框架,通过损失函数设计实现推理向量间的知识互补同时保持其独特性;客观分析认为该方法为LLMs推理表示的可控优化提供了新思路,有助于理解推理机制的内部表征。
Abstract: Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows that each form of logical reasoning can be captured as a reasoning-specific knowledge vector in a linear representation space, yet these vectors are largely independent of each other. Motivated by cognitive science theory that these subforms of logical reasoning interact closely in the human brain, as well as our observation that the reasoning process for one type can benefit from the reasoning chain produced by another, we further propose to refine the knowledge representations of each reasoning type in LLMs to encourage complementarity between them. To this end, we design a complementary subspace-constrained refinement framework, which introduces a complementary loss that enables each reasoning vector to leverage auxiliary knowledge from the others, and a subspace constraint loss that prevents erasure of their unique characteristics. Through steering experiments along reasoning vectors, we find that refined vectors incorporating complementary knowledge yield consistent performance gains. We also conduct a mechanism-interpretability analysis of each reasoning vector, revealing insights into the shared and specific features of different reasoning in LLMs.
[28] Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity cs.CL | cs.AI | cs.SCPDF
Yao Wang, Zixu Geng, Jun Yan
TL;DR: 本文提出了一种名为量子知识图谱(QKG)的新框架,用于建模知识图谱中三元组关系的上下文依赖性有效性。该框架将三元组的有效性定义为特定上下文函数,并在医学领域(以糖尿病为中心的PrimeKG子图)进行了实例化与评估。实验表明,在基于知识图谱的医学问答任务中,结合上下文匹配的QKG验证能显著提升大型语言模型(如Haiku-4.5、Qwen-3.6-Plus)的推理性能。
Details
Motivation: 标准知识图谱将每个关系视为全局有效,但在许多实际场景(如临床推理)中,关系是否应作为证据取决于具体上下文。本文旨在解决知识图谱中三元组有效性的上下文依赖性问题。
Result: 在MedReason数据集的KG-grounded子集(2,788个问题)上评估。使用Haiku-4.5作为推理器和验证器时,QKG结合上下文匹配相比无验证器基线提升1.40个百分点(pp),且优于无上下文匹配的KG验证(+0.79 pp)。使用更强的验证器Qwen-3.6-Plus时,QKG相比基线的增益从+1.40 pp扩大至+5.96 pp。
Insight: 创新点在于将三元组有效性形式化为上下文函数,提出QKG框架以显式建模关系适用的具体条件。核心见解是:知识图谱在基于LLM的临床推理中的价值不仅在于存储医学事实,更在于表示这些事实在特定患者上下文中的适用性。
Abstract: Knowledge graphs (KGs) are increasingly used to support large lan guage model (LLM) reasoning, but standard triplet-based KGs treat each relation as globally valid. In many settings, whether a relation should count as evidence depends on the context. We therefore formulate triplet validity as a triplet-specific function of context and refer to this formulation as a Quantum Knowledge Graph (QKG). We instantiate QKG in medicine using a diabetes-centered PrimeKG subgraph, whose 68,651 context-sensitive relations are further annotated with patient-group-specific constraints. We evaluate it in a reasoner–validator pipeline for medical question answering on a KG-grounded subset of MedReason containing 2,788 questions. With Haiku-4.5 as both the Reasoner and the Validator, KG-backed validation significantly improves over a no-validator baseline ($+0.61$ pp), and QKG with context matching yields the largest gain, outperforming both KG validation without context matching ($+0.79$ pp) and the no-validator baseline ($+1.40$ pp; paired McNemar, all $p<0.05$). Under a stronger validator (Qwen-3.6-Plus), the raw QKG gain over the no-validator baseline grows from $+1.40$ pp to $+5.96$ pp; the context-matching gap is non-significant ($p=0.73$) on the raw set but becomes borderline significant ($p=0.05$) after adjustment for knowledge leakage and suspicious questions, consistent with a benchmark-gold ceiling rather than a QKG limitation. Taken together, the results support the view that the value of a KG in LLM-based clinical reasoning lies not merely in storing medically related facts, but in representing whether those facts are applicable to the specific patient context. For reproducibility and further research, we release the curated QKG datasets and source code.\footnote{https://github.com/HKAI-Sci/QKG}
[29] EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce cs.CL | cs.AI | cs.DB | cs.LG | cs.MAPDF
Minhyeong Yu, Wonduk Seo
TL;DR: 本文提出EPM-RL,一个基于强化学习的框架,用于构建准确高效的本地化电商产品映射模型。该方法通过将高成本的智能体推理知识蒸馏到可训练的本地模型中,结合参数高效微调和基于智能体奖励的强化学习进行优化,旨在解决现有基于LLM和多智能体方法依赖昂贵外部API、检索复杂且难以在隐私敏感的企业环境中大规模部署的问题。
Details
Motivation: 解决电商产品映射任务中,由于卖家在商品标题中添加促销关键词、平台特定标签和捆绑描述导致同一产品名称多样,而现有基于LLM和多智能体的方法虽然提升了鲁棒性和可解释性,但依赖昂贵外部API、复杂推理流程,难以在注重隐私的企业环境中低成本大规模部署的问题。
Result: 初步结果表明,EPM-RL在性能上持续优于仅使用参数高效微调的训练方法,并且相比基于商业API的基线方法,在质量与成本之间提供了更好的权衡,同时支持私有化部署和更低的运营成本。
Insight: 创新点在于提出使用强化学习将产品映射任务从高延迟的智能体流程转化为可扩展、可检查且可直接部署的本地系统,具体通过结合LLM生成的推理依据、人类验证、参数高效微调以及基于智能体奖励(评估输出格式合规性、标签正确性和专门设计的评判模型给出的推理偏好分数)的强化学习进行联合优化。
Abstract: Product mapping, the task of deciding whether two e-commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform-specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM-based and multi-agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference-time orchestration, making large-scale deployment costly and difficult in privacy-sensitive enterprise settings. To address these issues, we present EPM-RL, a reinforcement-learning-based framework for building an accurate and efficient on-premise e-commerce product mapping model. Our central idea is to distill high-cost agentic reasoning into a trainable in-house model. Starting from a curated set of product pairs with LLM-generated rationales and human verification, we first perform parameter-efficient fine-tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent-based reward that jointly evaluates output-format compliance, label correctness, reasoning–preference scores from specially designed judge models. Preliminary results show that EPM-RL consistently improves over PEFT-only training and offers a stronger quality–cost trade-off than commercial API-based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high-latency agentic pipeline into a scalable, inspectable, and production-ready in-house system.
[30] Stabilizing Efficient Reasoning with Step-Level Advantage Selection cs.CL | cs.LGPDF
Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun
TL;DR: 该论文提出了一种名为步级优势选择(SAS)的方法,旨在稳定大型语言模型(LLM)的高效推理过程。研究发现,仅使用短上下文进行后训练(如标准GRPO)虽能压缩推理长度,但会导致训练不稳定和精度下降。SAS通过在推理步骤级别分配优势值,选择性地保留高置信度的正确步骤,从而在多个数学和通用推理基准上实现了更好的精度与效率权衡。
Details
Motivation: 动机是解决LLM在推理时因生成长而冗余的推理轨迹导致计算开销大的问题。现有高效推理方法(如基于长度的奖励或剪枝)通常在比基础模型训练更短的上下文窗口中进行后训练,这种短上下文训练的影响未被系统研究,且可能导致训练不稳定和精度下降。
Result: 在多样化的数学和通用推理基准上,SAS方法比最强的基于长度感知的基线平均Pass@1精度提高了0.86个百分点,同时平均推理长度减少了16.3%,实现了更好的精度-效率权衡。
Insight: 创新点在于揭示了短上下文后训练本身能诱导推理压缩但带来不稳定性的现象,并提出了步级优势选择(SAS)这一新颖方法,它通过步骤级别的置信度评估来稳定训练,避免因截断或验证器问题导致的错误惩罚,从而更有效地平衡推理精度与效率。
Abstract: Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, many approaches are post-trained under a much shorter context window than base-model training, a factor whose effect has not been systematically isolated. We first show that short-context post-training alone, using standard GRPO without any length-aware objective, already induces substantial reasoning compression-but at the cost of increasingly unstable training dynamics and accuracy degradation. To address this, we propose Step-level Advantage Selection (SAS), which operates at the reasoning-step level and assigns a zero advantage to low-confidence steps in correct rollouts and to high-confidence steps in verifier-failed rollouts, where failures often arise from truncation or verifier issues rather than incorrect reasoning. Across diverse mathematical and general reasoning benchmarks, SAS improves average Pass@1 accuracy by 0.86 points over the strongest length-aware baseline while reducing average reasoning length by 16.3%, yielding a better accuracy-efficiency trade-off.
[31] From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL | cs.AIPDF
Qiliang Liang, Hansi Wang, Zhong Liang, Yang Liu
TL;DR: 本文提出了一种名为调度-结构-逻辑(SSL)的显式结构化表示方法,用于表示LLM智能体技能。该方法将技能知识解耦为调度信号、执行结构和逻辑证据三个层次,旨在解决当前技能表示依赖文本描述、机器难以直接利用的问题。通过基于LLM的标准化器实现SSL,并在技能发现和风险评估任务上验证其有效性,显著优于纯文本基线。
Details
Motivation: 当前智能体系统中的技能(如SKILL.md文档)主要依赖文本描述,导致其调用接口、执行结构和副作用等信息混杂在自然语言中,难以被机器高效获取和利用。因此,需要一种显式的结构化表示来提升技能的可管理性和可用性。
Result: 在技能发现任务中,SSL将MRR从0.573提升至0.707;在风险评估任务中,将宏观F1分数从0.744提升至0.787,均显著优于纯文本基线。
Insight: 创新点在于借鉴经典语言学知识表示理论,首次为智能体技能提出了解耦调度、结构和逻辑三个层次的显式结构化表示(SSL)。这不仅提升了技能检索和审查的效率,也为构建更可检查、可重用和可操作的技能表示提供了实用方向,而非一个最终标准或端到端机制。
Abstract: LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface. An explicit representation of skill knowledge may therefore help make these artifacts easier for machines to acquire and leverage. Drawing on Memory Organization Packets, Script Theory, and Conceptual Dependency from Schank and Abelson’s classical work on linguistic knowledge representation, we introduce what is, to our knowledge, the first structured representation for agent skill artifacts that disentangles skill-level scheduling signals, scene-level execution structure, and logic-level action and resource-use evidence: the Scheduling-Structural-Logical (SSL) representation. We instantiate SSL with an LLM-based normalizer and evaluate it on a corpus of skills in two tasks, Skill Discovery and Risk Assessment, and superiorly outperform the text-only baselines: in Skill Discovery, SSL improves MRR from 0.573 to 0.707; in Risk Assessment, it improves macro F1 from 0.744 to 0.787. These findings reveal that explicit, source-grounded structure makes agent skills easier to search and review. They also suggest that SSL is best understood as a practical step toward more inspectable, reusable, and operationally actionable skill representations for agent systems, rather than as a finished standard or an end-to-end mechanism for managing and using skills.
[32] The Pragmatic Persona: Discovering LLM Persona through Bridging Inference cs.CL | cs.AIPDF
Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim
TL;DR: 本文提出了一种基于桥接推理的新颖分析框架,用于发现大型语言模型(LLM)在对话中展现的内在人物角色(persona)。该方法通过建模话语间的隐含概念关系(桥接推理),构建结构化知识图谱,从而在话语连贯性层面而非表层词汇或风格线索上捕捉并识别LLM的稳定人物角色。
Details
Motivation: 现有的人物角色发现方法大多依赖表层的词汇或风格线索,将对话视为扁平的token序列,未能捕捉维持人物角色一致性的更深层话语结构。本文旨在解决这一局限。
Result: 在多个推理主干和目标LLM(从小规模模型到800亿参数系统)上的实验结果表明,基于桥接推理的图谱在语义连贯性和人物角色识别的稳定性上,显著优于基于频率或风格的基线方法。
Insight: 创新点在于将认知话语理论(Cognitive Discourse Theory)中的桥接推理概念引入LLM人物角色分析,通过构建话语间的隐含语义关系图谱,揭示了人物角色特质更稳定地编码在话语的结构组织中,而非孤立的词汇模式中。这为从计算语言学、认知语义学和人物角色推理角度系统探查、提取和可视化LLM的潜在人物角色提供了新框架。
Abstract: Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference – implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git
[33] IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning cs.CLPDF
Navya Gupta, Rishitej Reddy Vyalla, Avinash Anand, Chhavi Kirtani, Erik Cambria
TL;DR: 本文提出IRIS框架,通过结合渐进式监督微调与反向课程强化学习,提升跨语言数学推理能力,并发布了包含英语、印地语和马拉地语的CL-Math数据集。
Details
Motivation: 解决课程学习在跨语言数学推理中难以生成一致逐步推理的问题,特别是在多语言和低资源环境下从英语到印度语言的迁移受限。
Result: 在标准基准和定制多语言测试集上,IRIS持续提升性能,在数学推理任务上取得强劲结果,在低资源和双语设置中显著增益,高资源语言也有适度改进。
Insight: 创新点包括双轴课程框架、结合正确性、步骤对齐、连续性和数值激励的复合奖励设计,以及使用GRPO优化,增强了跨语言推理的鲁棒性和效率。
Abstract: Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Indian languages remains limited. We propose IRIS: Interleaved Reinforcement with Incremental Staged Curriculum, a two-axis framework that combines Supervised Fine-Tuning on progressively harder problems (vertical axis) with Reverse Curriculum Reinforcement Learning to reduce reliance on step-by-step guidance (horizontal axis). We design a composite reward combining correctness, step-wise alignment, continuity, and numeric incentives, optimized via Group Relative Policy Optimization (GRPO). We release CL-Math, a dataset of 29k problems with step-level annotations in English, Hindi, and Marathi. Across standard benchmarks and curated multilingual test sets, IRIS consistently improves performance, with strong results on math reasoning tasks and substantial gains in low-resource and bilingual settings, alongside modest improvements in high-resource languages.
[34] AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models cs.CL | cs.AIPDF
Yimin Deng, Yejing Wang, Zhenxi Lin, Zichuan Fu, Guoshuai Zhao
TL;DR: 本文提出了AdapTime方法,旨在提升大语言模型在处理时间信息方面的推理能力。该方法通过动态执行推理步骤,包括重构、重写和审查三个动作,并由LLM规划器指导推理过程,以适应不同类型的时间问题。
Details
Motivation: 现有方法在处理时间信息时通常依赖外部工具或手动验证,且针对特定场景设计,导致泛化能力差;同时,固定流程无法适应不同类型时间问题所需的推理策略,造成简单问题处理冗余和复杂问题推理不足。
Result: 大量实验证明了AdapTime的有效性,它能无缝集成到最先进的LLM中,显著增强其时间推理能力,且不依赖外部支持。
Insight: 创新点在于提出自适应时间推理框架,通过动态规划推理动作(重构、重写、审查)来适应问题复杂度,避免了固定流程的局限性,提升了模型的泛化性和效率。
Abstract: Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often involve external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability. Moreover, these methods apply a fixed pipeline to all questions, overlooking the fact that different types of temporal questions require distinct reasoning strategies, which leads to unnecessary processing for simple cases and inadequate reasoning for complex ones. To this end, we propose AdapTime, an adaptive temporal reasoning method that dynamically executes reasoning steps based on the input context. Specifically, it involves three temporal reasoning actions: reformulate, rewrite and review, with an LLM planner guiding the reasoning process. AdapTime integrates seamlessly with state-of-the-art LLMs and significantly enhances their temporal reasoning capabilities without relying on external support. Extensive experiments demonstrate the effectiveness of our approach.
[35] MemeScouts@LT-EDI 2026: Asking the Right Questions – Prompted Weak Supervision for Meme Hate Speech Detection cs.CL | cs.AIPDF
Ivo Bueno, Lea Hirlimann, Enkelejda Kasneci
TL;DR: 本文提出了一种基于提示的弱监督方法,用于检测模因中的仇恨言论,特别是在多语言环境下针对同性恋恐惧症和跨性别恐惧症的检测。该方法将复杂的模因理解任务分解为一系列有针对性的、基于问题的标注函数,利用量化后的Qwen3-VLM模型通过回答这些问题来提取特征,从而超越了直接的VLM分类方法。
Details
Motivation: 模因中的仇恨言论检测因多模态特性、讽刺和文化背景等微妙线索而极具挑战性,尤其是在多语言环境中,端到端的提示方法在处理目标、立场、隐含性和讽刺等多重因素时显得脆弱。
Result: 在LT-EDI 2026共享任务中,该方法在英语、中文和印地语的同性恋/跨性别恐惧症检测上表现优异,排名分别为第1、第2和第3,显著优于直接的VLM分类,尤其在中文和印地语上取得了大幅提升。
Insight: 核心创新点在于将复杂的多模态分类任务分解为一系列有针对性的、基于约束答案选项的问题,通过提示的弱监督来构建标注函数并提取特征,再通过错误驱动的标注函数扩展和特征剪枝进行迭代优化,这提高了模型的可解释性和在多语言环境下的泛化能力。
Abstract: Detecting hate speech in memes is challenging due to their multimodal nature and subtle, culturally grounded cues such as sarcasm and context. While recent vision-language models (VLMs) enable joint reasoning over text and images, end-to-end prompting can be brittle, as a single prediction must resolve target, stance, implicitness, and irony. These challenges are amplified in multilingual settings. We propose a prompted weak supervision (PWS) approach that decomposes meme understanding into targeted, question-based labeling functions with constrained answer options for homophobia and transphobia detection in the LT-EDI 2026 shared task. Using a quantized Qwen3-VLM to extract features by answering targeted questions, our method outperforms direct VLM classification, with substantial gains for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement via error-driven LF expansion and feature pruning reduces redundancy and improves generalization. Our results highlight the effectiveness of prompted weak supervision for multilingual multimodal hate speech detection.
[36] MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning cs.CL | cs.AIPDF
Yimin Deng, Zhenxi Lin, Yejing Wang, Guoshuai Zhao, Pengyue Jia
TL;DR: 本文提出了MultiDx框架,一个用于诊断推理的两阶段多源知识集成方法。它通过整合来自网络搜索、SOAP格式病例和临床病例数据库的知识,首先生成疑似诊断和推理路径,然后通过匹配、投票和鉴别诊断整合多视角证据以生成最终预测。
Details
Motivation: 解决大型语言模型在诊断推理中因领域知识有限而表现不佳的问题,现有方法依赖内部知识或静态知识库,存在知识不足、适应性有限且忽视与标准临床推理轨迹对齐的缺陷。
Result: 在两个公开基准测试上的广泛实验证明了该方法的有效性。
Insight: 创新点在于提出了一个两阶段、多源知识集成的诊断推理框架,强调从动态外部知识源收集证据并进行整合,以模拟临床鉴别诊断过程,提升推理的准确性和与标准临床路径的对齐。
Abstract: Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While Large Language Models (LLMs) have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning paths by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction.~Extensive experiments on two public benchmarks demonstrate the effectiveness of our approach.
[37] Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk cs.CL | cs.AIPDF
Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang
TL;DR: 本文分析了前沿图像生成模型(如GPT Image 2、Nano Banana Pro等)从艺术合成转向合成视觉证据的趋势,指出其带来的社会风险,包括伪造危机图像、名人肖像、医疗扫描等,并提出了一个基于能力加权的风险框架,将模型功能与现实世界危害联系起来,最后给出了分层控制的实用建议。
Details
Motivation: 解决前沿图像生成模型在生成逼真视觉证据时,削弱社会对图像作为可靠记录的信任这一关键问题,分析其带来的现实风险。
Result: 研究发现风险不仅源于照片级真实感,更源于真实感、可读文本、身份一致性、快速迭代和传播环境的结合,并据此提出了风险框架。
Insight: 创新点在于提出了一个能力加权的风险分析框架,强调风险是多因素耦合的结果,而非仅由逼真度驱动,并系统性地提出了涵盖模型侧限制、加密溯源、可见标签等多层次的控制建议。
Abstract: Frontier image generation has moved from artistic synthesis toward synthetic visual evidence. Systems such as GPT Image 2, Nano Banana Pro, Nano Banana 2, Grok Imagine, Qwen Image 2.0 Pro, and Seedream 5.0 Lite combine photorealistic rendering, readable typography, reference consistency, editing control, and in several cases reasoning or search-grounded image construction. These capabilities create large benefits for design, education, accessibility, and communication, yet they also weaken one of society’s most common trust shortcuts: the belief that a plausible picture is a reliable record. This paper provides a source-grounded technical and policy analysis of synthetic visual risk. We first summarize the public capabilities of recent image models, then analyze public incidents involving fake crisis images, celebrity and public-figure imagery, medical scans, forged-looking documents, synthetic screenshots, phishing assets, and market-moving rumors. We introduce a capability-weighted risk framework that links model affordances to real-world harm in finance, medicine, news, law, emergency response, identity verification, and civic discourse. Our findings show that risk is driven less by photorealism alone than by the convergence of realism, legible text, identity persistence, fast iteration, and distribution context. We argue for layered control: model-side restrictions, cryptographic provenance, visible labeling, platform friction, sector-grade verification, and incident response. The paper closes with practical recommendations for model providers, platforms, newsrooms, financial institutions, healthcare systems, legal organizations, regulators, and ordinary users.
[38] Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis cs.CL | cs.AI | cs.CE | cs.LG | cs.MAPDF
Zhisong Qiu, Shuofei Qiao, Kewei Xu, Yuqi Zhu, Lun Du
TL;DR: 本文提出DataPRM,一种环境感知的生成式过程奖励模型,旨在解决通用过程奖励模型在动态数据分析任务中监督能力不足的问题。DataPRM通过主动与环境交互探测中间执行状态以发现静默错误,并采用反思感知的三元奖励策略区分可纠正的接地错误与不可恢复的错误,从而有效提升数据分析智能体的性能。
Details
Motivation: 通用过程奖励模型在静态领域(如数学)中表现优异,但在动态数据分析任务中难以有效监督智能体,具体表现为无法检测静默错误、错误惩罚探索性行动,因此需要专门针对数据分析环境设计的过程奖励模型。
Result: 实验表明,DataPRM在ScienceAgentBench和DABStep基准上分别将下游策略LLM的性能提升了7.21%和11.28%(使用Best-of-N推理);仅用4B参数即超越强基线,并在多种测试时扩展策略中展现出鲁棒泛化能力;集成到强化学习中后,在DABench和TableBench上分别达到78.73%和64.84%的准确率,优于基于结果奖励的基线。
Insight: 创新点包括环境感知的主动验证机制(通过交互探测静默错误)和反思感知的三元奖励策略(区分错误类型);从客观角度看,其可扩展的数据构建流程(基于多样性驱动的轨迹生成和知识增强的步骤级标注)也为过程奖励模型在复杂动态任务中的应用提供了新思路。
Abstract: Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.
[39] Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer cs.CLPDF
Shun Shao, Binxu Wang, Shay B. Cohen, Anna Korhonen, Yonatan Belinkov
TL;DR: 本文提出了可微分忠实度对齐(DFA)框架,用于将较小源语言模型中的电路信息迁移到较大目标模型中,通过可微分对齐学习实现跨模型电路迁移,避免了在目标模型上进行完整的电路发现。
Details
Motivation: 现有机制可解释性方法定位语言模型特定行为电路时成本高、模型特定且难以扩展至更大架构,因此需要一种高效、可扩展的跨模型电路迁移方法。
Result: 在Llama-3和Qwen-2.5模型上的六个任务(事实检索、多选推理、算术)评估显示,DFA在Llama-3 1B→3B迁移中表现最强,对齐电路与直接节点归因竞争,零样本迁移有效;但随着源-目标模型差距增大(尤其是Qwen-2.5)恢复减弱,表明架构和规模差异增加时迁移变难;DFA始终优于简单基线,在某些设置下恢复的目标模型电路忠实度与直接归因相当或更强。
Insight: 创新点在于通过可微分对齐学习实现跨模型电路迁移,避免目标模型完整电路发现;客观分析表明较小模型可为较大模型提供有用的机制先验,但节点级跨模型电路对齐存在架构和规模差异带来的限制。
Abstract: Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 $1$B$\rightarrow3$B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective. Recovery weakens for larger source–target gaps and is substantially lower on Qwen-2.5, suggesting that transfer becomes harder as architectural and scaling differences increase. Overall, DFA consistently outperforms simple baselines and, in some settings, recovers target-model circuits with faithfulness comparable to or stronger than direct attribution. These results suggest that smaller models can provide useful mechanistic priors for larger ones, while highlighting both the promise and the limits of node-level cross-model circuit alignment.\footnote{Code is available at https://github.com/jasonshaoshun/dfa-circuits.
[40] DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents cs.CLPDF
Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi
TL;DR: 本文提出了一种名为DPEPO的新强化学习算法,旨在解决基于大语言模型(LLM)的智能体在传统顺序‘推理-行动’范式中探索受限和环境理解不完整的问题。该方法引入了一种新颖的范式,使智能体能够同时与多个环境交互并共享跨轨迹经验,并通过分层奖励机制鼓励多样化的并行探索。
Details
Motivation: 传统基于LLM的顺序‘推理-行动’智能体每步仅与单一环境交互,导致探索受限和环境理解不完整。本文旨在通过并行交互和跨轨迹经验共享来解决这些问题。
Result: 在ALFWorld和ScienceWorld基准测试上的大量实验表明,DPEPO实现了最先进(SOTA)的成功率,同时保持了与强大顺序基线方法相当的效率。
Insight: 论文的核心创新点在于提出了一个支持并行环境交互和跨轨迹经验共享的新范式,并在此基础上设计了DPEPO算法。其创新之处在于采用了两阶段训练(初始监督微调+强化学习)和一个分层奖励方案,该方案包含并行轨迹级成功奖励以及旨在惩罚行为冗余、促进广泛探索的两个步级奖励(多样化行动奖励和多样化状态转移奖励)。这为提升LLM智能体的探索能力和任务解决效率提供了新思路。
Abstract: Large language model (LLM) agents that follow the sequential “reason-then-act” paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)
[41] OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CLPDF
Zheng Wu, Yi Hua, Zhaoyuan Huang, Chenhao Xue, Yijie Lu
TL;DR: 该论文提出了OS-SPEAR,一个用于系统评估操作系统(OS)智能体在安全、性能、效率和鲁棒性四个维度的综合工具包。它包含四个专门的数据子集和一个自动化分析工具,并对22个流行的OS智能体进行了广泛评估,揭示了其在效率与安全/鲁棒性之间的权衡等关键发现。
Details
Motivation: 当前多模态大语言模型(MLLMs)驱动的OS智能体缺乏在安全性、效率和多模态鲁棒性方面的严格评估,现有基准存在安全场景狭窄、轨迹标注噪声大和鲁棒性指标有限等问题,阻碍了其成为可信赖的日常伙伴。
Result: 使用OS-SPEAR对22个流行OS智能体进行了广泛评估,实证结果揭示了当前领域的几个关键现象:效率与安全性或鲁棒性之间存在普遍权衡,专用智能体性能优于通用模型,以及不同模态存在不同的鲁棒性漏洞。
Insight: 论文的主要创新点在于提出了一个涵盖安全、性能、效率和鲁棒性四个维度的系统性评估框架(OS-SPEAR),并通过轨迹价值估计与分层采样构建高质量性能子集、引入跨模态扰动测试鲁棒性等方法,为开发下一代可靠高效的OS智能体提供了标准化的评估基础和诊断工具。
Abstract: The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
[42] Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency cs.CLPDF
Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata
TL;DR: 该论文对大型视觉语言模型(LVLM)的结构化剪枝进行了全面研究,探索了层剪枝和宽度剪枝两种范式,并结合监督微调与知识蒸馏进行轻量级恢复训练。研究发现,在资源受限场景下,宽度剪枝表现更优;仅微调多模态投影器在低压缩率下足够有效;结合监督微调和隐状态蒸馏能实现最佳恢复效果;且仅需5%的数据即可恢复超过95%的原始性能。
Details
Motivation: 解决大型视觉语言模型因计算和内存需求大而难以部署在资源受限边缘设备上的问题,现有方法灵活性有限且计算成本高,因此探索通过结构化剪枝压缩现有模型的互补路径。
Result: 在3B到7B参数的三个代表性LVLM系列上进行的实证研究表明,仅使用5%的原始数据即可实现有效恢复,性能保留超过95%。
Insight: 创新点在于系统比较了层剪枝与宽度剪枝在LVLM上的动态特性,并提出了结合监督微调和隐状态蒸馏的高效恢复训练策略,证明了在极低数据量下实现高性能恢复的可行性,为资源受限场景下的模型压缩提供了实用指南。
Abstract: While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.
[43] A Multi-Dimensional Audit of Politically Aligned Large Language Models cs.CLPDF
Lisa Korver, Mohamed Mostagir, Sherief Reda
TL;DR: 本文提出一个受哈贝马斯沟通行动理论启发的多维审计框架,用于评估政治对齐的大型语言模型在有效性、公平性、真实性和说服力四个维度的表现。通过对九种通过微调或角色扮演实现政治对齐的流行LLM进行审计,研究发现模型在性能上存在权衡:大模型在角色扮演政治意识形态和回答真实性方面更有效,但公平性较差,对不同意识形态者表现出更高的愤怒和毒性语言偏见;微调模型比角色扮演模型偏见更低、对齐更有效,但推理任务性能下降且幻觉增加。所有模型在至少一个维度上存在缺陷,凸显了需要更平衡、鲁棒的对齐策略。
Details
Motivation: 随着LLM在各行业广泛应用,人们日益担忧其在政治话语等敏感领域的滥用风险。通过提示工程或微调技术将LLM与特定政治意识形态对齐在政治竞选等用例中可能有益,但由于性能下降、错误信息或偏见行为加剧的风险增加,需要谨慎考虑。本研究旨在确保政治对齐的LLM产生合法、无害的论点,提供一个评估模型负责任政治对齐的框架。
Result: 在九种流行LLM上的审计结果表明:大模型在角色扮演政治意识形态和回答真实性方面更有效,但公平性较差,表现出更高的愤怒和毒性语言偏见;微调模型比角色扮演模型偏见更低、对齐更有效,但推理任务性能下降且幻觉增加。所有模型在至少一个维度上存在缺陷。
Insight: 创新点在于提出了一个受哈贝马斯理论启发的多维审计框架,使用自动化定量指标系统评估政治对齐LLM。客观分析表明,该框架揭示了模型性能的固有权衡(如规模与公平性、对齐有效性与推理性能的冲突),强调了需要更平衡的对齐策略,为负责任的政治对齐提供了可操作的评估工具。
Abstract: As the application of Large Language Models (LLMs) spreads across various industries, there are increasing concerns about the potential for their misuse, especially in sensitive areas such as political discourse. Deliberately aligning LLMs with specific political ideologies, through prompt engineering or fine-tuning techniques, can be advantageous in use cases such as political campaigns, but requires careful consideration due to heightened risks of performance degradation, misinformation, or increased biased behavior. In this work, we propose a multi-dimensional framework inspired by Habermas’ Theory of Communicative Action to audit politically aligned language models across four dimensions: effectiveness, fairness, truthfulness, and persuasiveness using automated, quantitative metrics. Applying this to nine popular LLMs aligned via fine-tuning or role-playing revealed consistent trade-offs: while larger models tend to be more effective at role-playing political ideologies and truthful in their responses, they were also less fair, exhibiting higher levels of bias in the form of angry and toxic language towards people of different ideologies. Fine-tuned models exhibited lower bias and more effective alignment than the corresponding role-playing models, but also saw a decline in performance reasoning tasks and an increase in hallucinations. Overall, all of the models tested exhibited some deficiency in at least one of the four metrics, highlighting the need for more balanced and robust alignment strategies. Ultimately, this work aims to ensure politically-aligned LLMs generate legitimate, harmless arguments, offering a framework to evaluate the responsible political alignment of these models.
[44] Kwai Summary Attention Technical Report cs.CL | cs.AI | cs.IR | cs.LGPDF
Chenglong Chu, Guorui Zhou, Guowang Zhang, Han Li, Hao Peng
TL;DR: 本文提出了Kwai Summary Attention(KSA),一种新颖的注意力机制,旨在解决长上下文建模中标准softmax注意力因序列长度导致的二次方计算和内存开销问题。KSA通过将历史上下文压缩成可学习的摘要令牌,在KV缓存与序列长度之间维持线性关系的同时,以特定比例k进行语义级压缩,从而在可接受的内存成本下保留完整、可参考且可解释的长距离依赖。
Details
Motivation: 标准softmax注意力在长上下文设置下存在二次方复杂度,导致训练和推理成本急剧上升。现有解决方案(如GQA、MLA、SWA、GDN)要么KV缓存仍与序列长度线性相关,要么在KV缓存效率与长上下文建模效果间存在权衡。本文动机是探索一条中间路径:保持KV缓存与序列长度的线性关系,但通过特定比例k进行语义级压缩,以平衡内存开销与长距离依赖的保留。
Result: 摘要中未提及具体的定量实验结果、基准测试或与SOTA模型的比较。
Insight: 创新点在于提出了一种介于完全压缩KV缓存和完全保留序列信息之间的中间路径(O(n/k)),通过可学习的摘要令牌对历史上下文进行语义级压缩,旨在以可控的内存成本实现完整且可解释的长距离依赖建模,为长上下文LLM的迭代提供了新的技术方向。
Abstract: Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache’’, but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.
[45] SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering cs.CLPDF
Yuqing Fu, Yimin Deng, Wanyu Wang, Yuhao Wang, Yejing Wang
TL;DR: 本文提出SEARCH-R框架,用于解决多跳问答中的两个关键挑战:生成正确的推理路径和准确检索关键知识。该框架通过微调Llama3.1-8B模型训练端到端的推理路径导航器,并设计基于依赖树的检索方法来定量评估文档的信息贡献度。在三个具有挑战性的多跳数据集上的实验验证了该框架的有效性。
Details
Motivation: 现有方法主要依赖基于提示的方法生成推理路径,结合传统稀疏或稠密检索来生成最终答案,但推理路径生成缺乏对生成过程的有效控制,容易导致推理偏离;同时检索方法过度依赖知识匹配或相似度得分,而非评估信息的实际效用,导致检索到同质或无用的信息。
Result: 在三个具有挑战性的多跳数据集上进行广泛实验,验证了所提框架的有效性。
Insight: 创新点包括:1) 训练端到端的推理路径导航器(基于微调Llama3.1-8B),提供强大的子问题分解能力;2) 设计新颖的基于依赖树的检索方法,能够定量评估文档的信息贡献度,而非仅依赖相似度匹配。
Abstract: Multi-hop Question Answering (MHQA) aims to answer questions that require multi-step reasoning. It presents two key challenges: generating correct reasoning paths in response to the complex user queries, and accurately retrieving essential knowledge in the face of potential limitations in large language models (LLMs). Existing approaches primarily rely on prompt-based methods to generate reasoning paths, which are further combined with traditional sparse or dense retrieval to produce the final answer. However, the generation of reasoning paths commonly lacks effective control over the generative process, thus leading the reasoning astray. Meanwhile, the retrieval methods over-rely on knowledge matching or similarity scores rather than evaluating the practical utility of the information, resulting in retrieving homogeneous or non-useful information. Therefore, we propose a Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator framework named SEARCH-R. Specifically, SEARCH-R trains an end-to-end reasoning path navigator, which is able to provide a powerful sub-question decomposer by fine-tuning the Llama3.1-8B model. Moreover, a novel dependency tree-based retrieval is designed to evaluate the informational contribution of the document quantitatively. Extensive experiments on three challenging multi-hop datasets validate the effectiveness of the proposed framework. The code and dataset are available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_SEARCH-R.
[46] Generating Place-Based Compromises Between Two Points of View cs.CLPDF
Sumanta Bhattacharyya, Francine Chen, Scott Carter, Yan-Ying Chen, Tatiana Lau
TL;DR: 本文提出了一种利用大型语言模型生成两种对立观点之间共情中立妥协的方法。首先比较了四种提示工程方法在2400个关于共享场所的对比观点数据集上的表现,并通过50人研究评估了生成妥协的可接受性。研究发现,使用妥协与每个观点之间的外部共情相似性作为迭代反馈的方法优于标准思维链推理。然后,利用生成的妥协数据集,通过基于边际的人类偏好对齐训练了两个更小的基础模型,提高了效率并消除了推理过程中对共情估计的需求。
Details
Motivation: 大型语言模型在学术任务上表现出色,但在社会智能任务(如创造良好妥协)上存在困难。本文旨在解决生成两种对立观点之间共情中立妥协的问题。
Result: 在50名参与者的研究中,使用外部共情相似性作为迭代反馈的方法在生成妥协的可接受性上优于标准思维链推理。通过基于边际的人类偏好对齐训练的小型基础模型提高了效率。
Insight: 创新点在于将外部共情相似性作为迭代反馈机制来引导LLM生成更可接受的妥协,并通过偏好对齐将能力蒸馏到更小的模型中,从而在保持性能的同时提升推理效率。
Abstract: Large Language Models (LLMs) excel academically but struggle with social intelligence tasks, such as creating good compromises. In this paper, we present methods for generating empathically neutral compromises between two opposing viewpoints. We first compared four different prompt engineering methods using Claude 3 Opus and a dataset of 2,400 contrasting views on shared places. A subset of the gen erated compromises was evaluated for acceptability in a 50-participant study. We found that the best method for generating compromises between two views used external empathic similarity between a compromise and each viewpoint as iterative feedback, outperforming stan dard Chain of Thought (CoT) reasoning. The results indicate that the use of empathic neutrality improves the acceptability of compromises. The dataset of generated compromises was then used to train two smaller foundation models via margin-based alignment of human preferences, improving efficiency and removing the need for empathy estimation during inference.
[47] Aligned Multi-View Scripts for Universal Chart-to-Code Generation cs.CL | cs.AIPDF
Zhihan Zhang, Lizi Liao
TL;DR: 本文提出了Chart2NCode数据集和CharLuMA模型,用于实现通用的图表到代码生成。Chart2NCode包含17.6万个图表,并配有在Python、R和LaTeX中视觉等效的对齐脚本。CharLuMA模型基于LLaVA架构,通过一个参数高效的适配模块,利用语言条件化的低秩子空间混合来增强多模态投影器,使模型能够共享核心的图表理解能力,同时通过轻量级路由针对目标语言生成代码。
Details
Motivation: 现有的图表到代码生成方法主要局限于Python,限制了实际应用,并忽略了同一图表可以用不同绘图语言中语义等效的脚本表达这一关键监督来源。
Result: 广泛的实验表明,该方法在所有语言(Python、R、LaTeX)的代码可执行性和视觉保真度方面都取得了持续提升,优于强大的开源基线模型,并与专有系统保持竞争力。
Insight: 论文的创新点在于引入了多语言对齐的监督数据集,以及一个参数高效的适配器模块,该模块通过语言条件化的低秩子空间混合,实现了核心图表理解能力的共享和针对特定语言的轻量级专业化代码生成。从客观角度看,平衡的多语言监督被证明对所有语言都有益,且适配器能够分配一个紧凑的共享核心和语言特定的能力。
Abstract: Chart-to-code generation converts a chart image into an executable plotting script, enabling faithful reproduction and editable visualizations. Existing methods are largely Python-centric, limiting practical use and overlooking a critical source of supervision: the same chart can be expressed by semantically equivalent scripts in different plotting languages. To fill this gap, we introduce Chart2NCode, a dataset of 176K charts paired with aligned scripts in Python, R, and LaTeX that render visually equivalent outputs, constructed via a metadata-to-template pipeline with rendering verification and human quality checks. Building on a LLaVA-style architecture, we further propose CharLuMA, a parameter-efficient adaptation module that augments the multimodal projector with a language-conditioned mixture of low-rank subspaces, allowing the model to share core chart understanding while specializing code generation to the target language through lightweight routing. Extensive experiments show consistent gains in executability and visual fidelity across all languages, outperforming strong open-source baselines and remaining competitive with proprietary systems. Further analyses reveal that balanced multi-language supervision benefits all languages and that the adapter allocates a compact shared core plus language-specific capacity. Codes and data are available at https://github.com/Zhihan72/CharLuMA.
[48] MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG cs.CL | cs.IR | cs.ITPDF
Xihang Wang, Zihan Wang, Chengkai Huang, Quan Z. Sheng, Lina Yao
TL;DR: 该论文提出了MEG-RAG框架,通过引入多模态证据基础(MEG)这一语义感知度量来量化检索证据的贡献,并基于此训练一个多模态重排序器,以提升多模态检索增强生成(MRAG)系统中证据选择的准确性。
Details
Motivation: 当前多模态检索增强生成系统难以区分检索到的多模态数据是真正支持答案的语义核心还是仅提供表面相关性,现有度量方法依赖启发式的位置置信度,无法捕捉多模态实体的信息密度。
Result: 在M²RAG基准测试上的大量实验表明,MEG-RAG始终优于强基线模型,并在不同教师模型上展现出稳健的泛化能力。
Insight: 创新点在于提出了基于语义确定性锚定(Semantic Certainty Anchoring)的MEG度量,它关注高IDF的信息承载词元以更好地捕捉答案语义核心,从而引导重排序器优先考虑基于语义基础的高价值内容,而非词元概率分布。
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M$^2$RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.
[49] Evaluation of Pose Estimation Systems for Sign Language Translation cs.CLPDF
Catherine O’Brien, Gerard Sant, Mathias Müller, Sarah Ebling
TL;DR: 本文系统评估了多种姿态估计器在手语翻译任务中的表现,通过在RWTH-PHOENIX-Weather 2014数据集上训练统一的手语翻译流程,发现SDPose和Sapiens在翻译质量上优于常用的MediaPipe基线,并分析了姿态估计器的时序稳定性、手部关键点缺失和遮挡鲁棒性对下游任务的影响。
Details
Motivation: 手语翻译系统常依赖姿态序列作为输入,但姿态估计器的选择常被视为实现细节,缺乏系统性评估;本文旨在量化不同姿态估计器对下游翻译性能的实际影响。
Result: 在RWTH-PHOENIX-Weather 2014数据集上,SDPose和Sapiens达到最佳翻译性能(BLEU约11.5),优于MediaPipe基线(BLEU约10);在Signsuisse数据集遮挡测试中,Sapiens正确率100%(15/15),而OpenPifPaf仅正确1/15且翻译得分最弱。
Insight: 研究揭示了姿态估计器的选择对手语翻译性能有显著影响,特别是手部关键点缺失和遮挡鲁棒性;提供了可复现的实验代码,降低了使用替代姿态估计器的门槛,为领域建立了更科学的评估基准。
Abstract: Many sign language translation (SLT) systems operate on pose sequences instead of raw video to reduce input dimensionality, improve portability, and partially anonymize signers. The choice of pose estimator is often treated as an implementation detail, with systems defaulting to widely available tools such as MediaPipe Holistic or OpenPose. We present a systematic comparison of pose estimators for pose-based SLT, covering widely used baselines (MediaPipe Holistic, OpenPose) and newer whole-body/high-capacity models (MMPose WholeBody, OpenPifPaf, AlphaPose, SDPose, Sapiens, SMPLest-X). We quantify downstream impact by training a controlled SLT pipeline on RWTH-PHOENIX-Weather 2014 where only the pose representation varies, evaluating with BLEU and BLEURT. To contextualize translation outcomes, we analyze temporal stability, missing hand keypoints, and robustness to occlusion using higher-resolution videos from the Signsuisse dataset. SDPose and Sapiens achieve the best translation performance (BLEU ~11.5), outperforming the common MediaPipe baseline (BLEU ~10). In occlusion cases, Sapiens is correct in all tested instances (15/15), while OpenPifPaf fails in nearly all (1/15) and also yields the weakest translation scores. Estimators that frequently leave out hand keypoints are associated with lower BLEU/BLEURT. We release code that can be used not only to reproduce our experiments, but also considerably lowers the barrier for other researchers to use alternative pose estimators.
[50] K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology cs.CL | cs.AIPDF
Soyeon Kim, Cheongwoong Kang, Myeongjin Lee, Eun-Chul Chang, Jaedeok Lee
TL;DR: 本文介绍了K-MetBench,一个基于韩国国家资格考试的诊断性基准测试,用于从专家视觉推理、逻辑有效性、韩语地理文化理解和细粒度领域分析四个维度评估(多模态)大语言模型在气象学中的表现。评估了55个模型,揭示了模型在解释专业图表时存在模态鸿沟,以及在正确预测时仍可能产生逻辑幻觉的推理鸿沟,并发现韩国本土模型在本地语境中优于更大规模的全球模型。
Details
Motivation: 开发面向韩国天气预报员的实用(多模态)大语言模型助手,因缺乏基于权威来源的多维度专家级评估框架而受阻,因此需要构建一个全面的诊断基准来暴露模型的关键能力差距。
Result: 在K-MetBench上评估了55个模型,结果显示模型在解释专业图表时存在显著的模态鸿沟,并且在逻辑推理上存在缺陷(即使预测正确也可能产生幻觉)。关键发现是,韩国本土模型在涉及本地文化的任务上显著优于参数规模大得多的全球模型。
Insight: 创新点在于构建了一个基于权威资格考试的、多维度细粒度评估基准,强调了文化依赖性问题不能仅通过参数缩放解决,为开发可靠且具有文化意识的专家级AI智能体提供了路线图。
Abstract: The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .
[51] DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference cs.CL | cs.AIPDF
Zahra Dehghanighobadi, Asja Fischer
TL;DR: 论文提出DepthKV,一种层依赖的KV缓存剪枝框架,用于解决长上下文LLM推理中KV缓存内存占用线性增长的问题。该方法根据各层对剪枝的敏感度差异,非均匀地分配全局KV缓存预算,而非采用统一的剪枝比例。
Details
Motivation: 动机在于现有KV缓存剪枝方法通常对所有层采用统一的剪枝比例,但作者发现不同层对剪枝的敏感度存在显著差异,这种均匀分配假设是次优的,导致KV缓存预算利用效率不高。
Result: 在多个模型和任务上的实验表明,在相同的全局剪枝比例下,DepthKV始终优于均匀剪枝方法,证明了通过层依赖分配能更有效地利用KV缓存预算。
Insight: 创新点在于揭示了Transformer不同层对KV缓存剪枝的敏感度异质性,并据此提出了层依赖的预算分配策略。这为优化KV缓存管理提供了新视角,即应考虑模型内部的结构差异进行精细化资源分配。
Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.
[52] Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation cs.CL | cs.AIPDF
Sercan Karakaş, Yusuf Şimşek
TL;DR: 本文研究了土耳其语中信息来源可信度是否影响证据性形态标记(-DI和-mIs)的选择,并评估了大型语言模型(LLMs)是否能够追踪这种敏感性。通过人类实验发现,母语者在高可信度语境中更倾向于使用-DI,低可信度语境中更倾向于使用-mIs,表现出稳定的信任效应。然而,在评估的10个LLMs中,模型行为高度依赖于具体模型和提示方式,仅表现出微弱或不稳定的信任一致性变化,且常受输出合规问题和基础后缀偏好干扰,揭示了人类与LLMs在基于来源敏感的证据推理方面存在明显差距。
Details
Motivation: 探究土耳其语证据性形态是否受信息来源可信度影响,并检验LLMs是否具备类似人类对来源可信度的敏感性推理能力。
Result: 人类实验显示高可信度语境中-DI使用相对更多(信任效应稳健),而LLMs在三种提示范式(开放式填空、显式过去时填空、强制选择A/B)下表现高度不稳定,部分模型仅显示微弱或局部的信任一致性变化,效果常被反转或掩盖,未达到人类水平。
Insight: 研究为土耳其语证据性的信任/承诺理论提供了新证据,并揭示了LLMs在来源敏感推理任务上的局限性,提示当前LLMs在建模语言中细微的语用和社会认知因素方面仍存在不足。
Abstract: This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly external, while only its perceived reliability is manipulated (High-Trust vs. Low-Trust). In a human production experiment, native speakers of Turkish show a robust trust effect: High-Trust contexts yield relatively more -DI, whereas Low-Trust contexts yield relatively more -mIs, with the pattern remaining stable across sensitivity analyses. We then evaluate 10 LLMs in three prompting paradigms (open gap-fill, explicit past-tense gap-fill, and forced-choice A/B selection). LLM behavior is highly model- and prompt-dependent: some models show weak or local trust-consistent shifts, but effects are generally unstable, often reversed, and frequently overshadowed by output-compliance problems and strong base-rate suffix preferences. The results provide new evidence for a trust-/commitment-based account of Turkish evidentiality and reveal a clear human-LLM gap in source-sensitive evidential reasoning.
[53] Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination cs.CLPDF
Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu
TL;DR: 本文提出了一个名为ProHist-Bench的新基准测试,该基准基于中国科举制度,旨在评估大型语言模型(LLMs)进行专业级历史推理的能力。通过评估18个LLMs,研究发现即使是当前最先进的模型在处理复杂历史研究问题时也存在显著的能力差距。
Details
Motivation: 现有基准主要评估LLMs的基础知识广度或词汇理解,未能捕捉历史研究核心的高阶技能(如证据推理),因此需要一个新的基准来填补这一空白,深入探究LLMs的专业历史推理能力。
Result: 在ProHist-Bench(包含400个专家策划的问题和10,891个细粒度评估标准)上对18个LLMs的严格评估表明,即使是SOTA模型在处理复杂历史研究问题时也存在显著的能力差距。
Insight: 论文的创新点在于构建了一个基于中国科举制度的、跨学科合作开发的、具有挑战性的历史推理基准,这为评估和开发领域特定推理的LLMs提供了新工具,并揭示了LLMs在复杂专业任务上的当前局限性。
Abstract: While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
[54] The Chameleon’s Limit: Investigating Persona Collapse and Homogenization in Large Language Models cs.CLPDF
Yunze Xiao, Vivienne J. Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan
TL;DR: 本文研究了大型语言模型在模拟多智能体时出现的‘角色崩溃’现象,即尽管为每个智能体分配了不同的角色描述,但它们的行为模式却趋于同质化。作者提出了一个量化框架来评估角色空间的覆盖率、均匀性和行为复杂性,并在人格模拟、道德推理和自我介绍三个领域对十个LLM进行了评估。研究发现,模型在不同维度和领域上的崩溃程度不同,且行为变异往往基于粗略的人口刻板印象而非细粒度的个体差异。令人意外的是,实现最高单角色保真度的模型反而产生了最刻板化的群体。
Details
Motivation: 解决多智能体模拟等应用中,LLM智能体尽管被赋予不同角色描述,但行为却趋于同质化(即‘角色崩溃’)的问题,以支持构建多样化的模拟群体。
Result: 在人格模拟(BFI-44)、道德推理和自我介绍三个任务上评估了十个LLM。结果表明,模型在不同评估维度(如覆盖率、均匀性)和不同任务领域上表现出不同程度的角色崩溃。例如,一个模型可能在人格模拟上崩溃最严重,却在道德推理上最具多样性。
Insight: 创新点在于提出了一个量化‘角色崩溃’现象的评估框架(覆盖度、均匀性、复杂性),并揭示了模型行为多样性往往基于刻板印象而非指定细节。一个关键的反直觉发现是:单角色保真度最高的模型,其模拟的群体行为反而最刻板化。这强调了在群体层面而非单个角色层面评估LLM多样性的重要性。
Abstract: Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emph{Persona Collapse}: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbf{the models achieving the highest per-persona fidelity consistently produce the most stereotyped populations}. We release our toolkit and data to support population-level evaluation of LLMs.
[55] Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL | cs.LGPDF
Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas
TL;DR: 本文提出了一种名为HyLo的长上下文升级方法,旨在将预训练的Transformer LLMs转换为混合架构,以在保持短上下文质量的同时提升长上下文能力。该方法结合了架构适应、高效Transformer块(MLA)与线性块(Mamba2或Gated DeltaNet),并采用分阶段长上下文训练和教师引导蒸馏进行稳定优化。
Details
Motivation: 解决现有混合序列模型大多从头预训练、无法重用现有Transformer检查点的问题,提供一种实用的升级路径,以扩展预训练LLMs的上下文长度并减少KV缓存内存。
Result: 在1B和3B规模设置下,HyLo将可用上下文长度扩展至32倍,KV缓存内存减少90%以上,支持高达2M令牌的预填充和解码,在RULER等长上下文评估中显著优于SOTA升级混合基线,并在GSM8K、Lm-Harness常识推理和RULER-64K上以更少训练数据超越JetNemotron。
Insight: 创新点在于将架构适应与高效组件结合,通过分阶段训练和蒸馏实现稳定优化,为LLMs的长上下文扩展提供了一种高效且可复用的升级方案,兼顾了短上下文性能和长上下文能力提升。
Abstract: Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.
cs.CV [Back]
[56] See No Evil: Semantic Context-Aware Privacy Risk Detection for AR cs.CV | cs.AI | eess.SYPDF
Jialu Liu, Yao Li, Zhuoheng Li, Huining Li, Ying Chen
TL;DR: 本文提出PrivAR,一种利用视觉语言模型(VLM)和思维链提示进行上下文感知的增强现实(AR)隐私风险检测系统。它通过视觉场景线索推断潜在敏感信息类型(如在办公室环境中识别密码便签),并检测和模糊化文本内容以防止敏感信息暴露,同时保留VLM推理所需的上下文线索。此外,还研究了基于上下文的警告界面以增强用户隐私意识。
Details
Motivation: 现有AR隐私框架缺乏对视觉内容的语义理解,限制了其在检测依赖于上下文的隐私风险方面的有效性。AR系统持续捕获视觉数据带来了独特的隐私风险,需要更智能的检测方法。
Result: 在真实世界AR数据集上的实验表明,PrivAR相比基线方法实现了更高的准确率(81.48%)和F1分数(84.62%),并将隐私泄露率降低至17.58%。用户研究评估了基于上下文的警告界面,为有效的隐私感知AR设计提供了见解。
Insight: 创新点在于将视觉语言模型与思维链提示相结合,实现语义上下文感知的隐私风险检测,并通过检测与模糊化文本内容在保护敏感信息与保留必要上下文线索之间取得平衡。系统设计考虑了上下文感知的警告界面,提升了用户隐私意识,为AR隐私保护提供了新思路。
Abstract: Augmented reality (AR) systems pose unique privacy risks due to their continuous capture of visual data. Existing AR privacy frameworks lack semantic understanding of visual content, limiting their effectiveness in detecting context-dependent privacy risks. We propose PrivAR, which leverages vision language models (VLMs) with chain-of-thought prompting for contextual privacy risk detection in AR environments. PrivAR uses visual scene cues to infer potential sensitive information types, such as identifying password notes in office environments through contextual reasoning. PrivAR detects and obfuscates textual content, preventing exposure of sensitive information while preserving contextual cues necessary for VLM inference. Additionally, we investigate contextually-informed warning interfaces to enhance user privacy awareness. Experiments on a real-world AR dataset show that PrivAR achieves superior accuracy (81.48%) and F1-score (84.62%) compared to baselines, while reducing privacy leakage rate to 17.58%. User studies evaluating contextually-informed warning interfaces provide insights into effective privacy-aware AR design.
[57] FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers cs.CV | cs.AI | eess.IVPDF
Haopeng Jin
TL;DR: FreqFormer是一种用于长序列视频扩散Transformer的频率感知异构注意力框架,通过将token特征按频谱分割为不同频带,并采用不同注意力算子(低频全局注意力、中频块稀疏注意力、高频局部注意力),结合轻量级频谱路由网络动态分配计算资源,以降低二次自注意力成本。
Details
Motivation: 解决长序列视频扩散Transformer中二次自注意力计算和内存开销过大的问题,利用视频特征的频谱结构特性(低频承载全局布局和粗粒度运动,高频承载纹理和细节)进行高效近似。
Result: 在64K到1M token的模拟实验中,FreqFormer显著降低了估计的注意力FLOPs和KV相关内存流量,同时保持硬件友好的计算模式,为长视频扩散Transformer提供了实用的高效注意力方向。
Insight: 创新点在于将频谱分析与异构注意力结合,通过自适应频谱路由动态调整不同去噪阶段的计算重点(早期关注全局结构,后期关注细节),并采用跨频带摘要token实现廉价残差交换,配合融合GPU执行计划优化硬件效率。
Abstract: Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens provide cheap residual exchange. FreqFormer is paired with a fused GPU execution plan that co-schedules dense, sparse, and local branches to cut kernel launches and memory traffic. We give a consistent complexity model, an orthonormal-decomposition view of approximation, and simulation-based systems numbers (throughput, arithmetic intensity, memory traffic, duration scaling). In simulations from 64K to 1M tokens, FreqFormer substantially reduces estimated attention FLOPs and KV-related memory traffic versus dense attention while keeping a hardware-friendly pattern, supporting spectrally structured heterogeneous attention as a practical direction for long-video diffusion transformers.
[58] DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models cs.CV | cs.AIPDF
JiYang Wang, Jiawei Chen, Mengqi Xiao, Yu Cheng, Yangfu Li
TL;DR: 本文提出了DO-Bench,一个用于诊断视觉语言模型中物体幻觉问题的可归因基准。该基准通过结构化的多模态干预,分离了错误源于感知限制还是上下文文本先验的影响,从而揭示底层失败机制。
Details
Motivation: 现有基准主要关注总体准确率,但难以区分物体幻觉错误是源于模型的感知能力不足,还是受到上下文文本先验的过度影响,导致根本的失败机制不明确。
Result: 在多种开源和闭源VLM上的评估揭示了模型在先验敏感性和感知可靠性方面存在系统性差异,表明物体幻觉反映了超越总体准确率的、依赖于机制的异质性失败模式。
Insight: 创新点在于提出了一个可控的诊断性基准,其配对设计(先验覆盖维度和感知限制维度)和两个诊断指标(PriorRobust和PerceptionAbility)能够将错误归因于先验抑制、感知不足或两者的交互作用,为理解VLM的可靠性挑战提供了更精细的分析工具。
Abstract: Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accuracy but rarely disentangle whether errors stem from perceptual limitations or from the influence of contextual textual priors, leaving underlying failure mechanisms ambiguous. We introduce DO-Bench, a controlled diagnostic benchmark that isolates these sources through structured multimodal interventions. Rather than evaluating models in unconstrained settings, DO-Bench probes two complementary dimensions: the Prior Override dimension progressively strengthens contextual textual priors while holding visual evidence constant to assess resistance to prior pressure, and the Perception-Limited dimension incrementally enhances visual evidence from full-scene context to localized object crops to measure perceptual grounding strength. This paired design enables attribution of errors to prior suppression, perceptual insufficiency, or their interaction. We further define two diagnostic metrics, PriorRobust and PerceptionAbility, to quantify these behaviors consistently. Evaluations across diverse open- and closed-source VLMs reveal systematic differences in prior sensitivity and perceptual reliability, demonstrating that object hallucination reflects heterogeneous, mechanism dependent failure patterns beyond aggregate accuracy.
[59] PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging cs.CV | cs.AIPDF
Zibo Shao, Baochen Xiong, Xiaoshan Yang, Yaguang Song, Qimeng Zhang
TL;DR: 本文提出PivotMerge框架,用于解决异构多模态预训练模型的后对齐合并问题,通过共享空间分解与过滤以及对齐引导的层级合并,有效整合不同数据源学到的跨模态对齐能力,在多个多模态基准测试中优于现有基线。
Details
Motivation: 现有模型合并研究主要关注微调后场景,而多模态大语言模型预训练的核心在于建立有效的跨模态对齐,但异构多模态预训练中不同数据集诱导的互补对齐能力尚未被充分探索,导致合并时存在跨域参数干扰和层级对齐贡献差异等挑战。
Result: 在基于CC12M构建的系统性后对齐合并场景中,PivotMerge在多个多模态基准测试上持续超越现有基线,展示了其有效性和泛化能力。
Insight: 创新点在于提出后对齐合并任务,并设计共享空间分解与过滤来解耦共享对齐模式和域特定变异,以及对齐引导的层级合并来根据对齐贡献分配权重,为整合异构预训练模型的跨模态对齐能力提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model merging provides a cost-effective mechanism for integrating multiple expert MLLMs with complementary strengths into a unified model. However, existing model merging research mainly focuses on post-finetuning scenarios, leaving the pre-training stage largely unexplored. We argue that the core of MLLM pre-training lies in establishing effective cross-modal alignment, which bridges visual and textual representations into a unified semantic space. Motivated by this insight, we introduce the post-alignment merging task, which aims to integrate cross-modal alignment capabilities learned from heterogeneous multimodal pre-training. This setting introduces two key challenges: cross-domain parameter interference, where parameter updates learned from different data distributions conflict during merging, and layer-wise alignment contribution disparity, where different layers and projectors contribute unevenly to cross-modal alignment. To address them, we propose \textbf{PivotMerge}, a post-alignment merging framework for cross-modal projectors. PivotMerge incorporates two key components: Shared-space Decomposition and Filtering, which disentangles shared alignment patterns from domain-specific variations and suppresses conflicting directions, and Alignment-guided Layer-wise Merging, which assigns layer-specific merging weights based on differing alignment contributions. We construct systematic CC12M-based post-alignment merging scenarios for evaluation. Extensive experiments on multiple multimodal benchmarks show that PivotMerge consistently outperforms existing baselines, demonstrating its effectiveness and generalization ability.
[60] SGP-SAM: Self-Gated Prompting for Transferring 3D Segment Anything Models to Lesion Segmentation cs.CV | cs.AIPDF
Zixuan Tang, Shen Zhao
TL;DR: 本文提出了SGP-SAM,一个用于将3D SAM风格模型高效迁移到病灶分割任务的自门控提示框架。该框架的核心是自门控提示模块(SGPM),它通过一个轻量级多通道门控单元预测是否需要额外的多尺度特征融合,从而有条件地增强空间上下文表示。此外,论文还设计了一种结合Dice损失和体素平衡焦点项的Zoom Loss,以加强对小病灶的学习。
Details
Motivation: 直接将3D SAM风格模型迁移到病灶分割任务面临两大挑战:一是中间特征对小而不规则目标的空间表征能力较弱;二是3D体数据中存在极端的类别不平衡(前景-背景不平衡)。
Result: 在MSD Liver Tumor和MSD Brain Tumor(增强肿瘤)数据集上的实验表明,该方法相比基于SAM-Med3D的强迁移基线取得了持续的性能提升。在MSD Liver Tumor数据集上,SGP-SAM相比微调方法将mDice指标提升了7.3%。
Insight: 主要创新点包括:1. 自门控提示模块(SGPM),实现了条件化的多尺度空间增强,仅在需要时激活计算开销较大的多尺度融合,提高了效率;2. 针对小病灶设计的Zoom Loss,通过加权病灶区域的监督来缓解类别不平衡问题。从客观角度看,这种将动态门控机制与多尺度特征融合相结合,并针对特定任务设计损失函数的方法,为将大型基础模型高效适配到数据分布不平衡的医学图像分割任务提供了可借鉴的思路。
Abstract: Large segmentation foundation models such as the Segment Anything Model (SAM) have reshaped promptable segmentation in natural images, and recent efforts have extended these models to medical images and volumetric settings. However, directly transferring a 3D SAM-style model to lesion segmentation remains challenging due to (i) weak spatial representational capacity for small, irregular targets in intermediate features, and (ii) extreme foreground-background imbalance in 3D volumes.We propose SGP-SAM, a self-gated prompting framework for efficient and effective transfer to 3D lesion segmentation. Our key component, the Self-Gated Prompting Module (SGPM), performs conditional multi-scale spatial enhancement: a lightweight multi-channel gating unit predicts whether the current features require additional multi-scale fusion, and only then activates a Multi-Scale Feature Fusion Block to enrich spatial context. To further address small-lesion learning, we design a Zoom Loss that up-weights lesion-focused supervision by combining Dice and a voxel-balanced focal term.Experiments on MSD Liver Tumor and MSD Brain Tumor (enhancing tumor) show consistent gains over strong transfer baselines based on SAM-Med3D. On MSD Liver Tumor, SGP-SAM improves mDice by 7.3% over fine-tuning.
[61] Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges Test cs.CVPDF
Tairan Fu, Francisco Javier Santos-Martín, Javier Conde, Pedro Reviriego, Elena Merino-Gómez
TL;DR: 本文评估了包括GPT-5和Gemini 3在内的先进视觉语言模型在分析模拟仪表动态读数时的性能,发现它们无法准确处理高频时间事件和指针振动,不符合计量学和不确定性量化的严格标准。
Details
Motivation: 工业制造数字化转型需要自主机器人能与遗留模拟仪表交互,但现有视觉语言模型在零样本仪表识别中虽具潜力,却因无法准确分析动态事件而难以部署于测量系统。
Result: 在引入的包含多种仪表类型和运动速度的视频序列新数据集上测试,当前视觉语言模型在解释指针轨迹和刻度语义方面能力有限,未能达到IEEE和ISO标准中可信合成仪器的性能要求。
Insight: 论文创新点在于提出了一个专注于仪表动态读数分析的新数据集和评估框架,揭示了视觉语言模型在处理时间序列视觉信息和满足严格计量标准方面的根本局限性,为可靠工业视觉系统开发指明了关键挑战。
Abstract: The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated potential in zero-shot instrument recognition, their deployment in measurement systems remains constrained by an inherent inability to accurately analyze high-frequency temporal events and needle vibrations. This paper evaluates state-of-the-art models, including GPT-5 and Gemini 3, against the strict requirements of metrology and uncertainty quantification. To facilitate this evaluation, we introduce a novel dataset comprising video sequences of various gauge types: circular, linear, and Vernier, under diverse motion speed profiles. Our findings indicate that current VLMs exhibit limited ability in interpreting needle trajectories and scale semantics, failing to provide the traceability and reliability needed for safety-critical monitoring. The results demonstrate that these models have not yet achieved the performance necessary to be classified as trustworthy synthetic instruments under existing IEEE and ISO standards.
[62] Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics cs.CV | cs.AI | cs.LGPDF
Jiayuan Chen, Ruoqi Liu, Zishan Gu, Ping Zhang
TL;DR: 本文提出了一种干预感知的蒸馏框架,利用扰动转录组学指导显微镜图像表征学习,以解决图像表型分析缺乏机制深度、多模态数据弱配对且存在细胞类型和剂量变异的问题。
Details
Motivation: 基于显微镜的表型分析可扩展但缺乏机制深度,转录组学虽具机制性但成本高且数据稀缺;现有多模态方法要么用图像支持其他模态,要么仅通过样本身份简单对齐表征,忽略了弱配对数据中的细胞类型和剂量变异,限制了向未见干预的泛化能力。
Result: 在Cell Painting和RxRx数据集(与L1000配对)上,该方法相比自监督和对齐基线,显著提高了向未见干预的单次迁移能力和药物靶基因发现性能。
Insight: 创新点在于提出干预感知的蒸馏框架,强调干预语义而非身份对齐,并显式处理剂量和细胞类型不匹配;利用化学感知的码本和微调的单细胞基础模型编码细胞类型上下文并解耦剂量效应,从理论上证明了转录组指导能收紧基于图像预测的风险界限。
Abstract: Microscopy-based phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches. We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines.
[63] WebSerial Vision Training for Microcontrollers: A Browser-Based Companion to On-Device CNN Training cs.CV | cs.LGPDF
Jeremy Ellis
TL;DR: 本文介绍了webmcu-vision-web,一个基于浏览器的单文件、零安装应用程序,用于在Seeed Studio XIAO ESP32-S3 Sense微控制器上实现端到端的TinyML视觉模型训练和部署。它作为设备端Arduino固件的浏览器伴侣,提供了一个完全本地、私有的机器学习流程,涵盖固件烧录、图像采集、CNN训练、权重导出和实时激活可视化,无需任何额外软件安装。
Details
Motivation: 针对教育工作者、小型企业和研究人员,他们需要在特定部署条件下训练任务专用的视觉分类器,但缺乏便捷、私有的本地训练工具,现有方案可能涉及复杂安装或数据上传。
Result: 在三类参考问题(0Blank, 1Cup, 2Pen)上进行了五次一致性评估,报告了平均准确率和标准差,表明模型稳定收敛。在浏览器中使用TensorFlow.js进行CNN训练(每类约30张图像,20个epoch)大约需要1分钟,而设备端训练需要9分钟,使得完整的采集-训练-部署周期可在10分钟内完成。
Insight: 创新点包括:1)完全基于浏览器的零安装、端到端TinyML训练部署流程,确保数据隐私;2)通过esptool-js实现浏览器内固件烧录和config.json实时同步,无需重新编译即可调整超参数;3)集成TensorFlow.js实现快速浏览器端训练,显著加速迭代周期;4)提供实时Conv2激活热图流和混淆矩阵等可视化工具;5)代码库设计为LLM辅助适配新硬件和任务的活模板,促进可扩展性。
Abstract: This paper presents webmcu-vision-web, a single-file, zero-install browser application for end-to-end TinyML vision model training and deployment on the Seeed Studio XIAO ESP32-S3 Sense (XIAO ML Kit, $15–40 USD). Acting as a browser-based companion to the on-device Arduino firmware of Paper 1 [1], it provides a private, fully local machine learning pipeline, from firmware flashing through image collection, CNN training, weight export, and live activation visualization, without any software installation beyond a Chromium-based browser. The system targets educators, small businesses, and researchers who need to train task-specific visual classifiers under their exact deployment conditions. Key capabilities include: in-browser firmware flashing via esptool-js; an SD card file browser with image preview and inline editing; config.json live-sync for zero-recompile hyperparameter adjustment; webcam and ESP32 OV2640 camera image capture; TensorFlow.js CNN training completing a three-class run (~30 images per class, 20 epochs) in approximately 1 minute browser-side versus 9 minutes on-device, enabling a complete collect-train-deploy cycle in under 10 minutes; weight export as myWeights.bin and myWeights.h; confusion matrix; and a live Conv2 activation heatmap streamed from the ESP32 during inference. No data leaves the local machine at any stage. A five-run consistency evaluation on the three-class reference problem (0Blank, 1Cup, 2Pen) demonstrates stable convergence with mean accuracy and standard deviation reported; all artefacts are released at the repository link below. The repository is a living template for LLM-assisted adaptation to new hardware and tasks. All source code is MIT-licensed at https://github.com/webmcu-ai/webmcu-vision-web.
[64] ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation Scenes cs.CV | cs.AIPDF
Haonan Chen, Kaiwen Xiao, Bin Tian, Jun Fu
TL;DR: 该论文提出了ParkingScenes数据集,这是一个专门为端到端自动泊车任务设计的综合性多模态仿真数据集。该数据集基于CARLA仿真器构建,包含结构化的泊车轨迹、丰富的传感器数据(RGB相机、深度传感器、车辆状态、鸟瞰图),旨在解决该领域高质量结构化数据缺乏的问题。
Details
Motivation: 端到端学习在自动泊车领域前景广阔,但缺乏专门针对泊车场景的高质量、结构化数据集,这成为了技术发展的瓶颈。本文旨在填补这一空白,为学习型自动泊车系统提供一个可扩展、可复现的基准。
Result: 在相同条件下,使用ParkingScenes数据集训练的模型,其性能显著优于使用非结构化、人工收集的仿真数据训练的模型。这证明了结构化监督信号对于学习鲁棒且准确的泊车策略的有效性。
Insight: 主要创新点在于构建了一个专门针对泊车场景的、包含结构化轨迹(由Hybrid A*规划器和MPC控制器生成)和多模态感知数据的仿真数据集。其结构化、可复现的数据生成框架,为端到端泊车策略学习提供了高质量的监督信号和可靠的评估基准。
Abstract: Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is essential. While recent advances in end-to-end learning have shown great promise, the lack of high-quality, structured datasets tailored for parking scenarios remains a significant bottleneck.To address this gap, we present ParkingScenes, a comprehensive multimodal dataset specifically designed for end-to-end autonomous parking in simulated scenes. Built on the CARLA simulator, ParkingScenes features structured parking trajectories generated by a Hybrid A* planner and a Model Predictive Controller (MPC), providing accurate and reproducible supervision signals. The dataset includes 16 reverse-in and 6 parallel parking scenarios, each executed under two pedestrian conditions (present and absent), resulting in 704 structured episodes and approximately 105000 frames. Each scenario is repeated 16 times to ensure consistent coverage. Each frame contains synchronized data from four RGB cameras, four depth sensors, vehicle motion states, and Bird’s-Eye View (BEV) representations, enabling rich multimodal fusion and context-aware learning. To demonstrate the utility of our dataset, we compare models trained on ParkingScenes with those trained on unstructured, manually collected simulation data under identical conditions. Results show significant improvements in performance, underscoring the effectiveness of structured supervision for robust and accurate parking policy learning. By releasing both the dataset and the collection framework, ParkingScenes establishes a scalable and reproducible benchmark for advancing learning-based autonomous parking systems. The dataset and collection framework will be released at: https://github.com/haonan-ai/ParkingScenes
[65] AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method cs.CVPDF
Deshui Miao, Chao Yang, Chao Tian, Guoqing Zhu, Kai Yang
TL;DR: 本文提出了一种基于Sa2VA的Ref-VOS(Referring Video Object Segmentation)系统,该系统采用显式代理角色组织,通过代理循环判断是否接受、修订或细化Sa2VA生成的初始语义假设,最终实现视频中基于文本描述的目标分割。
Details
Motivation: 解决在MeViS-Text Track挑战中,如何更可靠地处理视频中基于文本描述的目标分割问题,特别是当目标不存在或需要精细时空定位时,传统方法可能直接依赖初始分割结果而缺乏验证和修正机制。
Result: 在MeViS-Text Track of 5th PVUW Challenge中,该方法获得了第三名,表明其有效性;系统通过代理层进行存在性验证、时序搜索、置信度感知修订和最终掩码细化,提升了分割的准确性和鲁棒性。
Insight: 创新点在于将Sa2VA作为密集语义先验生成器,而非最终答案,并引入多代理协作框架(如规划器、时序分区代理、侦察代理、细化代理等)进行动态决策和修正,这增强了系统的可解释性和适应性,可借鉴于其他需要复杂时空推理的视觉任务。
Abstract: This report describes a Ref-VOS pipeline centered on Sa2VA and organized with explicit agent roles. The key idea is that Sa2VA should provide the first dense semantic hypothesis, while an agent loop decides whether that hypothesis should be accepted, revised, or refined. The pipeline starts with a target-presence judgment stage. If the referred object does not exist in the video, the system directly outputs zero masks. Otherwise, Sa2VA receives the video and referring prompt and produces a coarse mask trajectory over the full video. This trajectory is treated as a semantic prior rather than a final answer. A planner agent decomposes the query, temporal partition agents identify informative blocks, scout agents search for anchor frames, and refinement agents convert reliable Sa2VA masks into boxes and points for SAM3 propagation. A critic scores candidate trajectories, a reflection controller repairs weak hypotheses, and a collaboration controller reconciles multiple agent branches. The result is a Ref-VOS system in which Sa2VA is responsible for dense grounded understanding, while the agent layer handles presence verification, temporal search, confidence-aware revision, and final mask refinement.
[66] From Skeletons to Pixels: Few-Shot Precise Event Spotting via Representation and Prediction Distillation cs.CV | cs.AIPDF
Zhong Han Ervin Yeoh, Jiang Kan
TL;DR: 本文研究了少样本精确事件定位(PES)问题,针对网球等快节奏运动中细粒度事件的帧级定位挑战,提出了两种互补的蒸馏策略:预测级的自适应权重蒸馏(AWD)和表示级的退火多模态蒸馏(AMD-FED),通过多模态知识迁移在有限监督下提升模型泛化能力。
Details
Motivation: 解决快节奏体育(如网球)中,由于运动模糊、动作差异细微和标注数据有限,导致精确帧级事件定位困难的问题。
Result: 在F3Set-Tennis(sub)数据集上的少样本k-clip设置中,两种方法均持续优于单模态基线和先前的PES方法;AMD-FED在花样滑冰数据集上也表现出鲁棒性能。
Insight: 创新点在于结合预测级和表示级的多模态蒸馏,特别是AMD-FED通过退火伪标签将鲁棒的骨架知识迁移到视觉模态,有效提升了少样本场景下的泛化能力。
Abstract: Precise Event Spotting (PES) is essential in fast-paced sports such as tennis, where fine-grained events occur within very short temporal windows. Accurate frame-level localization is challenging because of motion blur, subtle action differences, and limited annotated data. We study two complementary distillation strategies for few-shot PES: Adaptive Weight Distillation (AWD), a prediction-level method that adaptively weights teacher supervision on unlabeled data, and Annealed Multimodal Distillation for Few-Shot Event Detection (AMD-FED), a representation-level framework that transfers robust skeleton knowledge into visual modalities through annealed pseudo-labeling. Both methods use multimodal distillation to improve generalization under limited supervision. We evaluate them on F3Set-Tennis(sub) under few-shot k-clip settings, where they consistently outperform single-modality baselines and prior PES approaches. After observing the stronger performance of representation-level distillation on tennis, we further validate AMD-FED on a second sports dataset, Figure Skating, where it also shows robust performance in the k-clip scenario. These results highlight the effectiveness of multimodal distillation, especially representation-level transfer, for few-shot precise event spotting.
[67] AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards cs.CV | cs.CL | cs.MMPDF
Yiming Pan, Chengwei Hu, Xuancheng Huang, Can Huang, Mingming Zhao
TL;DR: 本文提出了AeSlides,一个基于强化学习、带有可验证奖励的框架,用于提升基于大语言模型的幻灯片生成中的美学布局。该工作通过精心设计的一套可验证指标来量化幻灯片布局质量,并利用这些指标通过GRPO强化学习方法直接优化生成模型,从而有效弥合文本生成与视觉美学评估之间的模态鸿沟。
Details
Motivation: 当前基于LLM的幻灯片生成过程以文本为中心,但其质量却由视觉美学决定,这种模态鸿沟导致模型经常产生美学上欠佳的布局。现有解决方案要么依赖高成本的视觉反思且收益有限,要么依赖大规模数据集微调但美学监督较弱且间接,而明确使用美学原则作为监督的方法尚未被探索。
Result: 在GLM-4.7-Flash模型上仅使用5K训练提示,AeSlides将宽高比合规率从36%提升至85%,同时将空白区域减少44%,元素碰撞减少43%,视觉不平衡减少28%。人工评估显示整体质量分数从3.31提升至3.56(+7.6%),超越了基于模型的奖励优化和基于反思的智能体方法,甚至略微优于Claude-Sonnet-4.5。
Insight: 论文的核心创新点在于提出了一种“可验证的美学范式”,即设计了一套准确、高效、低成本的可验证指标来量化布局美学问题(如宽高比、空白、碰撞、不平衡),并以此作为强化学习的奖励信号直接优化模型。这为弥合文本生成与视觉评估的模态差距提供了一种高效且可扩展的解决方案,避免了高成本的视觉反思或弱监督的大规模微调。
Abstract: Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.
[68] ATTN-FIQA: Interpretable Attention-based Face Image Quality Assessment with Vision Transformers cs.CV | eess.IVPDF
Guray Ozgur, Tahar Chettaoui, Eduarda Caldeira, Jan Niklas Kolf, Marco Huber
TL;DR: 本文提出了一种名为ATTN-FIQA的无训练人脸图像质量评估方法,该方法利用预训练的基于Vision Transformer的人脸识别模型中的注意力分数作为质量指标。通过提取最终Transformer块的预Softmax注意力矩阵,并聚合所有补丁的多头注意力信息,仅需单次前向传播即可计算图像级质量分数,无需模型修改或额外训练。在八个基准数据集和四个人脸识别模型上的综合评估表明,该方法能有效评估人脸图像质量并提供空间可解释性。
Details
Motivation: 现有的人脸图像质量评估方法通常需要计算成本高昂的多次前向传播、反向传播或额外训练,且近期才开始关注Vision Transformer的应用。本文旨在探索预训练Vision Transformer模型中固有的注意力模式能否作为高效、无需训练的质量评估指标。
Result: 在八个基准数据集(如LFW、CFP-FP等)和四个人脸识别模型(如ViT、DeiT等)上的实验表明,基于注意力的质量分数与人脸图像质量有效相关,并能揭示对质量判断贡献最大的面部区域,提供了空间可解释性。
Insight: 创新点在于首次利用预训练Vision Transformer的预Softmax注意力分数作为无需训练的质量评估指标,其假设是注意力幅度内在地编码了图像质量:高质量图像产生聚焦的高幅度注意力模式,而退化图像产生分散的低幅度模式。该方法仅需单次前向传播,计算高效且具有可解释性。
Abstract: Face Image Quality Assessment (FIQA) aims to assess the recognition utility of face samples and is essential for reliable face recognition (FR) systems. Existing approaches require computationally expensive procedures such as multiple forward passes, backpropagation, or additional training, and only recent work has focused on the use of Vision Transformers. Recent studies highlighted that these architectures inherently function as saliency learners with attention patterns naturally encoding spatial importance. This work proposes ATTN-FIQA, a novel training-free approach that investigates whether pre-softmax attention scores from pre-trained Vision Transformer-based face recognition models can serve as quality indicators. We hypothesize that attention magnitudes intrinsically encode quality: high-quality images with discriminative facial features enable strong query-key alignments producing focused, high-magnitude attention patterns, while degraded images generate diffuse, low-magnitude patterns. ATTN-FIQA extracts pre-softmax attention matrices from the final transformer block, aggregate multi-head attention information across all patches, and compute image-level quality scores through simple averaging, requiring only a single forward pass through pre-trained models without architectural modifications, backpropagation, or additional training. Through comprehensive evaluation across eight benchmark datasets and four FR models, this work demonstrates that attention-based quality scores effectively correlate with face image quality and provide spatial interpretability, revealing which facial regions contribute most to quality determination.
[69] EX-FIQA: Leveraging Intermediate Early eXit Representations from Vision Transformers for Face Image Quality Assessment cs.CV | eess.IVPDF
Guray Ozgur, Tahar Chettaoui, Eduarda Caldeira, Jan Niklas Kolf, Andrea Atzori
TL;DR: 本文提出EX-FIQA方法,首次系统研究了视觉变换器(ViT)中间层表示在面部图像质量评估中的作用,通过早期退出机制和分数融合策略,利用不同网络深度捕获的互补质量信息,在保持竞争力的同时显著节省计算成本。
Details
Motivation: 现有基于ViT的面部图像质量评估方法仅依赖最终层表示,忽略了中间网络深度捕获的质量相关信息,本文旨在探索如何有效利用ViT的中间表示来提升质量评估性能。
Result: 在八个基准数据集上使用四种人脸识别模型进行广泛评估,结果表明所提出的分数融合策略优于单退出方法,通过深度加权平均实现了最佳质量评估性能。
Insight: 创新点在于挑战了仅依赖深层特征进行面部分析的常规认知,揭示了中间表示包含有价值的质量信息,并提出无需架构修改或额外训练的分数融合框架,实现了性能与效率的最佳权衡。
Abstract: Face Image Quality Assessment is crucial for reliable face recognition systems, yet existing Vision Transformer-based approaches rely exclusively on final-layer representations, ignoring quality-relevant information captured at intermediate network depths. This paper presents the first comprehensive investigation of how intermediate representations within ViTs contribute to face quality assessment through early exit mechanisms and score fusion strategies. We systematically analyze all twelve transformer blocks of ViT-FIQA architectures, demonstrating that different depths capture distinct and complementary quality-relevant information, as evidenced by varying attention patterns and performance characteristics across network layers. We propose a score fusion framework that combines quality predictions from multiple transformer blocks without architectural modifications or additional training. Our early exit analysis reveals optimal performance-efficiency trade-offs, enabling significant computational savings while maintaining competitive performance. Through extensive evaluation across eight benchmark datasets using four FR models, we demonstrate that our fusion strategy improves upon single-exit approaches. Our proposed quality fusion approach employs depth-weighted averaging that assigns progressively higher importance to deeper transformer blocks, achieving the best quality assessment performance by effectively leveraging the hierarchical nature of feature learning in ViTs. Our work challenges the conventional wisdom that only deep features matter for face analysis, revealing that intermediate representations contain valuable information for quality assessment. The proposed framework offers practical benefits for real-world biometric systems by enabling adaptive computation based on resource constraints while maintaining competitive quality assessment capabilities.
[70] Unified Multi-Foundation-Model Slide Representation for Pan-Cancer Recognition and Text-Guided Tumor Localization cs.CVPDF
Tianyang Wang, Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Lina Gokhale
TL;DR: 本文提出了一个名为ASTRA的泛癌框架,旨在解决现有病理学基础模型产生的切片级表征碎片化问题。该框架通过整合异构的基础模型表征到一个共享的幻灯片级表征空间,并利用结构化的病理学标注字段(如分类类别、癌症类型和解剖部位)对该空间进行语义对齐,从而支持多种分类任务和无需像素级监督的文本引导肿瘤定位。
Details
Motivation: 当前病理学基础模型生态系统产生了强大但碎片化的图块级表征,这限制了其在需要统一幻灯片级推理和与临床信息可解释关联的任务中的应用。本文旨在整合这些异构表征,并利用幻灯片元数据中的结构化标注进行语义对齐,以支持更广泛的临床任务。
Result: 在包含10,359张全切片图像、涵盖16种肿瘤类型的CHTN队列上开发,ASTRA在四种病理学基础模型骨干网络上持续提升了泛癌分类性能:4类别分类的宏观AUC最高达97.8%,3类实体瘤分型达99.7%,16类癌症分型达99.2%。在肿瘤定位任务中,在包含16种癌症类型的CHTN子集(n=380)上平均Dice系数为0.897,在包含四种癌症类型的外部TCGA队列(n=1,686)上为0.738。
Insight: 创新点在于提出了一种统一的幻灯片级表征学习框架,通过稀疏专家混合上下文化、掩码多模型重建和与结构化病理学提示的对比对齐,将异构基础模型表征整合并语义对齐。核心洞察是,从幻灯片级元数据中提取的最小化结构化病理学标注字段,可以为统一表征学习提供有效的语义监督,从而在单一框架内同时实现泛癌预测和弱监督肿瘤定位。
Abstract: The expanding ecosystem of pathology foundation models has produced powerful but fragmented tile-level representations, limiting their use in clinical tasks that require unified slide-level reasoning and interpretable linkage to clinically meaningful information. We present ASTRA, a pan-cancer framework that integrates heterogeneous foundation-model representations into a shared slide-level representation space and semantically grounds that space using structured pathology annotation fields, including classification category, cancer type, and anatomic site. ASTRA combines sparse mixture-of-experts contextualization, masked multi-model reconstruction, and contrastive alignment to structured pathology prompts to learn slide representations that support 4-category classification, 3-class solid tumor typing, 16-class cancer typing, and text-guided tumor localization without pixel-level supervision. Developed on a CHTN cohort of 10,359 whole-slide images (WSIs) spanning 16 tumor types, ASTRA consistently improves pan-cancer classification across four pathology foundation-model backbones, achieving up to 97.8% macro-AUC for 4-category classification, 99.7% for 3-class solid tumor typing, and 99.2% for 16-class cancer typing. For tumor localization, ASTRA achieves a mean Dice of 0.897 on an annotated in-domain CHTN subset (n = 380) spanning 16 cancer types and 0.738 on an external TCGA cohort (n = 1,686) spanning four cancer types. These results demonstrate that minimal structured pathology annotation fields derived from slide-level metadata can provide effective semantic supervision for unified slide representation learning, enabling both pan-cancer prediction and weakly supervised tumor localization within a single framework.
[71] EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving cs.CV | cs.CL | cs.ROPDF
Finn Rasmus Schäfer, Yuan Gao, Dingrui Wang, Thomas Stauner, Stephan Günnemann
TL;DR: 该论文提出了EgoDyn-Bench,一个用于评估以视觉为中心的基础模型(如视觉语言模型VLM)在自动驾驶场景中自我运动理解能力的诊断性基准。通过将连续的车辆运动学映射到离散的运动概念,该基准将模型的内部物理逻辑与其视觉感知解耦。对超过20个模型的大规模实证审计揭示了一个显著的‘感知瓶颈’:模型虽然具备逻辑上的物理概念,但无法准确地将这些概念与视觉观察对齐,其表现甚至常常不如经典的非学习几何基线。研究还发现,提供明确的轨迹编码可以显著恢复所有评估模型的物理一致性,这表明视觉和语言功能是解耦的:自我运动逻辑几乎完全源自语言模态,而视觉观察贡献的信号微乎其微。
Details
Motivation: 尽管视觉语言模型(VLMs)提升了自动驾驶中的高层推理能力,但其将推理建立在自我运动底层物理基础上的能力尚不明确。论文旨在解决如何评估和理解这些模型在语义层面的自我运动理解能力,并诊断其物理推理与视觉感知的耦合缺陷。
Result: 在EgoDyn-Bench基准上对超过20个模型(包括闭源多模态大语言模型MLLM、不同规模的开源VLM以及专用视觉语言智能体VLA)进行了大规模审计。结果表明,所有模型都存在‘感知瓶颈’,在将物理概念与视觉观察对齐的任务上表现不佳,甚至经常低于经典的非学习几何基线。提供明确的轨迹编码后,所有模型的物理一致性得到显著恢复。
Insight: 论文的创新点在于提出了一个诊断性的基准EgoDyn-Bench,用于系统评估基础模型的自我运动理解,并揭示了当前模型架构中视觉感知与物理推理耦合的结构性缺陷。一个关键的客观发现是,在现有模型中,自我运动逻辑主要依赖于语言模态,视觉模态的贡献有限,这为构建物理对齐的具身AI提供了新的诊断框架和设计思路(例如,可能需要显式地整合运动信息)。
Abstract: While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model’s internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models
[72] Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CVPDF
Ziyun Chen, Fan Liu, Liang Yao, Chuanyi Zhang, Yuye Ma
TL;DR: 本文针对遥感图像描述任务,提出了一种无参考评估指标ReconScore,通过文本重建原始视觉元素的能力来评估描述质量,从而消除人工标注偏差。研究发现,未经微调的多模态大语言模型在零样本遥感图像描述任务中优于微调模型,并据此提出了一种无需训练的生成方法RemoteDescriber,在三个数据集上达到了最先进的性能。
Details
Motivation: 当前图像描述评估依赖人工标注参考文本,迫使模型模仿特定标注风格,掩盖了先进基础模型的真实描述能力,导致任务特定微调的必要性存疑。
Result: ReconScore评估显示,未经微调的MLLM在零样本遥感图像描述任务中超越微调模型;RemoteDescriber在三个数据集上实现了最先进的性能。
Insight: 创新点包括提出无参考评估指标ReconScore以消除人工偏差,发现微调可能非必要,并设计无需训练的生成方法RemoteDescriber,通过自校正机制提升语义精度。
Abstract: The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore’s reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.
[73] IoT-Enhanced CNN-Based Labelled Crack Detection for Additive Manufacturing Image Annotation in Industry 4.0 cs.CVPDF
Mohsen Asghari Ilani, Yaser Mike Banad
TL;DR: 本文提出了一种基于物联网(IoT)增强的深度学习框架,用于增材制造(AM)表面的自动化裂纹检测。该框架整合了物联网实时监控、高分辨率成像和边缘计算,实现了连续的原位缺陷检测与分类。通过实时数据采集支持基于CNN的即时分析,提高了AM质量控制的准确性和效率。
Details
Motivation: 解决增材制造过程中表面裂纹的自动化、实时检测问题,以提升质量控制的精度和效率,并支持工业4.0背景下的预测性维护和自适应控制。
Result: 在14,982张图像上,系统实现了99.54%的准确率,精确率96%,召回率98%,F1分数97%。通过数据集平衡和增强,准确率从32%提升至99%。在边缘设备上,模型量化和批处理将推理延迟降低了47%,基于MQTT和5G的数据流系统将传输开销降低了35%。
Insight: 创新点包括:物联网与边缘计算集成的实时监控系统、优化的CNN模型通过量化和批处理降低延迟、基于MQTT和5G的低延迟数据流、以及数字孪生(DT)技术用于预测性缺陷分析和AM参数动态调整。该框架为可扩展、高精度、低延迟的智能AM质量控制提供了解决方案。
Abstract: This paper presents an IoT-enhanced deep learning framework for automated crack detection in Additive Manufacturing (AM) surfaces using convolutional neural networks (CNNs). By integrating IoT-enabled real-time monitoring, high-resolution imaging, and edge computing, the system enables continuous in-situ defect detection and classification. Real-time data acquisition supports immediate CNN-based analysis, improving both accuracy and efficiency in AM quality control. The framework supports supervised and semi-supervised learning, enabling robust performance on large, sparsely annotated datasets. Using LabelImg for annotation and OpenCV for preprocessing, the system achieves 99.54% accuracy on 14,982 images, with 96% precision, 98% recall, and a 97% F1-score. Dataset balancing and augmentation significantly improve generalization, increasing accuracy from 32% to 99%. Beyond detection, the framework establishes a linkage between AM process parameters, defect formation, and surface topology, supporting predictive analytics and defect mitigation. Aligned with Industry 4.0, it incorporates Digital Twin (DT) technology for real-time process simulation, predictive maintenance, and adaptive control. Key contributions include an IoT-based monitoring system using edge devices (Raspberry Pi 4B), an optimized CNN with model quantization and batch processing reducing inference latency by 47%, and an MQTT-based low-latency data streaming system with 5G connectivity, lowering transmission overhead by 35%. DT integration further enables predictive defect analysis and dynamic adjustment of AM parameters. This work advances intelligent AM quality control by providing a scalable, high-accuracy, and low-latency framework. Future directions include multimodal data fusion, hybrid architectures, and enhanced Digital Twin simulations for AI-driven defect prevention.
[74] Probing Visual Planning in Image Editing Models cs.CV | cs.AIPDF
Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao, Bo Zhao, Xiaojian Ma
TL;DR: 本文提出了一种名为EAR的编辑即推理范式,将视觉规划重新定义为单步图像变换,以解决现有视觉规划方法计算效率低下的问题。为了评估模型的内在推理能力,作者引入了AMAZE数据集,该数据集包含迷宫和皇后问题等抽象谜题,并评估了多种主流图像编辑模型。
Details
Motivation: 动机在于解决当前视觉规划任务中,基于逐步生成范式的全视觉方法计算效率低下的问题,并希望将视觉规划与视觉识别分离开来,以更纯粹地评估模型的推理能力。
Result: 在AMAZE数据集上的评估结果显示,所有领先的专有和开源编辑模型在零样本设置下都表现不佳;但在基础规模上进行微调后,模型能显著泛化到更大的域内规模以及域外的规模和几何形状。然而,即使在高端硬件上运行的最佳模型,其零样本效率也无法与人类求解者相匹配。
Insight: 论文的创新点在于提出了EAR范式,将多步规划转化为单步图像编辑,并设计了AMAZE这一抽象谜题数据集来隔离和评估模型的视觉推理能力,而非视觉识别能力。这为评估神经网络的视觉规划能力提供了一个新的、可自动评估的基准。
Abstract: Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.
[75] Vision-Based Lane Following and Traffic Sign Recognition for Resource-Constrained Autonomous Vehicles cs.CV | eess.SYPDF
Md Tanjemul Islam, Md Rafiul Kabir
TL;DR: 本文提出了一种面向资源受限自动驾驶车辆的轻量级视觉感知框架,集成了车道线检测与跟踪以及交通标志识别功能。该框架采用基于阈值的车道分割方法结合透视变换和基于直方图的曲率估计,实现鲁棒的车道跟踪;同时评估了EfficientNet-B0和MobileNetV2两种轻量级卷积神经网络在交通标志识别任务上的性能。实验表明,系统在保持实时性能的同时,车道跟踪最大偏移RMSE仅为3.16%,EfficientNet-B0在离线测试集上达到98.77%的准确率,在设备端实时部署中达到90%的准确率。
Details
Motivation: 解决在计算资源有限的嵌入式平台上实现可靠实时感知算法的挑战,为资源受限的自动驾驶车辆提供轻量化的视觉感知解决方案。
Result: 车道跟踪最大偏移RMSE为3.16%;交通标志识别中,EfficientNet-B0在离线测试集上准确率为98.77%,在设备端实时部署中准确率为90%,优于MobileNetV2;MobileNetV2则具有更快的推理速度和更低的计算成本。
Insight: 创新点在于将计算高效的基于阈值的车道分割与透视变换、直方图曲率估计相结合,实现鲁棒的车道跟踪;同时系统性地评估并对比了两种轻量级CNN(EfficientNet-B0和MobileNetV2)在嵌入式实时交通标志识别中的性能权衡,为资源受限场景提供了实用的轻量化感知管道设计参考。
Abstract: Autonomous vehicles (AVs) rely on real-time perception systems to understand road environments and ensure safe navigation. However, implementing reliable perception algorithms on resource-constrained embedded platforms remains challenging due to limited computational resources. This paper presents a lightweight vision-based framework that integrates lane detection, lane tracking, and traffic sign recognition for embedded autonomous vehicles. A computationally efficient threshold-based lane segmentation method combined with perspective transformation and histogram-based curvature estimation is used for robust lane tracking under varying illumination conditions. A rule-based steering controller generates steering commands to maintain stable vehicle navigation. For traffic sign recognition, two lightweight convolutional neural networks (CNNs), EfficientNet-B0 and MobileNetV2, are evaluated using a custom dataset captured from the vehicle’s onboard camera. Experimental results show that the system achieves real-time performance while maintaining accurate lane tracking with only 3.16% maximum offset RMSE. EfficientNet-B0 achieves a high offline classification accuracy of 98.77% on the test dataset, while achieving 90% accuracy during real-time on-device deployment, outperforming MobileNetV2 in both settings. MobileNetV2, however, offers slightly faster inference and lower computational cost. These results highlight the effectiveness of lightweight vision-based perception pipelines for resource-constrained autonomous driving applications.
[76] SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV | cs.AIPDF
Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui
TL;DR: SketchVLM是一个无需训练、模型无关的框架,它使视觉语言模型(VLMs)能够在输入图像上生成非破坏性、可编辑的SVG叠加层,以视觉化地解释其答案。该框架在七个涵盖视觉推理(如迷宫导航、轨迹预测和物体计数)和绘图(如部件标注、连点和围绕物体绘制形状)的基准测试中,显著提升了任务准确性和标注质量。
Details
Motivation: 解决现有视觉语言模型(如Gemini-3-Pro和GPT-5)仅以文本形式回答问题,导致用户难以验证和理解的局限性,旨在让模型能够像人类一样通过指向、标注和绘图来视觉化地解释其推理过程。
Result: 在七个视觉推理和绘图基准测试中,SketchVLM将视觉推理任务的准确率提升了高达28.5个百分点,标注质量相对于图像编辑和微调绘图基线提升了高达1.48倍,同时生成的标注更忠实于模型的答案。单轮生成已实现强准确性和标注质量,多轮生成为人机协作提供了进一步机会。
Insight: 创新点在于提出了一种无需额外训练、可适配不同VLM的框架,通过生成可编辑的SVG叠加层来实现视觉化解释,增强了模型输出的可解释性和用户交互性,为人机协作开辟了新途径。
Abstract: When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model’s stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.
[77] Can Multimodal Large Language Models Truly Understand Small Objects? cs.CV | cs.AIPDF
Fujun Han, Junan Chen, Xintong Zhu, Jingqi Ye, Xuanjie Mao
TL;DR: 本文提出了首个针对多模态大语言模型(MLLMs)小物体理解(SOU)能力的综合基准SOUBench,包含评估数据集SOU-VQA和训练数据集SOU-Train。研究发现现有MLLMs在小物体理解方面能力较弱,并通过监督微调验证了SOU-Train能有效提升MLLMs的SOU能力。
Details
Motivation: 现有MLLMs在多种理解任务上表现出潜力,但其在小物体理解任务上的能力尚未被探索和评估,存在研究空白。
Result: 在包含18,204个VQA对、6个子任务、3个主要场景的SOU-VQA基准上,对15个SOTA MLLMs的全面评估揭示了它们在小物体理解方面的能力较弱。使用SOU-Train数据集对最新MLLM进行监督微调后,其小物体理解能力得到有效提升。
Insight: 创新点在于首次构建了针对MLLMs小物体理解能力的综合基准和数据集,并提出了自动化的视觉问答生成策略。这为社区评估和提升模型的小物体理解能力提供了关键的实证基础和数据资源。
Abstract: Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM’s ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.
[78] Text-Guided Multimodal Unified Industrial Anomaly Detection cs.CVPDF
Zewen Li, Shuo Ye, Zitong Yu, Weicheng Xie, Linlin Shen
TL;DR: 本文提出了一种基于文本引导的多模态统一工业异常检测框架,通过几何感知的跨模态映射器和对象条件文本特征适配器,解决了现有无监督方法中跨模态对齐模糊和几何建模不足的问题,并在单一模型中实现了多类别的异常检测。
Details
Motivation: 现有基于RGB-3D多模态数据的工业异常检测方法存在两个关键局限:缺乏高层语义指导导致的跨模态对齐模糊,以及RGB到3D特征映射中几何建模不足。
Result: 在MVTec 3D-AD和Eyecandies数据集上的大量实验表明,该方法在无监督设置下的分类和定位任务中达到了最先进的性能。
Insight: 创新点包括引入文本语义引导来对齐多模态特征,设计几何感知的跨模态映射器以保持结构信息,以及提出统一学习范式打破“一模型一类”的限制,实现单一模型的多类别检测。
Abstract: Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.
[79] On the Complementarity of Quantum and Classical Features: Adaptive Hybrid Quantum-Classical Feature Fusion for Breast Cancer Classification cs.CV | cs.AIPDF
Yasmin Rodrigues Sobrinho, João Renato Ribeiro Manesco, João Paulo Papa
TL;DR: 本文提出了一种用于乳腺癌分类的新型混合量子-经典架构,该架构采用双分支特征提取管道,探索了可训练和确定性量子范式,并引入了三种渐进式特征融合策略来统一互补的表示。
Details
Motivation: 解决量子机器学习与经典深度学习在医学图像分析中整合时,由于优化不对称性而难以有效统一的问题,旨在通过融合互补特征来提升诊断性能。
Result: 在BreastMNIST数据集上,提出的TSHF策略(特别是ResNet主干与可训练量子电路配对时)达到了87.82%的峰值准确率、91.77%的F1分数和89.08%的AUC-ROC,超越了纯经典基线。
Insight: 创新点在于提出了三种渐进式特征融合策略,特别是受多模态学习启发的TSHF策略,它通过可学习的标量动态平衡混合梯度动态并解决优化瓶颈,从而统一了多样化的特征表示,为量子增强诊断工具的临床部署提供了稳定高性能的架构。
Abstract: The integration of quantum machine learning with classical deep learning offers promising avenues for medical image analysis by mapping data into high-dimensional Hilbert spaces. However, effectively unifying these distinct paradigms remains challenging due to common optimization asymmetries. In this paper, a novel hybrid quantum-classical architecture for breast cancer diagnosis based on a dual-branch feature-extraction pipeline is proposed. Our framework extracts and unifies complementary representations from classical models and quantum circuits, exploring both trainable and deterministic (non-trainable) quantum paradigms. To integrate these embeddings, three progressive feature fusion strategies are introduced: Static Hybrid Fusion (SHF) for offline extraction, Dynamic Hybrid Fusion (DHF) for end-to-end co-adaptation, and a novel Temperature-Scaled Hybrid Fusion (TSHF). The TSHF strategy incorporates a learnable scalar, inspired by multimodal learning, that dynamically balances hybrid gradient dynamics and resolves optimization bottlenecks. Empirical validation on the BreastMNIST dataset confirms our hypothesis that unifying diverse feature representations creates a richer data context. The TSHF strategy, specifically when pairing a ResNet backbone with a trainable quantum circuit, achieved a peak accuracy of 87.82%, F1-score of 91.77%, and an AUC-ROC of 89.08%, outperforming purely classical baselines. These results demonstrate that the proposed hybrid framework improves classification accuracy and threshold reliability, providing a stable, high-performance architecture for the clinical deployment of quantum-enhanced diagnostic tools.
[80] AnemiaVision: Non-Invasive Anemia Detection via Smartphone Imagery Using EfficientNet-B3 with TrivialAugmentWide, Mixup Augmentation, and Persistent Patient History Management cs.CV | cs.LG | cs.SEPDF
Rahul Patel
TL;DR: 本文提出了AnemiaVision,一种基于智能手机拍照的非侵入性贫血筛查系统,通过拍摄眼睑结膜和指甲床图像,利用改进的EfficientNet-B3模型结合多种数据增强和训练技巧,实现了高精度的贫血检测,并部署为包含患者历史管理的Web应用。
Details
Motivation: 解决全球超过10亿人贫血诊断不足的问题,特别是在资源匮乏地区缺乏实验室血液检测的条件下,提供一种低成本、非侵入性的筛查工具。
Result: 在验证集上达到96.2%的准确率和0.98的AUC-ROC,贫血类别的灵敏度为0.96,显著优于仅使用CPU训练三周期的基线模型(准确率44.9%,AUC-ROC 0.58),具备作为一线筛查工具的潜力。
Insight: 创新点包括采用以验证准确率为准的早停策略、结合TrivialAugmentWide、Mixup等四种正交精度提升技术的数据增强方法,以及集成了持久化患者历史管理的轻量级Web部署架构,确保了模型的鲁棒性和实际应用可行性。
Abstract: Anemia affects over one billion people globally and remains severely under-diagnosed in low-resource regions where laboratory blood tests are inaccessible. This paper presents AnemiaVision, an end-to-end web-based system for non-invasive anemia screening from smartphone photographs of the palpebral conjunctiva and fingernail beds. The proposed pipeline fine-tunes a pre-trained EfficientNet-B3 backbone with a redesigned three-layer classifier head incorporating BatchNorm, GELU activations, and high-rate Dropout (0.45/0.35). Training employs four orthogonal accuracy-boosting techniques: TrivialAugmentWide for policy-free image augmentation, RandomErasing for spatial regularisation, Mixup (alpha=0.2) for inter-class smoothing, and cosine-annealing scheduling with linear warmup. Early stopping is governed by peak validation accuracy rather than validation loss to prevent premature termination on high-variance epochs. The deployed Flask application integrates persistent patient-history management backed by PostgreSQL on Render, with an automated database-migration entrypoint ensuring zero data loss across redeploys. Ablation experiments demonstrate that accuracy-first early stopping contributes +1.6% and Mixup contributes +2.8% to final validation accuracy. Overall, the proposed system achieves a validation accuracy of 96.2% and AUC-ROC of 0.98, compared with 44.9% validation accuracy and AUC-ROC of 0.58 from the three-epoch CPU-only baseline. Sensitivity for the anemic class reaches 0.96, making the system suitable as a first-line screening tool for community health workers in rural settings. The system is publicly accessible and source code is openly available.
[81] CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging cs.CV | cs.AIPDF
Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma
TL;DR: CheXmix是一种针对医学影像的早期融合生成式预训练模型,通过统一处理图像和文本标记,结合掩码自编码器和多模态大语言模型的优势,在胸部X光片和放射学报告的大规模数据集上进行训练,支持从粗粒度到细粒度的判别和生成任务。
Details
Motivation: 解决现有医学多模态基础模型(如基于CLIP和LLaVA的两阶段方法)中投影层可能扭曲视觉特征的问题,尤其是在医学影像中细微线索对准确诊断至关重要,因此提出早期融合生成方法以消除瓶颈并实现联合表示学习。
Result: 在CheXpert分类任务中,高图像掩码比例下AUROC超过现有生成模型6.0%,超过CheXagent 8.6%;在图像修复上比纯文本生成模型提升51.0%,在放射学报告生成的GREEN指标上超过CheXagent 45%,展示了在广泛胸部X光任务中捕捉细粒度信息的能力。
Insight: 创新点包括引入两阶段多模态生成预训练策略,结合掩码自编码器的表示能力和多模态大语言模型的生成优势,实现灵活的统一模型架构,可同时处理判别和生成任务,提升医学影像分析的准确性和泛化性。
Abstract: Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon’s autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.
[82] Understanding Representation Gaps Across Scales in Tropical Tree Species Classification from Drone Imagery cs.CV | cs.LGPDF
Sulagna Saha, Arthur Ouaknine, Etienne Laliberté, Carol Altimas, Evan M. Gora
TL;DR: 本研究探讨了利用无人机(UAV)影像进行热带树种分类时,不同空间尺度(高分辨率特写图像与低分辨率俯视航拍图像)之间的表征差距。研究通过微调实验,量化了视觉基础模型与领域通用植物识别模型在两种图像类型上的性能差异,发现特写图像上的分类性能始终优于俯视航拍图像,且对于稀有物种,这种性能差距会扩大。论文提出,通过跨空间尺度的自监督表征对齐,有望将细粒度视觉信息整合到基于俯视航拍图像的冠层级物种分类模型中,从而提升热带森林生物多样性的大规模监测能力。
Details
Motivation: 热带树种多样性高、物种间视觉相似性强,使得基于典型分辨率(厘米/像素)的无人机俯视航拍图像进行准确分类具有挑战性。而智能手机拍摄的特写公民科学照片在植物物种分类上表现优异。近期无人机数据采集技术的进步使得能够获取与俯视航拍图像空间配准的高分辨率特写图像,但其覆盖范围有限。本研究旨在评估现有方法在配对的多视角无人机图像上的性能,并量化不同尺度图像间的表征差距。
Result: 在物种丰富的热带森林中收集的配对无人机图像上进行的实验表明,分类性能在特写图像上始终高于俯视航拍图像,且对于稀有物种,这种性能差距进一步扩大。研究通过微调实验,量化了视觉基础模型(如CLIP)与领域通用植物识别模型(如PlantNet)在两种图像类型上的性能差异。
Insight: 论文的创新点在于系统性地评估了跨尺度(特写 vs. 俯视)无人机图像在树种分类中的表征差距,并揭示了稀有物种在此差距中受影响更大。从客观角度看,其提出的通过自监督学习进行跨尺度表征对齐的方法,为解决高分辨率数据有限条件下如何利用细粒度信息提升冠层级分类性能提供了一个有前景的研究方向,对于推动基于遥感的大规模生物多样性监测具有借鉴意义。
Abstract: Accurate classification of tropical tree species from unoccupied aerial vehicle (UAV) imagery remains challenging due to high species diversity and strong visual similarity among species at typical image resolutions (centimeters per pixel). In contrast, models trained on close-up citizen science photographs captured with smartphones achieve strong plant species classification performance. Recent advances in UAV data acquisition now enable the collection of close-up images that are spatially registered with top-view aerial imagery and approach the level of visual detail found in smartphone photographs, with the trade-off that such high-resolution photos cannot be acquired for many trees. In this work, we evaluate the performance of existing methods using paired top-view and close-up UAV imagery collected in a species-rich tropical forest. Through fine-tuning experiments, we quantify the performance gap between vision foundation models and in-domain generalist plant recognition models across both image types (high-resolution close-up versus coarser-resolution top-view imagery). We show that classification performance is consistently higher on close-up images than on top-view aerial imagery, and that this performance gap widens for rare species. Finally, we propose that self-supervised representation alignment across these two spatial scales offers a promising approach for integrating fine-grained visual information into canopy-level species classification models based on top-view UAV imagery. Leveraging high-resolution close-up UAV imagery to enhance canopy-level species classification could substantially improve large-scale monitoring of tropical forest biodiversity.
[83] From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models cs.CV | cs.AIPDF
Pir Bakhsh Khokhar, Carmine Gravino, Fabio Palomba, Sule Yildirim Yayilgan, Sarang Shaikh
TL;DR: 本研究提出了一种结合强判别模型与多模态解释的方法,用于糖尿病视网膜病变(DR)的分级,旨在将视网膜像素转换为临床可解释的输出。该方法在APTOS 2019基准上评估了多种CNN和Transformer骨干网络,并比较了集成策略,同时利用Grad-CAM++和视觉语言模型(VLM)提供视觉和文本解释。
Details
Motivation: 解决当前深度学习分类器在糖尿病视网膜病变筛查中难以在临床背景下解释的问题,通过结合强判别模型与多模态解释,提高分级的可解释性。
Result: 在APTOS 2019基准上,现代CNN骨干网络(如ResNet-50和ConvNeXt-Tiny)表现最佳,交叉验证QWK分别达到0.919和0.914;加权软投票集成策略最一致,QWK为0.934 +/- 0.017;VLM解释在等级一致性上表现良好,但在临床完整性和语义相似性之间存在权衡。
Insight: 创新点包括结合CNN-Transformer集成与多模态解释(视觉和文本)来增强临床可解释性,以及探索不同集成策略和VLM在保守提示下的应用;客观分析认为,该方法通过视觉语言模型生成文本解释,为深度学习模型在医疗领域的可解释性提供了新思路。
Abstract: The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).
[84] INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer for Public~Safety cs.CV | cs.ETPDF
Alexander Nikitas Dimopoulos, Joseph Grasso, John Beltz
TL;DR: 本文提出了一种名为INSIGHT的零目标域标注流程,通过配准的RGB-D数据将2D图像理解投影到3D度量空间中,以解决室内环境缺乏空间智能基础设施的问题。该方法使用两种可互换的视觉堆栈(基于SAM3基础模型的文本提示分割和传统CV堆栈)生成带标签的点云和场景图,在Stanford 2D-3D-S数据集上评估,实现了高效的数据压缩和快速传输,适用于公共安全领域。
Details
Motivation: 解决室内环境中缺乏类似GPS的空间智能基础设施的问题,特别是应急响应人员在不熟悉建筑中缺乏机器可读的安全设备地图;现有3D语义分割方法面临标注数据稀缺和点云方法对小而关键的安全特征识别能力不足的障碍。
Result: 在Stanford 2D-3D-S数据集的全部7个子区域(70,496张图像)上评估,生成与Pointcept模式兼容的带标签点云和符合ISO 19164标准的场景图,压缩比约10^4倍;在FirstNet Band 14上以1 Mbps传输角色过滤的有效载荷时间小于15秒;报告了7个共享类的逐点标注准确性、15个公共3D基准中缺失的安全关键类的检测灵敏度以及代码封顶的可部署估计,展示了2D到3D语义转移的有效性。
Insight: 创新点包括零目标域标注流程,通过2D到3D语义转移解决标注数据瓶颈;结合SAM3基础模型和传统CV堆栈(开放集检测、VQA、OCR)实现可互换和可独立检查的中间输出;场景图提供紧凑的建筑智能,适合现场部署,为公共安全应用提供了高效的几何-语义层次转移方法。
Abstract: Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine-readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety-critical features by native point-cloud methods. This paper presents INSIGHT, a zero-target-domain-annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB-D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation-model stack for text-prompted segmentation, and a traditional CV stack (open-set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D-3D-S (70{,}496 images), the pipeline produces Pointcept-schema-compatible labeled point clouds and ISO19164-compliant scene graphs with ${\sim}10^{4}{\times}$ compression; role-filtered payloads transmit in ${<}15$,s at 1,Mbps over FirstNet Band14. We report per-point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety-critical classes absent from public 3D benchmarks alongside code-capped deployable estimates, and inter-pipeline complementarity, demonstrating that 2D-to-3D semantic transfer addresses the labeled-data bottleneck while scene graphs provide building intelligence compact enough for field deployment.
[85] Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label cs.CV | cs.LGPDF
Mengke Li, Haiquan Ling, Yiqun Zhang, Yang Lu, Hui Huang
TL;DR: 本文提出了一种名为弱教师监督(WTS)的方法,利用预训练视觉-语言模型中的跨模态对齐能力,来解决长尾分布和高噪声标签数据中的标签-图像不匹配问题,从而提升模型的鲁棒性。
Details
Motivation: 现实世界数据常呈现长尾分布且包含大量噪声标签,这严重降低了深度模型的性能。现有方法忽视了高噪声场景下固有的严重标签-图像不匹配问题,因此效果有限。
Result: 在合成和真实世界数据集上的大量实验表明,WTS方法表现出优越性能,尤其是在高噪声条件下。
Insight: 创新点在于利用标签中的辅助文本信息,通过预训练视觉-语言模型的跨模态对齐来纠正标签不一致性,这种弱监督信号不受标签噪声和数据分布偏差影响。客观来看,该方法巧妙地将文本引导与噪声评估相结合,为处理长尾噪声数据提供了新思路。
Abstract: Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at https://anonymous.4open.science/r/WTS-0F3C.
[86] CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model cs.CV | cs.AI | q-bio.QMPDF
Syed Ibad Hasnain, Muhammad Faris, Hafiza Syeda Yusra Tirmizi, Rabail Khowaja, Hafsa Israr
TL;DR: 本文提出了一种用于脑肿瘤MRI图像分类的混合深度学习模型。该模型通过自适应注意力门机制,动态融合了擅长提取局部纹理和空间信息的CNN分支(SqueezeNet风格)与擅长捕获长距离全局依赖的ViT分支(MobileViT风格),实现了对局部和全局特征的上下文敏感融合。
Details
Motivation: 早期检测和分类脑肿瘤MRI图像至关重要,但医学图像特征提取困难。CNN擅长局部特征,ViT擅长全局依赖,但现有方法未能有效融合两者优势。本文旨在通过动态加权融合,提升分类性能。
Result: 在Kaggle的Brain Tumor MRI Dataset上进行训练和评估,模型取得了97.60%的测试准确率、97.30%的精确率、97.50%的召回率、97.40%的F1分数以及0.9946的宏平均AUC。这些结果优于单一的CNN、ViT基线以及当前有竞争力的融合方法,达到了SOTA水平。
Insight: 主要创新点是提出了自适应注意力门机制,实现了基于样本和特征的自适应动态加权融合,而非简单的固定权重或拼接。这为混合架构设计,特别是在医学图像分析领域,提供了一种有效的特征融合新思路。
Abstract: Early detection and classifying brain tumors using Magnetic Resonance Imaging (MRI) images is highly important but difficult to extract in medical images. Convolutional Neural Networks (CNNs) are good at capturing both local texture and spatial information whereas Vision Transformers (ViTs) are good at capturing long-range global dependencies. We propose a new hybrid architecture that combines a SqueezeNet-style CNN branch with a MobileViT-style global transformer branch, through an Adaptive Attention Gate mechanism, in this paper. The gate learns dynamically per-sample, per-feature weights to weight the contribution of each branch, allowing context-sensitive merging of local and global representations. The proposed model has a test accuracy of 97.60, a precision of 97.30, a recall of 97.50, an F1-score of 97.40, and a macro-average area under the curve (AUC) of 0.9946 with a trained and evaluated on the Brain Tumor MRI Dataset (Kaggle). These scores are higher than single CNN and ViT baselines, and current competitive fusion methods, showing that dynamic feature weighting is an effective way to classify medical images.
[87] UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks cs.CV | cs.AIPDF
Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan
TL;DR: 本文提出了UpstreamQA,一个用于视频问答任务的模块化显式推理框架。该框架通过上游的大型推理模型进行对象识别和场景上下文生成,生成丰富的推理轨迹,再传递给下游的大型多模态模型进行最终答案生成。在OpenEQA和NExTQA数据集上的实验表明,引入显式推理可以显著提升下游视频问答的性能和可解释性,但在基线性能足够高时也可能导致性能下降。
Details
Motivation: 当前的大型多模态模型在视频问答任务中进行的是隐式、不透明的多步推理,而大型推理模型虽然能生成显式的中间逻辑步骤以增强可解释性,但它们并非为原生视频理解设计,通常依赖静态帧采样。因此,需要一种方法将显式推理与视频理解相结合。
Result: 在OpenEQA和NExTQA数据集上,使用两种LRM(o4-mini, Gemini 2.5 Pro)和两种LMM(GPT-4o, Gemini 2.5 Flash)进行评估。结果表明,引入显式推理在多种场景下能显著提升下游VideoQA的性能和可解释性,但当基线性能已经很高时,也可能导致性能下降。
Insight: 论文的核心创新点是提出了一个模块化的框架,将视频问答中的显式推理(如对象识别、场景理解)与最终答案生成解耦,通过上游LRM生成可解释的推理轨迹来增强下游LMM。这为结合显式推理与多模态理解提供了一个原则性框架,在提升性能的同时增强了诊断透明度。
Abstract: Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task’s inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4-mini, Gemini 2.5 Pro) and two LMMs (GPT-4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.
[88] BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning cs.CVPDF
Hongxiang Peng, Dewei Bai, Hong Qu
TL;DR: BSViT是一种基于脉冲爆发的视觉Transformer,通过双通道脉冲爆发自注意力机制和局部注意力掩码策略,解决了传统脉冲视觉Transformer中二进制脉冲编码信息容量有限和全局自注意力计算密集的问题,在保持高能效的同时提升了视觉表示学习的表达能力。
Details
Motivation: 现有脉冲视觉Transformer受限于二进制脉冲编码的信息容量不足以及全局自注意力带来的密集令牌交互计算开销,BSViT旨在通过增强脉冲表示能力和引入结构感知的稀疏性来解决这些问题。
Result: 在静态和基于事件的视觉基准测试上,BSViT在准确率上持续优于现有的脉冲Transformer,同时保持了有竞争力的能量效率。
Insight: 创新点包括:1)双通道脉冲爆发自注意力机制,使用二进制脉冲编码查询、爆发脉冲编码键以增强表示能力,并采用双兴奋/抑制通道进行有符号调制;2)局部注意力掩码策略,引入空间先验以减少脉冲活动和计算开销;3)网络范围内系统集成了爆发脉冲编码,超越了传统二进制脉冲的表示容量,且整个注意力操作保持纯加法计算,与神经形态硬件兼容。
Abstract: Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.
[89] One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition cs.CVPDF
Balaji Darur, Amanmeet Garg, Makarand Tapaswi
TL;DR: 本文提出了一种多模态实体共指(MEC)方法,旨在解决视频情境识别(VidSitu)中跨镜头和不同外观下关键实体的一致识别问题。作者引入了CineMEC,一种无需显式定位监督的多阶段方法,通过将文本中的事件角色提及组与实体的视觉聚类相结合,统一了文本描述与视觉定位。该方法在扩展了定位标注的VidSitu数据集上进行了评估,在描述生成和视觉定位两方面均提升了性能。
Details
Motivation: 视频情境识别(VidSitu)旨在全面理解视频中“谁对谁做了什么,用什么方式,在哪里”等复杂情境。其核心挑战在于需要对跨镜头、外观多变的实体进行时空定位。作者认为,连贯的视频理解需要对扮演不同角色的实体进行一致识别,而现有工作主要关注描述生成,忽略了与视觉定位的协同。
Result: 在扩展了定位标注的VidSitu数据集上,所提出的CineMEC方法在描述生成任务上取得了CIDEr指标提升2.5%、LEA指标提升7%的改进;在视觉定位任务上,HOTA指标提升了18%。
Insight: 论文的主要创新点在于提出了多模态实体共指(MEC)任务及CineMEC方法,通过无监督方式将文本角色提及与视觉实体聚类关联,从而协同提升描述生成和视觉定位两个任务。其核心思想是利用两个任务之间的相互促进关系(协同效应),在没有显式定位监督的情况下实现更一致的视频理解。
Abstract: Video Situation Recognition (VidSitu) addresses the challenging problem of “who did what to whom, with what, how, and where” in a video. It tests thorough video understanding by requiring identification of salient actions and associated short descriptions for event roles across multiple events. Grounding with VidSitu requires spatio-temporal localization of key entities across shots and varied appearances. We posit that coherent video understanding requires consistent identification of entities that play different roles. We propose Multimodal Entity Coreference (MEC) to unite entity descriptions in text with grounding across the video. Towards this, we introduce CineMEC, a multi-stage approach that unites event role mention groups with visual clusters of entities, without explicit grounding supervision during training. Our approach is designed to exploit the synergy between visual grounding and captioning, where improving one influences the other and vice versa. For evaluation, we extend the VidSitu dataset with grounding annotations. While previous work focuses primarily on descriptions, CineMEC improves consistency across both: captioning (+2.5% CIDEr, +7% LEA) and visual grounding (+18% HOTA).
[90] AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval cs.CV | cs.AIPDF
Yihan Wang, Lei Li, Yao Lai, Jing Wang, Yan Lu
TL;DR: 本文提出了AnalogRetriever,一个用于模拟电路检索的统一三模态检索框架。它通过修复数据集,并利用视觉语言模型和图卷积网络分别编码原理图、功能描述和网表,通过课程对比学习将它们映射到共享嵌入空间,实现了跨模态语义检索。
Details
Motivation: 模拟电路设计严重依赖重用现有知识产权,但跨SPICE网表、原理图和功能描述等异构表示进行搜索仍然具有挑战性。现有方法大多局限于单一模态内的精确匹配,无法捕捉跨模态语义关系。
Result: 实验表明,AnalogRetriever在所有六个跨模态检索方向上实现了平均75.2%的Recall@1,显著优于现有基线。当集成到AnalogCoder智能体框架中作为检索增强生成模块时,它能持续提高功能通过率并完成之前无法解决的任务。
Insight: 创新点在于构建了高质量修复数据集,并提出了一个结合视觉语言模型和端口感知关系图卷积网络的三模态编码架构,通过课程对比学习实现跨模态语义对齐。这为异构电路设计数据的统一表示和检索提供了新思路。
Abstract: Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search. We first build a high-quality dataset on top of Masala-CHAI through a two-stage repair pipeline that raises the netlist compile rate from 22% to 100%. Built on this foundation, AnalogRetriever encodes schematics and descriptions with a vision-language model and netlists with a port-aware relational graph convolutional network, mapping all three modalities into a shared embedding space via curriculum contrastive learning. Experiments show that AnalogRetriever achieves an average Recall@1 of 75.2% across all six cross-modal retrieval directions, significantly outperforming existing baselines. When integrated into the AnalogCoder agentic framework as a retrieval-augmented generation module, it consistently improves functional pass rates and enables previously unsolved tasks to be completed. Our code and dataset will be released.
[91] Micro-Expression-Aware Avatar Fingerprinting via Inter-Frame Feature Differencing cs.CVPDF
Masoumeh Chapariniya, Jean-Marc Odobez, Volker Dellwo, Teodora Vuković
TL;DR: 本文提出了一种无需预处理的说话头视频身份验证方法,通过帧间特征差分来捕捉驱动者特有的微表情运动动态,从而在NVFAIR数据集上实现了0.877的AUC性能,优于或匹配基于关键点的基线方法。
Details
Motivation: 现有说话头视频身份验证方法依赖固定的、不可微的关键点提取阶段,无法实现从原始像素到输出的端到端优化,限制了模型性能。
Result: 在NVFAIR数据集上,所提方法整体AUC达到0.877,在多数跨生成器配对中达到或超过了基于关键点的基线性能;消融实验证实时间运动特征贡献了绝大部分判别性能。
Insight: 核心创新在于结合了微表情感知的F5C骨干网络与帧间特征差分设计原则,直接在学习的深度特征空间中对连续特征图进行差分,以抑制稳定的外观维度并保留驱动者特定的运动动态,实现了端到端的优化。
Abstract: Avatar fingerprinting, i.e., verifying who drives a synthetic talking-head video rather than whether it is real, is a critical safeguard for authorized use of face-reenactment technology. Existing methods rely on a fixed, non-differentiable landmark extraction stage that prevents the fingerprinting model from being optimized end-to-end from raw pixels. We propose a preprocessing-free system built on a micro-expression-aware backbone operating on raw video frames, with inter-frame feature differencing as the core design principle: consecutive feature maps are subtracted in the learned deep feature space, so that temporally stable appearance dimensions contribute zero to the output while driver-specific motion dynamics are preserved. A controlled ablation on NVFAIR confirms that temporal motion accounts for the large majority of discriminative performance, and that raw appearance features actively degrade identity separation. Both the choice of backbone and the differencing principle are essential: differencing alone is insufficient when applied to a generic encoder, as appearance-dominated features collapse to near-identical representations across adjacent frames, while the micro-expression-aware F5C backbone retains measurable motion variation that the differencing operation can exploit. Without any external preprocessing, our model achieves an overall AUC of 0.877 on NVFAIR and matches or exceeds the landmark-based baseline on the majority of cross-generator pairs.
[92] SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation cs.CVPDF
Kaiwen Huang, Yi Zhou, Yizhe Zhang, Jingxiong Li, Tao Zhou
TL;DR: 本文提出了一种名为SemiGDA的新型生成式双分布对齐框架,用于半监督医学图像分割。该方法通过双分布对齐模块(DAM)在潜在空间中对齐图像和掩码的特征分布,并设计了由一致性驱动的跳跃适配器(CDSA)策略来增强跨分支的语义对齐,从而利用未标注数据提升模型性能。
Details
Motivation: 传统判别式分割方法依赖分割掩码,忽略了特征层面的分布约束,在标签稀缺场景下限制了鲁棒语义表示的学习和对未标注数据的自适应建模。本文旨在解决这些问题。
Result: 在多个医学数据集上的实验结果表明,该方法优于其他最先进的半监督分割方法。
Insight: 创新点在于提出了生成式的双分布对齐框架,通过结构不同的编码器分别建模图像和掩码的特征分布并进行对齐,同时利用一致性损失驱动的跳跃适配器来融合多尺度特征并增强细粒度语义一致性,从而减少了对大规模标注数据的依赖。
Abstract: Semi-supervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency. Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods. Code is released at: https://github.com/taozh2017/SemiGDA.
[93] Lightweight and Production-Ready PDF Visual Element Parsing cs.CV | cs.AI | cs.CLPDF
Meizhu Liu, Yassi Abbasi, Matthew Rowe, Michael Avendi, Paul Li
TL;DR: 本文提出了一种轻量级且可用于生产环境的PDF视觉元素解析框架,旨在准确检测PDF文档中的图表、表格和表单等视觉元素,并可靠地将标题与对应元素关联起来。该方法结合了空间启发式规则、布局分析和语义相似度计算,在多个基准测试和生产数据上实现了高精度和高效率。
Details
Motivation: 现有PDF解析器在处理复杂视觉元素时存在诸多问题,如遗漏元素、提取无意义内容(如水印、徽标)、产生碎片化结果以及无法可靠关联标题与元素,这严重影响了后续的检索和问答等下游任务。
Result: 在流行的基准数据集和内部产品数据上,该方法实现了≥96%的视觉元素检测准确率和93%的标题关联准确率。在作为多模态RAG的预处理步骤时,其在内部数据和MMDocRAG基准测试上均显著优于现有最先进的解析器和大型视觉语言模型,同时将延迟降低了2倍以上。
Insight: 论文的创新点在于提出了一种结合空间启发式、布局分析和语义相似度的轻量级集成方法,而非依赖单一复杂模型,从而在保证高精度的同时实现了低延迟和部署友好性。这为生产环境中的文档理解任务提供了一种高效可靠的解决方案。
Abstract: PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves $\geq96%$ visual element detection accuracy and $93%$ caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over $2\times$. We have deployed the proposed system in challenging production environment.
[94] Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search cs.CV | cs.MMPDF
Zequn Xie, Guijin Luo, Chuxin Wang, Sihang Cai, Tao Jin
TL;DR: 本文提出了一种名为结构-语义解耦级联(SSDC)的框架,用于解决基于文本的人员异常搜索任务。该框架将检索过程分为两个阶段:首先使用轻量级模型进行基于骨骼相似性的粗粒度检索,然后通过一个包含侦探、分析师和写手的多智能体交互模块进行语义验证与合成,最后融合合成描述和结构先验进行重排序。
Details
Motivation: 现有基于姿态感知的方法在基于文本的人员异常搜索中面临一个根本性的“姿态-语义鸿沟”问题:语义不同的动作可能具有相似的骨骼几何结构。虽然多模态大语言模型(MLLMs)可以减少这种歧义,但将其用于大规模检索在计算上是不可行的。
Result: 在PAB基准测试上的实验表明,SSDC框架在平衡效率和语义推理方面实现了最先进的(SOTA)性能。
Insight: 论文的核心创新点在于提出了一个两阶段级联框架,将计算密集的语义理解任务(通过多智能体交互模块实现)与高效的粗粒度结构过滤解耦,从而在保持检索精度的同时显著提升了效率。其中,多智能体模块(侦探、分析师、写手)的分工协作设计,模拟了人类推理过程,是进行细粒度语义验证和证据合成的有效方法。
Abstract: Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.
[95] MetaErr: Towards Predicting Error Patterns in Deep Neural Networks cs.CV | cs.AI | cs.LG | cs.MMPDF
Varun Totakura, Shayok Chakraborty
TL;DR: 本文提出了一种名为MetaErr的简单而有效的框架,旨在预测深度神经网络在特定数据样本上的失败情况。该框架通过训练一个元模型来观察基础模型在给定学习任务上的表现,从而预测其成功或失败,且无需了解基础模型的架构和训练参数。实验表明,MetaErr在三个基准计算机视觉数据集上优于多个强基线,并成功应用于提升基于伪标签的半监督学习性能。
Details
Motivation: 深度神经网络在多媒体计算应用中广泛使用,但系统可能突然失败且缺乏预警或解释。尽管降低错误率是研究重点,但预测深度学习系统何时失败的问题尚未得到充分关注,本文旨在解决这一未被充分探索的问题。
Result: MetaErr在三个基准计算机视觉数据集上优于多个强基线,并成功应用于提升基于伪标签的半监督学习性能,展示了其潜力和实用性。
Insight: 创新点在于提出一个与基础模型架构和训练参数无关的元模型框架,用于预测深度神经网络的错误模式,这为智能多媒体应用中的错误预警和系统优化提供了新思路。
Abstract: Due to the unprecedented success of deep learning, it has become an integral component in several multimedia computing applications in todays world. Unfortunately, deep learning systems are not perfect and can fail, sometimes abruptly, without prior warning or explanation. While reducing the error rate of deep neural networks has been the primary focus of the multimedia community, the problem of predicting when a deep learning system is going to fail has received significantly less research attention. In this paper, we propose a simple yet effective framework, MetaErr, to address this under-explored problem in deep learning research. We train a meta-model whose goal is to predict whether a base deep neural network will succeed or fail in predicting a particular data sample, by observing the base models performance on a given learning task. The meta-model is completely agnostic of the architecture and training parameters of the base model. Such an error prediction system can be immensely useful in a variety of smart multimedia applications. Our empirical studies corroborate the promise and potential of our framework against competing baselines. We further demonstrate the usefulness of our framework to improve the performance of pseudo-labeling-based semi-supervised learning, and show that MetaErr outperforms several strong baselines on three benchmark computer vision datasets.
[96] STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning cs.CV | cs.LGPDF
Yanpei Gong, Beichen Zhang, Hao Wang, Zhaobo Qi, Xinyan Liu
TL;DR: 本文提出STAND方法,用于遥感图像变化描述任务,通过语义锚定约束和双粒度消歧机制,逐步解决遥感图像中视角、尺度和先验知识带来的固有歧义问题。
Details
Motivation: 现有方法在建模视频时忽略了遥感图像固有的视角、尺度和先验知识歧义,缺乏对编码器的有效约束,导致描述不准确。
Result: 在遥感图像变化描述任务上进行了广泛实验,验证了STAND方法的优越性及其在解决歧义问题上的有效性,达到了先进水平。
Insight: 创新点包括引入可解释约束来正则化时序表示,结合宏观全局上下文聚合与微观频率重聚焦注意力的双粒度消歧模块,以及利用语言类别先验的语义概念锚定模块来提升解码精度。
Abstract: Remote sensing image change captioning (RSICC) aims to describe the difference between two remote sensing images. While recent methods have explored video modeling, they largely overlook the inherent ambiguities in viewpoint, scale, and prior knowledge, lacking effective constraints on the encoder. In this paper, we present STAND, a Semantic Anchoring Constraint with Dual-Granularity Disambiguation for RSICC, to progressively resolve these ambiguities. Specifically, to establish a reliable feature foundation, we first introduce an interpretable constraint to regularize temporal representations. Operating on these purified features, a dual-granularity disambiguation module resolves spatial uncertainties by coupling macro-level global context aggregation for viewpoint confusion with micro-level frequency-refocused attention for small-object scale enhancement. Ultimately, to translate these visually disambiguated features into precise text, a semantic concept anchoring module leverages language categorical priors to tackle knowledge ambiguity during decoding. Extensive experiments verify the superiority of STAND and its effectiveness in addressing ambiguities.
[97] Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM cs.CVPDF
Jingxuan Kang, Ziqi Zhang, Shaoming Zheng, Shuang Li, Uday Bharat Patel
TL;DR: 本文提出了一种名为SPD(Saliency-Guided Prompt Distillation)的框架,旨在解决医学图像分割中,当提示(prompts)存在噪声、模糊或不精确时,基础模型(如Segment Anything Model, SAM)性能下降的问题。SPD通过一个轻量级的显著性头学习数据驱动的解剖先验,获得可靠的定位图,然后利用上下文提示蒸馏技术,验证并丰富来自相邻切片的噪声提示,生成一个与专家推理行为一致的共识提示集,并通过成对切片一致性目标在分割过程中加强局部解剖连贯性。
Details
Motivation: 在临床工作流中,可用的标注(如中心线点)通常是粗糙、模糊且有噪声的,这些不完美的提示会导致像SAM这样的强大基础模型产生不一致或不完整的分割结果,限制了其在医学成像中的可靠部署。
Result: 在四个具有挑战性的MRI和CT基准测试上的实验表明,SPD在基于区域和基于边界的指标上均持续优于现有的SAM适配方法和有监督基线,取得了显著的性能提升。
Insight: 创新点在于提出了一个结合显著性引导和上下文提示蒸馏的框架,将不可靠的提示转化为鲁棒的指导,并通过利用相邻切片间的解剖一致性来增强分割的鲁棒性,为在仅有不完美提示的临床环境中可靠部署基础模型提供了一条实用且有原则的路径。
Abstract: Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.
[98] KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition cs.CVPDF
Zhaoxiang Liu, Zhicheng Ma, Kaikai Zhao, Kai Wang, Shiguo Lian
TL;DR: 本文提出了一种新颖的Kolmogorov-Arnold卷积层,将Kolmogorov-Arnold表示定理与卷积操作深度融合,并在此基础上构建了高效的KAConvNet网络架构。该方法旨在解决现有结合KAN与卷积的方法在理论基础和计算效率上的不足,并在视觉识别任务中取得了优于现有KAN-卷积混合方法的性能,与主流ViT和CNN模型相比具有竞争力。
Details
Motivation: 现有研究对Kolmogorov-Arnold表示定理与卷积方法在计算机视觉任务中的结合探索有限,且已有尝试仅简单替换激活函数,破坏了KAN的理论基础并限制了其潜力。此外,KAN中使用的B样条曲线存在计算效率低和易过拟合的问题。
Result: KAConvNet在视觉识别任务上,其性能优于现有的KAN与卷积结合方法,并且与主流的Vision Transformers (ViTs) 和卷积神经网络 (CNNs) 相比达到了有竞争力的水平。
Insight: 核心创新在于提出了一个理论依据更强的Kolmogorov-Arnold卷积层,其设计具有理论对齐性,从而提供了更强的方法可解释性。这为开发更具创新性的CNN架构提供了新思路,特别是在融合经典数学定理与深度学习模型方面。
Abstract: The Convolutional Neural Networks (CNNs) have been the dominant and effective approach for general computer vision tasks. Recently, Kolmogorov-Arnold neural networks (KANs), based on the Kolmogorov-Arnold representation theorem, have shown potential to replace Multi-Layer Perceptrons (MLPs) in deep learning. KANs, which use learnable nonlinear activations on edges and simple summation on nodes, offer fewer parameters and greater explainability compared to MLPs. However, there has been limited exploration of integrating the Kolmogorov-Arnold representation theorem with convolutional methods for computer vision tasks. Existing attempts have merely replaced learnable activation functions with weights, undermining KANs’ theoretical foundation and limiting their potential effectiveness. Additionally, the B-spline curves used in KANs suffer from computational inefficiency and a tendency to overfit. In this paper, we propose a novel Kolmogorov-Arnold Convolutional Layer that deeply integrates the Kolmogorov-Arnold representation theorem with convolution. This layer provides stronger method interpretability because it is based on established mathematical theorems and its design has theoretical alignment. Building on the Kolmogorov-Arnold Convolutional Layer, we design an efficient network architecture called KAConvNet, which outperforms existing methods combining KAN and convolution, and achieves competitive performance compared to mainstream ViTs and CNNs. We believe that our work offers valuable insight into the field of artificial intelligence and will inspire the development of more innovative CNNs in the 2020s. The code is publicly available at https://github.com/UnicomAI/KAConvNet.
[99] EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence cs.CV | cs.AI | eess.IVPDF
Yahui Li, Yinfeng Yu, Liejun Wang, Shengjie Shen
TL;DR: 本文提出EAD-Net,一种基于扩散模型的情感感知说话头视频生成网络,旨在生成具有准确唇部同步和丰富情感面部表情的肖像视频。该方法通过引入SyncNet监督与时序表示对齐缓解多模态融合导致的唇同步退化,利用时空方向注意力机制建模长视频序列的复杂依赖,并通过时序帧图推理模块显式建模帧间时序一致性。此外,利用大语言模型从真实视频中提取文本描述作为高级语义指导以增强情感控制。
Details
Motivation: 现有方法依赖简单情感标签导致语义信息不足,引入高级语义虽增强表现力但易导致唇同步退化;同时主流方法在长视频中难以平衡计算效率与全局运动感知,且时序一致性较差。
Result: 在HDTF和MEAD数据集上的实验表明,该方法在唇同步准确性、时序一致性和情感准确性方面优于现有方法。
Insight: 创新点包括:1) 结合SyncNet监督与时序表示对齐以缓解唇同步退化;2) 提出时空方向注意力机制通过条带注意力捕获全局运动模式;3) 设计时序帧图推理模块通过图结构学习显式建模时序一致性;4) 利用大语言模型提取文本描述作为高级语义指导,增强情感控制的语义丰富度。
Abstract: Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbf{E}motion-\textbf{A}ware \textbf{D}iffusion model-based \textbf{Net}work, called \textbf{EAD-Net}. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.
[100] H-SemiS: Hierarchical Fusion of Semi and Self-Supervised Learning for Knee Osteoarthritis Severity Grading cs.CVPDF
Chandravardhan Singh Raghaw, Anushka Parwal, Shahid Shafi Dar, Prajakta Darade, Nagendra Kumar
TL;DR: 该论文提出了一种名为H-SemiS的层次化融合半监督与自监督学习框架,用于膝关节X光片的骨关节炎严重程度分级。该方法将多类分级任务分解为一系列二元子任务,并结合了对抗性自监督重建模块和量子启发的特征混合技术,以在标注数据有限的情况下有效应对类别不平衡和噪声标签问题。
Details
Motivation: 解决膝关节骨关节炎严重程度自动分级任务中,现有方法过度依赖大规模标注数据、对类别不平衡、噪声样本和临床标注变异性敏感的问题。
Result: 在两个具有挑战性的多类数据集和两个二类数据集上的综合评估表明,H-SemiS在多项评估指标上均优于多个竞争基线方法和最先进方法,展现了其优越性和泛化能力。
Insight: 主要创新点在于将多类分级任务层次化分解为二元子任务以直接缓解类别不平衡,并融合了对抗性自监督重建模块来学习鲁棒的解剖结构特征,以及采用量子启发的特征混合技术来改善噪声伪标签下的类别边界判别能力。
Abstract: Knee osteoarthritis (KOA) is a degenerative joint disease that can lead to chronic pain, reduced mobility, and long-term disability. Automated severity grading from knee radiographs can support early assessment, but current methods heavily depend on large labeled datasets and remain sensitive to class imbalance, noisy samples, and variability in clinical annotations. To alleviate these limitations, we propose a Hierarchical fusion of Semi-Supervised framework with Self-Supervision (H-SemiS) for KOA severity grading in knee X-ray samples using limited annotated data. Rather than treating severity grading as a flat multi-class problem, H-SemiS decomposes the task into a sequence of binary sub-tasks within a semi-supervised teacher-student architecture, directly mitigating the impact of class imbalance. To further enhance feature learning from unlabeled data, the framework integrates an adversarial self-supervised reconstruction module that encourages the network to capture robust anatomical structures. In parallel, a teacher-student design with quantum-inspired feature mixing improves discrimination boundaries between adjacent grades when pseudo-labels are noisy. We comprehensively evaluate H-SemiS on two challenging multi-class datasets and assess its generalizability on two binary-class datasets. Our experimental results demonstrate the superiority of the proposed H-SemiS framework across multiple evaluation metrics, consistently outperforming several competing baselines and state-of-the-art methods. The code is publicly available at https://github.com/chandravardhan-singh-raghaw/H-SemiS.
[101] Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection cs.CVPDF
Sanghoon Lee, Geon Lee, Hyekang Park, Bumsub Ham
TL;DR: 本文提出了一种用于开放词汇目标检测(OVD)的新型伪标签生成框架,旨在解决现有方法中类别标签分配不准确和区域提议网络(RPN)目标性评分不可靠的问题。该框架引入了分层置信度校准(HCC)技术,通过评估跨层次语义级别(类别、超类和子类)的一致性来确保可靠的类别标签估计;同时提出了LoCLIP,一种参数高效的CLIP适配方法,通过引入目标性标记来缓解RPN的基类偏见,并为新类别提供可靠的目标性估计。
Details
Motivation: 现有开放词汇目标检测方法存在两个关键缺陷:一是视觉语言模型(VLM)为图像级预测优化,导致用于伪标签生成的区域级预测不准确;二是仅在基类上训练的RPN产生的目标性评分对新类别不可靠。
Result: 在标准OVD基准测试(如COCO和LVIS)上进行的大量实验表明,该方法明显达到了新的最先进(SOTA)水平。
Insight: 主要创新点包括:1)分层置信度校准(HCC),利用语义层次结构提升伪标签可靠性;2)LoCLIP,一种轻量化的CLIP适配方法,通过引入目标性标记来解耦和校准目标性估计,有效缓解了基类偏见问题。
Abstract: Conventional object detectors typically operate under a closed-set assumption, limiting recognition to a predefined set of base classes seen during training. Open-vocabulary object detection (OVD) addresses this limitation by leveraging vision-language models (VLMs) to generate pseudo labels for novel object classes. However, existing OVD methods suffer from two critical drawbacks: (1) inaccurate class label assignments, as VLMs are optimized for image-level predictions rather than the region-level predictions required for pseudo labeling, and (2) unreliable objectness scores from region proposal networks (RPNs) trained exclusively on base object classes. To address these issues, we propose a novel pseudo labeling framework for OVD. Our approach introduces a hierarchical confidence calibration (HCC) technique, which ensures reliable class label estimation by assessing consistency across hierarchical semantic levels (class, super- and sub-category). We also present LoCLIP, a parameter-efficient adaptation of CLIP that incorporates an objectness token to mitigate base class bias problem of RPNs and provide reliable objectness estimations for novel object classes. Extensive experiments on standard OVD benchmarks, including COCO and LVIS, demonstrate that our approach clearly sets a new state of the art, validating the effectiveness of our approach. Project site: https://cvlab.yonsei.ac.kr/projects/HCC
[102] EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs cs.CV | cs.AIPDF
He Hu, Tengjin Weng, Zebang Cheng, Yu Wang, Jiachen Luo
TL;DR: 本文提出了EmoTrans基准,用于评估多模态大语言模型在理解、推理和预测视频中情绪动态变化方面的能力。该基准包含1000个手动标注的视频片段,覆盖12个真实场景,并提供了3000多个任务特定的问答对,包含情绪变化检测、情绪状态识别、情绪转换推理和下一情绪预测四个渐进式任务。
Details
Motivation: 现有基准主要将情绪理解视为静态识别问题,而当前多模态大语言模型是否能理解情绪作为动态过程(如演变、状态转换及在不同社交情境中的展开)尚不明确,因此需要构建专门基准来填补这一空白。
Result: 对18个先进多模态大语言模型的综合评估显示,它们在粗粒度情绪变化检测上表现相对较强,但在细粒度情绪动态建模方面仍存在困难;社交复杂场景(尤其是多人情境)具有显著挑战性,且推理导向的变体并未带来一致的明显改进。
Insight: 创新点在于首次构建了专注于情绪动态理解的多模态视频基准,并设计了从检测到推理预测的渐进式评估框架;客观来看,该工作强调了将情绪建模为动态过程的重要性,并为模型在复杂社交情境中的细粒度情绪理解能力提供了系统评估工具。
Abstract: Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo-gml/EmoTrans.
[103] PushupBench: Your VLM is not good at counting pushups cs.CV | cs.AIPDF
Shengzhi Li, Jiarun Chen, Karun Sharma, Jiaqi Su, Shichao Pei
TL;DR: 该论文提出了PushupBench,一个包含446个长视频片段(平均36.7秒)的基准测试,用于评估大型视觉语言模型在重复动作计数(如俯卧撑)上的能力。研究发现,当前最先进的模型计数准确率仅为42.1%,而开源模型表现更差,准确率仅约6%。论文指出仅用准确率评估具有误导性,并发现对计数任务进行微调可以提升模型在更广泛的视频理解任务上的性能。
Details
Motivation: 大型视觉语言模型能够识别视频中发生了什么,但在计数(例如计算动作重复次数)方面表现不佳。为了解决模型在时序推理和计数能力上的缺陷,作者创建了PushupBench来评估和推动该领域的发展。
Result: 在PushupBench上,性能最好的前沿模型仅达到42.1%的精确准确率;开源4B参数模型的准确率约为6%,与有监督的基线模型相当。此外,在计数任务上使用1k样本进行微调后,模型在多个通用视频理解基准上取得了提升:MVBench (+2.15)、PerceptionTest (+1.88)、TVBench (+4.54)。
Insight: 论文的核心创新点是提出了一个专门针对重复动作计数的长视频基准测试,揭示了当前VLM在时序推理上的关键短板。一个重要的发现是,计数能力可以作为更广泛的时序推理能力的代理任务,对其进行微调能有效提升模型在通用视频理解任务上的表现,这为提升VLM的时序理解能力提供了新的训练视角。
Abstract: Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1% exact accuracy; open-source 4B models score $\sim$6%, matching supervised baselines. We show that accuracy alone misleads – weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)
[104] A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis cs.CVPDF
Md. Afzalur Rahaman, Tahmid Rahman
TL;DR: 本文提出了一种名为DualStreamHybrid的异构双流框架用于视频动作识别,该框架为RGB和光流两种模态分别设计了不同的骨干网络(ViT-Tiny和MobileNetV2),并研究了五种融合策略在UCF11和UCF50数据集上的表现。结果表明,跨注意力融合在小数据集上效果最佳,而加权融合在两个数据集上表现最稳定,且最优融合策略取决于数据集规模。
Details
Motivation: 现有双流动作识别网络通常对RGB和光流模态使用相同的卷积骨干网络,忽略了两种模态在结构特性上的根本差异(光流捕捉细粒度运动模式,RGB帧携带丰富的表观和场景上下文信息)。
Result: 在UCF11数据集上,跨注意力融合取得了98.12%的测试准确率,优于仅使用RGB的ViT-Tiny基线(95.94%);在UCF50数据集上,加权融合达到了96.86%的准确率,是跨基准最一致的策略。学习到的流权重显示,UCF11中两种模态贡献接近(RGB: 0.507, flow: 0.493),而UCF50更偏向RGB流(RGB: 0.554, flow: 0.446)。
Insight: 核心创新在于提出了异构双流架构,允许为不同模态定制合适的骨干网络,并通过投影层实现特征交互,而非强制架构对称性。研究系统地比较了多种融合策略,揭示了最优融合方式与数据集规模相关的洞察,即小数据集受益于显式的跨模态注意力,而更大、更复杂的数据集则倾向于加权融合,且RGB流的重要性随数据集视觉多样性增加而提升。
Abstract: Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.
[105] Sphere-Depth: A Benchmark for Depth Estimation Methods with Varying Spherical Camera Orientations cs.CV | cs.AIPDF
Soulayma Gazzeh, Giuseppe Mazzola, Liliana Lo Presti, Marco La Cascia
TL;DR: 该论文提出了一个名为Sphere-Depth的新公共基准,用于系统评估单目深度估计模型在处理不同球形相机姿态下的等距柱面投影图像时的鲁棒性。论文通过模拟相机姿态扰动,测试了包括Depth Anything、Depth Anywhere在内的多种模型,并提出了基于深度校准的误差协议来统一评估指标。实验表明,即使专门为球形图像设计的模型,在相机姿态偏离标准姿态时性能也会显著下降。
Details
Motivation: 在机器人导航和沉浸式场景理解中,从球形图像进行可靠的深度估计至关重要。然而,实际机器人平台上的球形相机可能发生无意的姿态变化,加上等距柱面投影固有的几何畸变,会严重影响深度估计的有效性。目前缺乏系统评估模型对此类姿态变化鲁棒性的基准。
Result: 实验在Sphere-Depth基准上进行,评估了基于透视的模型Depth Anything和球形感知模型(如Depth Anywhere、ACDNet等)。结果表明,所有测试模型在相机姿态变化时都表现出显著的性能下降,即使是为球形图像设计的模型也不例外。论文提出的深度校准协议确保了跨模型评估的一致性。
Insight: 创新点包括:1) 引入首个系统评估球形相机姿态变化对深度估计模型鲁棒性影响的公共基准Sphere-Depth;2) 提出基于监督学习缩放因子的深度校准误差协议,将预测的相对深度值转换为度量深度值,实现跨模型的公平比较;3) 揭示了球形感知模型对相机姿态扰动的敏感性,为未来鲁棒球形视觉方法的设计提供了重要见解。
Abstract: Reliable depth estimation from spherical images is crucial for 360° vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real-world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere-Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective-based model, Depth Anything, and of spherical-aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration-based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: https://github.com/sgazzeh/Sphere_depth
[106] From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers cs.CV | cs.LGPDF
Jainum Sanghavi
TL;DR: 本文研究了仅通过图像分类预训练的Vision Transformer(ViT)如何自发编码空间结构信息,通过线性探测方法在ViT-B/16的各层中解码局部边界和全局深度信息,发现其形成了清晰的层次化空间表示:边界信息在中间层(5-6层)可线性解码,而深度信息在稍后层(第8层)达到峰值,两者在最终分类层均消失。因果干预实验进一步表明深度信号由特定方向主动维护,而非被动残留。
Details
Motivation: 动机在于探究仅通过图像分类任务预训练的ViT模型,在没有显式空间监督的情况下,如何以及在哪里编码了对下游任务(如空间理解)至关重要的结构信息,以理解其内部表示的层次性。
Result: 在BSDS500边界检测任务上,ViT-B/16第5-6层线性解码的AP达到0.833;在NYU Depth V2深度估计任务上,第8层线性解码的MAE为0.0875,均显示SOTA级别的可解码性,且随机权重对照确认这些编码是学习所得而非架构固有。
Insight: 创新点在于揭示了分类训练ViT中自发形成的、主动维护的空间层次结构,其从局部到全局的渐进编码模式与灵长类视觉皮层处理流程相似;方法上结合线性探测与因果干预(如方向消融和激活修补)为理解Transformer内部表示提供了新视角。
Abstract: Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re-derived at each layer rather than passively carried in the residual stream, with mid-layer interventions persisting most strongly downstream. The result is that a classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.
[107] Leveraging Spatial Transcriptomics as Alternative to Manual Annotations for Deep Learning-Based Nuclei Analysis cs.CV | cs.LGPDF
Kazuya Nishimura, Ryoma Bise, Haruka Hirose, Yasuhiro Kojima
TL;DR: 本文提出了一种利用空间转录组学数据作为监督信号,以减少对大规模像素级人工标注依赖的深度学习框架,用于病理图像中的细胞核分割与分类。通过整合细胞级ST数据,将基因表达谱转换为细胞类型标签,并设计了一种面向图像的分类方法,以连接基于基因表达的细胞分型和基于图像的细胞分类。
Details
Motivation: 解决病理图像中细胞核分割与分类任务依赖成本高昂且难以获取的大规模像素级人工标注的问题,尤其是在多样组织和染色条件下。
Result: 在未见过的器官上进行分割实验,相比传统监督模型,本方法在训练器官类型更少的情况下实现了更高的分割精度,展示了强迁移性;分类实验也显示出对现有方法的持续改进。
Insight: 创新性地利用空间转录组学数据作为替代监督源,并设计了一种桥接基因表达与图像识别的分类方法,为病理图像分析提供了数据高效且可迁移的解决方案,减少了人工标注依赖。
Abstract: Deep learning-based nuclei segmentation and classification in pathology images typically rely on large-scale pixel-level manual annotations, which are costly and difficult to obtain across diverse tissues and staining conditions. To address this limitation, we propose a framework that leverages spatial transcriptomics (ST) data as supervision for nuclei segmentation and classification. By incorporating cell-level ST data, we obtain gene expression profiles and corresponding nuclear masks from histopathological images. Gene expression profiles are converted into cell-type labels and used as training data for image-based classification. Because existing gene expression-based cell-type classification methods are not designed for image recognition, we introduce an image-oriented classification approach that bridges gene expression-based cell typing and image-based cell classification. To evaluate generalization, we conduct segmentation experiments on previously unseen organs and compare our method with conventional supervised models. Despite being trained on fewer organ types, our framework achieves higher segmentation accuracy, demonstrating strong transferability. Classification experiments further show consistent improvements over existing approaches.
[108] BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors cs.CVPDF
Dong Huo, Tristan Aumentado-Armstrong, Samrudhdhi B. Rangrej, Maitreya Suin, Angela Ning Ye
TL;DR: 本文提出BurstGP,一种基于扩散模型的原始连拍图像超分辨率方法,通过利用预训练基础模型的生成先验来增强传统BISR方法,引入退化感知条件机制和sRGB-to-lRGB转换器,以恢复更丰富的纹理和细节。
Details
Motivation: 传统连拍图像超分辨率方法在处理复杂纹理时易产生过度平滑,而现有扩散模型方法多针对单帧训练,未能充分利用连拍中的时序冗余和视频先验。
Result: 在定量指标(如MUSIQ和LPIPS)和定性评估上均优于现有SOTA方法,尤其在感知质量方面表现突出。
Insight: 创新点包括:将视频生成先验引入BISR任务,提出退化感知条件机制控制细节合成,以及设计sRGB-to-lRGB转换器以兼容原始输入;核心思想是利用预训练基础模型的强大生成能力提升保真度与细节恢复。
Abstract: Burst image super resolution (BISR) aims to construct a single high-resolution (HR) image by aggregating information from multiple low-resolution (LR) frames, relying on temporal redundancy and spatial coherence across the burst. While conventional methods achieve impressive results, they often struggle with complex textures and oversmoothing. Diffusion models, particularly those pretrained on high-quality data, have shown remarkable capability in generating realistic details for image and video super-resolution. However, their potential remains largely under-explored in BISR, where existing approaches typically rely on task-specific diffusion models trained from scratch and operate on single-frame reconstructions. In this work, we propose BurstGP, a novel diffusion-based solution for BISR, which leverages generative priors of recent foundation models to overcome these issues. In particular, we build a multiframe-aware diffusion model on top of a conventional BISR approach, which boosts image quality with minimal loss to fidelity. Further, we introduce (i) a novel degradation-aware conditioning mechanism, which controls synthesis of fine details based on the estimated degradation in the input, and (ii) a robust sRGB-to-lRGB inverter, enabling us to utilize generative multiframe (video) sRGB priors, while operating with raw input and lRGB output images. Empirically, we demonstrate that BurstGP outperforms the existing state of the art, both quantitatively (especially with respect to perceptual metrics, including MUSIQ and LPIPS) and qualitatively. In particular, our proposed method excels at recovering richer textures and finer structural details, highlighting the potential of video priors for BISR over traditional methods.
[109] Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model cs.CV | cs.AIPDF
Jingni Huang, Peter Bloodsworth
TL;DR: 本文研究在短期人体姿态预测中引入面部表情衍生的情感嵌入作为辅助条件信号。作者提出了一种轻量级的自回归预测世界模型,通过可学习的门控机制融合姿态关键点和情感嵌入,并在两个小规模姿态-情感视频数据集上进行实验。结果表明,归一化门控融合能显著提升情感驱动运动序列的预测性能,且反事实扰动实验证实了情感嵌入作为条件信号的有效性。
Details
Motivation: 现有轨迹预测模型主要依赖几何运动线索,忽略了影响人体运动动态的潜在情感信号。本文旨在探究面部表情衍生的情感嵌入能否为短期姿态预测提供辅助条件信号。
Result: 在两个小规模姿态-情感视频数据集上的实验表明,简单的多模态融合不能持续提升预测精度,而归一化门控融合显著提升了情感驱动运动序列的性能。反事实扰动实验证明预测轨迹对多模态输入变化具有可测量的敏感性。
Insight: 创新点在于将面部表情情感嵌入作为条件信号引入短期姿态预测,并设计了轻量级自回归预测世界模型与可学习的门控融合机制。从客观角度看,该研究为情感条件化预测提供了可行的架构,并验证了情感嵌入作为非冗余辅助特征的有效性。
Abstract: Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments were conducted on two small-scale pose-emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion-driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion-driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach.
[110] $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models cs.CVPDF
Haosen Li, Wenshuo Chen, Shaofeng Liang, Lei Wang, Kaishen Yuan
TL;DR: 本文提出了一种名为$Z^2$-Sampling(零成本之字形采样)的新方法,用于提升扩散模型中文本对齐生成的质量和效率。该方法通过理论分析证明显式的之字形轨迹在拓扑上是可简化的,并利用隐式代数消除和动态缓存的时间语义替代,在不增加神经函数评估(NFE)成本的情况下,有效探索数据流形的曲率,从而显著改善语义对齐。
Details
Motivation: 标准无分类器引导(CFG)仅依赖瞬时梯度,忽略了数据流形的内在曲率。现有的Z-Sampling等方法通过显式地多步前后向轨迹来探索曲率,虽能提升语义对齐,但导致NFE成本增加三倍,并因离流形评估引入截断误差,造成与真实边缘分布的累积漂移。本文旨在解决这一效率与精度权衡的问题。
Result: 广泛的评估表明,$Z^2$-Sampling在性能-效率的帕累托前沿上取得了结构性突破。该方法在多种架构(如U-Nets、DiTs)和模态(图像/视频)上验证了其普适性,并能与先进的文本对齐框架(如AYS、Diffusion-DPO)无缝正交结合,恢复了标准的2-NFE基线而不牺牲语义探索能力。
Insight: 主要创新点在于:1)理论证明了显式之字形序列的拓扑可简化性,并提出了隐式Z-Sampling,通过算子对偶性代数消除中间状态,物理上消除了离流形近似误差;2)利用概率流ODE的时间相干性,将隐式代数坍缩与动态缓存的时间语义替代耦合,实现了零额外成本的曲率探索;3)通过后向误差分析形式化证明,这种离散坍缩本质上合成了一个方向导数曲率惩罚项。
Abstract: Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce $Z^2$-Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE’s temporal coherence, $Z^2$-Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that $Z^2$-Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).
[111] Spatiotemporal Degradation-Aware 3D Gaussian Splatting for Realistic Underwater Scene Reconstruction cs.CVPDF
Shaohua Liu, Ning Gao, Zuoya Gu, Hongkun Dou, Yue Deng
TL;DR: 本文提出了一种名为MarineSTD-GS的新型3D高斯溅射框架,用于从水下视频中重建真实感水下场景。该框架通过引入成对的本质高斯和退化高斯基元,并设计时空退化建模模块,以自监督的方式从退化的图像中解耦出真实场景外观,从而同时处理水下成像中固有的时空退化问题。
Details
Motivation: 现有3D重建方法在处理水下视频时,由于水下成像固有的时空退化(如焦散、闪烁、衰减和背向散射),经常导致几何和外观重建不准确。现有工作通常只单独处理空间或时间退化,无法应对两者同时发生的真实水下场景。
Result: 在模拟和真实世界数据集上的实验表明,MarineSTD-GS能够鲁棒地处理时空退化,在新视角合成任务中优于现有方法,并重建出真实、无水干扰的场景外观。
Insight: 创新点在于提出了一个同时建模时空退化的3D高斯溅射框架,通过成对高斯基元和物理启发的退化建模模块实现自监督解耦。此外,深度引导几何损失和多阶段优化策略确保了训练的稳定性和几何的准确性,并构建了包含多种退化和真实外观的模拟基准用于全面评估。
Abstract: Reconstructing realistic underwater scenes from underwater video remains a meaningful yet challenging task in the multimedia domain. The inherent spatiotemporal degradations in underwater imaging, including caustics, flickering, attenuation, and backscattering, frequently result in inaccurate geometry and appearance in existing 3D reconstruction methods. While a few recent works have explored underwater degradation-aware reconstruction, they often address either spatial or temporal degradation alone, falling short in more real-world underwater scenarios where both types of degradation occur. We propose MarineSTD-GS, a novel 3D Gaussian Splatting-based framework that explicitly models both temporal and spatial degradations for realistic underwater scene reconstruction. Specifically, we introduce two paired Gaussian primitives: Intrinsic Gaussians represent the true scene, while Degraded Gaussians render the degraded observations. The color of each Degraded Gaussian is physically derived from its paired Intrinsic Gaussian via a Spatiotemporal Degradation Modeling (SDM) module, enabling self-supervised disentanglement of realistic appearance from degraded images. To ensure stable training and accurate geometry, we further propose a Depth-Guided Geometry Loss and a Multi-Stage Optimization strategy. We also construct a simulated benchmark with diverse spatial and temporal degradations and ground-truth appearances for comprehensive evaluation. Experiments on both simulated and real-world datasets show that MarineSTD-GS robustly handles spatiotemporal degradations and outperforms existing methods in novel view synthesis with realistic, water-free scene appearances.
[112] PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics cs.CVPDF
Tianyidan Xie, Zhentao Huang, Mingjie Wang, Xin Huang, Jun Zhou
TL;DR: PhysLayer是一个语言引导的分层动画框架,通过深度感知物理模拟将静态图像转化为视频。它利用视觉基础模型分解场景为基于深度的层,扩展2D刚体动力学以包含深度运动和透视一致缩放,并结合物理模拟轨迹与场景感知重照明生成时序一致的视频。
Details
Motivation: 现有图像到视频生成方法常产生物理上不合理的运动,且缺乏对物体动力学的精确控制;先前结合物理模拟器的方法局限于2D平面运动,未能捕捉深度感知的空间交互。
Result: 实验在CLIP-Similarity上提升2.2%,FID分数提升9.3%,Motion-FID提升3%;人工评估显示物理合理性提升24%,文本-视频对齐提升35%。
Insight: 创新点在于通过语言引导的场景理解实现深度分层,无需完整3D重建即可进行深度感知的物理模拟,并在计算效率和物理真实性之间取得平衡,为可控图像动画提供了新途径。
Abstract: Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2%), FID score (+9.3%), and Motion-FID (+3%), with human evaluation showing enhanced physical plausibility (+24%) and text-video alignment (+35%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.
[113] Identity-Decoupled Anonymization for Visual Evidence in Multi-modal Retrieval-Augmented Generation cs.CV | cs.IRPDF
Zehua Cheng, Wei Dai, Jiahao Sun
TL;DR: 本文提出了一种名为Identity-Decoupled MRAG的框架,用于在多模态检索增强生成系统中对包含人脸的视觉证据进行匿名化处理。该框架通过在检索和生成之间插入一个生成式匿名化模块,将人脸分解为身份代码和属性代码,并替换身份代码以合成匿名化人脸,同时保证隐私和下游任务所需的视觉线索。
Details
Motivation: 解决多模态检索增强生成系统中,检索到的图像常包含人脸身份等敏感个人信息,而现有匿名化技术要么破坏下游推理依赖的非身份视觉线索,要么无法提供有原则的隐私保证的问题。
Result: 论文通过一个基于多个人脸识别模型组成的集成验证器来强制执行隐私,使用基于铰链的损失函数,一旦身份相似度低于冒名顶替者阈值即停止优化,从而提供有原则的隐私保证。
Insight: 创新点包括:1) 使用解耦变分编码器将人脸分解为身份和空间结构化属性代码,并通过互信息惩罚和基于梯度的独立性项进行正则化;2) 引入流形感知拒绝采样器,用合成的、既与原始身份不同又逼真的身份代码进行替换;3) 采用条件潜在扩散生成器从替换身份和保留属性合成匿名化人脸,并蒸馏为潜在一致性模型以实现低延迟部署。
Abstract: Multi-modal retrieval-augmented generation (MRAG) systems retrieve visual evidence from large image corpora to ground the responses of large multi-modal models, yet the retrieved images frequently contain human faces whose identities constitute sensitive personal information. Existing anonymization techniques that destroy the non-identity visual cues that downstream reasoning depends on or fail to provide principled privacy guarantees. We propose Identity-Decoupled MRAG, a framework that interposes a generative anonymization module between retrieval and generation. Our approach consists of three components: (i)a disentangled variational encoder that factorizes each face into an identity code and a spatially-structured attribute code, regularized by a mutual-information penalty and a gradient-based independence term; (ii)a manifold-aware rejection sampler that replaces the identity code with a synthetic one guaranteed to be both distinct from the original and realistic; and (iii)a conditional latent diffusion generator that synthesizes the anonymized face from the replacement identity and the preserved attributes, distilled into a latent consistency model for low-latency deployment. Privacy is enforced through a multi-oracle ensemble of face recognition models with a hinge-based loss that halts optimization once identity similarity drops below the impostor-regime threshold.
[114] Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling cs.CV | cs.CL | cs.MM | cs.SD | eess.ASPDF
Zhen Ye, Xu Tan, Aoxiong Yin, Hongzhan Lin, Guangyan Zhang
TL;DR: Talker-T2AV是一个用于联合生成说话音频和视频的自回归扩散模型。它通过一个共享的自回归语言模型在统一的patch级token空间中对音频和视频进行高级跨模态联合推理,然后使用两个轻量级的扩散Transformer头分别解码为帧级的音频和视频潜在表示,从而在保持跨模态一致性的同时避免了低级别细节的不必要纠缠。
Details
Motivation: 现有的联合音视频生成模型通过全局注意力在去噪过程中完全耦合模态,将高级语义和低级细节完全纠缠在一起,这对于说话头合成是次优的,因为音频与面部运动在语义上相关,但它们的低级实现(声学信号和视觉纹理)遵循不同的渲染过程。这种全层次联合建模会导致不必要的纠缠并降低效率。
Result: 在说话肖像基准测试上的实验表明,Talker-T2AV在唇部同步准确性、视频质量和音频质量方面优于双分支基线方法,并且比级联流水线实现了更强的跨模态一致性。
Insight: 论文的核心创新点是提出了一个解耦的联合生成框架:在共享骨干网络中进行高级别的跨模态建模,而在低级别细化阶段使用模态特定的解码器。这种设计既保证了跨模态的语义一致性,又尊重了不同模态在低级别渲染上的独立性,提高了生成效率和质量。
Abstract: Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.
[115] Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation cs.CV | cs.ROPDF
Simone Mosco, Daniel Fusaro, Alberto Pretto
TL;DR: 本文提出了一种用于3D LiDAR异常分割的新方法,该方法直接在特征空间中操作,通过建模正常类别的特征分布来约束异常样本。同时,为了解决现有数据集场景简单、异常实例少且存在传感器分辨率域差距的问题,作者构建了一套基于现有语义分割基准的混合真实-合成数据集,包含多种分布外对象和复杂环境。大量实验表明,该方法在现有真实数据集上达到了最先进水平,并在新提出的混合数据集上取得了有竞争力的结果。
Details
Motivation: 在自动驾驶和机器人感知中,区分已知类别和未见过的对象(即异常分割)至关重要,但3D领域的研究有限,现有方法大多直接应用2D视觉的后处理技术。此外,唯一公开的3D LiDAR异常分割数据集场景简单、异常实例少,且存在传感器分辨率导致的严重域差距。
Result: 在现有真实世界数据集上达到了最先进(SOTA)水平,在新提出的混合数据集上取得了有竞争力的结果,验证了方法的有效性和所提数据集的实用性。
Insight: 创新点包括:1)直接在特征空间建模正常类别分布以约束异常样本的高效方法;2)构建了包含多种分布外对象和复杂环境的混合真实-合成数据集,弥补了现有数据集的不足,有助于推动3D LiDAR异常分割领域的研究。从客观角度看,该方法避免了复杂的后处理,而数据集的构建为领域提供了更全面的评估基准。
Abstract: Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at https://simom0.github.io/lido-page/.
[116] A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification cs.CVPDF
Peng Chen, Wenxuan He, Feng Qian, Guangyao Shi, Jingwen Yan
TL;DR: 本文提出了一种协同CNN-Transformer网络,结合池化注意力融合机制,用于高光谱图像分类。该方法通过并行双分支特征提取模块分别处理空间和光谱特征,并设计了混合池化注意力模块和跨层特征融合模块,以有效联合利用空间-光谱信息并减少层间信息损失。
Details
Motivation: 现有方法在高光谱图像分类中,难以有效联合利用空间-光谱信息,且在特征传播过程中存在层间信息丢失的问题。本文旨在解决这些挑战。
Result: 在多个代表性数据集上的大量实验表明,该方法相比现有最先进方法(SOTA)具有优越性能。
Insight: 创新点在于提出了协同CNN-Transformer架构,通过TBFE模块并行提取特征,HPA模块聚合空间注意力,以及CFF模块减少信息损失,实现了空间与光谱特征的有效分离与融合。
Abstract: In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at https://github.com/chenpeng052/SCT-Net.git.
[117] Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV | cs.MM | cs.SDPDF
Chunyu Li, Jiaye Li, Ruiqiao Mei, Haoyuan Xia, Hao Zhu
TL;DR: Hallo-Live是一个用于实时文本驱动音视频虚拟人生成的流式框架,通过结合异步双流扩散和以人为中心的偏好引导蒸馏,实现了高保真度和精确同步的肖像视频与语音联合合成。
Details
Motivation: 现有音视频扩散模型速度过慢,无法满足交互式使用需求,且在激进加速后质量会显著下降,因此需要开发一个既能实时生成又能保持高质量和同步性的框架。
Result: 在两张NVIDIA H200 GPU上,Hallo-Live以20.38 FPS运行,延迟为0.94秒,相比教师模型Ovi实现了16.0倍的吞吐量提升和99.3倍的延迟降低。在生成质量上,其VideoAlign总体得分和Sync Confidence得分与Ovi相当,并在整体质量-效率权衡上优于其他加速基线。
Insight: 创新点包括:1)引入未来扩展注意力机制以减少因果生成中的发音延迟;2)提出以人为中心的偏好引导DMD(HP-DMD)蒸馏方法,通过视觉保真度、语音自然度和音视频同步性的奖励重新加权训练样本;3)首次将流式双流扩散与偏好引导蒸馏结合用于实时文本驱动音视频生成。
Abstract: Real-time text-driven joint audio-video avatar generation requires jointly synthesizing portrait video and speech with high fidelity and precise synchronization, yet existing audio-visual diffusion models remain too slow for interactive use and often degrade noticeably after aggressive acceleration. We present Hallo-Live, a streaming framework for joint audio-visual avatar generation that combines asynchronous dual-stream diffusion with human-centric preference-guided distillation. To reduce articulation lag in causal generation, we introduce Future-Expanding Attention, which allows each video block to access synchronous audio together with a short horizon of future phonetic cues. To mitigate the quality loss of few-step distillation, we further propose Human-Centric Preference-Guided DMD (HP-DMD), which reweights training samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization. On two NVIDIA H200 GPUs, Hallo-Live runs at 20.38 FPS with 0.94 seconds latency, yielding 16.0x higher throughput and 99.3x lower latency than the teacher model Ovi. Despite this speedup, it retains strong generation quality, reaching comparable VideoAlign overall score and Sync Confidence score while outperforming other accelerated baselines in the overall quality-efficiency trade-off. Qualitative results further show robust generalization across photorealistic, multi-speaker, and stylized scenarios. To the best of our knowledge, Hallo-Live is the first framework to combine streaming dual-stream diffusion with preference-guided distillation for real-time, text-driven audio-visual generation.
[118] RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing cs.CV | cs.AIPDF
Pritesh Jha
TL;DR: RaV-IDP提出了一种基于重建验证的智能文档处理框架,通过在提取每个实体后将其重建回可比较的形式并与原始文档区域对比,生成无标签的保真度分数来验证提取的忠实性,并在保真度不足时触发GPT-4.1视觉回退机制。
Details
Motivation: 现有文档处理流程缺乏内在机制验证提取结果是否忠实于源文档,模型内部置信度仅衡量推理确定性而非与文档的对应关系,导致提取错误会悄无声息地传递到下游系统。
Result: 论文提出了一个分阶段评估框架,将每个流程组件与适当的基准配对,但摘要中未提及具体定量结果或与SOTA的比较。
Insight: 将重建作为核心架构组件,通过比较重建结果与原始文档区域生成保真度分数,提供了一种无需人工标注的、基于文档本身的验证信号;同时引入引导约束确保验证不陷入循环,并设计了结构化回退机制以处理低保真度情况。
Abstract: Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval-augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model-internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV-IDP), a document processing pipeline that introduces reconstruction as a first-class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label-free quality signal. When fidelity falls below a per-entity-type threshold, a structured GPT-4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per-stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at https://github.com/pritesh-2711/RaV-IDP for experimentation and use.
[119] Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation cs.CVPDF
Navid Aslankhani Khameneh, Marco Carletti, Cigdem Beyan
TL;DR: 本文提出了一种基于几何条件扩散模型(Pose-LDM)的方法,用于增强床上人体姿态估计任务在毯子遮挡下的鲁棒性。该方法直接从骨骼关键点合成被遮挡的图像,无需成对的监督或源图像像素级条件,从而生成多样化的姿态数据用于数据增强。
Details
Motivation: 解决床上人体姿态估计在严重毯子遮挡下因缺乏可靠标注训练数据而面临的挑战。现有方法依赖于多模态感知或需要可见源图像条件的图像翻译框架,限制了可扩展性和姿态多样性。
Result: 在固定骨干网络的下游姿态估计任务中评估,Pose-LDM在严重遮挡下实现了最高的严格定位精度,同时保持了与成对扩散模型相当的总体检测性能,接近全监督训练的水平。
Insight: 创新点在于将遮挡感知的数据增强重新定义为几何条件生成建模任务,提出了直接从姿态关键点生成遮挡图像的Pose-LDM模型。这消除了对成对监督和源图像像素级条件的依赖,提供了一种高效且无需修改传感流程的增强途径。
Abstract: Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: github.com/navidTerraNova/ GeoDiffPose.
[120] BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments cs.CVPDF
Guoxi Huang, Ruirui Lin, Yini Li, David R. Bull, Nantheera Anantrasirichai
TL;DR: 本文提出了一种名为BVI-Mamba的新型视频增强框架,旨在解决低光照和水下环境中视频常见的噪声、低对比度、色彩失衡和模糊等失真问题。该框架通过引入视觉状态空间(VSS)模型来降低内存使用和计算时间,其核心包括特征对齐模块和基于VSS块的UNet式增强模块。实验表明,该方法在低光照和水下视频增强任务上优于基于Transformer和卷积的模型。
Details
Motivation: 低光照和水下环境捕获的视频存在多种失真,不仅影响视觉质量,还降低自动检测等任务的性能。现有AI视频增强工具通常计算资源消耗大,处理耗时,因此需要一种更高效的方法。
Result: 在低光照和水下视频增强任务上,Visual Mamba技术超越了基于Transformer和卷积的模型,取得了更好的性能。
Insight: 主要创新点在于将视觉状态空间(VSS)模型引入视频增强领域,以替代传统卷积层,从而在保持或提升性能的同时显著降低计算开销。具体架构上,设计了特征对齐模块处理帧间时空位移,并使用VSS块构建的UNet式网络进行去噪和亮度调整,这是一种将序列建模高效应用于视觉任务的创新尝试。
Abstract: Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at https://github.com/russellllaputa/BVI-Mamba.
[121] SolarFCD: A Large-Scale Dataset and Benchmark for Solar Fault Classification in Photovoltaic Systems cs.CVPDF
Misbah Ijaz, Saif Ur Rehman Khan, Abd Ur Rehman, Arooj Zaib, Sebastian Vollmer
TL;DR: 本文介绍了SolarFCD,一个用于光伏系统太阳能电池板故障分类的大规模、多模态公开数据集与基准。该数据集通过整合三个公开数据集,包含RGB/无人机图像和热红外两种成像模态的4,435张图像,统一划分为健康、表面遮挡、结构故障和电气故障四类。研究提供了系统的数据划分、基准模型评估,并开源了数据集与代码以推动该领域研究。
Details
Motivation: 全球光伏系统部署的增加需要鲁棒、可扩展的自动检测技术,但目前缺乏大规模、多模态、公开可用的标注数据集,阻碍了该领域的进展。
Result: 在SolarFCD数据集上评估了来自五个设计家族的16种分类架构,ResNet101V2取得了最佳整体性能,准确率为86.68%,精确率为88.65%,召回率为88.62%,F1分数为88.17%。所有四个缺陷类别的检测性能均衡,差异在1.2个百分点以内。
Insight: 通过系统整合与统一标注多个公开数据集,构建了首个大规模、多模态的光伏故障分类数据集,并提供了可复现的基准测试;采用针对性的少数类数据增强等方法提升了数据集的平衡性与实用性,为自动化光伏检测研究提供了重要资源。
Abstract: The increasing global deployment of solar photovoltaic (PV) systems needs robust, scalable, and automated inspection technologies capable of detecting a wide range of panel flaws under a variety of operating situations. The lack of large-scale, multi-modal, publicly available annotated datasets is a major obstacle preventing advancement in this field. We introduce SolarFCD, an extensive dataset of solar panel defects created by methodically combining and reconciling three publicly accessible datasets covering two imaging modalities: RGB/Drone images and Thermal Infrared. The dataset consist of 4,435 images arranged under four unified defect classes such as: healthy images, Surface Obstruction, structural fault, and electrical fault. The dataset was divided into training, validation, and test splits at an 80:10:10 ratio through methodical label mapping, near-duplicate removal, and targeted augmentation of minority classes. Sixteen classification architectures from five design families were trained and assessed on the dataset to provide repeatable benchmark baselines. With an accuracy of 86.68%, precision of 88.65%, recall of 88.62%, and F1-score of 88.17%, ResNet101V2 performed the best overall. Per-class results showed balanced detection across all four defect categories within a narrow performance band of less than 1.2 percentage points. To promote open and repeatable research in automated PV inspection and solar energy operations and maintenance, the dataset, annotation files, and baseline code are made openly available.
[122] HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA cs.CVPDF
Francesco Dibitonto, Cigdem Beyan, Vittorio Murino
TL;DR: 本文提出HAC(CLIP的双曲适应),一种参数高效框架,通过轻量微调使预训练的CLIP模型能够迁移到双曲空间,并应用于零样本视觉问答任务。该方法在多个VQA基准测试中超越欧几里得基线和先前双曲方法,尤其在推理密集型任务上表现突出。
Details
Motivation: 现有双曲CLIP变体需从头训练,计算成本高且资源密集;HAC旨在以参数高效方式利用预训练CLIP,通过适应双曲几何来捕获层次结构,提升零样本VQA性能。
Result: 在涵盖通用、推理和OCR类别的多样化VQA基准测试中,HAC-S和HAC-B均一致超越欧几里得基线及先前双曲方法;HAC-B在推理密集型任务上比CLIP-B平均提升高达1.9个百分点。
Insight: 创新点包括参数高效的双曲空间适应框架,无需与VQA基准重叠数据的严格零样本评估范式,以及利用双曲几何增强CLIP对层次结构的表达能力,可借鉴于其他多模态任务的高效适应。
Abstract: Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC’s training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC’s task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC
[123] ZID-Net: Zero-Inference Diffusion Prior Decoupling Network for Single Image Dehazing cs.CV | eess.IVPDF
Xinheng Li, Minghao Chen, Mengqing Wu, Yan Liu, Guanying Huo
TL;DR: 本文提出ZID-Net,一种用于单幅图像去雾的新框架,旨在解决去雾任务中恢复质量与计算效率之间的权衡问题。该方法通过解耦扩散模型的监督与前向推理过程,设计了一个频率-空间解耦的前向主干网络,并在训练时引入一个零推理先验传播头来提供物理先验,从而在测试时实现高效且高质量的去雾。
Details
Motivation: 现有方法存在局限性:CNN网络难以学习密集非均匀雾霾的鲁棒先验,而扩散模型虽能提供强生成先验,但推理延迟高且采样不稳定。本文旨在设计一个既能利用扩散模型先验,又能实现高效前向推理的去雾网络。
Result: 在合成数据集RESIDE上达到40.75 dB PSNR;在真实世界数据集上以1.13 dB PSNR超越现有方法;在遥感数据集StateHaze1k上获得3.06 dB PSNR提升,推理时间仅为19.35毫秒。
Insight: 核心创新在于将扩散模型的监督(用于提供物理先验)与前向推理架构解耦,实现了零推理成本的先验利用。具体技术包括:频率-空间解耦的主干网络、用于提取净化结构细节的通道-空间拉普拉斯掩码(CSLM)、建立长程依赖的轻量全局上下文块(LGCB),以及自适应融合特征的动态特征仲裁块(DFAB)。训练时使用的零推理先验传播头(ZI-PPH)是关键,它利用条件扩散过程预测残差噪声来监督主干网络,测试时则丢弃扩散分支,从而将扩散先验集成到纯前向架构中。
Abstract: Single image dehazing is often constrained by a trade-off between restoration quality and computational efficiency. While efficient, CNN networks struggle to learn robust priors for dense and non-homogeneous haze. Conversely, diffusion models provide strong generative priors but suffer from severe inference latency and sampling instability. To address these limitations, we propose ZID-Net, a novel framework that explicitly decouples diffusion supervision from feed-forward inference. For efficient inference, we design a frequency-spatial decoupled feed-forward backbone. Within this backbone, a Channel-Spatial Laplacian Mask (CSLM) filters haze-amplified noise to extract purified structural details, while Lightweight Global Context Blocks (LGCBs) establish long-range spatial dependencies to capture the global variations of haze. A Dynamic Feature Arbitration Block (DFAB) then adaptively fuses these semantic and structural features for robust reconstruction. To provide this backbone with physical priors without the inference cost, we introduce a Zero-Inference Prior Propagation Head (ZI-PPH) during training. ZI-PPH leverages a conditional diffusion process to predict residual noise, providing degradation-aware structural supervision to the backbone. By discarding the diffusion branch at test time, ZID-Net integrates diffusion priors into a pure feed-forward architecture for accurate and efficient restoration. ZID-Net achieves 40.75 dB PSNR on the synthetic RESIDE dataset and outperforms existing methods with a 1.13 dB gain on real-world datasets. Additionally, it yields a 3.06 dB PSNR gain on the StateHaze1k remote sensing dataset with an inference time of just 19.35 ms. The project code is available at: https://github.com/XoomitLXH/ZID-Net.
[124] Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference cs.CV | cs.AIPDF
Xiaowei Mao, Bowen Sui, Weijie Zhang, Yawen Yang, Shengnan Guo
TL;DR: 本文提出VIBES框架,用于高速公路监控视频中的远场异常检测。该框架通过贝叶斯推理模块在线评估车辆轨迹以动态更新正常驾驶行为的概率边界,从而精准定位时空异常区域,并引导视觉语言模型仅对局部区域进行语义推理,以解决远场目标注意力稀释和计算成本高的问题。
Details
Motivation: 解决高速公路视频中远场目标(如细微异常车辆运动)的异常检测难题,克服现有方法在处理全局帧时导致的注意力稀释和过高计算成本,并提升跨场景泛化能力。
Result: 大量评估表明,VIBES提高了远场异常检测的准确性,降低了计算开销,实现了高实时效率和可解释性,并在多样高速公路条件下展现出良好的泛化性能。
Insight: 创新点在于将在线贝叶斯推理与视觉语言模型异步协作结合,通过概率边界动态触发并聚焦VLM推理于局部区域,从而兼顾检测精度、计算效率和跨场景泛化能力。
Abstract: Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.
[125] ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction cs.CV | cs.AIPDF
Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen
TL;DR: 本文提出了一种基于能量的时空交互感知框架ESIA,用于行人意图预测。该框架将意图预测建模为基于统一图表示的结构化预测问题,通过节点的一元势能捕捉个体意图,边上的成对势能编码社交与环境交互,并引入全局能量函数确保场景级行为一致性。
Details
Motivation: 现有行人意图预测研究受限于过度简化的多智能体交互模式、不透明的推理逻辑以及行为预测缺乏全局一致性,导致鲁棒性和可解释性不足。
Result: 在标准基准测试上的大量实验表明,ESIA实现了最先进的性能,并在可解释性上优于现有方法。
Insight: 创新点包括:1) 将意图预测构建为基于条件随机场的结构化预测问题,统一建模个体、社交与环境因素;2) 引入结构一致性项以无监督方式惩罚逻辑矛盾;3) 提出一元种子模拟退火算法,利用高置信度一元先验快速收敛到高质量解。
Abstract: Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.
[126] DynProto: Dynamic Prototype Evolution for Out-of-Distribution Detection cs.CVPDF
Yanqi Wu, Xinhua Lu, Runhe Lai, Qichao Chen, Jia-Xin Zhuang
TL;DR: 本文提出了一种名为DynProto的新方法,用于提升视觉语言模型(VLMs)中的分布外(OOD)检测。该方法的核心创新在于动态学习OOD原型,仅利用分布内(ID)信息,通过将易于检测的OOD样本作为“锚点”来发现其更难以检测的相似样本,从而解决了现有方法因预定义OOD标签集有限而失效的问题。
Details
Motivation: 现有方法依赖大型语料库中潜在的OOD标签作为辅助信息,但当真实世界的OOD样本超出预定义标签集时,这些方法往往失效。本文旨在克服这一限制,提出一种无需预定义OOD标签、仅利用ID信息就能在测试时动态学习OOD原型的方法。
Result: 在多个基准测试中,DynProto显著优于先前方法。具体在ImageNet OOD基准上,将FPR95降低了11.60%,并将AUROC提高了4.70%,达到了先进水平。
Insight: 论文的创新点在于基于“被预测为同一ID类的OOD样本在特征空间中倾向于聚类”的关键观察,设计了粗粒度OOD模式捕获和细粒度OOD模式精炼两个模块,动态生成代表性OOD原型。从客观角度看,这种无需外部OOD标签、仅依赖ID信息进行动态原型演化的框架具有架构无关性,可集成到各种骨干网络中,为OOD检测提供了新的思路。
Abstract: Recent studies show that using potential out-of-distribution (OOD) labels from large corpora as auxiliary information can improve OOD detection in vision-language models (VLMs). However, these methods often fail when real-world OOD samples fall outside the predefined OOD label set. To address this limitation, we propose DynProto, a novel approach that learns OOD prototypes dynamically during testing using only in-distribution (ID) information. DynProto is inspired by a key observation: OOD samples predicted as the same ID class tend to cluster in the feature space. With this insight, we leverage easy-to-detect OOD samples as ``anchors’’ to find their harder-to-detect, similar counterparts. To this end, DynProto introduces two modules: \textbf{Coarse OOD Pattern Capturing Module} caches OOD patterns that are easily confused with each ID class during testing, and \textbf{Fine-grained OOD Pattern Refinement Module} subsequently clusters these patterns within each cache and aggregates them into representative OOD prototypes. By measuring similarity to ID and dynamic OOD prototypes, DynProto enables accurate OOD detection. DynProto significantly outperforms prior methods across multiple benchmarks. On ImageNet OOD benchmark, DynProto reduces FPR95 by 11.60% and improves AUROC by 4.70%. Moreover, the framework is architecture-agnostic and can be integrated into various backbones.
[127] ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents cs.CV | cs.SEPDF
Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu
TL;DR: 该论文提出了ClawMark基准测试,用于评估多轮次、多天工作的多模态协作智能体在动态环境中的表现。该基准包含100个任务,覆盖13个专业场景,通过5个有状态沙盒服务和1537个确定性检查器进行评分,避免了使用LLM作为评判者。测试了7个前沿智能体系统,发现最强模型加权得分为75.8,但完整任务成功率仅为20%,表明部分进展常见而端到端工作流完成仍困难。
Details
Motivation: 现有基准测试通常在单一静态环境中运行且以文本为中心,无法充分评估协作智能体在长期动态工作环境(如邮件更新、日历变动、多模态文档变化)中的适应能力,因此需要新的评估框架。
Result: 在ClawMark基准上测试了7个前沿智能体系统,最强模型获得75.8加权得分,但严格任务成功率仅20%;性能在首次外部环境更新后显著下降,突显了适应动态状态的关键挑战。
Insight: 创新点在于构建了多轮次多天任务、有状态沙盒环境(文件系统、邮件、日历等)和基于规则的验证系统;客观来看,该研究首次系统量化了协作智能体在长期动态环境中的适应瓶颈,为真实场景部署提供了关键评估工具。
Abstract: Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.
[128] MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks cs.CV | cs.HCPDF
Jui-Cheng Chiu, Yu-Chao Wang, Shengyang Luo, Tongyan Wang, Qi Yang
TL;DR: MIRAGE是一个用于分析多人物绘画中人物间微观交互关系的框架,它通过构建结构化中间表示来捕捉身份、姿态和视线假设,从而将空间定位与叙事生成分离,提升视觉语言模型在复杂艺术场景中的解释可靠性和透明度。
Details
Motivation: 现有视觉语言模型在处理多人物绘画时,常因缺乏可追溯的视觉证据而产生不可靠的、无根据的叙事解释,无法系统识别人物间通过视线、手势和空间排列等微妙线索建立的交互关系。
Result: 在盲评估协议下,与仅基于绘画的VLM基线相比,MIRAGE显著提高了身份一致性、减少了关系幻觉,并增加了对微妙交互的覆盖范围。
Insight: 创新点在于引入结构化中间表示作为可验证的证据层,将空间定位与叙事生成解耦,这为复杂视觉叙事提供了更可靠、透明且以人为中心的交互控制层,可借鉴于需要细粒度关系推理的视觉理解任务。
Abstract: Appreciating multi-figure paintings requires understanding how characters relate through subtle cues like gaze alignment, gesture, and spatial arrangement. We present MIRAGE, an evidence-centric framework designed to scaffold the exploration of these “micro-interactions” in multi-figure artworks. While such cues are essential for deep narrative appreciation, they are often distributed across complex scenes and difficult for viewers to systematically identify. Existing vision-language models (VLMs) frequently fail to provide reliable assistance, offering ungrounded interpretations that lack traceable visual evidence. MIRAGE addresses this by constructing a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. However, the challenge extends beyond extracting these cues to coordinating them during interpretation. Without an explicit mechanism to organize and reconcile relational evidence, models often collapse multiple interaction hypotheses into a single unstable or weakly grounded narrative, even when low-level signals are available. This representation allows users to verify how high-level interpretations are anchored in low-level visual facts. By separating spatial grounding from narrative generation, MIRAGE enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. We evaluate MIRAGE against painting-only VLM baselines using a blind assessment protocol. Results show that MIRAGE significantly improves identity consistency, reduces relational hallucinations, and increases the coverage of subtle interactions. These findings suggest that structured grounding can serve as a critical interaction control layer, providing the necessary scaffolding for a more reliable, transparent, and human-led understanding of complex visual narratives.
[129] MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation cs.CVPDF
Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei
TL;DR: 本文提出了MuSS,一个专为多镜头视频和主体到视频(S2V)生成设计的大规模双轨数据集,旨在解决真实叙事逻辑、时空文本-视频对齐冲突以及S2V生成中的‘复制-粘贴’困境等核心挑战。该数据集源自3000多部电影,支持复杂的蒙太奇转换和以主体为中心的叙事。同时,作者提出了一个电影叙事基准,包含视觉-逻辑驱动范式和新的抗复制-粘贴方差(ACP-Var)指标,用于严格评估连续叙事和3D结构一致性。实验表明,基于MuSS增强的模型在叙事有效性和跨镜头身份保持方面达到了最先进水平。
Details
Motivation: 当前视频基础模型擅长单镜头生成,但现实世界的电影叙事依赖于复杂的多镜头序列。由于缺乏能够解决真实叙事逻辑、时空对齐冲突和S2V生成中‘复制-粘贴’困境的数据集,该领域的进展受到限制。
Result: 在提出的电影叙事基准上进行广泛实验,结果表明,当前基线模型难以处理连续叙事逻辑或退化为简单的2D贴图生成器,而基于MuSS增强的模型在叙事有效性和跨镜头身份保持方面达到了最先进(SOTA)水平。
Insight: 论文的创新点包括:1)构建了大规模、双轨的MuSS数据集,明确支持复杂蒙太奇和主体叙事;2)提出了渐进式标注流程,先确保局部镜头级准确性再强制全局叙事连贯性,以消除上下文冲突;3)引入了跨镜头匹配机制,从根本上消除S2V的复制-粘贴捷径;4)提出了视觉-逻辑驱动的电影叙事基准和新的ACP-Var指标,用于评估连续叙事和3D一致性。
Abstract: While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the “copy-paste” dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.
[130] VitaminP: cross-modal learning enables whole-cell segmentation from routine histology cs.CVPDF
Yasin Shokrollahi, Karina B. Pinao Gonzales, Elizve N. Barrientos Toro, Paul Acosta, Patient Mosaic Team
TL;DR: VitaminP是一个跨模态学习框架,能够从常规H&E染色组织病理学图像中实现全细胞分割,通过利用配对的H&E-mIF数据将分子边界信息从mIF转移到H&E图像,以克服H&E图像中细胞质对比度不足的限制。
Details
Motivation: 解决常规H&E染色因细胞质对比度有限而只能分析细胞核,以及mIF技术成本高、可及性受限的问题,旨在实现从广泛可用的H&E图像中准确分割全细胞。
Result: 在涵盖34种癌症类型、超过700万个实例的14个公共数据集上训练,优于四种最先进方法,并在未见数据集(包括包含24种罕见癌症类型的内部数据集)上表现出良好的泛化能力。
Insight: 创新点在于提出跨模态监督作为恢复缺失生物结构的通用策略,通过从mIF学习边界信息来增强H&E图像的分割能力;同时构建了大规模分割资源并开发了开源平台VitaminPScope以促进广泛应用。
Abstract: Accurate whole-cell and nuclear segmentation is essential for precision pathology and spatial omics, yet routine hematoxylin and eosin (H&E) staining provides limited cytoplasmic contrast, restricting analyses to nuclei. Multiplex immunofluorescence (mIF) facilitates precise whole-cell delineation but remains constrained by cost and accessibility. We introduce VitaminP, a cross-modal learning framework enabling whole cell segmentation from H&E images. By learning from paired H&E-mIF data, VitaminP transfers molecular boundary information from mIF to overcome cytoplasmic contrast in H&E, establishing cross-modal supervision as a general strategy for recovering missing biological structure. We train VitaminP on 14 public datasets covering 34 cancer types and over 7 million instances, integrating publicly available labels with extensive annotations generated in this study, forming one of the largest resources for segmentation. VitaminP outperforms four state-of-the-art methods and generalizes to unseen datasets, including an in-house dataset spanning 24 rare cancer types. We further developed VitaminPScope, an open-source platform providing an interface for scalable inference and enabling broad adoption.
[131] Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction cs.CVPDF
Jan Warchocki, Xi Wang, Jonas Kulhanek, Jan van Gemert
TL;DR: 本文评估了动态单目3D高斯泼溅(3DGS)模型在自我中心(egocentric)和外部中心(exocentric)视频上的场景重建性能,使用EgoExo4D数据集中的配对视频。研究发现,自我中心视角的重建质量始终较低,且这种差异主要源于静态内容而非动态内容的重建问题。
Details
Motivation: 自我中心视频在增强现实、机器人和辅助技术中日益重要,但其快速相机运动和复杂场景动态为3D重建带来挑战。现有动态单目3DGS模型很少在自我中心视频上评估,不清楚它们是否能泛化到该场景或是否需要专门解决方案。
Result: 在EgoExo4D数据集上评估显示,自我中心视角的重建质量(以峰值信噪比衡量)持续低于外部中心视角,且差异主要来自静态内容重建。
Insight: 论文创新点在于首次系统评估动态3DGS在自我中心视频上的性能,揭示了静态内容重建是主要瓶颈,强调了开发自我中心专用方法的必要性,并提出了分别评估视频静态和动态区域的价值。
Abstract: Egocentric video provides a unique view into human perception and interaction, with growing relevance for augmented reality, robotics, and assistive technologies. However, rapid camera motion and complex scene dynamics pose major challenges for 3D reconstruction from this perspective. While 3D Gaussian Splatting (3DGS) has become a state-of-the-art method for efficient, high-quality novel view synthesis, variants, that focus on reconstructing dynamic scenes from monocular video are rarely evaluated on egocentric video. It remains unclear whether existing models generalize to this setting or if egocentric-specific solutions are needed. In this work, we evaluate dynamic monocular 3DGS models on egocentric and exocentric video using paired ego-exo recordings from the EgoExo4D dataset. We find that reconstruction quality is consistently lower in egocentric views. Analysis reveals that the difference in reconstruction quality, measured in peak signal-to-noise ratio, stems from the reconstruction of static, not dynamic, content. Our findings underscore current limitations and motivate the development of egocentric-specific approaches, while also highlighting the value of separately evaluating static and dynamic regions of a video.
[132] ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction cs.CV | cs.CLPDF
Zichun Guo, Yuling Shi, Wenhao Zeng, Chao Hu, Haotian Lin
TL;DR: 本文介绍了ShredBench基准测试,用于评估多模态大语言模型在文档碎片重建任务中的语义推理能力。该基准通过自动化生成流程从Markdown直接生成碎片化文档,涵盖英文、中文、代码和表格四种场景及三种碎片粒度。实验表明,当前MLLMs在文档完整时表现良好,但在碎片化后性能显著下降,揭示了其在跨模态细粒度推理上的不足。
Details
Motivation: 现有MLLMs在视觉丰富文档理解任务中表现优异,但评估多基于完整、结构良好的文档图像,缺乏对碎片化文档重建等需要结合视觉模式识别与语义推理的挑战性场景的系统评估。
Result: 在ShredBench上对SOTA MLLMs的实证评估显示,模型在完整文档上有效,但一旦文档被撕碎,重建性能大幅下降,归一化编辑距离随碎片增加而急剧恶化,表明当前模型难以处理视觉不连续性。
Insight: 创新点在于提出了一个可自动生成、防止训练数据污染的碎片化文档基准,并揭示了MLLMs在跨模态细粒度推理上的关键缺陷,为鲁棒VRDU研究指明了方向。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.
[133] Latent Inter-Frame Pruning: A Training-Free Method Bridging Traditional Video Compression and Modern Diffusion Transformers for Efficient Generation cs.CVPDF
Dennis Menn, Chih-Hsien Chou
TL;DR: 本文提出了一种无需训练的潜在帧间剪枝框架,通过识别并剪除潜在扩散模型编码视频时沿时间轴的冗余潜在块,减少计算负担并提高生成吞吐量。为解决直接剪枝导致的视觉伪影,引入了注意力恢复机制来弥合训练与推理之间的差距。
Details
Motivation: 视频生成虽然能生成逼真视频,但计算成本高、速度慢,难以实时应用。观察到潜在扩散模型框架下自编码器编码的视频潜在表示在时间轴上存在冗余,类比传统视频压缩算法避免传输冗余帧数据,旨在通过剪枝减少计算量。
Result: 该方法将视频编辑吞吐量提高了1.44倍,在NVIDIA RTX 6000上达到12.44 FPS,同时保持视频质量。
Insight: 创新点在于将传统视频压缩思想(如帧间冗余消除)引入现代扩散变换器视频生成,提出无需训练的潜在帧间剪枝与注意力恢复机制,为整合传统视频压缩方法与现代生成管道提供了新思路。
Abstract: Video generation, while capable of generating realistic videos, is computationally expensive and slow, prohibiting real-time applications. In this paper, we observe that video latents encoded via an autoencoder under the Latent Diffusion Model (LDM) framework contain redundancy along the temporal axis. Analogous to how traditional video compression algorithms avoid transmitting redundant frame data, we propose the Latent Inter-frame Pruning framework to prune (skip the re-computation of) duplicated latent patches, thereby reducing computational burden and increasing throughput. However, direct pruning results in visual artifacts due to the discrepancy between full-sequence training and pruned inference. To resolve these artifacts, we propose an Attention Recovery mechanism to bridge the train-inference gap. With our proposed method, we increase video editing throughput by 1.44$\times$, achieving 12.44 FPS on an NVIDIA RTX 6000 while maintaining video quality. We hope our work inspires further research into integrating traditional video compression methods with modern video generation pipelines. This work is a preliminary work on Training-free Latent Inter-Frame Pruning with Attention Recovery.
[134] Exploring Audio Hallucination in Egocentric Video Understanding cs.CV | cs.AIPDF
Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang
TL;DR: 本文研究了以自我为中心(Egocentric)视频理解中大型视听语言模型(AV-LLMs)的音频幻觉问题,即模型倾向于根据可见但未听到的视觉线索推断出声音。作者提出了一个系统性的自动评估框架,通过有针对性的问答协议来分析音频幻觉,并构建了一个包含300个视频和1000个声音相关问题的数据集。评估发现,先进的AV-LLMs(如Qwen2.5 Omni)在音频幻觉方面表现不佳。
Details
Motivation: 以自我为中心的视频中,声音是理解用户活动和环境的关键线索,尤其是在视觉信息不稳定或被遮挡时。然而,当前最先进的大型视听语言模型在生成多模态描述时容易出现音频幻觉,即根据视觉信息错误推断声音,这影响了模型的可靠性。
Result: 在作者构建的数据集上,对先进AV-LLMs(如Qwen2.5 Omni)的评估显示,模型在涉及前景动作声音和背景环境声音的问答任务中准确率分别仅为27.3%和39.5%,表现出很高的幻觉率。
Insight: 论文的创新点在于首次系统性地探索和评估了AV-LLMs在以自我为中心视频中的音频幻觉问题,提出了一个基于针对性问答协议的自动评估框架和一个区分前景与背景声音的细粒度分类法。这强调了在多模态模型开发中,对幻觉进行鲁棒评估以衡量响应可靠性的必要性。
Abstract: Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.
[135] AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance cs.CVPDF
Benjamin Klein, Kazi Ruslan Rahman, Sanchita Ghose
TL;DR: AMAVA是一个自适应的运动感知视频转音频框架,旨在帮助视障人士感知动态环境。它通过轻量级AI模型区分低运动和高运动场景,在静态环境中生成语音描述,在高运动场景中优先提供安全提示音效,以减少认知负荷并增强环境感知。
Details
Motivation: 现有导航辅助工具难以有效传达动态现实环境信息,导致视障用户因持续、无差别的反馈而产生认知过载,需要一种能自适应环境变化并提供上下文相关音频反馈的实时系统。
Result: 在实时导航研究中,与仅使用白手杖相比,结合AMAVA的系统显著提高了用户信心和感知安全性,表明该框架在实际应用中的有效性。
Insight: 创新点包括基于运动感知的自适应音频生成管道、结合专家混合和跨模态注意力的视觉语言模型,以及通过提示缓存和类别特定节流机制减少听觉杂乱和延迟,这些设计可借鉴于多模态辅助系统和实时环境感知应用。
Abstract: Navigational aids for blind and low vision individuals struggle conveying dynamic real-world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real-time video-to-audio framework that converts mobile device video into contextually relevant sound effects or text-to-speech descriptions. We propose a motion-aware pipeline using a lightweight AI classification model to distinguish between low and high-movement scenes followed by a real-time text-to-audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high-movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder-only transformer-based vision-language model with mixture-of-experts and cross-modal attention for visual understanding, in conjunction with neural text-to-speech and natural sound synthesis networks. The proposed framework uses prompt-based caching and category-specific throttling to avoid auditory clutter and minimize latency. We present a comprehensive evaluation of the system, including a real-time navigation study comparing a white cane alone versus with AMAVA, that shows a significant increase in user confidence and perceived safety.
[136] 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA cs.CVPDF
Zhiyu Wang, Xudong Kang, Shutao Li
TL;DR: 本文提出ASR-SaSaSa2VA框架,用于音频引导的视频对象分割。该方法通过自动语音识别将音频输入转换为文本运动描述,再利用预训练的基于文本的参考视频分割模型进行像素级预测,并引入一个基于音频的多模态大语言模型来检测无目标表达的音频片段,以提高鲁棒性。
Details
Motivation: 解决现有音频驱动视频分割方法计算量大、难以对齐时序音频线索与动态视频内容,以及依赖大规模配对音视频数据集的问题。
Result: 在第五届PVUW挑战赛(MeViS-v2-Audio赛道)中取得了80.7的最终得分,获得了第二名。
Insight: 创新点在于将音频模态转换为文本描述,从而复用强大的预训练文本-视频分割模型,避免了端到端多模态融合的高计算成本;同时引入无目标表达检测模块来处理模糊或无关的音频输入,提升了系统的实用性。
Abstract: Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.
[137] GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction cs.CVPDF
Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
TL;DR: 本文提出了GoClick,一个仅含2.3亿参数的轻量级视觉语言模型,用于图形用户界面(GUI)元素定位任务,旨在解决现有大型模型无法在资源受限设备(如手机)上部署的问题。通过采用编码器-解码器架构和渐进式数据精炼流程,模型在保持小尺寸和高推理速度的同时,达到了与更大模型相当的定位精度。
Details
Motivation: 当前用于GUI元素定位的视觉语言模型通常参数量巨大(超过25亿),无法在移动设备等资源受限环境中进行低延迟部署,因此需要开发轻量级且高效的替代方案。
Result: 实验表明,GoClick在多个GUI元素定位基准测试中表现出色,其定位精度与显著更大的模型相当,同时保持了小模型尺寸和高推理速度。当集成到设备-云协作框架中时,它还能帮助云端任务规划器实现更精确的元素定位,从而提高任务成功率。
Insight: 论文的创新点在于:1) 针对小参数规模下的GUI定位任务,证明了编码器-解码器架构优于仅解码器架构;2) 提出了渐进式数据精炼流程,通过任务类型过滤和数据比例调整,从大规模原始数据集中提取高质量核心训练集,有效提升了小模型的性能。这为轻量级GUI智能体模型的开发提供了有价值的探索方向。
Abstract: Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.
[138] LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CVPDF
Rinyoichi Takezoe, Yaqian Li, Zihao Bo, Anzhou Hou, Mo Guang
TL;DR: 本文提出了一种名为LearnPruner的两阶段视觉令牌剪枝框架,旨在解决视觉语言模型因长视觉序列输入带来的高计算负担问题。该方法首先通过视觉编码器后的可学习剪枝模块移除冗余视觉令牌,然后在LLM的中间层仅保留任务相关令牌。实验表明,该方法仅使用5.5%的视觉令牌即可保持约95%的原始性能,并实现3.2倍的推理加速。
Details
Motivation: 视觉语言模型在视觉理解和推理方面表现出色,但长视觉序列输入导致计算负担沉重。现有基于注意力分数的令牌剪枝方法存在局限性,视觉编码器存在注意力沉没问题,而LLM中的注意力机制虽存在位置偏差,但其文本到视觉的注意力对偏差具有抵抗性,可用于有效剪枝。
Result: 在相关基准测试中,LearnPruner在仅使用5.5%视觉令牌的情况下,保持了约95%的原始模型性能,并实现了3.2倍的推理加速,展现了优越的精度-效率权衡。
Insight: 创新点在于系统分析了视觉编码器和LLM中注意力机制对令牌剪枝的指导效果差异,并据此设计了两阶段剪枝框架。其核心洞察是:视觉编码器的注意力存在沉没问题,而LLM中间层的文本到视觉注意力能抵抗位置偏差,从而为剪枝提供更有效的指导。可借鉴之处在于将剪枝过程分解为冗余去除和任务相关保留两个阶段,并引入可学习的剪枝模块。
Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose LearnPruner, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM’s middle layer. Experimental results show that our LearnPruner can preserve approximately 95% of the original performance while using only 5.5% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.
[139] LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization cs.CVPDF
Bokang Zeng, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, Jiaojiao Jiang
TL;DR: 本文提出了一种名为LAVA的分层音视频防篡改水印框架,用于鲁棒的深度伪造检测与定位。该方法通过跨模态水印融合和校准感知对齐,解决了现有方法因音视频证据解耦、水印信号在现实世界退化下不可靠以及压缩失真导致的定位脆弱性问题。
Details
Motivation: 现有主动水印方法在深度伪造检测与定位中,常将音视频证据解耦,并假设水印信号在现实退化下仍可靠,这使其易受多模态错位和压缩失真的影响;同时,现有半脆弱视觉水印方法因其嵌入频带与压缩敏感区域重叠,在编解码压缩下性能显著下降。
Result: 大量实验表明,LAVA实现了近乎完美的检测性能(AP = 0.999),对压缩和多模态错位保持鲁棒,并显著优于现有的音视频融合基线方法,大幅提升了篡改定位的可靠性。
Insight: 创新点在于提出了一个校准感知的音视频水印融合框架,通过分层设计和跨模态融合来保持压缩和音视频异步下的篡改证据一致性,从而增强了定位的鲁棒性;从客观角度看,其将水印嵌入与压缩敏感频带解耦的思路对提升抗压缩能力具有借鉴意义。
Abstract: Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.
[140] Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification cs.CVPDF
Xiaoliu Luo, Minxue Xiao, Ting Xie, Mengzhu Wang, Huiqing Qi
TL;DR: 本文提出了一种名为多视图协同学习(MVSL)的统一框架,旨在解决低资源条件下生物医学图像分类的挑战。该框架通过解耦视觉和文本编码器的适配、引入多粒度对比学习以及利用大语言模型提供的结构化监督,实现了更稳定的跨模态对齐和细粒度判别能力。
Details
Motivation: 动机在于解决低资源生物医学图像分类中标注有限、类间视觉差异细微以及疾病语义复杂的问题,同时克服现有视觉-语言模型在参数高效微调及细粒度语义一致表示学习方面的局限性。
Result: 在涵盖9种成像模态和10个解剖区域的11个公共生物医学数据集上进行的大量实验表明,MVSL在少样本和零样本分类设置中均持续优于最先进方法。
Insight: 创新点包括:解耦视觉与文本编码器的适配以尊重其表示特性;通过多粒度对比学习显式建模全局图像语义和局部病灶级证据;利用大语言模型生成的结构化监督来保持疾病级语义结构,从而间接正则化视觉嵌入。
Abstract: Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision–language models offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from large language models, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on $11$ public biomedical datasets spanning $9$ imaging modalities and $10$ anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.
[141] Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis cs.CVPDF
Xuemei Qiu, Dawei Fan, Yebin Huang, Yanping Chen, Lifang Wei
TL;DR: 本文提出了一种名为分层原型域先验(HPDP)的统一多模态框架,用于联合组织病理学诊断和预后分析。该框架通过形态学锚定原型系统(MAPS)和正弦位置编码器(SPE)引入可解释的形态学先验和空间结构建模,并利用大语言模型(LLM)生成的描述通过分层跨模态对齐(HCMA)模块桥接语义鸿沟,从而提升对全切片图像复杂肿瘤微环境的理解。
Details
Motivation: 现有基于多示例学习(MIL)的病理图像分析方法通常将全切片图像视为无序的图像块集合,忽略了关键的形态学语义和空间几何信息,导致容易过拟合背景噪声且视觉特征与高级诊断知识脱节。
Result: 在七个癌症队列上的广泛实验表明,HPDP框架在诊断和预后任务上均取得了最先进的(SOTA)性能,并展现出卓越的鲁棒性和可解释性。
Insight: 创新点在于将形态学先验(MAPS)和空间结构编码(SPE)作为归纳偏置引入MIL框架,以对抗数据驱动的‘黑箱’问题;同时利用LLM生成的文本描述通过分层对齐机制(HCMA)实现跨模态语义引导,从而提升模型的可解释性和性能。
Abstract: Digital pathology has fundamentally altered diagnostic workflows by enabling the computational analysis of gigapixel Whole Slide Images (WSIs), yet effectively deciphering their complex tumor microenvironments remains a formidable challenge. Existing Multiple Instance Learning (MIL) frameworks typically treat Whole Slide Images as unstructured bags of patches, discarding critical morphological semantics and spatial geometry. This lack of inductive bias often leads to overfitting on background noise and fails to align visual features with high-level diagnostic knowledge. To overcome these limitations, we propose the Hierarchical Prototype-based Domain Priors (HPDP) framework, a unified multimodal approach for joint histopathology diagnosis and prognosis. HPDP mitigates the data-driven “black box” issue by introducing a Morphologically Anchored Prototype System (MAPS), which anchors learning to interpretable morphological clusters, and a Sinusoidal Positional Encoder (SPE) to explicitly model tissue architecture. Furthermore, we bridge the semantic gap via a Hierarchical Cross-Modal Alignment (HCMA) module, using Large Language Model (LLM)-generated descriptions to contextually refine visual representations. Extensive experiments across seven cancer cohorts demonstrate that HPDP consistently achieves state-of-the-art performance with superior robustness and interpretability.
[142] SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs cs.CVPDF
Zi-Hao Bo, Yaqian Li, Anzhou Hou, Rinyoichi Takezoe, Ertao Zhao
TL;DR: 本文提出了SMoES(软模态引导的专家专业化)方法,用于改进基于混合专家(MoE)的大型视觉语言模型(VLMs)中的专家路由策略。该方法通过动态软模态分数捕捉层依赖的模态融合模式,结合专家分箱机制和箱间互信息正则化,引导专家进行模态专业化。实验在四个MoE-VLM和16个基准测试上验证了其在效果和效率上的提升。
Details
Motivation: 现有MoE-VLM中的路由策略多为手工设计或模态无关的,忽略了模型中层依赖的模态融合模式,无法有效引导专家专业化,因此需要一种能利用模态特定信号来指导路由的方法。
Result: 在四个MoE-VLM和16个基准测试上的实验表明,该方法在多模态和纯语言任务上平均分别提升0.9%和4.2%,减少了56.1%的专家并行通信开销,并在实际部署中实现了12.3%的吞吐量提升,验证了其有效性和效率。
Insight: 创新点在于引入了动态软模态分数来捕捉层依赖的融合模式,并结合专家分箱与互信息正则化来促进模态专业化;客观来看,该方法将路由策略与模态感知的专家专业化对齐,为MoE-VLM的容量和效率提升提供了新思路。
Abstract: Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.
[143] ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services cs.CVPDF
Fengxian Ji, Jingpu Yang, Zirui Song, Lang Gao, Junhong Liang
TL;DR: 该论文提出了一个名为ServImage的基准测试,用于评估图像生成和编辑模型在真实商业设计项目中的经济价值。它包括一个包含1.07k付费商业设计任务的数据集(ServImageBench)、一个结合了三个质量维度的综合评分系统(ServImageScore),以及一个基于该评分系统训练的支付预测模型(ServImageModel),该模型在预测人类支付决策上达到了82.00%的准确率。
Details
Motivation: 当前图像生成和编辑模型在学术基准上表现良好,但在付费的真实商业设计项目中的性能和经济价值尚不明确,因此需要建立一个能关联模型输出与经济价值的基准。
Result: 在ServImage基准上,提出的支付预测模型(ServImageModel)在预测人类支付决策上达到了82.00%的准确率,并能产生校准后的支付概率。
Insight: 创新点在于首次构建了一个直接关联图像生成模型输出与商业经济价值的基准,通过整合基线要求满足度、视觉执行质量和商业必要性满意度三个维度来量化商业可接受性,为评估模型的商业可行性提供了全面基础。
Abstract: Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbf{ServImage}, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf{\textit{ServImageBench}}: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over $295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf{\textit{ServImageScore}}: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf{\textit{ServImageModel}}: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \href{https://github.com/FengxianJi/ServImage}{Github.}
[144] DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery cs.CV | cs.CL | cs.IR | cs.MMPDF
Jiawei Wang, Ming Lei, Yaning Yang, Xinyan Lin, Yuquan Le
TL;DR: DeepTaxon是一个检索增强的多模态框架,通过可解释的检索和推理,统一了物种识别与发现任务。它利用检索索引获取候选物种的示例图像,进行链式思维比较推理,将发现任务重新定义为基于检索的决策问题,而非参数化记忆问题,从而在无需人工标注的情况下自动生成分类或发现标签。
Details
Motivation: 解决生物多样性研究中,在数万种视觉相似的类群中识别已知物种,并在开放世界环境中发现未知物种的挑战。现有方法将识别与发现视为独立问题,分类模型假设封闭集,而发现则依赖基于阈值的拒绝机制。
Result: 在大型分布内基准和六个分布外数据集上的广泛实验表明,该方法在识别和发现任务上均取得了一致的性能提升。消融研究进一步展示了其随候选数量k和示例数量n的有效测试时扩展性、强大的零样本迁移到未见领域的能力,以及在不同检索编码器上的一致性能。
Insight: 核心创新在于将物种发现重新定义为显式的、基于检索的决策问题,而非隐式的参数化记忆问题,从而统一了识别与发现任务。通过检索增强的监督微调和强化学习,实现了高召回检索到高精度决策的转换,并提供了可解释的推理过程。
Abstract: Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top-$k$ candidate species with $n$ exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count $k$ and exemplar count $n$, strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.
[145] Robust Grounding with MLLMs against Occlusion and Small Objects via Language-guided Semantic Cues cs.CV | eess.IVPDF
Beomchan Park, Seongho Kim, Hyunjun Kim, Sungjune Park, Yong Man Ro
TL;DR: 本文提出了一种名为语言引导语义线索(LGSCs)的新方法,旨在提升多模态大语言模型(MLLMs)在拥挤场景中的定位鲁棒性。该方法通过语义线索提取器(SCE)从MLLM的视觉管道中提取对象语义线索,并利用对应的文本嵌入进行引导,生成LGSCs作为语言语义先验,然后将其重新整合到视觉管道中以优化对象语义,从而有效应对遮挡和小物体带来的挑战。
Details
Motivation: MLLMs在一般场景中具有增强的定位能力,但在拥挤场景(存在遮挡和小物体等视觉挑战)中的鲁棒性尚未得到充分探索,这些挑战会损害对象语义并降低定位性能。
Result: 大量实验和分析表明,将LGSCs整合到MLLM中能有效提高在拥挤场景中的定位准确性。
Insight: 创新点在于利用语言表达对视觉退化免疫的特性,通过语言引导的语义线索(LGSCs)作为先验来增强视觉语义,从而提升模型在复杂视觉条件下的鲁棒性;从客观角度看,该方法巧妙地结合了视觉与语言模态的优势,为解决遮挡和小物体检测问题提供了一种新颖的跨模态增强思路。
Abstract: While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.
[146] QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering cs.CV | cs.AIPDF
Woojun Jung, Junyeong Kim
TL;DR: 本文提出了一种无参考的叙事视频摘要评估指标QEVA,通过多模态问答直接评估候选摘要与源视频的一致性,涵盖覆盖率、事实性和时序性三个维度。同时,作者构建了MLVU(VS)-Eval基准数据集,包含200个视频的800个摘要,用于透明评估。实验表明QEVA与人类判断的相关性优于现有方法。
Details
Motivation: 传统基于n-gram重叠的指标和近期基于大语言模型的方法严重依赖人工撰写的参考摘要,限制了其实用性以及对细微语义方面的敏感性,因此需要一种不依赖参考摘要的评估方法。
Result: 在MLVU(VS)-Eval基准上,QEVA在Kendall’s τ_b、τ_c和Spearman’s ρ等指标上显示出比现有方法更高的人类判断相关性。
Insight: 创新点在于提出了一种基于多模态问答的无参考评估框架,将摘要评估转化为对覆盖率、事实性和时序性的量化测量,并构建了公开基准以促进该领域研究。
Abstract: Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall’s $τ_b$, $τ_c$, and Spearman’s $ρ$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.
[147] SemiSAM-O1: How far can we push the boundary of annotation-efficient medical image segmentation? cs.CVPDF
Yichi Zhang, Le Xue, Bichun Xu, Judong Luo, Zhigang Wu
TL;DR: 本文提出了一种名为SemiSAM-O1的注释高效医学图像分割框架,该框架仅需一张带标注的模板图像即可进行半监督学习。它通过充分利用基础模型的特征表示能力,在极端单标签设置下扩展了专家-通用协作学习框架,包含粗粒度伪标签生成和迭代训练-精炼两个阶段,旨在缩小单标签半监督学习与全监督学习之间的性能差距。
Details
Motivation: 为了解决基于深度学习的医学图像分割模型标注负担重的问题,特别是在复杂成像模态下,现有基于基础模型的半监督学习方法在极端有限标注场景中性能不够鲁棒,因此本文旨在探索在仅有一张标注图像的情况下,如何有效进行分割。
Result: 在多种模态和解剖目标的分割任务上进行的大量实验表明,SemiSAM-O1显著缩小了单标签半监督学习与全监督学习之间的性能差距,同时显著降低了在线基础模型推理的计算开销。
Insight: 创新点在于将专家-通用协作学习框架极端化至单标签设置,并充分利用基础模型编码器的特征表示能力(超越其提示接口)来生成和迭代精炼伪标签,特别是通过不确定性引导的精炼步骤,利用全局特征空间聚合相似邻域信息来修正高不确定性区域,形成模型与伪标签相互促进的良性循环。从客观角度看,这种将基础模型作为特征提取器而非仅提示接口的思路,以及结合不确定性估计的迭代精炼机制,是高效利用极少量标注的关键创新。
Abstract: Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model’s feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model’s encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model’s global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.
[148] TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations cs.CVPDF
Yifeng Bai, Zhirong Chen, Erkang Cheng, Haibin Ling
TL;DR: 本文提出了TopoHR,一种用于驾驶场景拓扑推理的端到端框架。该框架通过分层中心线表示(点查询、实例查询和语义表示)和分层拓扑推理模块,实现了中心线检测与拓扑推理之间的循环交互,从而相互增强。
Details
Motivation: 现有方法主要关注实例级中心线检测,然后使用简化的MLP层进行顺序拓扑推理,且忽视了点与实例(P2I)关系在拓扑推理中的重要性。TopoHR旨在解决这些局限,通过建立中心线检测与拓扑推理的循环交互来提升性能。
Result: 在OpenLane-V2基准测试中,TopoHR刷新了SOTA性能:在subset_A上,DET_l提升+3.8,TOP_ll提升+5.4;在subset_B上,DET_l提升+11.0,TOP_ll提升+7.9。
Insight: 创新点包括:1) 分层中心线表示与解码器,整合了点、实例和语义特征;2) 分层拓扑推理模块,统一捕获细粒度P2I关系和全局I2I连接;3) 端到端框架中中心线检测与拓扑推理的循环交互机制,可相互迭代优化。
Abstract: Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, they often neglect the importance of \textit{point-to-instance} (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in $\mathrm{DET}{\text{l}}$, +5.4 in $\mathrm{TOP}{\text{ll}}$ on $\text{subset_A}$ and +11.0 in $\mathrm{DET}{\text{l}}$, +7.9 in $\mathrm{TOP}{\text{ll}}$ on $\text{subset_B}$, validating the effectiveness of the proposed components. The code will be shared publicly at https://github.com/Yifeng-Bai/TopoHR.git.
[149] FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs cs.CVPDF
Jiayi Wang, Lichun Zhang, Xiaoqi Zhuang, Jiaqi Zhang, Lu Yu
TL;DR: 本文提出了一种名为FDIM的通用视频质量评估指标,该指标基于特征距离,适用于传统和神经视频编解码器,并支持SDR和HDR格式。FDIM采用混合架构,结合了深度特征和手工特征,以捕捉从结构纹理退化到高级语义偏差的多种失真。
Details
Motivation: 随着视频技术向超高清和高动态范围发展,对高效压缩的需求增加,神经视频编解码器快速发展,但其编码伪影与传统编解码器不同,具有内容变化和生成特性,传统视频质量评估方法难以捕捉,因此需要一种能泛化到不同编解码器、内容类型和动态范围的评估指标。
Result: FDIM在包含超过16k视频序列的大规模主观质量评估数据集DCVQA上训练,并在十个SDR/HDR VQA数据集上进行广泛实验,结果表明FDIM实现了强大的泛化能力,并与主观评估高度相关,达到先进水平。
Insight: 创新点在于提出了一种混合架构,结合深度特征学习多尺度表示和手工特征提供稳定补充线索,以提升跨编解码器和格式的泛化性能;客观分析认为,这种结合方法能有效应对神经编解码器带来的新挑战,为视频质量评估提供了通用解决方案。
Abstract: Video technology is advancing toward Ultra High Definition (UHD) and High Dynamic Range (HDR), which intensifies the need for higher compression efficiency for these high-specification videos. Beyond advances in traditional codecs, neural video codecs (NVCs) have attracted significant research attention and have evolved rapidly over the past few years. The coding artifacts of NVCs often exhibit content-varying and generative characteristics, which differ from those of conventional codecs and are challenging for traditional video quality assessment (VQA) methods to capture. Therefore, VQA metrics are required to generalize across different codecs, content types, and dynamic ranges to better support video codec research and evaluation. In this paper, we propose FDIM, a feature-distance-based generic video quality metric for both traditional and neural video codecs across SDR and HDR formats. FDIM employs a hybrid architecture that integrates deep and hand-crafted features. The deep feature component learns multi-scale representations to capture distortions ranging from structural and textural fidelity degradation to high-level semantic deviations, while the hand-crafted feature component provides stable complementary cues to improve overall generalization. We trained FDIM on a large-scale subjective quality assessment dataset (DCVQA) consisting of over 16k video sequences encoded by traditional block-based hybrid video codecs and end-to-end perceptually optimized neural video codecs. Extensive experiments on ten SDR/HDR VQA datasets containing diverse, previously unseen codecs demonstrate that FDIM achieves strong generalization and high correlation with subjective assessment. The source code for FDIM and the DCVQA validation set will be released at https://github.com/MCL-ZJU/FDIM.
[150] Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images cs.CVPDF
Jinkun Dai, Yuanxin Ye, Peng Tang, Tengfeng Tang, Xianping Ma
TL;DR: 本文提出了一种名为TSMNet的文本监督多模态开放词汇语义分割网络,用于多模态遥感图像。该方法通过双分支文本编码器提取场景级语义和对象级标签信息,并利用文本引导的视觉语义融合模块实现跨模态动态交互,从而提升开放词汇语义分割的性能。
Details
Motivation: 当前多模态遥感图像语义分割方法主要关注视觉模态的互补性,但忽略了非视觉文本数据的整合,这些数据可以弥合视觉模式与现实世界概念之间的语义鸿沟。
Result: 在作者构建的两个新多模态数据集上进行实验,TSMNet在分割准确性上优于其他最先进的语义分割模型,并在不同地理和传感器场景中展现出强大的泛化能力。
Insight: 创新点包括引入双分支文本编码器提取多粒度文本特征,以及设计文本引导的视觉语义融合模块实现动态跨模态交互;这为可解释的遥感分析提供了新范式,表明文本知识集成能显著增强模型的泛化性。
Abstract: Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet
[151] EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT cs.CVPDF
Xuguang Bai, Mingxuan Liu, Tongxi Song, Yifei Chen, Hongjia Yang
TL;DR: 本文提出了EXACT,一种可解释的、异常感知的3D胸部CT视觉基础模型,通过从配对的临床扫描和放射学报告中学习空间分辨表示,无需体素级标注即可联合学习器官分割和多实例异常定位,并在多疾病诊断、零样本异常定位、下游适应和视觉基础报告生成等任务上展现出广泛且一致的性能提升。
Details
Motivation: 解决现有视觉-语言基础模型在压缩扫描和报告为全局图像-文本表示时,无法保留空间证据和支持临床有意义解释的问题,旨在开发一个能识别疾病、定位异常并提供可解释视觉证据的临床有用AI。
Result: 在回顾性跨国多中心评估中,EXACT在多个临床相关CT任务上表现出广泛且一致的改进,超越了现有的3D医学基础模型。
Insight: 创新点在于通过解剖感知的弱监督预训练,联合学习器官分割和异常定位,生成器官特定的异常感知图,将每个体素分配一个限定于其对应解剖结构的疾病特异性异常分数,从而编码病变范围和器官级上下文,为可信赖的体医学AI建立了可扩展的范式。
Abstract: Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.
[152] PointTransformerX:Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms cs.CVPDF
Laurenz Reichardt, Nikolas Ebert, Oliver Wasenmüller
TL;DR: 本文提出PointTransformerX(PTX),一种完全基于PyTorch原生的3D点云视觉Transformer骨干网络,无需依赖自定义CUDA算子或外部库,实现了跨硬件平台(NVIDIA、AMD GPU及CPU)的高效可移植性。通过引入3D-GS-RoPE旋转位置编码直接建模空间关系、线性投影替代稀疏卷积块嵌入、以及推理时注意力窗口缩放等技术,在保持竞争力的精度下显著减少了参数和内存占用。
Details
Motivation: 现有3D点云感知方法严重依赖定制CUDA算子进行空间操作,导致其在非NVIDIA硬件(如AMD GPU和嵌入式设备)上可移植性和效率受限,本文旨在解决这一限制。
Result: 在ScanNet数据集上,PTX达到了PointTransformer V3精度的98.7%,同时参数减少79.2%,推理速度提升1.6倍,内存占用仅253 MB,实现了跨硬件平台的SOTA级效率。
Insight: 创新点包括:3D-GS-RoPE旋转位置编码无需邻域构建即可在自注意力中直接编码3D空间关系;用线性投影替代稀疏卷积块嵌入简化计算;推理时注意力窗口缩放策略可在不重新训练的情况下提升精度;全PyTorch原生设计消除了硬件依赖,为点云处理提供了高效可移植的解决方案。
Abstract: 3D point cloud perception remains tightly coupled to custom CUDA operators for spatial operations, limiting portability and efficiency on non-NVIDIA, AMD, and embedded hardware. We introduce PointTransformerX (PTX), a fully PyTorch-native vision transformer backbone for 3D point clouds, removing all custom CUDA operators and external libraries while retaining competitive accuracy. PTX introduces 3D-GS-RoPE, a rotary positional embedding that encodes 3D spatial relationships directly in self-attention without neighborhood construction, and further replaces sparse convolutional patch embedding with a linear projection. PTX explores inference-time scaling of attention windows to improve accuracy without retraining. With a redesigned feed-forward network, PTX achieves 98.7% of PointTransformer V3’s accuracy on ScanNet with 79.2% fewer parameters and executing 1.6\times faster while requiring just 253 MB memory. PTX runs natively on NVIDIA GPUs, AMD GPUs (ROCm), and CPUs, providing an efficient and portable foundation for point cloud perception.
[153] POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation cs.CVPDF
Yaohou Fan, Qingzhong Wang, Yongsong Huang, Junyi Liu, Tomo Miyazaki
TL;DR: 本文提出POCA框架,通过帕累托最优集识别和自适应课程对齐策略,解决视觉文本生成模型中文本准确性与图像整体连贯性之间的权衡问题,避免传统加权和优化方法的不稳定性,并在多奖励数据集上实现更优收敛。
Details
Motivation: 现有视觉文本生成模型在文本准确性和图像美学质量、指令遵循能力之间存在权衡,传统强化学习方法通过加权和优化多奖励不稳定且难以平衡权重,同时训练提示的选择效率低下。
Result: 实验结果表明,POCA在CLIP、HPS分数和句子准确率等所有指标上均显著提升,实现了多目标优化下的综合性能改进。
Insight: 创新点在于将多奖励对齐建模为帕累托最优问题,避免简单标量化,并结合自适应课程策略自动评估难度以优化学习序列,在统一奖励空间中消除不一致信号,实现从易到难的优化路径。
Abstract: Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.
[154] Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning cs.CVPDF
Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin
TL;DR: 本文提出了Omni-o3框架,一种用于审慎音频-视觉推理的深度嵌套全模态演绎方法。它通过将推理建模为动态递归搜索,在分支间共享推理前缀,并迭代执行扩展、选择、模拟和反向传播四个原子认知动作,以解决全模态理解中搜索空间巨大且冗余的问题。
Details
Motivation: 当前推理范式(顺序生成或并行采样)导致推理轨迹孤立,无法共享有希望的中间路径,限制了探索效率并在复杂视听任务中导致错误累积。本文旨在打破这一瓶颈,实现更高效、审慎的全模态推理。
Result: 在11个基准测试上的广泛实验表明,Omni-o3取得了有竞争力的性能,在全面的音频-视觉、以视觉为中心和以音频为中心的推理任务中解锁了高级能力。
Insight: 核心创新在于将推理形式化为动态递归搜索的深度嵌套演绎策略,以及包含冷启动监督微调和嵌套组采样驱动的探索性强化学习的两阶段训练范式。这为处理跨模态交互的复杂搜索空间提供了一种结构化和高效的探索方法。
Abstract: Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.
[155] Computer Vision-Based Early Detection of Container Loss at Sea cs.CVPDF
Vishakha Lall, Capt. Stanley S Pinto, Capt. Chu Xing Peng, Wu Kaiwen
TL;DR: 本研究提出了一种基于计算机视觉的低成本、可改装系统,用于利用现有船载摄像头对海上集装箱失稳进行早期检测。该系统通过目标分割、光流时域目标跟踪和残差运动提取来量化集装箱的相对运动,并在真实船载视频上验证了其在多变海况和能见度条件下的有效性。
Details
Motivation: 集装箱海运是全球贸易的支柱,但海上集装箱丢失是持续的安全、环境和经济挑战。尽管遵守货物系固手册,但船舶运动、风载荷和恶劣海况等动态海洋条件会逐渐导致集装箱堆失稳并落水。随着国际海事组织对丢失集装箱的新强制报告要求,迫切需要一种可靠的、基于证据的失稳集装箱早期检测解决方案。
Result: 在真实船载视频上的实验评估表明,所提出的流程能在不同海况和能见度条件的挑战下,有效分离出集装箱级别的运动。
Insight: 创新点在于提出了一种低成本、可改装的视觉框架,将目标分割、光流跟踪与残差运动提取相结合,专门用于量化集装箱堆的相对运动,以实现早期预警,从而提升货物安全、操作弹性和法规遵从性。从客观角度看,该方法利用现有基础设施(船载摄像头),通过计算机视觉技术将复杂的物理失稳问题转化为可量化的运动分析问题,具有实际部署的可行性。
Abstract: Containerised shipping underpins global trade, yet container loss at sea remains a persistent safety, environmental, and economic challenge. Despite compliance with Cargo Securing Manuals, dynamic maritime conditions such as vessel motion, wind loading, and severe sea states can progressively destabilise container stacks, leading to overboard losses. With the new International Maritime Organisation’s (IMO) mandatory reporting requirements for lost containers, there is an urgent need for a reliable, evidence-based early detection solution for destabilised containers. This study showcases a low-cost, retrofittable computer vision-based system for early detection of destabilised containers using existing onboard cameras. The framework integrates object segmentation to isolate container stacks, temporal object tracking using optical flow and individual objects’ residual motion extraction to quantify relative movement. Experimental evaluation on real onboard ship footage demonstrates that the proposed pipeline effectively isolates container-level motion under challenging conditions of varying sea states and visibility conditions. By enabling early alerts for crew intervention and navigational adjustment, the proposed approach enhances cargo safety, operational resilience, and regulatory compliance.
[156] Touchless Intraoperative Image Access System Based on Vision-Based Hand Tracking cs.CVPDF
Yin Lin, Domenico Aquino, Alberto Redaelli, Massimiliano Del Bene, Riccardo Barbieri
TL;DR: 本文提出了一种基于视觉手势跟踪的无接触术中图像访问系统,使用单个RGB摄像头实时捕捉手势,通过MediaPipe Hands进行手部关键点2.5D估计,并将直观手势映射为平移、旋转和缩放命令,实现与医学图像查看器的自然交互。系统无需额外硬件或用户特定训练,架构独立于可视化软件,本研究以PyVista为例集成,并通过帧级日志和延迟、稳定性、交互鲁棒性等指标进行性能评估。
Details
Motivation: 解决手术环境中因无菌要求和操作流程连续性而日益重要的无接触医学图像交互需求,提供一种低成本、无需专用硬件的解决方案。
Result: 实验结果表明系统具有实时性,延迟较低且控制稳定,符合流畅交互的要求,验证了低成本无接触术中图像访问方案的可行性。
Insight: 创新点在于利用单目RGB摄像头和现成的MediaPipe Hands库实现实时2.5D手部跟踪,将简单直观手势映射为连续图像操作命令,系统架构与可视化软件解耦,为未来临床评估奠定基础;客观分析其低成本、易部署的优势在医疗场景中具有实用价值。
Abstract: Touchless interaction with medical images is becoming increasingly important in the surgical field, where sterility and continuity of the operational workflow are essential requirements. This work presents a vision-based system for intraoperative navigation of medical images through hand gestures acquired using a single RGB camera. Unlike many existing solutions, the system does not require additional hardware or user-specific training. Hand tracking is performed in real time using MediaPipe Hands, which provides a 2.5D estimation of hand landmarks. Simple and intuitive gestures are then mapped into translation, rotation, and zoom commands, enabling continuous and natural interaction with the image viewer. The system architecture is independent from the visualization software and, for implementation simplicity, in this study it was integrated with PyVista. Performance was evaluated through frame-level logging and quantitative analysis of latency, stability, and interaction robustness metrics. Experimental results highlight real-time behavior, with reduced latencies and stable control, in line with the requirements of fluid interaction. The system demonstrates the feasibility of a low-cost touchless solution for intraoperative access to medical images, laying the groundwork for future clinical evaluations.
[157] ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning cs.CVPDF
Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen
TL;DR: 本文提出了ReVSI基准,旨在重建视觉空间智能评估,以更准确地评估视觉语言模型(VLM)的3D推理能力。该基准通过重新标注多个数据集的场景对象和几何信息,并生成可回答且正确的问答对,解决了现有评估中因标注错误和输入限制导致的系统性问题。
Details
Motivation: 现有空间智能评估在现代VLM设置下存在系统性无效问题:一是基于点云标注的问答对在视频评估中可能因重建或标注伪影导致错误或模糊;二是评估常假设全场景访问,而VLM实际输入为稀疏采样帧,使许多问题无法回答。
Result: 在ReVSI基准上对通用和领域特定VLM的评估揭示了先前基准掩盖的系统性失败模式,提供了更可靠和可诊断的空间智能评估。
Insight: 创新点包括:通过重新标注和人工验证确保问答对在模型实际输入下的正确性和可回答性;提供多帧预算变体和细粒度对象可见性元数据以增强评估可控性;使用专业3D标注工具进行偏差缓解。
Abstract: Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model’s actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.
[158] Don’t Pause! Every prediction matters in a streaming video cs.CVPDF
Dibyadip Chatterjee, Zhanzhong Pang, Fadime Sener, Yale Song, Angela Yao
TL;DR: 该论文提出了SPOT-Bench,一个用于评估流式视频模型实时感知与辅助能力的新基准,其特点是包含多轮主动查询,并引入了Timeliness-F1综合指标来衡量预测的时序精度和覆盖范围。研究发现离线模型存在滥发预测或响应迟钝的问题,并提出了无需训练的流式适配方法AsynKV,该方法利用长短时记忆并在‘死区时间’高效扩展计算,在SPOT-Bench上超越了现有流式模型,并在回顾性基准上达到了SOTA水平。
Details
Motivation: 解决现有在线视频问答(VideoQA)基准测试的回顾性局限,这些基准在固定时间戳暂停视频并提问,未能测试模型在连续视频流中的实时预测能力,因此需要一个新的基准来评估流式视频模型作为实时助手所需的感知和响应能力。
Result: 在提出的SPOT-Bench基准上,AsynKV方法超越了现有流式模型,并在回顾性基准测试中达到了最先进的(SOTA)水平。
Insight: 创新点在于提出了首个专注于评估流式视频实时感知的基准SPOT-Bench及其综合指标Timeliness-F1,并揭示了‘死区时间’(视频中无需响应的部分)的概念。提出的AsynKV方法是一种无需训练的流式适配技术,通过利用长短时记忆和在死区时间扩展计算,有效平衡了事件感知的准确性和流式行为的响应效率。
Abstract: Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.
[159] An Affordable,Wearable Stereo-Eye-Tracking Platform cs.CVPDF
Alexander Zimmer, Yasmeen Abdrabou, Enkelejda Kasneci
TL;DR: 本文提出了一种经济实惠、可穿戴的立体眼动追踪平台,该平台采用现成和3D打印组件构建,旨在填补现有眼动追踪设备在算法开发和比较评估方面灵活性不足的空白。系统包含四个红外眼动相机、红外照明、可选场景相机以及支持校准和同步数据采集的软件,支持多种眼动追踪范式。
Details
Motivation: 现有可穿戴眼动追踪设备(包括商业和开源)在算法开发和比较评估方面灵活性有限,本文旨在通过模块化设计解决这一问题。
Result: 通过原型实现验证了该方法的可行性,但摘要未提及具体定量结果或基准测试表现。
Insight: 创新点在于硬件架构的模块化和可扩展性设计,支持多种眼动追踪范式(立体、光斑、双目)于单一硬件配置中,且所有硬件设计开源,便于研究复用和比较。
Abstract: Research on video-based eye-tracking has long explored stereo and glint-based methods, yet existing wearable eye trackers - both commercial and open-source - offer limited flexibility for algorithm development and comparative evaluation. We present an affordable, wearable stereo eye-tracking platform built from off-the-shelf and 3D-printable components that explicitly targets this gap. The system combines four infrared eye cameras, infrared illumination, an optional scene camera, and software support for calibration and synchronized data acquisition. By design, the platform supports multiple eye-tracking paradigms, including stereo, glint-based, and binocular approaches, within a single hardware configuration. Rather than optimizing for end-user robustness, the platform prioritizes modularity and extensibility for research use. This paper focuses on the hardware architecture and calibration pipeline and demonstrates the feasibility of the approach using a prototype implementation. All hardware designs and documentation are made openly available.
[160] See Further, Think Deeper: Advancing VLM’s Reasoning Ability with Low-level Visual Cues and Reflection cs.CV | cs.AIPDF
Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang
TL;DR: 本文提出了一种名为ForeSight的统一多模态交错推理框架,旨在通过引入低级视觉线索和基于掩码的视觉反馈机制,来增强视觉语言模型(VLM)的推理能力。该框架利用强化学习驱动模型自主决策工具调用和答案验证,并在新构建的CG-SalBench数据集上验证了其有效性。
Details
Motivation: 现有基于强化学习的VLM推理方法存在两个关键局限:缺乏对低级视觉信息的利用以及有效的视觉反馈机制,这限制了模型对细粒度视觉特征的感知和答案的动态修正能力。
Result: 实验结果表明,所提出的ForeSight-7B模型在相同参数规模模型中表现显著更优,并在新构建的CG-SalBench数据集上的某些指标上超越了当前最先进的闭源模型。
Insight: 主要创新点在于将低级视觉工具(如边缘、纹理分析)整合到推理链中以增强视觉感知(See Further),并设计了一种基于掩码的视觉反馈机制,使模型能够通过视觉反射动态重新审视和更新答案(Think Deeper),形成了一个由强化学习驱动的、自主决策的闭环推理框架。
Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
[161] SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters cs.CV | cs.AIPDF
Arya Shah, Deepali Mishra, Chaklam Silpasuwanchai
TL;DR: 该论文研究了小型开放权重视觉语言模型在评估图像与文本描述对齐时的可靠性,发现这些模型存在明显的’谄媚’行为,即在不依赖视觉证据的情况下给出高分。作者提出了’虚张声势系数’来量化这种现象,并在包含17.3万张AI生成角色肖像与文本描述的基准上测试了六个模型,结果显示模型大小与谄媚率呈显著负相关。
Details
Motivation: 视觉语言模型越来越多地被用作需要细致图像理解任务的评估器,但其在评分图像与文本描述对齐时的可靠性尚未充分探索,特别是小型开放权重模型可能存在的’谄媚’行为。
Result: 在173,810个AI生成角色肖像与文本描述的基准上,测试了六个参数从450M到8B的开放权重VLM,发现模型大小与谄媚率呈显著负相关(r = -0.96, p = 0.002),最小模型LFM2-VL(450M)的谄媚率为22.3%,最大模型LLaVA-1.6(7B)为6.0%。
Insight: 提出了’虚张声势系数’这一新指标来量化模型评分与证据回忆之间的不匹配,揭示了小型开放权重VLM在作为自动评估器时,特别是在属性丰富的合成图像评估任务中,其评分与视觉证据之间存在可测量且重要的差距,这为模型部署的可靠性提供了关键洞见。
Abstract: Vision-language models (VLMs) are increasingly deployed as evaluators in tasks requiring nuanced image understanding, yet their reliability in scoring alignment between images and text descriptions remains underexplored. We investigate whether small, open-weight VLMs exhibit \emph{sycophantic} behavior when evaluating image-text alignment: assigning high scores without grounding their judgments in visual evidence. To quantify this phenomenon, we introduce the \emph{Bluffing Coefficient} (\bc), a metric that measures the mismatch between a model’s score and its evidence recall. We evaluate six open-weight VLMs ranging from 450M to 8B parameters on a benchmark of 173,810 AI-generated character portraits paired with detailed textual descriptions. Our analysis reveals a significant inverse correlation between model size and sycophancy rate ($r = -0.96$, $p = 0.002$), with smaller models exhibiting substantially higher rates of unjustified high scores. The smallest model tested (LFM2-VL, 450M) produced sycophantic evaluations in 22.3% of cases, compared to 6.0% for the largest (LLaVA-1.6, 7B). These findings have direct implications for the deployment of small, open-weight VLMs as automated evaluators within attribute-rich, synthetic image evaluation tasks, where the gap between assigned scores and cited visual evidence is both measurable and consequential.
[162] Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation cs.CV | cs.AIPDF
Yubo Jiang, Xin Yang, Abudukelimu Wuerkaixi, Zheming Yuan, Xuxin Cheng
TL;DR: 该论文提出了一种名为正负解码(PND)的训练无关推理框架,旨在缓解视觉语言模型(VLM)中的物体幻觉问题。该方法通过在解码过程中进行干预,利用双路径对比机制来增强视觉保真度,从而生成更忠实于图像内容的描述。
Details
Motivation: 动机在于解决视觉语言模型因过度依赖语言先验而产生的物体幻觉问题,即生成与视觉现实相矛盾的内容。作者发现其根本原因在于模型存在关键的注意力缺陷,即视觉特征在解码过程中权重不足。
Result: 在POPE、MME和CHAIR等基准测试上的广泛实验表明,PND实现了最先进的性能,准确率提升高达6.5%,显著减少了物体幻觉并增强了描述细节,且无需任何模型重新训练。该方法在LLaVA、InstructBLIP、InternVL和Qwen-VL等多种VLM架构上均有效泛化。
Insight: 宣称的创新点在于提出了一个无需训练、直接干预解码过程的双路径对比推理框架(PND),通过正路径放大显著视觉证据和负路径削弱核心物体特征来纠正注意力缺陷。客观来看,其核心洞察是识别并量化了VLM中的注意力缺陷,并设计了一种新颖的、基于对比的解码时干预策略来强制模型关注视觉证据,这是一种高效且通用的幻觉缓解方法。
Abstract: Vision-Language Models (VLMs) are frequently undermined by object hallucination–generating content that contradicts visual reality–due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object’s features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model’s outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail–all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.
[163] DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation cs.CVPDF
Md Shohel Rana, Andrew H. Sung
TL;DR: DYMAPIA是一个用于检测AI生成视频篡改的多域框架,它融合了空间、频谱和时间线索来捕捉视觉数据中的细微篡改痕迹。该系统通过结合傅里叶频谱、局部纹理描述符、边缘不规则性和光流一致性等证据,构建动态异常掩码,以高空间精度突出篡改区域。这些掩码指导一个名为DistXCNet的轻量级分类器,该分类器从Xception蒸馏而来,并采用深度可分离卷积优化,以实现快速、区域聚焦的分类。
Details
Motivation: AI生成媒体的快速发展引发了人们对内容真实性和数字信任的迫切担忧,需要有效的Deepfake检测方法来应对。
Result: 该框架在FF++、Celeb-DF和VDFD基准测试上取得了最先进(SOTA)的结果,准确率和F1分数均超过99%,同时模型保持紧凑,适合实时使用。
Insight: 创新点在于提出了一种多域融合(空间、频谱、时间)的动态异常掩码生成方法,以及一个从Xception蒸馏并优化的轻量级区域聚焦分类器DistXCNet,实现了高精度与高效率的平衡,展现了在实时取证任务中的部署潜力。
Abstract: AI-generated media are advancing rapidly, raising pressing concerns for content authenticity and digital trust. We introduce DYMAPIA, a multi-domain Deepfake detection framework that fuses spatial, spectral, and temporal cues to capture subtle traces of manipulation in visual data. The system builds dynamic anomaly masks by combining evidence from Fourier spectra, local texture descriptors, edge irregularities, and optical flow consistency, which highlight tampered regions with fine spatial accuracy. These masks guide DistXCNet, a lightweight classifier distilled from Xception and optimized with depthwise separable convolutions for fast, region-focused classification. This joint design achieves state-of-the-art results, with accuracy and F1-scores exceeding 99% on FF++, Celeb-DF, and VDFD benchmarks, while keeping the model compact enough for real-time use. Beyond outperforming existing full-frame and multidomain detectors, DYMAPIA demonstrates deployment readiness for time-critical forensic tasks, including media verification, misinformation defense, and secure content filtering.
[164] AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark cs.CVPDF
Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen
TL;DR: 本文提出了AutoGUI-v2,一个用于评估GUI智能体深度功能理解和交互结果预测能力的综合性多模态基准。该基准通过一个新颖的VLM-人类协作流程构建,从多平台截图中递归解析出层次化功能区域,并生成了涵盖六个操作系统、共2,753个任务的多样化评估集,测试内容包括区域/元素级语义、功能定位和动态状态预测。
Details
Motivation: 现有GUI智能体基准存在割裂,要么关注黑盒任务完成,要么关注静态、浅层的定位,无法评估智能体是否真正理解GUI的隐含功能和状态转换逻辑。为了弥合这一差距,需要一个新的基准来评估对GUI功能的深度理解和交互结果的预测能力。
Result: 评估揭示了VLM模型在GUI理解上的显著二分现象:在代理数据上微调的开源模型(如Qwen3-VL)擅长功能定位,而商业模型(如Gemini-2.5-Pro-Thinking)在功能描述上占优。所有模型在处理不常见操作的复杂交互逻辑时都存在困难,表明深度功能理解仍是一个重大挑战。
Insight: 论文的创新点在于提出了一个系统性的、多层次的GUI功能理解评估框架,超越了传统的元素匹配或任务完成度评估。其构建基准的VLM-人类协作递归解析流程,以及将评估重点放在“预测交互后的数字世界状态”这一动态理解能力上,为下一代GUI智能体的发展提供了新的衡量标准和方向。
Abstract: Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the “digital world state” resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.
[165] TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering cs.CVPDF
Dongxing Mao, Yilin Wang, Linjie Li, Zhengyuan Yang, Alex Jinpeng Wang
TL;DR: 本文介绍了TextGround4M,一个包含超过400万提示-图像对的大规模数据集,用于解决文本到图像生成模型中文本空间布局渲染不准确的问题。该数据集提供了细粒度的文本跨度级标注和边界框,支持布局感知的文本渲染训练。作者还提出了一种轻量级的训练策略,在不改变模型架构或推理行为的情况下,为自回归T2I模型添加布局感知的跨度标记。此外,研究构建了一个分层布局复杂度的基准测试,并引入了两个布局感知的评估指标,以评估模型在零样本设置下的文本渲染空间准确性。
Details
Motivation: 当前文本到图像生成模型在准确渲染提示指定的文本及其空间布局方面存在困难,尤其是在多跨度、结构化场景中。这一挑战源于缺乏将提示与图像中精确文本和布局对齐的数据集,以及缺少有效的布局质量评估指标。
Result: 在零样本设置下,使用分层布局复杂度基准进行评估,结果显示,在TextGround4M上训练的模型在文本保真度、空间准确性和提示一致性方面优于强基线模型。
Insight: 创新点包括:1) 构建了大规模、细粒度标注的TextGround4M数据集,为布局感知的文本渲染提供监督;2) 提出了一种轻量级的训练策略,通过添加布局感知的跨度标记来增强模型,无需修改架构;3) 引入了分层布局复杂度的基准测试和两个新的布局感知评估指标,填补了文本渲染空间评估的空白。从客观角度看,该研究通过数据驱动和评估方法创新,系统性地提升了文本到图像生成中布局控制的性能。
Abstract: Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout – especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.
[166] Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data cs.CVPDF
Mohammadmehdi Ataei, Farzaneh Askari, Kamal Rahimi Malekshan, Pradeep Kumar Jayaraman
TL;DR: Zero-to-CAD是一个可扩展的框架,通过将大型语言模型嵌入到反馈驱动的CAD环境中,以代理搜索的方式合成可执行的CAD构造序列,生成了约一百万个可执行、可读、可编辑的CAD程序,并利用合成数据微调视觉语言模型,从多视图图像重建可编辑的CAD程序,性能优于包括GPT-5.2在内的基线模型。
Details
Motivation: 解决现有大规模3D数据集(如B-Reps或网格)缺乏参数化构造历史(即设计意图的编码)的问题,以填补几何规模与参数化可解释性之间的空白。
Result: 在合成数据上微调的视觉语言模型,从多视图图像重建可编辑CAD程序的任务中,性能优于包括GPT-5.2在内的强基线模型,有效引导了序列生成能力,无需真实的构造历史训练数据。
Insight: 采用代理式搜索框架,将LLM嵌入CAD环境进行迭代生成、执行和验证,促进了几何有效性和操作多样性;合成大规模可执行CAD序列,超越了传统的草图拉伸工作流,为CAD AI提供了关键资源。
Abstract: Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset’s utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.
[167] CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping cs.CVPDF
Md Shohel Rana, Tanoy Debnath
TL;DR: 本文提出CA-IDD,一种基于扩散模型的换脸方法,通过多尺度交叉注意力整合注视、身份和面部解析等多模态引导,在去噪过程中引入预计算的身份嵌入,以实现准确且一致的身份迁移,并利用专家引导的监督提升语义连贯性和视觉质量。
Details
Motivation: 解决现有换脸方法(尤其是基于GAN的方法)在身份保持和视觉真实感之间难以平衡的问题,这些方法受限于可控性不足和模式崩溃。
Result: 在FID指标上达到11.73,超过了FaceShifter和MegaFS等基线方法,定性结果也显示其在多种姿态下具有更好的身份保持能力。
Insight: 创新点在于首次将扩散模型引入换脸任务,通过多模态交叉注意力实现细粒度的区域控制;客观来看,其分层注意力机制和稳定的扩散框架为身份一致性编辑提供了新思路。
Abstract: Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.
[168] RACANet: Reliability-Aware Crowd Anchor Network for RGB-T Crowd Counting cs.CVPDF
Jinghao Shi, Mengqi Lei, Kunliang He, Yun Li, Wei Bao
TL;DR: 本文提出了一种名为RACANet的两阶段融合框架,用于RGB-T(可见光-热红外)人群计数。该框架首先通过轻量级的跨模态对齐预训练阶段显式学习跨模态语义对应关系,然后在正式训练阶段引入局部锚点融合模块,基于预训练学到的先验知识聚合高可靠性区域的特征并实现像素级自适应特征重分布,以解决现有方法在局部空间差异建模和模态可靠性细粒度表征方面的不足。
Details
Motivation: 现有RGB-T人群计数方法大多依赖隐式的跨模态融合策略,缺乏对局部空间差异的显式建模以及在位置级别对模态可靠性的细粒度刻画,这限制了融合过程的准确性和可解释性。
Result: 在RGBT-CC和Drone-RGBT两个广泛使用的基准数据集上的实验表明,RACANet的性能优于现有方法。
Insight: 创新点包括:1) 引入两阶段框架,通过显式的跨模态对齐预训练学习语义对应;2) 提出局部锚点融合模块,利用预训练先验生成局部语义锚点并实现像素级自适应融合;3) 提出差异感知一致性约束,动态协调模态表示一致区域的可靠性。从客观角度看,该方法将模态对齐与融合解耦,并强调了局部可靠性的显式建模,提升了融合过程的精细度和可解释性。
Abstract: RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local bidirectional soft matching. Then, based on the priors learned during pretraining, a Local Anchor Fusion Module (LAFM) is introduced in the formal training stage. This module generates local semantic anchors by aggregating features from highly reliable regions and further enables adaptive pixel-level feature redistribution with a local attention mechanism. In addition, we propose a discrepancy-aware consistency constraint to dynamically coordinate the reliability of regions where modal representations are consistent. Experiments conducted on two widely used benchmark datasets, RGBT-CC and Drone-RGBT, demonstrate that RACANet outperforms existing methods. The anonymous code is available at https://anonymous.4open.science/r/RACANet-9985.
[169] Improving Vision-language Models with Perception-centric Process Reward Models cs.CVPDF
Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng
TL;DR: 本文提出了一种名为Perceval的感知中心过程奖励模型,用于改进视觉语言模型的推理能力。该模型能够从模型响应中提取与图像相关的声明,并与视觉证据进行逐项比对,从而定位感知错误。Perceval被集成到强化学习训练过程中,提供细粒度的监督信号,并可在推理阶段辅助模型进行错误截断和反思,实现测试时扩展。实验表明,该方法在多个领域的基准测试上显著提升了多种经过强化学习训练的推理型视觉语言模型的性能。
Details
Motivation: 现有的基于可验证奖励的强化学习方法对视觉语言模型的监督过于粗粒度,难以诊断和纠正推理链中的错误。因此,需要一种能够进行细粒度、感知层面的错误定位和纠正的方法。
Result: 实验在多个领域的基准测试上进行,结果表明,使用Perceval训练的模型性能显著提升。在测试时扩展方面,该方法也优于多数投票等其他策略,取得了持续的性能增益。
Insight: 核心创新点在于提出了一个专注于感知过程的过程奖励模型,实现了对模型响应中感知错误的细粒度定位。这不仅为强化学习训练提供了更精细的监督信号,还通过测试时错误截断和反思机制,实现了推理阶段的性能扩展,是一种通用的感知中心监督策略。
Abstract: Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model’s response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.
[170] Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift cs.CVPDF
Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi
TL;DR: 本文提出了一种名为MG-MTTA的测试时适应方法,用于解决视觉语言模型在部署时视觉和文本模态发生不对称偏移的问题。该方法基于对多模态后验的优化视角,将适应问题转化为对融合预测的约束解混问题,通过结合融合后验熵最小化和基于锚点的可靠性感知门控先验,在保持骨干网络冻结的情况下仅更新轻量级门控或适配器,以控制模态可靠性并减少错误。
Details
Motivation: 视觉语言模型在零样本设置下泛化良好,但在实际部署中,视觉和文本分支常发生不对称偏移。在此条件下,仅基于熵的测试时适应方法在锐化融合后验的同时可能增加错误,因为不可靠的模态仍可能主导融合。本文旨在研究这种失效模式并解决模态可靠性控制问题。
Result: 在基于ImageNet的基准测试中,MG-MTTA在保持语义的文本偏移下将top-1准确率从57.97提升至66.51,在联合视觉-文本偏移下从21.68提升至26.27,同时在仅视觉偏移的基准测试中保持竞争力。
Insight: 论文的创新点在于从优化视角分析多模态后验,将测试时适应形式化为约束解混问题,并提出了结合熵最小化与可靠性感知门控先验的轻量级适应方法。核心洞察是:多模态测试时适应应控制模态可靠性,而不仅仅是预测熵,这为防止不可靠模态主导融合提供了理论条件和实用阈值。
Abstract: Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
[171] CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV | cs.AIPDF
Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang
TL;DR: 本文提出CF-VLA,一种用于视觉-语言-动作策略的从粗到细动作生成方法,旨在解决基于流的VLA策略因多步推理导致效率低下的问题。该方法通过两阶段生成:先进行粗粒度初始化构建动作感知起点,再进行单步局部细化修正残差,从而在保证性能的同时显著提升推理效率。
Details
Motivation: 基于流的视觉-语言-动作策略在动作生成方面表达能力强,但存在根本性低效问题:需要多步推理从无信息的高斯噪声中恢复动作结构,导致在实时约束下效率与质量难以权衡。
Result: 在CALVIN和LIBERO基准测试中,CF-VLA在低NFE(函数评估次数)机制下建立了强大的效率-性能前沿:持续优于现有NFE=2方法,在多项指标上匹配或超越NFE=10的π0.5基线,将动作采样延迟降低75.4%,并在真实机器人任务中达到83.0%的最佳平均成功率,分别超过MIP 19.5个百分点和π0.5 4.0个百分点。
Insight: 创新点在于重新思考生成式动作建模中起点的作用,提出从粗到细的两阶段生成框架,将动作生成重构为结构化的粗初始化加单步细化的过程;从客观角度看,其分步训练策略(先学习受控粗预测器再进行联合优化)有效稳定了训练,实现了高效推理与高性能的平衡。
Abstract: Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $π_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real-robot success rate of 83.0%, outperforming MIP by 19.5 points and $π_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.
[172] Meta-CoT: Enhancing Granularity and Generalization in Image Editing cs.CV | cs.AI | cs.LG | cs.MMPDF
Shiyi Zhang, Yiji Cheng, Tiankai Hang, Zijin Yin, Runze He
TL;DR: 本文提出Meta-CoT范式,通过将单图像编辑操作分解为(任务、目标、所需理解能力)三元组,并进一步将编辑任务分解为五个元任务进行训练,以提升模型对编辑操作的细粒度理解和泛化能力。该方法还引入了CoT-编辑一致性奖励来对齐推理与编辑行为。实验表明,该方法在21个编辑任务上平均提升15.8%,并能有效泛化到未见任务。
Details
Motivation: 现有统一多模态理解/生成模型通过将细粒度理解融入思维链(CoT)过程来改进图像编辑性能,但何种形式的CoT和训练策略能共同提升理解粒度和泛化能力仍未充分探索。
Result: 在21个编辑任务上实现了平均15.8%的整体性能提升,并且在仅使用少量元任务训练后,能有效泛化到未见过的编辑任务。
Insight: 核心创新在于提出了可分解性和泛化性两个关键属性的两级分解范式:将编辑意图建模为三元组以增强理解粒度,并将编辑任务分解为五个基础元任务以实现强泛化;同时引入CoT-编辑一致性奖励来确保推理信息被准确有效地利用于编辑过程。
Abstract: Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model’s understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model’s editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/
[173] NeuroClaw Technical Report cs.CVPDF
Cheng Wang, Zhibin He, Zhihao Peng, Shengyuan Liu, Yufan Hu
TL;DR: 本文介绍了NeuroClaw,一个专为神经影像研究设计的可执行且可复现的多智能体研究助手。它直接处理原始多模态神经影像数据,利用BIDS元数据,无需用户准备定制化输入或代码,并通过分层技能架构、环境管理和检查点等工程化设计提升工作流的透明度与可复现性。
Details
Motivation: 动机是解决神经影像领域因数据模态异构、流程多阶段且复杂、以及可复现性风险高而阻碍AI智能体加速科学工作流的挑战。
Result: 在NeuroBench系统级基准测试中,与直接调用智能体相比,启用NeuroClaw的运行在可执行性、产物有效性和可复现性准备度方面取得了持续且显著的分数提升。
Insight: 创新点在于将领域专业化(神经影像语义与BIDS)、分层多智能体架构与强调可复现性的工程化层(环境管理、检查点、审计追踪)相结合,构建了一个端到端的可执行研究框架,并配套提出了针对该领域的系统级评估基准NeuroBench。
Abstract: Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html
[174] WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring cs.CVPDF
Vandita Shukla, Fabio Remondino, Blair Costelloe, Benjamin Risse
TL;DR: WildLIFT是一个计算框架,旨在将单目无人机视频提升至3D空间,用于物种无关的野生动物监测。它通过整合单目视频中的三维场景几何与开放词汇2D实例分割,实现3D检测与跟踪,并生成带有语义面信息的定向3D边界框标签,以支持下游生态分析。
Details
Motivation: 当前基于单目RGB摄像头的无人机野生动物监测分析大多局限于二维图像空间,未能充分利用视频中的几何信息,限制了其在行为研究和种群监测中的应用。
Result: 该框架在包含四个大型哺乳物种的2,581帧手动标注数据上进行了验证,涉及超过6,700个3D检测。WildLIFT在多动物场景中保持了较高的身份一致性,并通过基于关键帧的细化大幅减少了手动3D标注工作量。
Insight: 创新点在于将单目深度估计与开放词汇2D分割相结合,实现无需物种先验知识的3D检测与跟踪,并利用定向3D边界框的语义面信息来量化视角覆盖和动物间遮挡,为生态分析提供了结构化的3D元数据。
Abstract: Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two-dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three-dimensional scene geometry from monocular drone video with open-vocabulary 2D instance segmentation to enable species-agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter-animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi-animal scenes and substantially reduces manual 3D annotation effort through keyframe-based refinement. By transforming standard drone footage into structured 3D and viewpoint-aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.
[175] OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer cs.CVPDF
Boyang Wang, Guangyi Xu, Zhipeng Tang, Jiahui Zhang, Zezhou Cheng
TL;DR: 本文提出OmniShotCut,一种基于镜头查询Transformer的全新镜头边界检测方法,将SBD建模为结构化关系预测任务,联合估计镜头内和镜头间关系。同时,论文引入了一个全合成的过渡合成流程来生成精确标注,并构建了OmniShotCutBench这一现代宽领域基准测试集。
Details
Motivation: 解决现有SOTA方法在镜头边界检测中存在的边界不具可解释性、遗漏细微但有害的不连续性、以及依赖噪声大、多样性低的标注和过时基准等问题。
Result: 论文在提出的新基准OmniShotCutBench上进行了评估,旨在实现全面和诊断性的评估,但摘要中未提及具体的定量结果或与SOTA的比较水平。
Insight: 核心创新点在于将SBD重新定义为结构化关系预测问题,并采用基于镜头查询的密集视频Transformer进行联合建模;同时,通过全合成流程生成高质量、参数化的训练数据,以解决标注噪声问题。
Abstract: Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.
[176] Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CVPDF
Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li
TL;DR: Tuna-2是一种原生统一的多模态模型,直接基于像素嵌入进行视觉理解和生成,摒弃了传统的VAE或表示编码器等模块化视觉编码器设计。实验表明,它在多模态基准测试中达到了最先进的性能,证明了统一的像素空间建模在高质量图像生成方面完全可以与潜在空间方法竞争。
Details
Motivation: 解决传统统一多模态模型依赖预训练视觉编码器、且理解与生成任务使用分离的视觉表示所导致的任务不对齐和无法完全端到端优化的问题。
Result: 在多模态基准测试中达到最先进(SOTA)性能;在需要细粒度视觉感知的任务上,其无编码器设计在大规模训练下实现了更强的多模态理解能力。
Insight: 创新点在于提出完全基于像素嵌入的原生统一多模态架构,证明了预训练视觉编码器对于多模态建模并非必需,端到端的像素空间学习为生成和感知任务提供了可扩展的更强视觉表示路径。
Abstract: Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2’s encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.
[177] World-R1: Reinforcing 3D Constraints for Text-to-Video Generation cs.CVPDF
Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang
TL;DR: 本文提出World-R1框架,通过强化学习将3D约束对齐到文本到视频生成中,以解决现有视频基础模型常见的几何不一致问题。该方法利用预训练的3D基础模型和视觉语言模型提供反馈进行优化,无需修改底层架构,并采用周期性解耦训练策略平衡几何一致性与场景动态流畅性。
Details
Motivation: 现有视频生成模型常出现几何不一致问题,而通过架构修改注入3D先验的方法计算成本高且可扩展性有限,因此需要一种高效且可扩展的方法来增强视频生成的3D一致性。
Result: 广泛评估表明,该方法显著提升了3D一致性,同时保持了基础模型的原始视觉质量,有效弥合了视频生成与可扩展世界模拟之间的差距。
Insight: 创新点在于使用强化学习(Flow-GRPO)和预训练模型反馈来对齐3D约束,无需架构改动;引入专门的世界模拟纯文本数据集;以及周期性解耦训练策略来平衡几何刚性与动态流畅性,为增强生成模型的3D一致性提供了一种高效且可扩展的途径。
Abstract: Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
cs.IR [Back]
[178] Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking cs.IR | cs.AI | cs.CLPDF
Eyhab Al-Masri
TL;DR: 本文提出了一个统一的基准框架,用于量化大型语言模型(LLM)在作为自主代理执行任务时,在发现和排序外部API方面的差异(即模型间分歧)。该研究在15个标准API领域和5个主要模型系列中,使用多种集合、排序和一致性度量方法进行评估。结果表明,模型间存在中等程度的整体一致性,但分歧程度高度依赖于任务领域,结构化任务相对稳定,而开放式任务则表现出更高的分歧。
Details
Motivation: 随着LLM越来越多地作为自主代理,通过调用外部API来执行复杂任务,其可靠性和一致性尚未得到充分表征。本文旨在解决如何量化LLM在API发现和排序上的差异(即模型间分歧),以评估其可靠性和协调性。
Result: 在15个API领域和5个主要模型家族上的评估结果显示,模型间整体一致性中等(平均重叠度约0.50,Kendall’s tau约0.45),但分歧具有强烈的领域依赖性:结构化任务(如天气、语音转文本)表现稳定,而开放式任务(如情感分析)则表现出显著更高的分歧。共识分析表明,数据驱动的领域内模型一致性较高,而抽象推理任务的一致性则下降。
Insight: 论文的创新点在于提出了一个量化LLM间通信分歧的统一基准框架,并揭示了模型间分歧的系统性模式(领域依赖性)。从客观角度看,该研究为多智能体系统中基于共识加权的可靠性协调提供了依据,并指出表面一致性可能掩盖行动相关排序的不稳定性,这为部署前的安全风险检测提供了诊断基准。
Abstract: Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We present a unified benchmarking framework to quantify inter-LLM divergence, defined as the extent to which models differ in API discovery and ranking under identical tasks. Across 15 canonical API domains and 5 major model families, we measure pairwise and group-level agreement using set-, rank-, and consensus-based metrics including Average Overlap, Jaccard similarity, Rank-Biased Overlap, Kendall’s tau, Kendall’s W, and Cronbach’s alpha. Results show moderate overall alignment (AO about 0.50, tau about 0.45) but strong domain dependence: structured tasks (Weather, Speech-to-Text) are stable, while open-ended tasks (Sentiment Analysis) exhibit substantially higher divergence. Volatility and consensus analyses reveal that coherence clusters around data-bound domains and degrades for abstract reasoning tasks. These insights enable reliability-aware orchestration in multi-agent systems, where consensus weighting can improve coordination among heterogeneous LLMs. Beyond performance benchmarking, our results reveal systematic failure modes in multi-agent LLM coordination, where apparent agreement can mask instability in action-relevant rankings. This hidden divergence poses a pre-deployment safety risk and motivates diagnostic benchmarks for early detection.
[179] Geometric Analysis of Self-Supervised Vision Representations for Semantic Image Retrieval cs.IR | cs.CVPDF
Esteban Rodríguez-Betancourt, Edgar Casasola-Murillo
TL;DR: 本文评估了现代自监督视觉学习方法在基于向量数据库和最近邻搜索的典型检索框架下的表现,发现潜在空间的几何特性(如各向异性和局部纯度)对近似最近邻索引性能有显著影响。
Details
Motivation: 解决自监督视觉表示在内容图像检索中应用不足的问题,探索其几何特性对检索性能的影响。
Result: 实验表明,高各向异性和偏度的表示会降低基于分区和哈希的搜索性能,而高各向同性和局部纯度的表示能提升语义检索效果。
Insight: 创新点在于揭示了自监督表示几何特性与检索性能的关联,强调了优化表示各向同性和局部纯度的重要性,为改进检索系统提供了新视角。
Abstract: Content-based image retrieval (CBIR) systems enable users to search images based on visual content instead of relying on metadata. The text domain has benefited from vector search of representations created with unsupervised methods such as BERT. However, modern self-supervised learning methods for vision are mostly not reported in CBIR-related literature, instead relying on supervised models or multi-modal methods that align text and vision. We evaluate how the representations learned by modern self-supervised learning methods for vision perform under typical retrieval stacks that leverage vector databases and nearest neighbor search. Our evaluation reveals that the latent space geometry impacts approximate nearest neighbor (ANN) indexing. Specifically, highly anisotropic representations with high skewness produced by several modern SSL methods degrade the performance of partition-based and hashing-based search, even if their own linear probe or K-NN accuracy is not affected. In contrast, representations with higher isotropy and local purity better satisfy the distance-based assumptions of ANN indexes, leading to improved semantic retrieval performance.
eess.AS [Back]
[180] In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions eess.AS | cs.CL | cs.LG | cs.SDPDF
Xulin Fan, Vishal Sunder, Samuel Thomas, Mark Hasegawa-Johnson, Brian Kingsbury
TL;DR: 本文提出了一种名为In-Sync的方法,通过扩展现有的语音感知大语言模型,使其能够直接预测单词级时间戳,同时保持转录质量。作者引入了一系列轻量级训练策略来增强对齐鲁棒性,实验表明该方法在提升时间戳准确性的同时,也改善了整体自动语音识别性能。
Details
Motivation: 现有语音感知语言模型虽然能生成丰富输出,但单词级时间戳预测通常依赖外部对齐工具,这限制了其在字幕生成、媒体搜索和多模态同步等应用中的效率与精度。
Result: 在多个数据集上的实验表明,所提策略不仅提高了时间戳准确性,还带来了整体ASR性能的提升,实现了高效且统一的语音识别与精确时间戳预测。
Insight: 创新点在于将时间戳预测直接集成到语音感知LLM中,并通过轻量级训练策略实现鲁棒对齐,避免了对外部对齐工具的依赖,为端到端的语音识别与时间戳生成提供了统一框架。
Abstract: Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.
cs.CR [Back]
[181] Jailbreaking Frontier Foundation Models Through Intention Deception cs.CR | cs.AI | cs.CLPDF
Xinhe Wang, Katia Sycara, Yaqi Xie
TL;DR: 本文提出了一种新型的多轮越狱方法,通过意图欺骗攻击前沿基础模型。该方法利用模型从基于拒绝的安全机制转向安全完成机制所带来的漏洞,在对话中逐步建立信任,最终诱导模型生成有害输出,并揭示了一种新的‘准越狱’漏洞。
Details
Motivation: 前沿基础模型(如GPT-5)的安全机制从‘拒绝回答’转向‘安全完成’,旨在最大化帮助性同时遵守安全约束。然而,当攻击者伪装其意图为良性时,这种机制存在被利用的风险,尤其是在多轮对话中攻击者可以反复强化其欺骗性意图。本文旨在利用此漏洞进行越狱攻击。
Result: 该方法在包括GPT-5-thinking和Claude-Sonnet-4.5在内的前沿模型上取得了高成功率。在多模态视觉语言模型上的实验表明,该方法优于现有的最先进模型。
Insight: 主要创新点在于提出了一种利用‘意图反转’和模型一致性属性的多轮对话越狱策略,并首次揭示和定义了‘准越狱’这一新型漏洞,即模型虽不直接回复有害内容,但泄露的信息本身仍是有害的。
Abstract: Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user’s intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively benign intent. In this work, we introduce a novel multi-turn jailbreaking method that exploits this vulnerability. Our approach gradually builds conversational trust by simulating benign-seeming intentions and by exploiting the consistency property of the model, ultimately guiding the target model toward harmful, detailed outputs. Most crucially, our approach also uncovered an additional class of model vulnerability that we call para-jailbreaking that has been unnoticed up to now. Para-jailbreaking describes the situation where the model may not reveal harmful direct reply to the attack query, however the information that it reveals is nevertheless harmful. Our contributions are threefold. First, it achieves high success rates against frontier models including GPT-5-thinking and Claude-Sonnet-4.5. Second, our approach revealed and addressed para-jailbreaking harmful output. Third, experiments on multimodal VLM models showed that our approach outperformed state-of-the-art models.
cs.SD [Back]
[182] HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD | cs.CLPDF
Peize He, Yaodi Luo, Xiaoqian Liu, Xuyang Liu, Jiahang Deng
TL;DR: 本文提出了一种名为HeadRouter的动态头权重路由方法,用于大型音频语言模型中的任务自适应音频令牌剪枝。该方法通过感知不同音频任务中注意力头的不同重要性,来最大化保留关键令牌,从而在降低推理成本的同时保持模型性能。
Details
Motivation: 现有的大型音频语言模型在处理长序列时推理成本高,而现有的令牌压缩方法通常假设所有注意力头对各种音频任务的贡献相同,通过平均所有头的分数来计算令牌重要性。然而,作者分析发现注意力头在不同音频领域表现出不同的行为,且只有稀疏的注意力头子集对音频有积极反应,在处理语义和声学任务时性能完全不同。
Result: 在AudioMarathon和MMAU-Pro基准测试上的大量实验表明,HeadRouter实现了最先进的压缩性能,在仅保留70%音频令牌的情况下甚至超过了基线模型,在Qwen2.5-Omni-3B和Qwen2.5-Omni-7B上分别达到了原始模型平均性能的101.8%和103.0%。
Insight: 论文的创新点在于揭示了注意力头在不同音频任务中的异质性,并据此提出了无需训练、可应用于各种大型音频语言模型的动态头重要性感知令牌剪枝方法HeadRouter,通过动态路由机制实现任务自适应的高效压缩。
Abstract: Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head-importance-aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training-free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU-Pro benchmarks demonstrate that HeadRouter achieves state-of-the-art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5-Omni-3B and Qwen2.5-Omni-7B, respectively.
cs.AI [Back]
[183] The Power of Power Law: Asymmetry Enables Compositional Reasoning cs.AI | cs.CL | cs.LGPDF
Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu
TL;DR: 论文研究了数据分布对模型组合推理能力的影响,发现遵循幂律分布的自然语言数据在训练模型时,相比均匀分布数据能带来更好的性能。作者通过理论分析和实验验证,揭示了幂律分布通过引入有益的不对称性,优化了损失景观,使模型能更高效地学习高频技能组合,并以此为基石掌握低频长尾技能。
Details
Motivation: 解决在自然语言数据遵循幂律分布(大部分知识和技能出现频率极低)的背景下,如何高效训练模型进行组合推理(如状态跟踪、多步算术)的问题,挑战了通过数据重加权或筛选使其均匀化以改善长尾技能学习的常见直觉。
Result: 在广泛的组合推理任务(如状态跟踪、多步算术)上,使用幂律分布进行训练的表现持续优于使用均匀分布。理论分析证明,在幂律分布下学习所需训练数据量显著减少。
Insight: 宣称的创新点在于揭示了幂律分布训练的优势机制:其诱导的有益不对称性改善了病理性的损失景观,使模型能分阶段高效学习技能。客观来看,这为理解有效训练数据分布提供了新视角,挑战了均匀化处理长尾数据的传统做法,强调了数据固有统计特性对学习动态的积极影响。
Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.
[184] FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean cs.AI | cs.CLPDF
Jordan Meadows, Lan Zhang, Andre Freitas
TL;DR: 本文提出了FormalScience,一个领域无关、人机协同的自动化形式化流水线,旨在将非正式的数学推理(特别是在科学领域,如物理)转化为可形式验证的代码(如Lean4)。通过构建FormalPhysics数据集并评估多种LLM方法,论文系统性地分析了物理领域自动形式化中的语义漂移问题,并发布了交互式工具以促进更广泛科学领域的定理证明。
Details
Motivation: 解决大型语言模型在将非正式科学推理(尤其是涉及特定领域符号和计算,如狄拉克符号、矢量微积分)转化为可形式验证代码时面临的挑战,并降低领域专家进行形式化的成本和门槛。
Result: 构建了包含200个大学物理问题及其Lean4形式化表示的FormalPhysics数据集,该数据集在形式有效性上达到完美,且陈述复杂度高于现有形式数学基准。评估了开源和专有LLM在零样本提示、带错误反馈的自我精炼以及新型多阶段智能体方法下的表现,揭示了当前基于LLM的自动形式化方法的局限性。
Insight: 创新点包括:1)提出一个可扩展的人机协同智能体流水线,使无深厚形式语言经验的领域专家能以低成本生成语法正确且语义对齐的形式证明;2)首次系统性地刻画了物理自动形式化中的语义漂移现象(如符号坍缩和抽象提升),揭示了当无法完全保持语义时形式语言验证的内容;3)发布了交互式UI系统,便于将方法推广到物理以外的科学领域。
Abstract: Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science
[185] Discovering Agentic Safety Specifications from 1-Bit Danger Signals cs.AI | cs.CLPDF
Víctor Gallego
TL;DR: 本文提出了EPO-Safe框架,使大型语言模型(LLM)智能体仅通过与环境交互并接收稀疏的1比特危险信号(即仅指示动作是否不安全),就能自主发现并演化出自然语言行为安全规范。该方法在AI Safety Gridworlds及其文本模拟场景中验证了有效性,能在少量轮次内发现安全行为,并生成可解释的安全假设。
Details
Motivation: 解决LLM智能体在仅能观察到稀疏、低维危险信号(而非完整的奖励函数或详细反馈)的结构化环境中,如何自主发现隐藏的安全约束和目标的问题。
Result: 在五个AI Safety Gridworlds和五个文本模拟场景中,EPO-Safe在1-2轮(5-15个回合)内即可发现安全行为,生成可读的安全规范。即使在50%的非危险步骤存在虚假警告的噪声情况下,平均安全性能仅下降约15%。实验表明,仅基于奖励的反思会加剧奖励攻击行为,而EPO-Safe通过专用安全通道有效避免了该问题。
Insight: 创新点在于证明了LLM能够从极其稀疏的1比特危险信号中进行安全推理和规范演化,无需依赖丰富的文本反馈。该方法通过跨回合反思自然过滤不一致信号,实现了对噪声的鲁棒性。生成的规范是自主发现、可审计的行为规则集,不同于需要人工编写的宪法AI方法,为安全对齐提供了一种从经验中学习的新途径。
Abstract: Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function $R^$, only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward $R$ may diverge from $R^$. EPO-Safe discovers safe behavior within 1-2 rounds (5-15 episodes), producing human-readable specifications with correct explanatory hypotheses about hazards (e.g., “X cells are directionally hazardous: entering from the north is dangerous”). Critically, we show that standard reward-driven reflection actively degrades safety: agents reflecting on reward alone use the loop to justify and accelerate reward hacking, proving that reflection must be paired with a dedicated safety channel to discover hidden constraints. We further evaluate robustness to noisy oracles: even when 50% of non-dangerous steps produce spurious warnings, mean safety performance degrades by only 15% on average, though sensitivity is environment-dependent, as cross-episode reflection naturally filters inconsistent signals. Each evolved specification functions as an auditable set of grounded behavioral rules discovered autonomously through interaction, rather than authored by humans as in Constitutional AI (Bai et al., 2022).
[186] Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models cs.AI | cs.CL | cs.LGPDF
Sharan Ramjee
TL;DR: 本文研究了连续思维模型中的潜在推理错位检测问题,通过构建MoralChain基准数据集和双触发范式训练模型,揭示了连续思维模型在潜在空间中可能隐藏错位推理,同时生成对齐输出的风险,并提出了基于线性探针的早期检测方法。
Details
Motivation: 连续思维模型通过潜在空间推理克服了链式思维依赖自然语言的表达带宽限制,但其不可解释性引发了安全担忧:如何在不可解释的潜在空间中检测错位推理?
Result: 在MoralChain基准(包含12,000个社会场景)上的实验表明:线性探针在行为可区分条件([T][O] vs [O])上训练后,能高精度迁移检测武装但良性状态([T] vs 基线);错位编码出现在早期潜在思维标记中。
Insight: 创新点包括:提出双触发范式(武装触发[T]和释放触发[O])模拟潜在推理错位;发现对齐与错位推理在潜在空间中占据几何分离区域;主张安全监控应针对潜在推理的“规划阶段”。
Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model’s expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the “planning” phase of latent reasoning.
[187] Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus cs.AI | cs.CLPDF
Johannes Moll, Jannik Lübberstedt, Christoph Nuernbergk, Jacob Stroh, Luisa Mertens
TL;DR: 本研究评估了一种基于大语言模型(LLM)的智能体推理系统在合成多发性骨髓瘤患者纵向临床记录方面的能力。该系统在回答复杂临床问题时,其与专家共识的一致性(79.6%)超越了检索增强生成(RAG)和全上下文输入等基线方法,尤其在处理复杂问题和长记录时优势明显。然而,系统错误的临床严重性高于专家间的分歧,表明其临床应用仍需前瞻性评估。
Details
Motivation: 多发性骨髓瘤的治疗决策依赖于分布在大量异构临床文档中的累积病史,目前尚不清楚基于LLM的系统能否达到接近专家共识的水平来综合这些证据。
Result: 在811名患者的纵向记录(44,962份文档)上进行评估,智能体推理系统与专家共识的一致性达到79.6%,显著优于迭代RAG(75.4%)和全上下文输入(75.8%)。在基于标准的综合问题(+9.4个百分点)和最长的记录(前10%,+13.5个百分点)上提升最大。系统错误率(12.2%)与专家分歧率(13.6%)相当,但57.8%的系统错误具有临床显著性,而专家分歧中仅18.8%具有临床显著性。
Insight: 创新点在于提出了一个智能体推理系统,通过迭代、主动的查询和推理过程,超越了传统RAG和全上下文输入在复杂、长序列临床数据综合任务上的性能天花板。客观来看,其核心价值在于展示了智能体架构在处理需要深度、多步推理的纵向医疗记录分析任务上的潜力,但同时也凸显了系统错误在临床上的高风险性,为未来医疗AI系统的安全部署提供了重要警示。
Abstract: Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.
[188] Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations cs.AI | cs.CL | cs.CYPDF
Bowen Jian, Rongjie Yu, Hong Wang, Liqiang Wang, Zihang Zou
TL;DR: 本文提出了一种新颖的流程,旨在使自动驾驶系统能够从交通法规中推导出符合具体场景的驾驶要求。该方法通过将大型语言模型的推理过程锚定在一个结构化的交通场景分类体系中,显著提升了法规与场景的匹配精度以及所推导出的强制性/禁止性要求的准确性。
Details
Motivation: 自动驾驶车辆在实际场景中可能违反交通法规,而传统基于形式化逻辑的方法来编码法规要求存在劳动密集、难以扩展和维护成本高的问题。利用大型语言模型自动推导法规要求虽前景广阔,但缺乏在结构化场景中的显式推理,容易导致检索不相关或遗漏适用条款,产生不精确的要求。
Result: 在中国交通法规和OnSite数据集(包含5,897个场景)上,该方法将法规-场景匹配精度提升了29.1%,并将推导出的强制性要求和禁止性要求的准确率分别提高了36.9%和38.2%。
Insight: 核心创新点在于提出了一个将LLM推理过程通过编码层次语义的节点锚点,显式地“接地”到结构化交通场景分类体系中的流程。这解决了LLM在法规理解中缺乏场景上下文的问题,为构建自动驾驶的法规遵从层和实时监控器提供了可行方案。
Abstract: Driving in compliance with traffic laws and regulations is a basic requirement for human drivers, yet autonomous vehicles (AVs) can violate these requirements in diverse real-world scenarios. To encode law compliance into AV systems, conventional approaches use formal logic languages to explicitly specify behavioral constraints, but this process is labor-intensive, hard to scale, and costly to maintain. With recent advances in artificial intelligence, it is promising to leverage large language models (LLMs) to derive legal requirements from traffic laws and regulations. However, without explicitly grounding and reasoning in structured traffic scenarios, LLMs often retrieve irrelevant provisions or miss applicable ones, yielding imprecise requirements. To address this, we propose a novel pipeline that grounds LLM reasoning in a traffic scenario taxonomy through node-wise anchors that encode hierarchical semantics. On Chinese traffic laws and OnSite dataset (5,897 scenarios), our method improves law-scenario matching by 29.1% and increases the accuracy of derived mandatory and prohibitive requirements by 36.9% and 38.2%, respectively. We further demonstrate real-world applicability by constructing a law-compliance layer for AV navigation and developing an onboard, real-time compliance monitor for in-field testing, providing a solid foundation for future AV development, deployment, and regulatory oversight.
[189] LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People cs.AI | cs.CV | cs.HC | cs.MAPDF
Aydin Ayanzadeh, Tim Oates
TL;DR: 本文提出了一种由大语言模型(LLM)引导的智能体框架,用于解析楼层平面图,以支持盲人和低视力(BLV)人群的无障碍室内导航。该系统通过多智能体模块将平面图解析为空间知识图谱,并利用路径规划器生成安全的导航指令,最后通过安全评估智能体评估潜在危险。
Details
Motivation: 解决现有室内导航解决方案依赖昂贵的每栋建筑基础设施的问题,为BLV人群提供一种轻量级、可扩展的无障碍导航方案。
Result: 在真实世界UMBC数学与心理学楼(MP-1和MP-3层)和CVC-FP基准测试上进行了评估。在MP-1上,短、中、长路径的成功率分别达到92.31%、76.92%和61.54%,优于最强的单次调用基线模型(Claude 3.7 Sonnet)。在MP-3上也取得了显著提升,证明了该方法相对于单次调用LLM基线的持续优势。
Insight: 创新点在于提出了一个结合多智能体协作、自校正管道和迭代重试循环的智能体框架,将平面图解析为结构化知识库,并集成了安全评估机制。从客观角度看,其将LLM作为智能体协调器,通过反馈循环提升解析精度,并专注于无障碍导航的具体安全需求,是一个有前景的、可扩展的解决方案。
Abstract: Indoor navigation remains a critical accessibility challenge for the blind and low-vision (BLV) individuals, as existing solutions rely on costly per-building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real-world UMBC Math and Psychology building (floors MP-1 and MP-3) and on the CVC-FP benchmark. On MP-1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single-call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP-3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single-call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.
cs.LG [Back]
[190] KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning cs.LG | cs.AI | cs.CLPDF
Cheng Gao, Cheng Huang, Kangyang Luo, Ziqing Qiao, Shuzheng Si
TL;DR: 本文提出KARL框架,通过知识边界感知的强化学习来缓解大语言模型(LLM)的幻觉问题。该框架包含一个基于组内响应统计进行在线知识边界估计的动态奖励机制,以及一个两阶段训练策略,旨在使模型在超出其知识范围时适当地选择弃答,同时保持高准确率。
Details
Motivation: 现有强化学习方法虽然能促进模型自主弃答,但其静态奖励机制不了解模型的知识边界,往往导致模型过度谨慎,从而损害回答准确性。因此,需要一种方法能动态地将模型的弃答行为与其不断演化的知识边界对齐。
Result: 在多个基准测试上的广泛实验表明,KARL在准确性和幻觉抑制之间取得了优越的权衡,在分布内和分布外场景下均能有效抑制幻觉,同时保持高准确率。
Insight: 核心创新点在于知识边界感知奖励和两阶段强化学习训练策略。前者通过在线估计知识边界实现动态奖励,后者通过先探索边界、再转换错误答案为弃答的策略,避免了’弃答陷阱’,在不牺牲准确性的前提下减少了幻觉。
Abstract: Enabling large language models (LLMs) to appropriately abstain from answering questions beyond their knowledge is crucial for mitigating hallucinations. While existing reinforcement learning methods foster autonomous abstention, they often compromise answer accuracy because their static reward mechanisms, agnostic to models’ knowledge boundaries, drive models toward excessive caution. In this work, we propose KARL, a novel framework that continuously aligns an LLM’s abstention behavior with its evolving knowledge boundary. KARL introduces two core innovations: a Knowledge-Boundary-Aware Reward that performs online knowledge boundary estimation using within-group response statistics, dynamically rewarding correct answers or guided abstention; and a Two-Stage RL Training Strategy that first explores the knowledge boundary and bypasses the “abstention trap”, and subsequently converts incorrect answers beyond the knowledge boundary into abstentions without sacrificing accuracy. Extensive experiments on multiple benchmarks demonstrate that KARL achieves a superior accuracy-hallucination trade-off, effectively suppressing hallucinations while maintaining high accuracy across both in-distribution and out-of-distribution scenarios.
[191] Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG | cs.CLPDF
Haoze He, Xingyuan Ding, Xuan Jiang, Xinkai Zou, Alex Cheng
TL;DR: 本文提出了一种无需辅助损失的MoE监督微调框架,通过偏置驱动的稀疏化与常激活门控聚合专家相结合,旨在解决MoE模型中路由器层脆弱导致的专家激活不平衡问题,从而更好地保留长尾专家信息。
Details
Motivation: 针对MoE架构在监督微调时路由器层易崩溃、现有方法如DenseMixer和ESFT引入噪声梯度导致性能下降的问题,作者观察到即使很少激活的专家也编码了对下游任务重要的知识,因此需要一种新方法来有效保留这些长尾专家信息。
Result: 在大规模MoE模型上的实验表明,该方法在数学推理和常识问答基准测试中平均提升超过2.5%,优于当前最先进的SFT基线如DenseMixer和ESFT。
Insight: 创新点在于通过偏置驱动的稀疏化策略与常激活门控聚合专家的设计,避免了强制平衡所有专家激活,而是鼓励任务相关专家保持活跃,同时将长尾专家推向非活跃状态,从而缓解梯度饥饿并整合分散信息,有效保留了稀疏路由下的长尾专家知识。
Abstract: Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mixing or auxiliary load-balancing losses, but these introduce noisy gradients that often degrade performance. In preliminary experiments, we systematically pruned experts and observed that while certain super experts are activated far more frequently, discarding less used experts still leads to notable performance degradation. This suggests that even rarely activated experts encode non-trivial knowledge useful for downstream tasks. Motivated by this, we propose an auxiliary-loss-free MoE SFT framework that combines bias-driven sparsification with always-active gated condenser experts. Rather than enforcing balanced activation across all experts, our method encourages task-relevant experts to remain active while pushing long-tailed experts toward inactivity. The condenser experts provide a persistent, learnable pathway that alleviates gradient starvation and facilitates consolidation of information that would otherwise remain fragmented across sparsely activated experts. Analysis further suggest that this design better preserves long-tailed expert information under sparse routing. Experiments on large-scale MoE models demonstrate that our approach outperforms state-of-the-art SFT baselines such as DenseMixer and ESFT, achieving average gain of 2.5%+ on both mathematical reasoning and commonsenseQA benchmarks.
[192] Process Supervision of Confidence Margin for Calibrated LLM Reasoning cs.LG | cs.CLPDF
Liaoyaqi Wang, Chunsheng Zuo, William Jurayj, Benjamin Van Durme, Anqi Liu
TL;DR: 本文提出了一种名为RLCM(Reinforcement Learning with Confidence Margin)的校准感知强化学习框架,旨在解决基于结果的奖励机制导致大型语言模型(LLM)在推理过程中过度自信的问题。该方法通过引入基于置信度边际的过程监督奖励,联合优化推理的正确性和置信度可靠性,从而减少幻觉、提升校准效果,并支持更高效的置信度控制与聚合。
Details
Motivation: 基于强化学习(RL)扩展测试时计算已成为提升LLM推理能力的可靠路径,但基于结果的奖励常使模型过度自信,导致幻觉、不可靠的置信度控制以及不必要的计算分配。
Result: 在数学、代码、逻辑和科学等多个基准测试中,RLCM方法在保持或提升准确性的同时,显著改善了模型的校准性能,并实现了更高效的符合风险控制和有效的置信度加权聚合。
Insight: 核心创新在于将校准目标融入RL框架,通过鼓励单个推理轨迹中正确与错误步骤之间的置信度边际最大化,而非简单对齐置信度与正确性概率,从而联合优化正确性与置信可靠性,为LLM推理提供了更稳健的置信信号。
Abstract: Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.
[193] SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning cs.LG | cs.AI | cs.CLPDF
Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, Valentina Pyatkin
TL;DR: 这篇论文指出,近期声称混合策略优化方法优于标准SFT-then-RL流程的研究,其结论基于一个有缺陷的基线。该基线缺陷源于DeepSpeed的CPU卸载优化器bug(会静默丢弃梯度累积中的中间微批次)和OpenRLHF中的损失聚合bug。修复这些bug后,标准的SFT-then-RL流程在数学推理基准上超越了所有评估的混合策略方法。
Details
Motivation: 论文旨在澄清和纠正近期LLM推理优化研究中的一个错误结论,即混合策略方法优于标准的SFT-then-RL流程,指出该结论源于底层框架的bug导致的基线性能被抑制。
Result: 修复bug后,标准SFT-then-RL流程在数学基准测试上超越了所有已发表的混合策略方法:使用Qwen2.5-Math-7B提升了+3.8分,使用Llama-3.1-8B提升了+22.2分。甚至一个仅进行50步RL的简化版本,在使用更少FLOPs的情况下,也在数学基准上超越了混合策略方法。
Insight: 论文的核心创新点在于识别并纠正了影响LLM训练评估的两个关键框架bug,从而恢复了标准SFT-then-RL流程的有效性。这提醒研究社区在比较新方法时,必须确保基线实现正确无误,底层工具链的bug可能严重扭曲性能评估。同时,结果表明,精心调优的标准流程可能比复杂的混合策略更简单有效。
Abstract: Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.
[194] On-Device Vision Training, Deployment, and Inference on a Thumb-Sized Microcontroller cs.LG | cs.CVPDF
Jeremy Ellis
TL;DR: 本文提出了一种完整的端到端设备端视觉机器学习流程,包括数据采集、使用Adam优化的两层CNN训练和实时推理,完全在成本为15-40美元的微控制器级设备上执行。该系统在约1750行可读C++代码中实现了核心ML生命周期的每一步,编译时间不到一分钟,无需外部ML依赖,并在Seeed Studio ESP32-S3 XIAO ML Kit上实现了三类64x64图像分类,每次训练约9分钟,实时推理速度为6.3 FPS。
Details
Motivation: 解决基于云的工作流需要外部基础设施且对实践者隐藏计算流程的问题,旨在提供一个完全在微控制器上运行的透明、可访问的端到端视觉ML系统,降低部署门槛和成本。
Result: 在Seeed Studio ESP32-S3 XIAO ML Kit(8 MB PSRAM)上,实现了三类64x64图像分类,每次训练运行约9分钟,实时推理速度为6.3 FPS,展示了在资源受限设备上的可行性和效率。
Insight: 创新点包括:正确的批次级梯度累积;用于推理的预计算调整大小查找表;无SD卡的双格式权重导出用于固化部署;基于启动时自动解析的三层权重优先级系统;单一常量网络重新配置接口;以及适应微控制器约束的PSRAM感知内存管理。从客观角度看,这些技术贡献优化了微控制器上的ML工作流,提高了资源利用率和部署灵活性。
Abstract: This paper presents a complete, end-to-end on-device vision machine learning pipeline, comprising data acquisition, two-layer CNN training with Adam optimization, and real-time inference, executing entirely on a microcontroller-class device costing $15-40 USD. Unlike cloud-based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three-class 64x64 image classification in approximately 9 minutes per training run, with real-time inference at 6.3 FPS. Key contributions include: correct batch-level gradient accumulation; pre-computed resize lookup tables for inference; dual-format weight export for SD-free baked-in deployment; a three-tier weight priority system (SD binary > baked-in header > He-initialization) resolved automatically at boot; a single-constant network reconfiguration interface; and PSRAM-aware memory management suited to microcontroller constraints. All source code and reference datasets are released under the MIT License at https://github.com/webmcu-ai/on-device-vision-ai
[195] Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data cs.LG | cs.CVPDF
Long Jing, Zhixiong Yang, Yajun Zhang, Xinlong Feng
TL;DR: 本文提出了一种名为CLMM的对比学习框架,用于解决多模态人类活动识别中数据异构和标签稀缺的问题。该框架采用两阶段训练策略,第一阶段通过CNN-DiffTransformer编码器和硬正样本加权算法学习跨模态共享信息,第二阶段通过质量引导注意力和双向门控单元的双分支架构捕获模态特定信息,并结合主辅助协作训练策略融合信息。实验表明,CLMM在三个公共数据集上显著提升了识别准确率和收敛性能。
Details
Motivation: 多模态人类活动识别面临模态间数据高度异构和标签稀缺的挑战,导致现有解决方案与实际应用需求存在差距,本文旨在设计一个能在有限标注数据下实现有效多模态识别的通用框架。
Result: 在三个公共数据集上的实验结果显示,CLMM在识别准确率和收敛性能方面显著超越了当前最先进的基线方法(SOTA)。
Insight: 创新点包括:1)两阶段训练策略,分别处理共享和模态特定信息;2)CNN-DiffTransformer编码器结合硬正样本加权算法以增强共享学习;3)双分支架构与主辅助协作训练策略实现信息融合。从客观角度看,该框架通过对比学习有效缓解了标签稀缺问题,并设计了针对多模态异构数据的专用模块,具有较好的通用性和实用性。
Abstract: Human activity recognition serves as the foundation for various emerging applications. In recent years, researchers have used collaborative sensing of multi-source sensors to capture complex and dynamic human activities. However, multimodal human activity sensing typically encounters highly heterogeneous data across modalities and label scarcity, resulting in an application gap between existing solutions and real-world needs. In this paper, we propose CLMM, a general contrastive learning framework for human activity recognition that achieves effective multimodal recognition with limited labeled data. CLMM employs a novel two-stage training strategy. In the first stage, CLMM employs a CNN-DiffTransformer encoder to capture cross-modal shared information by extracting local and global features. Meanwhile, a hard-positive samples weighting algorithm enhances gradient propagation to reinforce shared learning. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, while a primary-auxiliary collaborative training strategy fuses both shared and modality-specific information. Experimental results on three public datasets demonstrate that CLMM significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance.
[196] V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think cs.LG | cs.CVPDF
Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt
TL;DR: 本文提出V-GRPO方法,通过整合基于ELBO的替代目标与GRPO算法,并引入降低方差和控制梯度步长的技术,实现了对去噪生成模型(如扩散模型)的高效在线强化学习对齐,在文本到图像合成任务中取得了SOTA性能并显著加速训练。
Details
Motivation: 解决去噪生成模型(如扩散模型)与人类偏好或可验证奖励对齐的挑战,现有基于MDP轨迹或ELBO替代的方法存在效率低下或性能不足的问题。
Result: 在文本到图像合成任务中达到SOTA性能,相比MixGRPO和DiffusionNFT分别实现2倍和3倍的训练加速。
Insight: 创新点在于证明基于ELBO的替代方法通过方差减少和梯度控制可以变得稳定高效,并设计了V-GRPO框架,其易于实现、与预训练目标对齐且避免了MDP方法的局限性。
Abstract: Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.
[197] ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers cs.LG | cs.CVPDF
Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee
TL;DR: 本文提出了ELSA,一种精确的线性扫描注意力算法,旨在为视觉Transformer提供快速且内存高效的精确注意力计算。该方法通过算法重构在线softmax注意力,在保持精确softmax语义的同时,实现了O(log n)的并行深度和O(n)的额外内存开销,且不依赖Tensor Core,可作为即插即用的替代方案部署于从A100到Jetson TX2等各种硬件平台。
Details
Motivation: 现有注意力加速器通常需要在精确softmax语义、依赖Tensor Core融合内核或引入限制长序列FP32吞吐量的顺序深度之间进行权衡,缺乏一种硬件无关、高精度且高效的精确注意力实现方案。
Result: 在A100 FP32基准测试(1K-16K tokens)中,ELSA比内存高效的SDPA快1.3-3.5倍,在BERT上快1.97-2.27倍;在Jetson TX2上,比Math(64-900 tokens)快1.5-1.6倍,在LLaMA-13B卸载场景下(≥32K tokens)获得了17.8-20.2%的吞吐量提升。在FP16下,ELSA在长序列上接近硬件融合基线,同时保留了完整的FP32能力。
Insight: 核心创新在于将在线softmax更新转化为在关联幺半群上的前缀扫描操作,从而在理论上保证了精确的softmax语义和O(u log n)的FP32相对误差界,并实现了硬件无关性、O(log n)并行深度和即插即用特性,为跨平台高精度推理提供了统一的注意力内核。
Abstract: Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)preserves exact softmax semantics in real arithmetic with a \emph{provable} $\mathcal{O}(u\log n)$ FP32 relative error bound; (ii)casts the online softmax update as a prefix scan over an associative monoid $(m,S,W)$, yielding $O(n)$ extra memory and $O(\log n)$ parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 – making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to $O(\log n)$ at full precision. On A100 FP32 benchmarks (1K–16K tokens), ELSA delivers $1.3$–$3.5\times$ speedup over memory-efficient SDPA and $1.97$–$2.27\times$ on BERT; on Jetson TX2, ELSA achieves $1.5$–$1.6\times$ over Math (64–900 tokens), with $17.8$–$20.2%$ throughput gains under LLaMA-13B offloading at $\ge$32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.
cs.CY [Back]
[198] When VLMs ‘Fix’ Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR cs.CY | cs.AI | cs.CV | cs.LGPDF
Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim
TL;DR: 本文首次系统研究了多行手写数学公式的光学字符识别(OCR),揭示了视觉语言模型(VLMs)的一个关键失败模式:过度修正。作者提出了PINK(基于墨迹的惩罚评分)这一语义评估指标,利用大型语言模型(LLM)进行基于评分标准的打分,并明确惩罚过度修正行为。在FERMAT数据集上对15个最先进的VLM进行评估,发现与BLEU相比排名发生显著逆转,且人类专家研究表明PINK与人类判断的一致性更好。
Details
Motivation: 当前基准测试无法正确评估手写数学公式的转录能力,现有研究多关注单行表达式并依赖BLEU等词汇指标,无法评估跨多行学生解答的语义推理。VLM在转录学生作业时经常“修正”错误,从而掩盖了教育评估旨在检测的错误,即过度修正问题。
Result: 在FERMAT数据集上对15个SOTA VLM的全面评估显示,与BLEU相比排名发生重大逆转:如GPT-4o因激进的过度修正受到严重惩罚,而Gemini 2.5 Flash则成为最忠实的转录器。人类专家研究表明,PINK与人类判断的一致性显著更好(55.0%偏好 vs. BLEU的39.5%)。
Insight: 论文的创新点在于首次系统研究多行手写数学OCR中的VLM过度修正问题,并提出了PINK这一新颖的语义评估指标,它结合LLM进行基于规则的评分并明确惩罚过度修正,为教育场景下的手写数学OCR提供了更可靠、更符合人类判断的评估框架。
Abstract: Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student’s work, these models often “fix” errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU’s 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.
[199] A satellite foundation model for improved wealth monitoring cs.CY | cs.CVPDF
Zhuo Zheng, Iván Higuera-Mendieta, Richard Lee, David Newhouse, Talip Kilic
TL;DR: 这篇论文提出了一个名为Tempov的卫星基础模型,用于通过卫星图像大规模、高分辨率地预测和监测财富与贫困。该模型通过自监督学习在300万对双时相Landsat图像上进行预训练,并采用参数高效的微调方法适应稀疏的调查标签。它能够实现零样本的实时预测(nowcasting)、回顾性预测(hindcasting)和十年变化追踪,在低标签情况下仅需10%的调查样本就能达到有竞争力的准确度,并成功生成了非洲大陆的高分辨率财富变化地图。
Details
Motivation: 传统贫困统计数据(如人口普查和家庭调查)在低收入和中等收入国家存在成本高、更新慢、易过时和易出错的问题。卫星图像虽能提供全球覆盖和大规模预测的可能性,但现有方法在可靠识别局部变化和应对时间偏移方面存在不足。
Result: Tempov模型在财富预测任务上超越了现有的神经网络和地理空间基础模型基线。在低标签情况下,仅使用10%的调查样本就达到了有竞争力的准确度。模型在非洲内外的人口大国中表现出良好的泛化能力,并扩展为一个统一的非洲大陆模型,取得了强大的性能(R²=0.63,r²=0.68),据此生成了非洲高分辨率的十年财富及财富变化地图。
Insight: 论文的创新点在于提出了一个专门针对时序卫星图像的自监督预训练基础模型(Tempov),并采用参数高效的微调策略来适应稀疏的监督标签,从而显著降低了对昂贵标签收集的依赖。该方法实现了对财富动态(包括实时预测、回顾预测和变化追踪)的大规模、高分辨率监测,为利用常规收集的卫星数据进行及时、可扩展、低成本的财富贫困监测提供了开源途径。
Abstract: Poverty statistics guide social policy, but in many low- and middle-income countries, censuses and household surveys that collect these data are costly, infrequent, quickly outdated, and sometimes error-prone. Satellite imagery offers global coverage and the possibility of predicting economic livelihoods at scale, yet existing approaches to predicting livelihoods with imagery or other non-traditional data often fail to reliably identify local-level variation and, as we show, degrade under temporal shift. Here we introduce Tempov, a satellite foundation model pretrained by self-supervision on three million bi-temporal Landsat pairs and adapted with parameter-efficient fine-tuning to sparse survey labels. The model enables large-scale, high-resolution wealth mapping and dynamic measurement, including zero-shot nowcasting up to a decade after observed labels, retrospective hindcasting, and decadal change tracking, while outperforming existing neural network and geospatial foundation-model baselines. In low-label regimes, Tempov achieves competitive accuracy with only 10% of survey samples, indicating substantially reduced dependence on expensive label collection. The model further generalizes across populous countries within and outside Africa, and scales to a unified Africa-wide model with strong continent-level performance ($R^2=0.63$, $r^2=0.68$), from which we generate high-resolution decadal maps of wealth and wealth changes for the African continent. Analysis of these maps shows large variation in recent economic performance both within and across countries. Our open-source approach provides a pathway to timely, scalable, low-cost monitoring of wealth and poverty from routinely collected satellite data.
cs.RO [Back]
[200] Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training cs.RO | cs.CVPDF
Suning Huang, Jiaqi Shao, Ke Wang, Qianzhong Chen, Jiankai Sun
TL;DR: 本文提出DeLock方法,旨在解决视觉-语言-动作(VLA)策略在少量演示数据上进行监督微调(SFT)后出现的‘锁定’现象,即策略过度特化于微调数据,无法响应新指令。DeLock通过在微调过程中保持视觉基础,并在测试时应用对比提示引导来缓解锁定,在八个仿真和真实世界评估中表现优异。
Details
Motivation: 解决通用VLA策略在少量数据微调后出现的‘概念锁定’(固着于训练对象/属性)和‘空间锁定’(固着于训练空间目标)问题,使其在保持对微调数据性能的同时,能够泛化到新指令。
Result: 在八个仿真和真实世界评估中,DeLock持续优于强基线,并与使用大量精心策划演示数据进行微调的最先进通用策略性能相当或更优。
Insight: 创新点在于发现并利用策略内部预训练知识已足够,无需额外监督信号或数据增强;通过保持微调期间的视觉基础和应用测试时对比提示引导来解耦锁定,这是一种高效且数据高效的方法。
Abstract: Have you ever post-trained a generalist vision-language-action (VLA) policy on a small demonstration dataset, only to find that it stops responding to new instructions and is limited to behaviors observed during post-training? We identify this phenomenon as lock-in: after low-data, supervised fine-tuning (SFT), the policy becomes overly specialized to the post-training data and fails to generalize to novel instructions, manifesting as concept lock-in (fixation on training objects/attributes) and spatial lock-in (fixation on training spatial targets). Many existing remedies introduce additional supervision signals, such as those derived from foundation models or auxiliary objectives, or rely on augmented datasets to recover generalization. In this paper, we show that the policy’s internal pre-trained knowledge is sufficient: DeLock mitigates lock-in by preserving visual grounding during post-training and applying test-time contrastive prompt guidance to steer the policy’s denoising dynamics according to novel instructions. Across eight simulation and real-world evaluations, DeLock consistently outperforms strong baselines and matches or exceeds the performance of a state-of-the-art generalist policy post-trained with substantially more curated demonstrations.
cs.SE [Back]
[201] Code Broker: A Multi-Agent System for Automated Code Quality Assessment cs.SE | cs.AI | cs.CL | cs.PLPDF
Samer Attrah
TL;DR: 本文介绍了Code Broker,一个基于Google Agent Development Kit构建的多智能体系统,用于自动化评估Python代码质量。该系统通过分层架构协调多个专用智能体,结合LLM推理和静态分析工具,生成包含正确性、安全性、风格和可维护性四个维度的评估报告,并以Markdown和HTML格式呈现。
Details
Motivation: 解决传统代码质量评估工具在反馈可读性、全面性和自动化程度上的不足,旨在为开发者提供更直观、可操作的代码质量分析与改进建议。
Result: 在代表性Python代码库上进行了初步定性评估,结果表明并行专用智能体能生成面向开发者的可读反馈,但当前在评估深度、安全工具、大型仓库处理和内存持久化方面存在局限。
Insight: 创新点包括:分层多智能体架构协调LLM与静态分析信号,异步执行与重试逻辑提升鲁棒性,以及轻量级会话内存用于历史评估上下文管理;从系统设计角度为代码质量评估的自动化与智能化提供了可借鉴的工程实践。
Abstract: We present Code Broker, a multi agent system built with Google Agent Development Kit ADK that analyses Python code from files, local directories, or GitHub repositories and generates actionable quality assessment reports. The system employs a hierarchical five agents architecture in which a root orchestrator coordinates a sequential pipeline agent, which in turn dispatches three specialised agents in parallel a Correctness Assessor, a Style Assessor, and a Description Generator before synthesising findings through an Improvement Recommender. Reports score four dimensions correctness, security, style, and maintainability and are rendered in both Markdown and HTML. Code Broker combines LLM based reasoning with deterministic static-analysis signals from Pylint, uses asynchronous execution with retry logic to improve robustness, and explores lightweight session memory for retaining and querying prior assessment context. We position the paper as a technical report on system design and prompt or tool orchestration, and present a preliminary qualitative evaluation on representative Python codebases. The results suggest that parallel specialised agents produce readable, developer oriented feedback, while also highlighting current limitations in evaluation depth, security tooling, large repository handling, and the current use of only in memory persistence. All code and reproducibility materials are available at: https://github.com/Samir-atra/agents_intensive_dev.
[202] AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking cs.SE | cs.CLPDF
Dongxin Guo, Jikun Wu, Siu Ming Yiu
TL;DR: 本文提出了AgentEval框架,用于对多步骤智能体工作流进行结构化评估。该框架将智能体执行过程建模为有向无环图(DAG),每个节点包含由校准的LLM评估的质量指标,并通过层次化故障分类法进行分类,同时追踪错误传播以实现自动化根因定位。实验表明,该方法在故障检测召回率和根因分析准确性上显著优于端到端评估和扁平化步骤级评估。
Details
Motivation: 当前智能体系统的评估实践(如端到端结果检查和临时性轨迹检查)系统性地掩盖了中间步骤的失败,而这些中间失败在现实世界的错误预算中占主导地位。需要一种能够系统评估智能体工作流中间步骤并追踪错误传播的框架。
Result: 在三个生产工作流(450个测试用例)上,AgentEval的故障检测召回率是端到端评估的2.17倍(0.89 vs. 0.41),与人类专家的Cohen’s kappa一致性为0.84,根因分析准确率达到72%(人类上限为81%)。在tau-bench和SWE-bench上的跨系统评估显示其可迁移性(故障检测召回率>=0.78)。消融研究表明,基于DAG的依赖建模单独贡献了+22个百分点的故障检测召回率和+34个百分点的根因分析准确性提升。
Insight: 核心创新在于将智能体工作流形式化为评估DAG,结合层次化故障分类法和错误传播追踪,实现了细粒度、可解释的步骤级评估。该方法通过结构化依赖建模显著提升了故障检测和根因分析的性能,并展示了良好的跨工作流和跨基准的可迁移性,为智能体系统的持续集成/持续部署(CI/CD)提供了实用的评估工具。
Abstract: Agentic systems that chain reasoning, tool use, and synthesis into multi-step workflows are entering production, yet prevailing evaluation practices like end-to-end outcome checks and ad-hoc trace inspection systematically mask the intermediate failures that dominate real-world error budgets. We present AgentEval, a framework that formalizes agent executions as evaluation directed acyclic graphs (DAGs), where each node carries typed quality metrics assessed by a calibrated LLM judge (GPT-4o), classified through a hierarchical failure taxonomy (3 levels, 21 subcategories), and linked to upstream dependencies for automated root cause attribution. An ablation study isolates the impact of DAG-based dependency modeling: it alone contributes +22 percentage points to failure detection recall and +34 pp to root cause accuracy over flat step-level evaluation with identical judges and rubrics. Across three production workflows (450 test cases, two agent model families, predominantly sequential architectures with a 12% non-DAG trace rate), AgentEval achieves 2.17x higher failure detection recall than end-to-end evaluation (0.89 vs. 0.41), Cohen’s kappa = 0.84 agreement with human experts, and 72% root cause accuracy against an 81% human ceiling. Cross-system evaluation on tau-bench and SWE-bench traces confirms transferability (failure detection recall >= 0.78) without taxonomy or rubric modification. A 4-month pilot with 18 engineers detected 23 pre-release regressions through CI/CD-integrated regression testing, reducing median root-cause identification time from 4.2 hours to 22 minutes and driving measurable failure rate reductions in two workflows.
cs.GR [Back]
[203] Personalizing Causal Audio-Driven Facial Motion via Dynamic Multi-modal Retrieval cs.GR | cs.CVPDF
Xuangeng Chu, Yu Han, Wei Mao, Shih-En Wei
TL;DR: 本文提出了一种端到端的因果框架,通过动态多模态风格检索实现个性化因果面部运动生成,能够在超低延迟下利用非结构化风格参考,解决了现有方法在实时流式处理与高保真个性化之间的权衡问题。
Details
Motivation: 现有音频驱动面部动画框架无法兼顾实时流式处理与高保真个性化,通常依赖延迟诱导的音频前瞻或需要用户预先编码静态嵌入,难以捕捉动态特性。
Result: 该方法在唇形同步准确性、身份一致性和感知真实感方面显著优于最先进方法,并通过广泛的定量评估和用户研究验证。
Insight: 创新点包括:1) 保持解码因果性的时间分层运动表示,捕获全局时间上下文和高频细节;2) 联合查询音频和运动的多模态风格检索器,动态提取风格先验而不破坏因果性,实现了可扩展的个性化。
Abstract: Audio-driven facial animation is essential for immersive digital interaction, yet existing frameworks fail to reconcile real-time streaming with high-fidelity personalization. Current methods often rely on latency-inducing audio look-ahead, or require high user compliance to pre-encode static embeddings that fails to capture dynamic idiosyncrasies. We present an end-to-end causal framework for personalizing causal facial motion generation via dynamic multi-modal style retrieval, enabling ultra-low latency while uniquely leveraging unstructured style references. We introduce two key innovations: (1) a temporal hierarchical motion representation that captures global temporal context and high-frequency details while maintaining decoding causality, and (2) a multi-modal style retriever that jointly queries audio and motion to dynamically extract stylistic priors without breaking causality. This mechanism allows for scalable personalization with total flexibility regarding the number and contents of templates. By integrating these components into a causal autoregressive architecture, our method significantly outperforms state-of-the-art approaches in lip-sync accuracy, identity consistency, and perceived realism, supported by extensive quantitative evaluations and user studies.